There’s been a lot of commentary on a recent study by the Replication Project that attempted to replicate 100 published studies in psychology, all of which found statistically significant effects of some kind. The results were pretty dismal. Only about one-third of the replications observed a statistically significant effect, and the average effect size was about half that originally reported.
Unfortunately, most of the discussion of this study I’ve seen, notably in the New York Times, has missed the key point, namely the problem of publication bias. The big problem is that, under standard 20th century procedures, research reports will only be published if the effect observed is “statistically significant”, which, broadly speaking means that the average value of the observed effect is more than twice as large as the estimated standard error. According to the standard classical hypothesis testing theory, the probability that such an effect will be observed by chance, when in reality there is no effect, is less than 5 per cent.
There are two problems here, traditionally called Type I and Type II error. The classical hypothesis testing focuses on reducing Type I error, the possibility of finding an effect when none exists in reality, to 5 per cent. Unfortunately, when you do lots of tests, you get 5 per cent of a large number. If all the original studies were Type I errors, we’d expect only 5 per cent to survive replication.
In fact, the outcome observed in the Replication Study is entirely consistent with the possibility that all the failed replications are subject to Type II error, that is, failure to demonstrate an effect that is there in reality
I’m going to illustrate this with a numerical example[^1].