The great replication crisis

There’s been a lot of commentary on a recent study by the Replication Project that attempted to replicate 100 published studies in psychology, all of which found statistically significant effects of some kind. The results were pretty dismal. Only about one-third of the replications observed a statistically significant effect, and the average effect size was about half that originally reported.

Unfortunately, most of the discussion of this study I’ve seen, notably in the New York Times, has missed the key point, namely the problem of publication bias. The big problem is that, under standard 20th century procedures, research reports will only be published if the effect observed is “statistically significant”, which, broadly speaking means that the average value of the observed effect is more than twice as large as the estimated standard error. According to the standard classical hypothesis testing theory, the probability that such an effect will be observed by chance, when in reality there is no effect, is less than 5 per cent.

There are two problems here, traditionally called Type I and Type II error. The classical hypothesis testing focuses on reducing Type I error, the possibility of finding an effect when none exists in reality, to 5 per cent. Unfortunately, when you do lots of tests, you get 5 per cent of a large number. If all the original studies were Type I errors, we’d expect only 5 per cent to survive replication.

In fact, the outcome observed in the Replication Study is entirely consistent with the possibility that all the failed replications are subject to Type II error, that is, failure to demonstrate an effect that is there in reality

I’m going to illustrate this with a numerical example[^1].

Suppose each of the 100 studies was looking at a treatment (any kind of intervention of change) of some kind, which results in shifting some variable of interest by 0.1 standard deviations (in the context of IQ test scores, for example, this would be a shift of 1.5 IQ points). Suppose the population parameters in the absence of treatment are known, and we have a sample of 225 treatments. We’d expect the sample mean value obtained in this way to be, on average 0.1 standard deviations higher than the value for the population at large. But the sample mean itself is a random variable, with a standard deviation equal to the population standard deviation divided by sqrt(225) = 15. That is, if we normalize the population distribution to have mean zero and standard deviation 1, the sample mean will have mean 0.1 and standard deviation 0.066. That in turn means that about 30 per cent of the observed samples will have a value greater than twice the sample standard deviation, which is roughtly the level required to find statistical significance.

Under best practice 20th procedure, the experimenters would report the effect if it passes the standard test for statistical significance, and dump the experiment otherwise[^2]. The resulting population of reported results will have an average effect size of around 0.2 population standard deviations [^3].

Now think about what happens when a study like this is replicated. There’s only a 30 per cent chance that the original finding of statistical significance will be repeated. Moreover, the average effect size will be close to the true effect size, which is half the reported effect size.

I don’t think that the results of the replications can be explained this way. At a rough guess, half of the observed failures were probably Type I errors in the original study, and half were Type II errors in the replication.

The broader problem is that the classical approach to hypothesis testing doesn’t have any real theoretical foundations: that is, there is no question to which the proposal “accept H1 if it would be true by chance only 5 per cent of the time, retain H0 otherwise” represents a generally sensible answer. But, we are stuck with it as a social convention, and we need to make it work better.

Replication is one way to improve things. Another, designed to prevent the kind of tweaking pejoratively referred to as ‘data mining’ or ‘data dredging’ is to require researchers to register the statistical model they plan to use before collecting the data. Finally, and what has been the dominant response in practice is to disregard the “95 per cent” number associated with classical hypothesis testing theory and to treat research findings as a kind of Bayesian update on our beliefs about the issue in question. If we have no prior beliefs one way or the other, a rough estimate is that a finding reported with “95 per cent” confidence is about 50 per cent likely to be right. Turning this around, and adding a little more scepticism, we get the provocative presentation of Ioannides “most published research results are wrong

[^1]: Which will probably include an error, since I’m prone to them, but a fixable one, since the underlying argument is valid.

[^2]: In reality, a more common response, especially with nearly-significant results, is to tweak the test until it is passed.

[^3]:I eyeballed this because I was too lazy to look up or calculation the truncated mean for the normal, so I’d appreciate it if a commenter would do my work for me

30 thoughts on “The great replication crisis

  1. @Donald Oats

    And at another corner of the things psych can and will investigate, is the science of psychogeography that “looks at how strongly the place we’re in can influence how we think and feel.” From the quantum level to the universal, psych just wants to understand the patterns that are to be found and why these patterns in human behaviour can look like fractals?

    I don’t like to think of the not-theories as toothbrushes though. I like to think of them as glass beads some of which are connected by conceptual/narrative ‘threads’; for example effects like Dunning-Kruger and Motivated Cognition are quite robust and the patterns of human behaviour that can be seen in the description of these cognitive processes can be linked to the patterns that Freud saw in his environment. Some of Freud’s stories about why people do what they do are wrong in our environment because our environment has changed.

    Anyway, about ‘psychogeography’; if one thinks about the sense of place that is so strong in aboriginal cultures and in the other people who live on ‘the land’ – although the neo-liberal impulse to get ahead has overruled that ‘instinct’ in some of these people and they will sell anything if the price is right – this is another way ‘psychology’ can be used to create links – stories – that make it possible to understand what we do.

    It’s on Sunday Extra.

    The most valuable knowledge that I personally gained from my undergrad psych course and this is so indulgently off topic – was that one could ‘research’ things – almost anything that that was before the internet which developed while I was at Uni. But mostly it became obvious how complicated and fundamentally unable to be understood fully life is but that it doesn’t matter if you keep aiming for the truth/objectivity even if it is ‘just a horizon value’.

    So, that is probably why it seems to me that intro psych and basic stats could be like the Arts degree once was and provide a set of skills that are very useful for negotiating one’s way through the diversity that is human nature/culture.

    Ikon did you watch the TV series The Samuri with Shintaro the ninja?

  2. @Donald Oats
    @Donald Oates

    Psychology experiments must be like trying to herd cats, I reckon. All the complexities of sentient beings are present, and somehow this must be corralled into something meaningful. Definitely not easy.

    Cats are easy. 🙂

    A friend of mine was doing a Ph.D and wanted to do some research with school kids in school settings Ten-year olds IIRC . Nothing to it.

    Get usable idea, sell idea to committee/advisor

    Find some funding

    Get approval from the ethics people

    Prepare testing and communication material

    Negotiate with local school board at staff and board levels. Remember you will need your advisor to sit in on several (most?) meetings

    School boards usually only meet once a month so hope your pitch is good the first time.

    Individually contact parents of all suitable childen and get written permission for their participation. Pray enough children are available.

    Run study–remember you probably only have perhaps 4 or 5 months in a year to do the data gathering.

    Work around sports and cultural activities at the several schools.

    Hope that children will cooperate and no flu epidemics strike during the data gathering periods.

  3. The way I remember it, it should be

    The nature of Monkey was —

    But I always wondered:
    How does an eon wheel?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s