‘Data mining’ is an interesting term. It’s used very positively in some academic circles, such as departments of marketing, and very negatively in others, most notably departments of economics. The term refers to the use of clever automatedsearch techniques to discover putatively significant relationships in large data sets.
The paradigm example, though a very old-fashioned one is ‘stepwise regression’. You take a variable of interest then set up a multivariate regression. The computer then tries out all the other variables in the data set one at a time. If the variable comes up significant, it stays in, otherwise it’s dropped. In the end you have what is, arguably, the best possible regression.
Economists were early and enthusiastic users of stepwise regression, but they rapidly became disillusioned. To see the problem, consider the simpler case of testing correlations. Suppose, in a given dataset you find that consumption of restaurant meals is positively correlated with education. This correlation might have arisen by chance or it might reflect a real causal relationship of some kind (not necessarily a direct or obvious one). The standard statistical test involves determining how likely it is that you would have seen the observed correlation if there was in fact no relationship. If this probability is lower than, say, 5 per cent, you say that the relationship is statistically significant.
Now suppose you have a data set with 10 variables. That makes 45 (=10*9/2) distinct pairs you can test. Just by chance you’d expect two or three correlations that appear statistically significant correlations. So if your only goal is to find a significant relationship that you can turn into a publication, this strategy works wonders.
But perhaps you have views about the ‘right’ sign of the correlation, perhaps based on some economic theory or political viewpoint. On average, half of all random correlations will have the ‘wrong’ sign, but you can at expect to find at least one ‘right-signed’ and statistically significant correlation in a set of 10 variables. So, if data mining is extensive enough, the usual statistical checks on spurious results become worthless.
In principle, there is a simple solution to this problem, reflecting Popper’s distinction between the context of discovery and the context of justification. There’s nothing wrong with using data mining as a method of discovery, to suggest testable hypotheses. Once you have a testable hypothesis, you can discard the data set you started with and test the hypothesis on new data untainted by the process of ‘pretesting’ that you applied to the original data set.
Unfortunately, at least for economists, it’s not that simple. Data is scarce and expensive. Moreover, no-one gets their specification right first time, as the simple testing model would require. Inevitably, therefore, there has to be some exploration (mining) of the data before hypotheses are tested. As a result, statistical tests of significance never mean precisely what they are supposed to.
In practice, there’s not much that can be done except to rely on the honesty of investigators in reporting the procedures they went through before settling on the model they estimate. If the results are interesting enough, someone will find another data set to check or will wait for new data to allow ‘out of sample’ testing. Some models survive this stringent testing, but many do not.
I don’t know how the marketing guys solve this problem. Perhaps their budgets are so large that they can discard used data sets like disposable syringes, never infecting their analysis with the virus of pretesting. Or perhaps they don’t know or don’t care.
Update Kevin Drum at CalPundit gives the perspective of a marketing guy, with lots of interesting points (for example, loyalty programs are there to collect data for mining). He doesn’t accept my main point and raises the dreaded “B” word – Bayesian.
For those familiar with debates among statisticians, this is the point at which things typically become both heated and incomprehensible (just like a lot of blogs, really). A real challenge, which I may tackle at some point, is to explain the Bayesian concept of statistical reasoning in ordinary language. For the moment, though, I’ll just agree that a debate like the one over data mining ultimately makes sense only if it’s cast in Bayesian terms, that is, with a discussion of the beliefs we hold before we begin the statistical analysis.