David
Colquhoun (2014) recently wrote “If you use p = 0.05 to suggest that you have
made a discovery, you will be wrong at least 30% of the time.” At the same
time, you might have learned that if you set your alpha at 5%, the Type 1 error
rate (or false positive rate) will not be higher than 5%. How are these two statements
related?
First of
all, the statement by David Colquhoun is obviously incorrect – peer reviewers
nowadays are not what they never were – but we can correct his sentence by
changing ‘will’ into ‘might, under specific circumstances, very well be’. After all, if you would only
examine true effects, you could never be wrong when you suggested, based on p =
0.05, that you made a discovery.
The
probability that a statement about a single study being indicative of a true
effect is correct, depends on the percentage of studies you do where there is
an effect (H1 is true), and when there is no effect (H0 is true), the statistical
power, and the alpha level. The false discovery rate is the percentage of
positive results that are false positives (not
the percentage of all studies that
are false positives). If you perform 200 tests with 80% power, and 50% (i.e.,
100) of the tests examine a true effect, you’ll find 80 true positives (0.8*100),
but in the 50% of the tests that do not examine a true effect, you’ll find 5
false positives (0.05*100). For the 85 positive results (80 + 5), the false
discovery rate is 5/85=0.0588, or approximately 6% (see the Figure below, from Lakens & Evers, 2014, for a visualization).
At the same
time, the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors. This is also true in the Figure above. Of 200 studies, at most 0.05*200 = 10 will be false positives.
This happens only when H0 is true for all 200 studies. In our situation, only 5
studies (2.5% of all studies) are Type 1 errors, which is indeed less than 5%
of all the studies we’ve performed.
So what’s
the problem? The problem is that you should not try to translate your Type 1
error rate into the evidential value of a single study. If you want to make a
statement about a single p < 0.05 study representing a true effect, there is no way to
quantify this without knowing the power in the studies where H1 is true, and
the percentage of studies where H1 is true. P-values and
evidential value are not completely unrelated, in the long run, but a single study
won’t tell you a lot – especially when you investigate counterintuitive findings that are unlikely to be true.
So what
should you do? The solution is to never say you’ve made a discovery based on a
single p-value. This will not just make statisticians, but also philosophers of
science, very happy. And instead of making a fool out of yourself perhaps as often as 30% of the time, you won't make a fool out of yourself at all.
A statistically
significant difference might be ‘in line with’ predictions from a theory. After
all, your theory predicts data patterns, and the p-value tells you the probability of observing data (or more
extreme data), assuming the null hypothesis is true. ‘In line with’ is a nice
way to talk about your results. It is not a quantifiable statement about your
hypothesis (that would be silly, based on a p-value!),
but it is a fair statement about your data.
P-values
are important tools because they allow you to control error rates. Not the
false positive discovery rate, but
the false positive rate. If you do 200 studies in your life, and you control
your error rates, you won't say that there is an effect, when there is no effect,
more than 10 times (on average). That’s pretty sweet. Obviously, there are also Type 2
errors to take into account, which is why you should design high-powered studies, but that’s a different story.
Some people
recommend lowering p-value thresholds to as much ass 0.001 before you announce a ‘discovery’ (I've already explained why we should ignore this), and others
say we should get rid of p-values altogether. But I think we should get rid of ‘discovery’, and use p-values to control our error
rates.
It’s
difficult to know, for any single dataset, whether a significant effect is
indicative of a true hypothesis. With Bayesian statistics, you can convince everyone
who has the same priors. Or, you can collect such a huge amount of data, that
you can convince almost everyone (irrespective of their priors). But perhaps we should not try to get too much out of single studies. It might just be the case that, as long as we share all our results, a bunch of close replications extended with pre-registered novel predictions of a pattern in the data will be more
useful for cumulative science than quantifying the likelihood a single study
provides support for a hypothesis. And if you agree we need multiple studies, you'd better control your Type 1 errors in the long run.