The 20% Statistician: September 2015

Sunday, September 20, 2015

How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?

David Colquhoun (2014) recently wrote “If you use p = 0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.” At the same time, you might have learned that if you set your alpha at 5%, the Type 1 error rate (or false positive rate) will not be higher than 5%. How are these two statements related?

First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were – but we can correct his sentence by changing ‘will’ into ‘might, under specific circumstances, very well be’. After all, if you would only examine true effects, you could never be wrong when you suggested, based on p = 0.05, that you made a discovery.

The probability that a statement about a single study being indicative of a true effect is correct, depends on the percentage of studies you do where there is an effect (H1 is true), and when there is no effect (H0 is true), the statistical power, and the alpha level. The false discovery rate is the percentage of positive results that are false positives (not the percentage of all studies that are false positives). If you perform 200 tests with 80% power, and 50% (i.e., 100) of the tests examine a true effect, you’ll find 80 true positives (0.8*100), but in the 50% of the tests that do not examine a true effect, you’ll find 5 false positives (0.05*100). For the 85 positive results (80 + 5), the false discovery rate is 5/85=0.0588, or approximately 6% (see the Figure below, from Lakens & Evers, 2014, for a visualization).

At the same time, the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors. This is also true in the Figure above. Of 200 studies, at most 0.05*200 = 10 will be false positives. This happens only when H0 is true for all 200 studies. In our situation, only 5 studies (2.5% of all studies) are Type 1 errors, which is indeed less than 5% of all the studies we’ve performed.

So what’s the problem? The problem is that you should not try to translate your Type 1 error rate into the evidential value of a single study. If you want to make a statement about a single p < 0.05 study representing a true effect, there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true. P-values and evidential value are not completely unrelated, in the long run, but a single study won’t tell you a lot – especially when you investigate counterintuitive findings that are unlikely to be true.

So what should you do? The solution is to never say you’ve made a discovery based on a single p-value. This will not just make statisticians, but also philosophers of science, very happy. And instead of making a fool out of yourself perhaps as often as 30% of the time, you won't make a fool out of yourself at all.

A statistically significant difference might be ‘in line with’ predictions from a theory. After all, your theory predicts data patterns, and the p-value tells you the probability of observing data (or more extreme data), assuming the null hypothesis is true. ‘In line with’ is a nice way to talk about your results. It is not a quantifiable statement about your hypothesis (that would be silly, based on a p-value!), but it is a fair statement about your data.

P-values are important tools because they allow you to control error rates. Not the false positive discovery rate, but the false positive rate. If you do 200 studies in your life, and you control your error rates, you won't say that there is an effect, when there is no effect, more than 10 times (on average). That’s pretty sweet. Obviously, there are also Type 2 errors to take into account, which is why you should design high-powered studies, but that’s a different story.

Some people recommend lowering p-value thresholds to as much ass 0.001 before you announce a ‘discovery’ (I've already explained why we should ignore this), and others say we should get rid of p-values altogether. But I think we should get rid of ‘discovery’, and use p-values to control our error rates.

It’s difficult to know, for any single dataset, whether a significant effect is indicative of a true hypothesis. With Bayesian statistics, you can convince everyone who has the same priors. Or, you can collect such a huge amount of data, that you can convince almost everyone (irrespective of their priors). But perhaps we should not try to get too much out of single studies. It might just be the case that, as long as we share all our results, a bunch of close replications extended with pre-registered novel predictions of a pattern in the data will be more useful for cumulative science than quantifying the likelihood a single study provides support for a hypothesis. And if you agree we need multiple studies, you'd better control your Type 1 errors in the long run.

Saturday, September 12, 2015

Researchers who don't share their file-drawer for a meta-analysis

I’ve been reviewing a number of meta-analyses in the last few months, and want to share a problematic practice I’ve noticed. Many researchers do not share unpublished data when colleagues who are performing a meta-analysis send around requests for unpublished datasets.

It’s like these colleagues, standing right in front a huge elephant in the room, say: “Elephant? Which elephant?” Please. We can all see the elephant, and do the math. If you have published multiple studies on a topic (as many of the researchers who have become associated with specific lines of research have) it is often very improbable that you have no file-drawer.

If a meta-analytic effect size suggests an effect of d = 0.4 (not uncommon), and you contribute 5 significant studies with 50 participants in each between subject condition (a little optimistic sample size perhaps, but ok) you had approximately 50% power. If there is a true effect, finding a significant effect five times in a row happens, but only 0.5*0.5*0.5*0.5*0.5 = 3.125% of the time. The probability that someone contributes 10 significant, but no non-significant, studies to a meta-analysis is 0.09% if they had 50% power. Take a look at some published meta-analyses. This happens. These are the people I’m talking about.

I think we should have a serious discussion about how we are letting it slide when researchers don't share their file drawer when they get a request by colleagues who plan to do a meta-analysis. Not publishing these studies in the first place was clearly undesirable, but it also was pretty difficult in the past. But meta-analyses are a rare exception when non-significant findings can enter the literature. Not sharing your file-drawer, when colleagues especially ask for it, is something that rubs me the wrong way.

Scientists who do not share their file drawer are like people who throw their liquor bottle on a bicycle lane (yeah, I’m Dutch, we have bicycle lanes everywhere, and sometimes we have people who drop liquor bottles on them). First of all, you are being wasteful by not recycling data that would make the world a better (more accurate) place. Second, you can be pretty sure that every now and then, some students on their way to a PhD will drive through your glass and get a flat tire. The delay is costly. If you don’t care about that, I don’t like you.

If you don’t contribute non-significant studies, not only are you displaying a worrisome lack of interest in good science, and a very limited understanding of the statistical probability of finding only significant results, but you are actually making the meta-analysis less believable. When people don’t share non-significant findings, the alarm bells for every statistical technique to test for publication bias will go off. Techniques that estimate the true effect size while correcting for publication bias (like meta-regression or p-curve analysis) will be more likely to conclude there is no effect. So not only will you be clearly visible as a person who does not care about science, but you are shooting yourself in the foot if your goal is to make sure the meta-analysis reveals an effect size estimate significantly larger than 0.

I think this is something we need to deal with if we want to improve meta-analyses. A start could be to complement trim-and-fill analyses (which test for missing studies) with a more careful examination of which researchers are not contributing their missing studies to the meta-analysis. It might be a good idea to send these people an e-mail when you have identified them, to give them the possibility to decide whether, on second thought, it is worth the effort to locate and share their non-significant studies.