Post provided by JARROD HADFIELD
Last week the Center for Open Science held a meeting with the aim of improving inference in ecology and evolution. The organisers (Tim Parker, Jessica Gurevitch & Shinichi Nakagawa) brought together the Editors-in-chief of many journals to try to build a consensus on how improvements could be made. I was brought in due to my interest in statistics and type I errors – be warned, my summary of the meeting is unlikely to be 100% objective.
True Positives and False Positives
The majority of findings in psychology and cancer biology cannot be replicated in repeat experiments. As evolutionary ecologists we might be tempted to dismiss this because psychology is often seen as a “soft science” that lacks rigour and cancer biologists are competitive and unscrupulous. Luckily, we as evolutionary biologists and ecologists have that perfect blend of intellect and integrity. This argument is wrong for an obvious reason and a not so obvious reason.
We tend to concentrate on significant findings, and with good reason: a true positive is usually more informative than a true negative. However, of all the published positives what fraction are true positives rather than false positives? The knee-jerk response to this question is 95%. However, the probability of a false positive (the significance threshold, alpha) is usually set to 0.05, and the probability of a true positive (the power, beta) in ecological studies is generally less than 0.5 for moderate sized effects. The probability that a published positive is true is therefore 0.5/(0.5+0.05) =91%. Not so bad. But, this assumes that the hypotheses and the null hypothesis are equally likely. If that were true, rejecting the null would give us very little information about the world (a single bit actually) and is unlikely to be published in a widely read journal. A hypothesis that had a plausibility of 1 in 25 prior to testing would, if true, be more informative, but then the true positive rate would be down to (1/25)*0.5/((1/25)*0.5+(24/25)*0.05) =29%. So we can see that high false positive rates aren’t always the result of sloppiness or misplaced ambition, but an inevitable consequence of doing interesting science with a rather lenient significance threshold.
Where the sloppiness comes in is a failure to acknowledge these facts, a reluctance to act on them appropriately and the use of various practices that increase the false positive rate even further. Psychologists and cancer biologists are now tackling these issues, and it is time we joined them.
Flexibility in data collection and analysis
From initial data collection to publication we are given great freedom in what we choose to do. On finding that y does not correlate with x we could shift our attention to z. If our treatment has no effect we could test whether that is because its effects are sex-specific. These types of exploratory analyses play an important role in science, but it would be lunacy to subject them to the lenient significance thresholds that we use for predefined tests. In these two simple cases the true positive rate would fall to an abysmal (1/25)*0.5/((1/25)*0.5+(24/25)*(1-0.95^2)) =18%.
Is this a common problem? My feeling is yes, but since there is no obligation for a scientist to disclose whether tests and analysis decisions are post-hoc or not, it is hard to say. It was suggested during the meeting that scientists should be made to declare this information. I can think of no reasonable argument against it; it’s a very valuable bit of information when assessing scientific evidence, it can easily be given and an abbreviation indicating the type of test (post-hoc, a priori, unsure) takes up little space in a publication.
Two other, more extreme, suggestions were proposed. The first was for authors to provide a full transcript of analysis decisions and then use (new) statistical methods to ascertain the appropriate significance threshold. In my opinion this is impractical, requiring too much time and statistical expertise and ignoring the fact that some modelling decisions are made in good faith based on scientific knowledge that even the most sophisticated algorithm would be blind to.
A second suggestion, which I personally hope will become the gold-standard of future research, is to encourage the use of preregistered analysis and data-collection plans. These plans would provide a more honest and more accurate record of intent that readers, reviewers and, perhaps most importantly, authors themselves can trust. Once the embargoes on preregistration plans expire, the underworld of unpublished studies would be exposed and their detrimental effects could be adjusted for.
Few people at the meeting would have argued that clinical trials shouldn’t follow preregistration guidelines, and so I was surprised that many were lukewarm about trying to meet those standards in our own field. The most substantive concerns over preregistration plans were the amount of work they would entail and the notion that ecological studies encounter many more unforeseen circumstances than controlled clinical experiments. Both these concerns are, I think, overstated. Preregistration plans would surely form the backbone of methods sections, which have to be written anyway and although ecological studies may well deviate further from initial plans, preregistration is not intended to be a tool to unilaterally penalise such behaviour. It is simply an effective means by which readers can judge how decisions made after observing the data effect final research outcomes.
The capacity to predict future observations is central to science and without replication it is impossible to say whether this requirement is met. An important question is: what breadth of contexts should we allow these future observations to come from? And the answer is the breadth of contexts we wish to generalise to. A valid response to a clinical study that cannot be replicated is not ‘of course not, you used different patients’.
It is hard to say at which level results in evolutionary biology and ecology should generalise too, but I think a good answer most of the time is that they should generalise across species. Replication at this level has been called quasi-replication and I dislike the connotation that this name brings. Part of the meeting was to provide incentives for true replication where an original study is replicated as closely as possible in the same species. Prior to the meeting I was skeptical that this level of replication is the best use of limited resources, but I was persuaded that replicating a number of influential studies makes sense for two reasons:
- Iconic studies often have an influence disproportionate to the evidence they provide, and replication can go someway to mitigating this bias.
- The discussions of many papers include (semi) plausible biological explanations for why their conclusions differ from previous studies, and only rarely is it suggested that the original study may have been a false positive.
Exact replication, with sufficient power, allows us – as far as is possible – to differentiate false-positives from context dependency. I expect that exact replication will show, as logic dictates (see above) and data suggest, that context-dependency is an untenable get-out-of-jail-free card and that playing it undermines the scientific progress.
The perceived benefits of replicating a study are low compared to engaging in new research. To make replication of studies more common researchers need to be offered incentives that funding bodies and journals seem unwilling to provide currently. There is hope this will change, but in the meantime many research groups continue to collect data on their study systems that could be used to validate previous findings. Allowing authors to publish short (two paragraph) addenda to their original publications would lower the costs of writing and submitting replication studies, and over time these addenda may reduce the stigma associated with publishing false positives and increase transparency.
Journals are very particular about the minutiae of how we cite published work, and yet can be relatively laissez-faire about how we report the quantitative information on which many of our scientific conclusions are ultimately based. Synthetic analyses, including meta-analyses, are the most objective way we have of drawing evidence from a body of studies and require access to this quantitative information in a useable form. Anyone who has conducted a meta-analysis will be familiar with how many studies have to be discarded because of poor reporting standards. For many common analyses, standards should be fairly straightforward to develop and would place a minimal burden on authors. Unlike citation formats, the opportunity to standardise across journals should make it a relatively painless process.
A recurrent theme throughout the meeting was that proposed solutions are open to cheating and evolutionary biologists and ecologists (despite their perfect blend of intellect and integrity) will abuse them. So be it. My guess is that this would be a very small minority, and if such people wish to waste their time and energy on false-positives, more fool them. The majority won’t abuse them, and the reward that these suggestions offer should be justification enough to try them out. The alternative is more algorithm-assisted story telling and more flukes with half-lives of a decade. We could blame faceless publishers and government agencies for any lack of change, but this is passing the buck: the editors of journals and the awarding panels of funding bodies are none other than ourselves and the buck stops with us.