Thursday, December 27, 2012

The Reproducible Research Guilt Trip May Finally Be Paying Off

We might be closer to killing off the "Just take my word for it - I'm pretty sure I did this right" methods section

There is no shortage of well-reasoned articles filled persuasive arguments about the need for higher reproducible research standards in the scientific literature. With so many good posts about the virtues of reproducible research, they all boil down to one overarching concept:

write shit down

Why is this even an issue? Biologists in particular seem to be collectively and subconsciously reacting to those awful General Chemistry labs where they had you copy down pages of instructions verbatim into your lab notebook. It should come as no surprise that bioinformatics is ground zero for reproducibility activism.

It is unfortunate reproducible research is tied up with all sorts of other holier-than-thou practices: open access, open source, open data, literate programming, blogging, functional programming. This all-encompassing evangelism tends to polarize people. While wonky ├╝ber-programmers like C. Titus Brown lay out fundamental practices for reproducibility, most PIs have been publicly giving lip service to the idea of reproducible research, belying a "I don't wanna eat my vegetables"-type disdain. There are now "corsortia" and an "initiative" to compel scientists to actually write their shit down, preferably with door prizes. If you think this has a "posture pals" (video) feel to it, you're not alone. As the number of pro-RR articles has steadily increased, few take these to heart.

This head against wall bashing has been the pattern for many years - better tools are now available (RStudio, knitr, Galaxy, cloud computing, figshare, github, bitbucket) and more rah-rah from the blogosphere - but little enforcement from major journals. But now a recent development has raised my hopes, because it indicates editors have been tightening the screws enough to cause discomfort:

People have actually started to argue against reproducible research!

Hearts and Minds

The founder of the irreplicability movement is Christopher Drummond, author of “Reproducible Research: a Dissenting Opinion”. I will attempt to paraphrase his arguments here:
  1. Richard Feynman never had a Github account.
  2. No one is really going to read your damn code anyway.
  3. Writing shit down == A big drag, man.
  4. The Anil Potti incident proves liars always lie about their Rhodes Scholarships first. We should crack down on curricula vitae, not veritas curat.
Drummond's points are challenged here by statistician and Coursera favorite Roger Peng.

A precursor to the dissenting opinion article is Drummond's "Replicability is not Reproducibility: Nor is it Good Science". A distinction is drawn between reproducibility and replicability, the former being what is advocated and the latter being more generalizable or scientifically provable. The idea we require researchers to submit their data and code, replicable research, is a narrow concept really only useful for ferreting out scientific misconduct.

Black-footed Ferret
I would argue that ignorance of biological sequence analysis, and even moreso statistics, is a bigger threat than the outright fraud seen in the Duke case. Most bioinformatics manuscripts feature analysis which is not replicable, which is frightening to consider when GWAS and exome NGS variant papers implicate so many genes in disease, many of them residing along a razor thin p-value threshold tweaked by several incomprehensible cherry picked program parameters.

It is not clear science can efficiently self-correct. So while replicability is not reproducibility, reproducibility is too slow to substitute for replicability. A manuscript that describes real reproducible biological phenomena is essentially conjecture until it can be repeated. The greatest ferret-legger the world has ever known will live in obscurity until they buy a ferret. We have a culture of scientists who refuse to buy a ferret.

Accounting for Tastes

The other dissenting opinion (here) is from UCSC's Kevin Karplus, who replies to Iddo Friedberg's post recommending a panel of white coat mechanics to help biologists get their code ready for publication. Karplus raises two points:
  • It is difficult to make polished software for others to use and that is not the point of research.
  • Replicability is not reproducibility.
Regardless of Friedberg's proposal, railing against "polished software" is simply a straw man argument. Reproducible research in 2012 2013 does not mean robust, extensible, or even well documented code. Most sequence analysis papers feature very little compiled code, but rely on using a series of executable programs glued together using scripting languages, producing intermediate data then digested into a report, often written in R.

Getting these sequence analysis workflows to be reproducible will not require a highly skilled platoon of developers. Any willing researcher can submit a shell script or a build script of commands provided they avoid these common pitfalls:
  • Using bioinformatics web applications with no web service capability
  • Using desktop bioinformatics software with no logging capability
  • Relying on proprietary institutional databases, perhaps with stored procedures that prove too unwiedly to dump
  • Using command line programs without a directory-based bash history
  • Using Excel to manually manipulate data
As our toolset and research community matures, these excuses obstacles will eventually disappear. But there is one scenario which will always be true in some of the more competitive arenas of bioinformatics programming (e.g. structure prediction, de novo assembly):
  • The researcher was perfectly capable of submitting code but decided to retain a competitive advantage.
"Over-CASPed" researchers who are unwilling to divulge their secret sauce should be relegated to appropriate sandboxes.

Replication does not prove a biological truth but we often don't even have the fleeting proof that a scientist did what they said they did.

Which brings us back to those damn chemistry labs. While many public access talk shows find chemists willing argue against evolution, you would be hard pressed to find a one who would argue against writing shit down.

In other words: Not writing shit down is an even worse idea than creationism.


There, I blogged in 2012.