OPINION

September/October 2012

Open and Free: Software and Scientific Reproducibility

doi: 10.1785/0220120091

OPEN AND FREE

Recent scandals, retractions and proliferation of scientific research has reached a stage where scrutiny of scientific debate is now routinely reported in the public press as evidence of bungling, or, worse, dishonesty, in our profession (see Zimmer, 2012; Reay, 2010). High profile cases, like the discredited cancer research at Duke University or the notorious “Climategate”, can be traced back to poor implementation of checks and balances on standard scientific practice. One remedy is to require researchers to use open methods of analysis, to share software, and to submit only reproducible analysis where it is feasible. Barriers to reproducibility are propriety (closed) software and computer platforms that encourage, rather than prevent, sloppy documentation. It is time to remind ourselves of efforts, perhaps started in the 1980’s, to reestablish standards of reproducibility (see Schwab et al., 1996).

It has been a couple years now that I have acted as editor-in- chief of SRL, Seismological Research Letters. In the capacity of editor I handle papers submitted by authors all over the world. Some time ago a paper came across my desk formatted with an older version of a very popular word processor. I will not mention any names, but this software is used, nearly universally, by scientists and professionals around the world. The paper included equations formatted with an outdated system, a set of routines supplied by the corporate software publisher of the word processor that have since been superseded by newer, costly upgrades.

Since the submission was formatted in this manner, I could not view the text correctly. I had to seek out an older version on a computer that somehow survived numerous upgrades since applied. Of course, I could have sent the paper back to the authors and required them to reformat the submission using a more recent version. But that would require them to purchase the new, possibly expensive, upgrades the corporate owners of the software require. For developing world colleagues this would be an onerous burden. I would rather recommend to the authors that they use one of the much more powerful, open software packages for managing word processing and numerical calculations. Without the burden of licensing, they could share with me, or anyone, the codes used to process the data and text. True, obsolete open code also exists and can be frustrating, but it does not cost much to upgrade or even to find legacy versions.

Those who are entrenched and dependent will cry out. The rest of us may smirk and nod knowingly.

Students who take my classes are dissuaded from using proprietary software. As I explain to the students, it may seem free that one can log on to computers across campus and get access to bundles of software for free, it is, however, an illusion. Costs to the University for site licenses are enormous. I often tell audiences around the world when I promote my own software, one day the chancellor of the University of North Carolina will be leaning over and perusing complex University budgets, looking for ways to slash and save. He will notice a line item for certain, very popular, software that costs a considerable amount, licensed on yearly basis.

“What is this?” he might ask his assistant.

“Yes, that is for software our scientists use to analyze their data.”

“Oh, I see. Well, then, let the scientists pay.”

Those who are entrenched and dependent will cry out. The rest of us may smirk and nod knowingly.

Of course, open and free software is not a new concept in the seismology/geophysics community. In fact, we are very likely among pioneers in this endeavor (certainly in the distribution of open data). The USGS, IRIS, UNAVCO are a few of the organizations dedicated to open software and distributed freely. Some of the software on these sites have varying degrees of openness and users should scrutinizie license agreements carefully. Popular software, including SAC, GMT, OpenSHA, and SEISAN, extend the flexibility of earth scientists to exchange information and analysis without being concerned with expensive licensing and, of course, overlord control. For younger researchers and students the variability of these packages and the choices available can be daunting.

REPRODUCIBILITY

The recent UseR! conference in Nashville, TN, scheduled several sessions on the problem of scientific reproducibility (http://biostat.mc.vanderbilt.edu/wiki/Main/UseR-2012). There is an ongoing and strong emphasis on developing tools that will enable statisticians to share documents that report results of complicated analysis along with all the code and data that lead to the conclusions. In the biomedical fields the detailed documentation and more careful procedures are critical for insuring transparency and reliability. I recommend we do a better job at training the next generation of young seismologist to adhere to these principles and focus on the extra effort of good scientific practice.

We should not allow our students and colleagues to insert ad hoc analyses, calculations that are not documented and figures that have no provenance. So called “user friendly” software (clickware) that allows authors to modify results with no trail or documentation should be disallowed.

With multitudes of scientific journals proliferating throughout the web it is difficult to keep up with the latest scientific results let alone checking them for correctness of analysis. This is a critical issue in fields where lives are at stake, like in the biomedical fields. Earth sciences in general, and seismology especially, are not immune to social impact. For this reason it is of utmost importance to emphasize the scientific reproducibility. We should not allow our students and colleagues to insert ad hoc analyses, calculations that are not documented and figures that have no provenance. So called “user friendly” software (clickware) that allows authors to modify results with no trail or documentation should be disallowed.

A recent scandal in medical research uncovered a sequence of publication of results in high profile journals of statistical analyses that could not be duplicated. Incomplete data, ambiguous analysis procedures and misuse of statistical methodology can create chains of erroneous science that propagate through the field. Once results have been published authors are reluctant to admit to error, and will not share their results. In most cases, perhaps, the intention is not malicious. The authors in this case, however, were not forthcoming in producing the proper files for open investigation and years passed before the improprieties were sorted out.

Proprietary software is not reproducible, unless one considers passing on the costs of purchase to those attempting to reproduce. Thus, I recommend we all return to the days of LaTeX or even troff and nroff. To some it may seem like a step backwards, but actually it is a step forward to world of reproducibility and testable scholarly contribution. A large body of new software is available now that is easy to use, open, free and incorporates structures that encourage reproducibility. I am referring to R, Sweave, knitr, SAGE and a host of other downloadable software packages that we should all be using. See Pebesma et al. (2012) or explore on http://www.r-project.org.

I realize it may be painful to learn how to report results in a more authoritative way. But I believe the effort is worth the struggle. Scandals like climate-gate and other, more insidious misconduct that affects our health care system (Ioannidis et al., 2009) must be avoided, especially in a society where suspicion and mistrust is rampant. I strongly urge scientific journals to insist on open software solutions and submissions that can be checked by independent means. I am not sure when or whether SRL will be able to join this effort, but I think we should anticipate something along these lines some time in the future.   

REFERENCES

Ioannidis, J. P. A., D. B. Allison, C. A. Ball, I. Coulibaly, X. Cui, A. C. Culhane, M. Falchi, C. Furlanello, L. Game, G. Jurman, J. Mangion, T. Mehta, M. Nitzberg, G. P. Page, E. Petretto, and V. van Noort (2009). Repeatability of published microarray gene expression analyses, Nat. Genet. 41, no. 2, 149– 155, http://dx.doi.org/10.1038/ng.295.

Pebesma, E., D. Nüst, and R. Bivand (2012). The R software environment in reproducible geoscientific research, Eos Trans. AGU 93, no. 16, 163, http://dx.doi.org/10.1029/2012EO160003.

Reay, D. S. (2010). Lessons from climategate, Nature 467, no. 7312, 157, http://dx.doi.org/10.1038/467157a.

Schwab, M., M. Karrenbach, and J. Claerbout (1996). Making scientific computations reproducible, http://sepwww.stanford.edu/lib/exe/fetch.php?media=sep:research:reproducible:cip.pdf.

Zimmer, C. (2012). A sharp rise in retractions prompts calls for reform. The New York Times, April 16, 2012.

Jonathan M. Lees University of North Carolina, Chapel Hill Department of Geological Sciences CB #3315, Mitchell Hall Chapel Hill, NC 27599-3315 jonathan [dot] lees [at] unc [dot] edu


To send a letter to the editor regarding this opinion or to write your own opinion, you may contact the SRL editor by sending e-mail to
<srled [at] seismosoc [dot] org>.



[Back]

 

Posted: 19 July 2012