Home> Blog> Five Rules for Computational Scientists

Five Rules for Computational Scientists

ยท

This post originally appeared on the Software Carpentry website.

Stepping back from the details for a moment, here are five rules every computational scientist should (try to) follow:

1. Version Control

Put every primary artifact (source code, raw data files, parameters, etc.) in a version control system so that you have a record of exactly what you did, and when. There's no need to store things you re-create, such as the graphs you generate from your data files, as long as you have the raw material archived and timestamped.

The one major exception to this rule is very large data sets: tools like Subversion aren't designed to handle the terabytes and petabytes that come out of the LHC or the Hubble. However, the teams managing those experiments include people whose job is archiving and managing data using specialized (and often one-of-a-kind) systems.

2. Provenance

Track the provenance of your code and data. Museums use the term "provenance" to mean the paper trail of ownership and transfer for a particular piece; in the scientific world, provenance is a record of what raw data was combined or processed to produce a particular result, what tools were used to do the processing, what parameters were given to those tools, and so on. If raw data sources and source files have unique version numbers (which they will if you're keeping them in a version control system), then it's a simple matter of programming to copy those IDs forward each time you derive a new result, such as an aggregate data set, a graph for a paper, or the paper itself.

The good news is, tools to do this tracking automatically are finally entering production: see the Open Provenance Model website for updates on efforts to standardize the kinds of information they record, and how they communicate.

3. Design for Test

Write testable software. Tangled monolithic programs are very hard to test; instead, programs should be built out small, more-or-less independent components, each of which can be tested in isolation. Building programs this way requires discipline on the part of the developer, but there are lots of places to turn for guidance, such as Michael Feathers' excellent book Working Effectively With Legacy Code.

Modularizing code and defining clear interfaces between modules also helps speed things up. One application programmer working at Lawrence Livermore National Laboratory typically found that simply by tidying up the code scientists brought to him, he could speed it up by a factor of 10 or 20, even before parallelizing it (which he could only do after cleaning it up).

4. Test

Actually test the software you've written. Yes, it's much harder to test most scientific applications than it is to test games or banking software, both because of numerical accuracy issues, and because scientists usually don't know what the right answer is. (If they did, they'd be writing up their paper, not writing software.) However, as Diane Kelly and other researchers have found, there's a lot scientists can do. Run simple cases that can be solved analytically; compare the program's output against experimental data; compare the output of the parallel Fortran version using the hyper-efficient algorithm against the output of the sequential MATLAB version using the slow, naive, but comprehensible algorithm, and so on. And do code reviews: study after study has shown that having someone else read your code is the most effective, and most cost-effective, way to find bugs in it.

5. Review

Finally and most importantly, insist on access to the software used to produce the results in papers you are reviewing. No, you won't be able to read or review all of the ATLAS particle detector software, and no, the folks at Wolfram Research aren't going to give you the source of Mathematica, but not having access to the engineering schematics of today's high-throughput sequencing machines doesn't stop us from reviewing the rest of our peers' wet lab protocols. Most scientific software is neither very large nor closed source; we can and should start to treat it according to the same rules we've used for physical experiments for the last 300 years.