Wanted: An Entry-Level Provenance Library

This post originally appeared on the Software Carpentry website.

One of the reason we keep teaching Subversion is that it allows us to show students a simple but useful trick. If you add the following to a text file:

$Revision: $

and then tell Subversion to set the "Revision" keyword on that file, the next time you commit it, Subversion will automatically update the text to:

$Revision: 423$

or whatever the revision number actually is. This is handy if you're mailing files around, and want people to be able to tell exactly which revision they have, but what makes it really useful is this:

1. Embed the revision number in a string:

version_string = "$Revision: 423$"

2. Extract it (I'll show the Python, but the trick works in any language):

version_number = int(version_string.strip("$").split()[1]) # version_number is now 423

3. Print this as a comment at the start of any output, along with parameters:

print '#', sys.argv[0], version_number
print '# ...alpha', alpha
print '# ...beta', beta
for result in all_results:
    print result

so that the program's output is:

# analyze.py 423
# ...alpha 0.5
# ...beta 1.7
22,43,17.5
22,44,18.5
...,...,... # and so on

This is a quick and easy way to keep track of the provenance of the data: if done systematically, it ensures that every result contains a record of how it was produced.

Of course, a real provenance system needs to do more than this: it needs to track the inputs to the program, so that if analyze.py was run something preprocess.py produced, we can trace backward from analyze.py's output all the way to preprocess.py. There was an abortive effort a few years ago to standardize provenance information, but it got bogged down in XML schemas and ontologies and all the other details that standards committees love and working scientists find irrelevant.

What the scientists we're trying to help actually need right now is something a lot simpler: a suite of inter-operable libraries for various languages that are no more complicated than the various xUnit libraries for testing, or the argparse and CLI libraries for parsing command-line arguments in Python and Java respectively. It's OK if those libraries don't capture all the information that anyone might conceivably want; what's most important is that they capture enough to be useful, with close to no effort on the scientist's part, so that we can get this ball rolling.

If this sounds like something you'd be interested in helping with, please give us a shout. It would be a good contribution to the scientific programming community, and a good way to meet other believers in better scientific software.