How Software Carpentry Helped Me Write a Paper

This post originally appeared on the Software Carpentry website.

A paper I wrote titled Increasing the Efficiency of a Thermionic Engine Using a Negative Electron Affinity Collector was recently published in the Journal of Applied Physics. I found that applying some of the best practices advocated by the Software Carpentry project eliminated much of the drudgery and tedium which leads to manually induced error in developing the software and composing the manuscript. As a result, I was able to focus more on the scientific problems instead of on tedious bookkeeping-type issues.

As background, I have a general interest in energy because its production and efficient use is critical to modern society. Specifically, I am interested in the direct conversion of heat to electricity using a device called a thermionic engine. In the paper, I apply modern materials to the device to simulate its performance, and I show that these devices can potentially achieve greater than 20% efficiency.

I first stumbled upon Software Carpentry in 2006 or so when I was a graduate student because I knew I was bad at software development, but I had no idea how to improve. Since then, my software development skills have improved tremendously. I have contributed to the materials used in the Software Carpentry bootcamps, and I have taught at three bootcamps (University of Chicago, Johns Hopkins, and Carnegie Mellon). In this post I describe how some of what I've learned directly improved my work.

Version Control

I put every digital file associated with this project under version control; I had a repository for the source code of the software I wrote, and a separate repository for the source files for the manuscript itself. Since I was the sole author of the paper, inter-personal collaboration wasn't an issue. However, version control allowed me to work on both the code and paper from multiple computers without the tedium of manually synchronizing changes among different machines. Personally, having a system to synchronize the code and the paper was a huge relief. Before I started using version control I always had a nagging worry that somewhere along the line I forgot to sync important corrections to the code and my results would be unexpectedly but subtly wrong.

In addition to synchronization, the Git-based workflow of branching and merging allowed me to try changes to my manuscript or code without the fear of breaking some functionality that I couldn't recover later. Before making a major change, I would create a branch in the Git repo. If I ended up not liking the change, or the change didn't work like I expected, I'd just checkout the master branch, nuke the feature branch, and be on my way.

Third, i relied heavily on Git's tagging feature to mark particular revisions of the manuscript. I would tag each commit that was sent to colleagues for review, and I would tag each revision I sent to the editor during the review process. I used tags structured as YYYY-MM-DD, and described the tagged revision (who I sent it to, what step in the process I was at, etc.) in the tag's commit message.

Finally, cloning repositories across multiple machines is an implicit backup strategy (not the best, but better than nothing). If my notebook caught on fire, at worst I was only a few commits behind on my desktop machine.

I found that the proper use of version control completely eliminates an entire class of problems from the software development process as well as the manuscript composition process. I delegated the administrative and bookkeeping problems to the version control system and I was therefore able to spend my time and creative energy on the scientific problems instead.

Open-Source Tools

All the code and third-party modules used to perform the calculations appearing in the paper (Python, NumPy, SciPy), matplotlib), the tools I used to manage my workflow (Git, Make, Bash, IPython, Homebrew, virtualenv, Sphinx), as well as the system I used to prepare the manuscript (LaTeX and BibTeX) are non-proprietary and open-source. This decision was driven by practicality and not ideology. The major advantage of open-source tools is that they remove the need to ask permission to do your work. Instead of spending my time getting purchase orders prepared to buy software licesnses, or tracking down IT people to install license files on my machines, I could simply brew install what I needed. As a result, more of my time was available to spend solving the scientific problem at hand.

Automation

One day during the preparation of this manuscript, I discovered the Makefile for building LaTeX and my whole life improved. Automating the build for the manuscript such that the output files were contained in their own folder (which I could .gitignore) had the surprising side-effect of decreasing the cognitive load of composing the paper. I'm not sure why those extra files were such a mental drag, but being able to issue a make display command at the command prompt freed up my energy to focus on the issues of structuring the argument of my paper. In the future, I plan on extending the Makefile functionality to recalculate data or re-plot figures when necessary.

Documentation

I made every effort to write good documentation for all the submodules, classes, and methods of my library as well as notes to explain the design decisions I made during development. I wrote this documentation with future collaborators in mind: I believe my code is good, but the real test is if other people use it. Not understanding how a software package works is a huge barrier for someone to use it, and I want to lower that barrier as much as possible. Since I wrote the documentation simultaneous with the software development, I was able to easily build HTML docs for the library with Sphinx).

Conclusions

I didn't hit all the points that were mentioned in the Best Practices paper, but hopefully you have gotten a sense of how some of these techniques helped me. If you are wondering if learning a seemingly arcane system like Git or Make can help you, I can emphatically say that they helped me. I am not an expert at any of these systems, but even a small amount of knowledge allowed me to delegate most non-science tasks to my computer so I could spend my time on solving the scientific problems.

Postscript

Please feel free to fork or star my code on GitHub. The scripts I used to calculate the data can be found as supplementary material on the JAP website. You can install the module if you have pip (and the required dependencies) by simply executing

pip install git+git://github.com/jrsmith3/tec.git

at the command line.

I would love to hear about your research, particularly if you are interested in collaborating. Please don't hesitate to email me at joshua.r.smith@gmail.com.