What We Teach in Two Days
This post originally appeared on the Software Carpentry website.
This week's workshops at MBARI and NERSC both had more lecturing and less hands-on practical work than either I or the students would have liked, but when we're trying to squeeze so many things into two days, that's probably unavoidable. We hope that the online tutorials we're going to run over the next few weeks will make up for that by giving learners a chance to practice their skills at leisure.
On the other hand, I'm quite pleased with the topics and sequence: I think we did a pretty good job of explaining how to do scientific data processing, and how the pieces fit together. Here's what we covered:
- The morning of day 1 is the Unix shell. After
ls
,cd
,mkdir
,rm
,mv
, and an editor [1], we introduced text filters likehead
,tail
,wc
,sort
,uniq
, andcut
so that we could teach pipes and redirection. We then spend the middle of the morning on the Unix philosophy of "little pieces loosely joined", and wrap up by showing them how to save commands in files (to re-execute), and how to use for-loops to run their data pipelines once for each source file. We also talk about repeating commands with up-arrow or!123
, and about usinghistory | tail -whatever > the-steps-i-used.txt
to keep a record of how they produced results. - The afternoon of day 1 is a quick (and unfortunately shallow) introduction to Python. "Open a file, for-loop over the lines, convert them from strings to floats, add 'em up, and print the total" is the first hour's goal; once they've got that, we cover
if
statements, command-line arguments, and standard input and output, so that by the end of the afternoon they are building little tools of their own that play nicely in a Unix pipeline. (For example, Michelle Levesque had the NERSC students implement very simple versions ofhead
andcut
.) We close off by showing them how to factor repeated code into functions, and how to put those functions into files of their own so that they can be re-used in several different tools. We tell them (but don't actually show them) that all of these ideas apply equally well to R, MATLAB, Perl, or whatever else they want to use; we also point out things like using sensible variable names, breaking code into digestible lumps, and other transferable bits of programming hygiene. - The morning of day 2 is version control [2]. We start with the introduction that's on the web site, which is the only time we use slides (everything else is live coding), then walk them through the update-merge-edit-commit cycle. We also show them how to use
svn status
,svn log
,svn blame
, andsvn revert
, but do not actually show them how to merge things (either across branches or from old revisions to new ones): based on past experience, that's a step too far for an introductory lecture. What we do instead is show them how to use keyword expansion to put the revision numbers of files into the files themselves, so that they can start tracking data provenance with just a few extra lines of code in their pipelines. This is the capstone of the "how to program" part of the bootcamp. - The afternoon of day 2 introduces the basics of SQL: filtering, aggregation (but not
group by
), simple joins,NULL
if there's time (and my voice hasn't run out),insert
anddelete
, and then how to put SQL in a Python program. Again, we emphasize that the ideas transfer to other languages, and how database queries can and should be thought of as just another stage in a pipeline.
The big idea that ties all of this together isn't actually the Unix philosophy; it's that programming is a human activity.:
- Short-term memory can only hold so much at a time, so build things to fit into it.
- We're most productive when we're not being interrupted (or interrupting ourselves), so use tools that support an interactive do-and-see flow.
- People are fallible, so make defense in depth a habit (i.e., check your data, figure out how to test things before you write them, run regression tests, etc.).
So that's what we do. I think it works well—I'd enjoy hearing everyone else's thoughts.
[1] If learners already use a plain-text editor, we enocurage them to keep using that; otherwise, we show them Nano, not because anyone should actually use it for programming, but because it's so simple that we don't really have to explain anything more than "control-X to exit".
[2] Unless Dreamhost has screwed up creation of a temporary Subversion repository for students to use, in which case some last-minute juggling is required.