Data Science Workflows

This post originally appeared on the Software Carpentry website.

Half a dozen of us got together yesterday morning to chat about The Bad Data Handbook, what the curriculum for a Software Carpentry-style bootcamp for data scientists ought to be, and a bunch of other things. The Etherpad is unfortunately down right now, so you can't read David Flanders' excellent notes, but as an outcome, I'd like to ask you all to do a bit of homework.

First, though, some context. One of the best ways to design instructional material is called "backward design". It's described in Wiggins & McTighe's Understanding by Design, but since that book takes four pages to say what most people would understand in two sentences, here's the gist:

Write the final exam.
Work backward from that to write the exercises learners will do to practice for what's on the exam.
Figure out what you need to tell them so that they can do those exercises.

This summary makes it sound like "teaching to the test", but it's not. Instead, the point of writing the exam first is to be as concrete as possible, as early as possible, about what you're actually going to change in your learners' heads. It's all too easy to say, "Students will understand basic image processing", but three different people will interpret that four different ways. Operationalizing it in the form of an exam gives everyone a straw man to argue over.

Two dozen of us went through this exercise last week at a workshop on what to teach biologists about computing. The first step was to put together a "driver's license" exam—something that would tell you fairly quickly whether the author of the paper you were about to review was computationally competent. I posted the groups' responses a couple of days ago. As a follow-on, we did before-and-after user stories: what do people do now, and what will they do that's (hopefully) better after our training; those responses are also online.

In order to focus discussion, and help us figure out what's common to everyone's data crunching needs, and what's specific to various disciplines, I'd like to ask you all to write and post a point-form description of an actual data crunching task that you, or someone you work with, has recently done:

Pick something that took about an hour to do (because these scenarios are things we might actually perform live in a classroom).
If possible, choose a "first contact" scenario, i.e., describe your first encounter with a particular data set.
Go all the way from "find the URL for the data so I can download it" to "here's a usable result".
Outline your activities at the level of one bullet point for every 2-3 minutes of work, so that you wind up with 20-30 bullet points.
Give the names of tools and commands, but more importantly, explain briefly *why* you're doing each step.
Include mistakes: if you accidentally overwrite a file, or join two data sets the wrong way and don't realize it for 20 minutes, it'll be informative to see how you do eventually spot the mistake and recover from.

When you're done, please post your result on the web and add a link as a comment on this blog post. I'll collate them, code things up, and then sort by frequency to give us an idea of how often people do various things. I expect we'll have a long tail distribution in which everyone does a handful of common tasks; those then become the core of our curriculum.

If you have the hour this will take, I'd be grateful if you could get stuff to the list or online by the end of next week (Friday July 26) so that I can have results back the Tuesday after.