Dark Matter, Public Health, and Scientific Computing

This post originally appeared on the Software Carpentry website.

This is the text of a talk given at the 8th IEEE International Conference on eScience, October 10, 2012. The slides are also available.

Back in March, Scott Hanselman wrote a blog post titled Dark Matter Developers: The Unseen 99% that crystallized something I'd been thinking about for a while. In it, he said:

[We] hypothesize that there is another kind of developer than the ones we meet all the time. We call them Dark Matter Developers. They don't read a lot of blogs, they never write blogs, they don't go to user groups, they don't tweet or facebook, and you don't often see them at large conferences... [A]s one of the loud-online-pushing-things-forward 1%, I might think I need to find these Dark Matter Developers and explain to them how they need to get online! Join the community! Get a blog, start changing stuff, mix it up! But...those dark matter 99% have a lot to teach us about GETTING STUFF DONE... They aren't chasing the latest beta or pushing any limits, they are just producing.

I'm not as optimistic as Scott, at least, not when it comes to scientific computing. I agree that 95% spend their time with their heads down, working hard, instead of talking about using GPU clouds to personalize collaborative management of reproducible peta-scale workflows, or some other permutation of currently-fashionable buzzwords. It isn't even because they don't know there's a better way. It's because for them, that better way is out of their reach.

Let me back up a few years. In 1997, while I was on holiday in Venezuela, that country took delivery of its first CT scanner. It was the lead story on the evening news, complete with a few seconds of video showing a military convoy escorting the device to the university hospital. Why a military convoy? Because to get from the airport to the center of the city, the truck carrying the scanner had to pass through a slum where three quarters of a million people didn't have clean water, much less first-world health care.

That image has stuck in my head ever since because it's the most accurate summary of the state of scientific computing that I know. While you are here talking about the CT scanners of computational science, the 95% (and yes, I do think it is 95%) are suffering from computational dysentery. If you think I'm exaggerating, ask yourself:

How many graduate students write shell scripts to analyze data sets in batches instead of running those analyses manually?
How many use version control to track what they've done and collaborate with colleagues? (In the largest computer science department in Canada, the answer is only 10%.)
How many of them routinely and instinctively break large computational problems down into pieces small enough to be comprehensible, testable, and reusable? For bonus marks, how many of them know those are really all the same thing?

Now, you could say this isn't your problem, but you'd be wrong: it's actually the biggest problem you have. Why? Because if people are chronically malnourished, giving them access to a CT scanner when they're in their twenties doesn't make a damn bit of difference to their well being.

And in many ways, you're the biggest problem they have. Why? Because you're the only "real" programmers they know, and when they come to you and ask for clean water, your answer is, "Let's talk about brain scans. They're cool."

If you set aside googling for things, the overwhelming majority of scientists don't use computers any more effectively today than they did twenty-five years ago. They're no more likely to know that routine tasks can be automated; they're no more likely to understand the difference between structured and unstructured data, and it takes them just as long to write a 300-line data analysis script as it did when people would actually get a little giddy at the thought of having a 16 megahertz Sun-3 workstation with 8 megabytes of RAM on their desk.

Let's pause for a moment and fill in some details. First, is it true that only a few percent of research scientists are computationally competent? As I said, I don't have data to put in front of you, but I've been helping scientists of all kinds do computational work since 1986, not just at supercomputing centers. Working in those gives you a biased view of the world, just like working in the CT lab at a third-world hospital whose patients can all afford first-world health care gives you a biased view of how the general population is doing. And I've been teaching scientists at universities and government labs as a full-time job for most of the last two and a half years, and talking to a wide variety of people who are doing the same thing. One percent would be pessimistic hyperbole, but there's no way the actual number is more than five percent.

Second, what do I actually mean by "computationally competent"? We've all heard of "computational thinking", but that phrase has been completely devalued by people jumping on a bandwagon without actually changing direction. When I say that someone is computationally competent, I mean the same thing I mean when I say they're statistically competent: they know enough to do routine tasks without breaking a sweat, where to look to find answers they can understand to harder problems, and when to go and find an expert to solve their problems for them. More specifically, I think a scientist is computationally competent if she knows how to build, use, validate, and share software to:

manage and process data,
tell if it's been processed correctly,
find and fix problems when it hasn't been,
keep track of what she's done,
share work with others, and
do all of these things efficiently.

You can't do these things without understanding some fundamental concepts—that's what "computational thinking" would mean if it still meant anything. But mastering those concepts is intrinsically dependent on mastering the tools used to put them into practice: you cannot use tools effectively you're working by rote, but equally, you cannot grasp abstractions without concrete examples to hang on to.

Are you computationally competent? Let's find out. Please grab a pen and a piece of paper, or shut down Facebook and open an editor instead. I'm going to show an outline of the "driver's license" exam we put together for physicists who want to use the new DiRAC supercomputing facility. I won't ask you to actually answer the questions; instead, I'll show you what you need to do in order to get full marks. For each step, I'd like you to give yourself one point if you're sure you could it, half a point if you think you might come up with a solution after some struggle, zero if you're sure you couldn't, and -1 if you don't understand what it says. Ready? Here goes.

Question 1: Check out a working copy of the examination materials from Subversion.

Question 2: Use find and grep together in a single command to create a list of all .dat files in the working copy, and redirect the output to create a file called all-dat-files.txt, then commit that file to the repository.

Question 3: Write a shell script that takes one or more numbers as command-line parameters and runs a legacy Python program once for each number.

Question 4: Edit a Makefile so that if any .dat file in the input directory changes, the program analyze.py is run to create a corresponding .out file.

Question 5: Write four tests using an xUnit-style unit testing framework for a function that calculates running totals. Explain why you think your four tests are the most likely to uncover bugs in the function.

Question 6: Explain when and how the function that calculates running totals might still produce wrong answers, even though it passes your tests.

Question 7: Do a code review of the legacy program used in Question 3 (which is about 50 lines long) and describe the four most important improvements you would make to it.

How many of you think you'd get 7 out of 7? How many would get at least 5? How many had positive scores? Now, how many think the median score among graduate students in science and engineering would be non-negative?

And before we go on: the point of the exam isn't the specific tools. We could use Git instead of Subversion, or MATLAB instead of Python, and in fact, we're preparing variants of the exam to do exactly that. Ten years from now, the exam might allow for direct neural interfaces, but the core ideas of automating repetitive tasks and being able to tell good code from bad will, I think, remain the same.

Now, do you think that someone could use that GPU provenance peta-cloud without knowing how to do the things this test assesses? More importantly, do you think that someone who doesn't have these skills, and doesn't understand the concepts they embody, will be able to debug that GPU provenance whatever when something goes wrong? Or think of new ways to use it to advance their research? Because the real point isn't to give scientists a handful of tools—the real point is to give them what they need to build tools for themselves. And if you're only helping the small minority of scientists lucky enough to have acquired the skills that mastering your shiny toy depends on, your potential user base is many times smaller than it could be.

All right: now that we've diagnosed the problem, the cure seems obvious. All we have to do is get universities to put more computing in their undergrad programs. However, we've been banging that drum for at least twenty-five years now, with no real success. Yes, there are a few programs in physics and computing or bioinformatics, but having worked with a few of their graduates, I don't think those programs do any better than the "soak it up by osmosis in grad school" approach. The problem is that everyone's curriculum is already full to bursting. If we want to put more computing into a four-year undergrad program in chemistry, we have to drop—what? Thermodynamics, or quantum mechanics? And please don't pretend that we can just put a bit into every course. Number one, five minutes out of every lecture hour adds up to four courses over the span of a degree. Second, those five minutes will be the first thing dropped when the lecturer is running late. And third, are you familiar with the phrase "the blind leading the blind"?

Ah, but we have an Internet! Everything scientists need to know is online, and there are now dozens of free online courses as well. But neither forums nor a MESS (Massively Enhanced Sage on the Stage) are effective for most novices who are still trying to construct the conceptual categories they need to have before they can assimilate mere information. Somebody needs to get these people from A to B so that they can get themselves from B to M, Z, θ, and beyond.

The only thing that works—at least, the only thing that has worked for us in fourteen years of experimentation—is to give graduate students a few days of intensive training in practical skills followed by a few weeks of slower-paced instruction. Let's break that down:

Target graduate students students because they have an immediate personal need (particularly if they're six months or a year into their research and have realized just how painful it's going to be to brute force their way to a solution), and because they have time (which faculty usually don't).
Teach them for a few days of intensive training because that's what they can actually schedule. At the low end, Software Carpentry's workshops are two days long (three if the host adds a day of discipline-specific material at the end). At the high end, Titus Brown's Next Generation Sequencing course at Michigan State runs for two weeks, which means there's time for volleyball and beer. Anything less than two days, and you can't cover enough to make it worthwhile. Anything more than two weeks, and people can't put the rest of their lives aside to attend.
Focus on practical skills so that they see benefits immediately. That way, when we come to them and say, "Here's something that's going to take a little longer to pay off," they're more likely to trust us enough to invest the required time.
Follow up with a few weeks of slower-paced instruction, such as meeting once a week for an hour to work through a few problems. We've tried doing this with online video conferencing, and while that's better than nothing, it's like old dishwater compared to the hearty organic beer of sitting side by side.

What do we actually teach? It depends on the audience, but our core is:

The Unix shell. We only cover a dozen basic commands; our real aim is to introduce people to pipes, loops, history, and the idea of scripting.
Python. Here, our goal is to show them how to build components for use in pipelines (so that they'll see there's no magic), and when and why to break code into functions.
Version control, for collaboration and reproducibility.
Testing. We teach them to use tests to specify behavior and make refactoring safe as well as to check correctness.
And we usually include one other topic as well, like a quick intro to SQL or matrix programming, depending on the audience and how much time is available.

All of this is "merely useful". It's certainly not publishable any longer, which means that by definition, it's not interesting to most computer scientists from a career point of view. However, two independent assessments have found that it's enough to set between a third and two thirds of scientists on the road that leads to those reproducible peta-scale GPU cloud workflows I mentioned earlier. Even if you take the lower of those two figures, that's a six-fold increase in the number of people who understand what you're trying to do, and are able to take advantage of it. If you think that's not going to help your project, you're either incredibly arrogant, hopelessly naive, independently wealthy, or a die-hard Lisp programmer.

Anatole France once wrote, "The law, in its majestic equality, forbids the rich and the poor alike to sleep under bridges, to beg in the streets, and to steal bread." Thanks to modern computers, every scientist can now devote her working life to wrestling with installation and configuration issues that she doesn't have the conceptual tools to deal with effectively.

You can help. In fact, we can't succeed without your help. As Terry Pratchett said, "If you build a man a fire, you'll keep him warm for a night. If you set a man on fire, you'll keep him warm for the rest of his life."

The first thing you can do is host a workshop. A growing number of our alumni have become instructors in their own right—there are even a few here in the audience today. They're all volunteers, so the only cost is a couple of plane tickets, a couple of hotel rooms, and a few pots of coffee. If you're willing to book a room and do some advertising, we can send people to you to get things started. This will particularly help those of you in support roles: the people who've been through workshops probably won't ask fewer questions, but they'll certainly ask better ones.

The second thing you can do is teach a workshop yourself. All of our materials are open license, and we will help teach you how to use them, and how to teach more effectively in general.

Finally, you can help shine some light on the "dark matter" of scientific computing. There's a lot of discussion now about requiring scientists to share their software. What I'd like even more is for scientists to share their computational practices. I'd like every paper I review to include a few lines telling me where the version control repository holding the code is, what percentage of the code is exercised by unit tests, whether the analyses we're being shown were automated or done by hand, and so on. I'm not suggesting that we should require people to meet any particular targets—not yet, anyway—but the first step in any public health campaign has to be finding out how many people are sick with what.

To conclude, it isn't really a choice between increasing the productivity of the top 5% of scientists ten-fold or doubling the productivity of the other 95%. It's really a choice between seeing your best ideas left on the shelf because they're out of most scientists' reach, or raising up a generation of scientists who can all do the things we think are exciting? A few months shy of my fiftieth birthday, with a wonderful little girl at home who's going to inherit all the problems we didn't get around to solving, and my sister eight months dead from cancer, I know which matters more to me. If you'd like to help, please visit our web site or mail us at team@carpentries.org. We look forward to hearing from you.

Greg Wilson 2012-10-10
Opinion Software Carpentry

Dialogue & Discussion

Comments must follow our Code of Conduct.

this GitHub Repository