Three Studies (Maybe Four)

This post originally appeared on the Software Carpentry website.

We're in the thick of picking students and projects for Google Summer of Code, which has inspired some less-random-than-usual thoughts. Here are two studies I'd like to do (or see done):

What has happened to previous students? How many are still involved in open source? How many have gone on to {start a company, grad school, prison}? What do they think they learned from the program? How much of the software they wrote is still in use? Etc.
Every one of the 175 organizations blessed by Google this year is using the same web application for collecting and voting on projects. From what I can tell, they're all using it in different ways: +4 means something very different to the Python Software Foundation than it does to Eclipse or SWIG. They're also using a bewildering variety of other channels for communication: wikis, IRC, Skype chat sessions, mailing lists (the most popular), and so on. Why? Is this another reflection of Jorge Aranda's finding that every small development group evolves a different process, but all those processes "work" in some sense, or is it—actually, I don't have any competing hypotheses right now, but I'm sure there are some.

And while we're on the subject of studies, I just read Hochstein et al's paper "Experiments to Understand HPC Time to Development" (CT Watch Quarterly, 2(4A), November 2006). They watched a bunch of grad students at different universities develop some simple parallel applications using a variety of tools, and measured productivity as (relative speedup)/(relative effort), where relative speedup is (reference execution time)/(parallel execution time), and relative effort is (parallel effort)/(reference effort). The speedup measure is unproblematic, but as far as I can tell, they don't explain where their "reference effort" measure comes from. I suspect it's the effort required to build a serial solution to the problem, and that "parallel effort" is then the additional time required to parallelize; I've mailed the authors to ask, but haven't heard back yet.

I wasn't surprised when I realized that the authors hadn't done the other half of the study, i.e., they hadn't benchmarked the productivity of a QDE (quantative development environment) like MATLAB—many people talk and think as if scientific computing and high-performance computing were the same thing. At first glance, it doesn't seem like it would be hard to do—you could use the performance of the MATLAB or NumPy code over the performance of a functionally equivalent C or Fortran program for the numerator. You have to be careful about the denominator, though: if my guess is right, then if things were done in real-world order, you'd be comparing:

time to write parallel code after writing serial code		time to write serial code from scratch
	vs
time to write MATLAB from scratch		time to write serial code having written MATLAB

Even with that, I strongly suspect that MATLAB (or any other full-featured QDE) would come out well ahead of any parallel programming environment currently in existence on problems of this size. Yes, you need big iron to simulate global climate change over the course of centuries, but that's not what most scientists do, and the needs of that minority shouldn't dominate the needs of the desktop majority.

I'd also be interested in re-doing this study using MATLAB parallelized with Interactive Supercomputing's tools. I have no idea what the performance would be, but the parallelization effort would be so low that I suspect it would once again leave today's mainstream HPC tools in the dust.

And now let's double back for a moment. I used the phrase "desktop majority" a couple of paragraphs ago, but is that really the case? What do most computational scientists use? What if we include scientists who don't think of themselves as computationalists, but find themselves doing a lot of programming anyway, just because they have to? If you plotted rank vs. frequency, would you get a power law distribution, i.e., does Zipf's Law hold in scientific computing? Last term, I calculated a Gini coefficient for each team in my undergraduate software engineering class using lines of code instead of income as a raw metric; what's the Gini coefficient for the distribution of computing cycles used by scientists (i.e., how evenly or unevenly is computing power distributed)? And how should the answers to these questions shape research directions, the development of new tools, and what we teach in courses like Software Carpentry?