Python Supercomputing Statistics
This post originally appeared on the Software Carpentry website.
I have a major grant application and (hopefully final) revisions to my next children's book due on Friday, so of course I'm reading white papers about Python-friendly supercomputing from Interactive Supercomputing, a Boston-area firm that's about three years old. IS offers several kinds of parallelism for MATLAB, Python, R, SAS, and other high-level languages; I don't know if their tools are any easier to use than anyone else's, but they have an impressive team (including Russ Barbour, ex-Apollo, and Steve Reinhardt, ex-Cray).
What's more immediately interesting to me is two of their papers (free, but registration required). The first, "Python Technical Computing End-User Study", was prepared by Fletcher Spaght, Inc.; based on 604 responses to a survey, it concludes that:
- significantly increased performance of Python codes would cause large or revolutionary improvements to 35% of technical users (8% would experience revolutionary benefits from 10X performance boost);
- most (52%) organizations using Python for technical applications consider their codes to be important to accomplishing their mission;
- technical Python program run times are long (31% typically over 1 hour);
- Python data sets are large (41% GBs or larger) for technical applications;
- large amounts of time are spent optimizing codes to run them productively on desktop workstations;
- in organizations using Python, tools such as C (91%), MATLAB (49%),and Fortran (32%) are also widely used for developing technical applications;
- Most (63%) organizations surveyed are interested in running Python on HPC resources and at least 65% of Python technical users have access to such systems; and
- half o survey respondents have ported their technicalPython codes [doesn't say to what], but only 17% do so with any frequency.
Some of the details in the paper are interesting too. 36% use Python for test & measurement, 29% for communications [presumably communications applications, rather than inter-application communication and coordination, but this is not clear], and 24% each for signal/image processing and physical design. 33% describe their use of Python as "glue language", while 42% use the numerical libraries, and 24% use external libraries. 91% of users also use C/C++, 49% use MATLAB, 32% use Fortran, and 22% each use Mathematica and R.
The other paper was prepared by the Simon Management Group. Its conclusions are more motherhood-and-apple-pie-ish: for example, "HPC software development environments vary widely by factors such as size and focus." There are still a few interesting itms, though: the median team size is 4-6 developers, 50% of respondents report that their organization works on 1-5 projects at a time (and 11.5% report working on more than 30 at a time), the expected median data ste within three years ranges from 200 to 600 GB, and 42% indicated that projects typically last 6 months, while 23.1% describe their projects as open-ended. I'm not sure what it all means just yet, but they're good numbers to know...