Teaching to the Workflow

This post originally appeared on the Software Carpentry website.

"Teaching to the test" has a deservedly bad reputation, but what about "teaching to the workflow"? A group of us came together at the NCEAS Open Science Codefest last year and put together a paper on open science in ecology. In it, we sketch three examples of possible open science workflows (Figure 2 in the paper). In response, I was asked what Software Carpentry should teach to prepare people for working in those ways. My top three things are (in order):

The idea that it's OK to be wrong in public. As long as people are honest about what they're trying to do and what they think what they've done means. And they try to make it right when they are wrong. And that everyone is wrong sometimes. And that a lot of people are willing to help fix honest mistakes or misunderstandings (see, e.g., Stack Overflow). And that this, it seems to me, is how science overall should (and sometimes does) work. We're all in this together. This is why I think it's so important for instructors to screw up when teaching and slowly talk through how they untangle themselves. No grad student or programmer starts out like Athena from the head of Zeus, fully formed and brilliant from the beginning. Because writing, documenting, and openly sharing scientific thinking and the code behind it—even with mistakes—is one of the best ways we have as scientists to be honest with ourselves and with others.
The idea of provenance. Not just in the formal sense but also in the broader sense, related to a whole ton of important ideas, including but not limited to: metadata, version control, code modularity, Don't Repeat Yourself, testing, open access, etc. Science is deeply and fundamentally about provenance of ideas and provenance of evidence. All scientists get this intuitively, but the ways to document this trail are changing and multiplying rapidly. Where is the idea from (articles, DOIs, open access)? What is the evidence for it (figures, tables, statistics)? How were the analyses done (code, version control, formal computational provenance)? Where did the data come from (metadata, data archiving, well-documented field or lab methods)? Software Carpentry has for a long time taught people how to structure and document code and make analyses reproducible, but I think a lot of people (myself included) also want and need to learn how to better structure, from start to finish, their project organizational schemes and filesystems (and their science!) with provenance in mind. Because the process of science is messy, it's not and will never be easy. But it seems to me that many of us want to know how to do it better.
The idea of structured data. This could be anything from how to use Excel responsibly, to munging/slicing/dicing in python/R, to designing bespoke SQLite databases, but the idea of it is crucial. This is very rarely taught in any formal setting and absolutely is the key enabling anything analytical that follows. Most people, in my experience, struggle through until they, by chance, hit upon the format that enables pivot tables/JMP/ggplot to work well. Structuring data well from the start leads to easier and less frustrating analyses, easier and less frustrating collaborations, and easier and less frustrating archiving. In the end, science and scientists win when data are well-structured. Format code for people and format data for machines.

What would you teach to prepare people to do the best science possible given the changing technological landscape?