Software Carpentry as a University Course

This post originally appeared on the Software Carpentry website.

The inaugural Software Carpentry and Data Carpentry Instructor and Helper Retreat is over! It was a long day, packed with tutorials, demos, and discussions. I (Daniel Chen) led a round table discussion with Tiffany Timbers and Jenny Bryan. You can watch the discussion and/or read the notes on etherpad.

First, thanks to everyone who joined the discussion:

  • Tiffany Timbers / @TiffanyTimbers
  • Jenny Bryan / @JennyBryan
  • Daniel Chen / @chendaniely
  • Camille Avestruz
  • David LeBauer
  • Cam Macdonell
  • Drew Tyre / @atiretoo
  • Rayna Harris / @raynamharris and the +5 from Austin Texas including: Rebecca Tarvin / @frogsicles
  • Jon Pipitone
  • John Moreau
  • Gayathri Swaminathan
  • Nichole Bennett
  • Tracy Teal
  • Raniere Silva
  • Sarah Stevens
  • Lex Nederbragt
  • Kate Hertweck
  • Greg Wilson
  • Becca Tarvin
  • Azalee Bostroem

Hopefully I got everyone!


Last year I asked the discuss list if anyone had any experience with turning the software-carpentry lessons into a university course. Bill Mills mentioned that two main hurdles revolve around politics and curriculum control. Ethan White and Michael Jackson went on to describe their own experiences with setting up a course. I want to believe the push for code sharing and reproducible articles from Nature and the submission guidelines for PLOS ONE have addressed Greg's initial observations that universities find that lab skills course is not worth graduate credit. The conversation continues here, here, and a blog post by Damien Irving here.

The Software Carpentry and Data Carpentry Instructor and Helper Retreat seemed like a good means to revisit the discussion.

Who Are We?

Daniel Chen, Tiffany Timbers, and Jenny Bryan.

The Roundtable Discussion

Earlier this year, Anthony Scopatz and Kathryn Huff published Effective Computation in Physics and created a course around the book. Tiffany, can you talk about the course you've been beta testing?

[T]: Anthony has been teaching the course to undergraduate students and I've been building off Katy and Anthony's experiences from teaching as well as my own from teaching Software Carpentry. I decided to use this textbook as my course at Quest. The book is based on Python, but many of the principles are language agnostic. The book covers many topics SWC teaches, such as Bash, Python, Make, documentation, licenses, and many open science aspects. Katy and Anthony also created a Github repo of annotated iPython (Jupyter) notebooks of the code in the book. The notebook annotations are aimed to help those who are using the book as instructors and/or students.

A response to my original discuss question revolved around administrative and university department politics on who should and who may teach a course like this. For example, some people mentioned that the computer science department will push back to any course involving programming. How did you over come these hurdles, or any others?

[T] This was not much of an issue for me because Quest is a liberal arts college, and this is essentially the only computer science course they are offering this year.

[J] I just sneak it into courses that are already on the books. I have created a course from scratch, and it was a nightmare. My advice is to teach it in courses that already have a number (i.e., approved by the university). But that's not a longterm solution. It is better if you pilot it in an existing course or a topics course.

How would other people go about piloting a course like this? For example, I'm just a graduate student, would me getting a faculty to sponsor to run a special topics course be the only way to get started?

[J] Two good options for introducing SWC-type material: For faculty or others who teach regular courses, one could offer it as a special topics course. For grad students who normally TA, one could offer to run a hands-on computing lab for a related, existing course.

How do we organize the SWC material into a 30 week course? Should we expand the current material into 30 weeks, or offer 2 half semester courses? Bill Mills expressed a concern that a 30 week course could too easily devolve into a Machine Learning/Data science course because instructors overshoot what can be crammed into a computing course. SWC tries to teach fundamental skills. How do we break up the current material?

[T] My current course is taught as a 3.5 week course. This is supposed to be the same number of hours as a full semester course. We cover an abbreviated shell and 6 hours of version control. Since code review was part of the course, learning to send PRs is essential. These sessions were split up a week apart. Then we covered all the Python basics. Data munging and visualizations are really important, so we spent a lot of time with Visualizations and Pandas. Because I think you learn a lot by doing, almost the last 9 hours of the course are done doing hack sessions, where I am around to help. Quest also prefers courses not to be lecture based to promote active learning, so I'm not teaching during the hack sessions.

[J] I don't necessarily teach classic SWC material in the same format use din two-day workshops. I incorporate it as we go though an existing course. One of the nice things about spreading out the material, is that you can make the examples less artificial. This has always been a complaint of mine with the existing SWC material. Because you have to move through so much technical stuff, you end up doing a bunch of stuff with the file foo, and a bunch of other unrealistic problems. By spreading everything out, you give the students a chance to use an example that is not boring.

Jenny, how do you sneak the SWC material into your existing course?

[J] I have a lot of material in both settings. I use the 2 courses as an excuse to develop each other. I mainly teach R and data analysis. So the SWC git lessons are helpful. I don't teach as much Git formally as I should, I maily teach things by doing. I've never drawn Git concept maps when I teach. Not sure if this is a good or bad thing. But the students end up manage to get the work done. I haven't brought SWC material explicitly, yet.

If you are teaching SWC as a graduate or undergrad class, you have to assess people. What assignments and quizzes do you give so you can get an understanding of how much material they are learning?

[J] I don't do small quizzes. I used to have a big giant project at the end, but this ends up with many big heroic things happening at the end of the course for both the students and myself. What would happen is that they would leave stuff to the end, and end up pulling 3 all-nighters and not necessarily get it all done. I would then get 40 or 50 projects that are not amenable to a single rubric. So, I have evolved over the years to homework assignments. These HW assignments are more like prompts, and are given once a week. The goal is to have small and steady work throughout the semester versus a massive project at the end. Hopefully, this is more true to working in real life and I have a pretty standard rubric and makes grading much easier. Having TAs help too! I teach and have students use Git and the Github infrastructure for commenting on commits and PRs. So, reer review is also invaluable.

The book has "physics" in the title. What department do we put a SWC class under? Should we limit the course to a particular major of students? Or have have the course be completely free enrollment? Jenny, your class is under Statistics, is your class closed off to just statistics people? Who can take your class?

[J] My class is wide open to any graduate student. I have undergrads too, but I have to sign forms for them. I have tables and figures that I will put in the etherpad on enrollment statistics. Since I've been tracking the people who attend my course: I've had people from 50 different programs. 25 of which have at least 2 students. So basically, they are all over. 40% are cumulatively statistics since I have all the incoming masters students. Computational biology and bioinformatics are usually the others taking the course. I think it's good to have this diversity in the class and not be hyper specific.

[T] I have been using Katy and Anthony's book for this course because I was hired by the department of physical science at Quest. Despite 'physics' being in the book and course title, I have a wide variety of people who take my class. Quest caps classes at 20 students, and my current class has 13 students. People are from all different departments: biology, physics, oceanography, marketing, and even sociology. We have a mid-block review for the course where the students provide feedback to the teaching. They started discussing the textbook and they all really like it.
The book has been very relevant to the class even if I was not teaching straight out of the book. For example, I was more comfortable teaching the SWC Git material so this is what I taught in our Git workshops. I used the inflammation dataset where Katy and Anthony used different datasets. So, the book and class does not need to be catered to physicists. However, the book is very Python specific. Shell and Git would be relevant to other things, but more than half of the book is teaching Python.

What does the first few weeks of your course look like?

[J] I have the liberty to teach data analysis. The first thing we do is create figures using the cleanest data possible. This was the motivation bet behind the gapminder package in CRAN. Then week-by-week we slowly work in the details. We start from a 'happy' place. Then edit a README on Github. Then, perform local git edits before pushing it back to Github. The goal is to focus on small milestones with visual and tangible results and motivate with success. Thinking about motivation is really important, then you can slowly develop the tools underneath. I have the luxury of doing this because it is a full semester 13 week course.

[T] I agree motivation is key to getting students engaged and on board. My course is taught in 3.5 weeks instead of 13 weeks. However, during this time, this is the only course the students are focused on. One of the drawbacks about my format is you miss the ability to consolidate information. It's more like a 3.5 week long SWC workshop.

To compare my course with Katy, Anthony, and Bill:

  • Anthony teaches git right after the shell. I did this in my course as well.
  • I wanted to use Github as a way to submit homework, so I teach an abbreviated shell (Includes how to navigate the filesystem and creating things lessons).

I did go with a big final project (maybe a decision I'll be regretting soon?), so during the day we learn something for 3 hours and then I have the students implement something from the day's lesson that applies to the final project. For example:

  • Day 1: navigate shell, create things, and shell scripts: The HW was to create a shell script to setup the directory structure for the project.
  • Day 2: Git: The HW was to make a git directory of their final project, and build up from there.

I try to give a big picture in the beginning of class (e.g. websites hosted on github) then show them organized code repository (e.g. a formated readme and folder structure). However, I do not get to visualization until near the end of 4th or 5th day.

The Gapminder dataset is on CRAN for the R folks, what about the Python people? Is there a CSV?

[J] The package itself contains the TSV in non https locations, and github.

[T] Nancy Soontiens from UBC has developed a gapminder lesson for Python for the women in science workshop

Jenny teaches a 13 week course and Tiffany you teach a 3.5 week course, How much time per week do you meet? And how do you split up the SWC material throughout the weeks/days

[J] I teach less classic SWC. I have 3 hours with the students a week. About 80% of the time is setting the stage with slides. The rest is working in R. The way I do it, most of the learning is in the homeworks. The class is just enough information and guidance so they are not totally lost during the homeworks.

[T] We do 3 hours of shell the first day. 3 Hours of git the next. I think it depends how many active learning exercise you can get. For me it took me 9 Hours to get through loops, functions, and if statements because I had a lot of active learning exercise because I wanted them to fully understand these concepts. Testing took another 3 hours (in the third week of the course). Essentially I do 2.5 weeks of SWC then we do code review. Each student is given the other student's homework to review. And I sit with them and they give me a summary of the pull request. At the same time the rest of the class is working on their final project.

What final projects do you have the students work on? Do you assign them a project or do they bring their own questions?

[T] I've been extremely flexible with this. I have them think about the final project from Day 1. They pick something in the area they are interested in and have a research question with a hypothesis and do something that requires a computer to do. These are undergraduates, so my expectations are not as high because they do not have as much research experience. The project end goal is to have a Python script import data and save a plot or table from an analysis or visualization and have a record of the Python script inputs in either a Makefile or Shell script to make sure everything is reproducible. There is also a 2500 word report because they need to be able disseminate their findings and information. Finally, a 8 minute presentation as an IPython (jupyter) notebook.

How much literate programming do you teach? When do you teach it?

[T] I get them coding in the notebook and slowly during the first week I show them markdown cells. They don't necessarily use them at the very beginning, but because they see me use them it in class, they eventually do. In the second week, they are asked to do a statistical analysis and in a markdown cell explain why they chose that test and what the test means. It's interesting to see them tell each other to use markdown cells in the peer code review. They push each other to do it. Their final presentation is in a Jupyter Notebook so they are well on their way to doing literate programming, but I do not give an explicit class on it.

[J] I teach rmarkddown on Day 1. It enables them to make a plot which gets them excited. Then you can make a web page. Then you can easily put it on Github. It makes it extremely rewarding to the students. The peer review is helpful because they can provide feedback and have constant interaction between people. The students discuss the layout of the repository, how easy was it to find scripts and understand what the code is doing, etc.

Grading code and technical work is difficult, how do you do it?

[J] One of other reasons of doing peer review is to get feedback to the students faster than if only the instructor is grading.

Jenny since you seem to have been doing this for a few years, do you have any available resources to share?

[J] My writing on the teaching of the course has always been talks. My courses are online, and I have speaker decks.

There's also Ethan White's courses here and here.

Software and Stats for biology is such a fast paced field, how do you even decide what to teach?

[J] The 2 courses I've taught the past decade is "Data Analysis" where I am chasing technology. It's exhausting, but I also want to learn it myself.

I currently do the opposite in my genomics class. I started by chasing the technology for years, but every year almost required a completely new course re-write. I eventually generalized it to the key genomics pipelines and statistical analysis. For example, multiple comparison and empirical bayes. I try not to chase specific sequencing platforms. You can only be chasing technology a few places at a time.

Research wise, I've only been able to put these classes after I've gotten tenure. This allmws me to put more effort into developing the course with fewer outside distractions. I would say what I'm doing would not mesh well with being super productive research-wise Jeff Leek, Roger Peng, and Rafael Irizarry, are doing on line courses and super productive research-wise.

Do you have TAs?

[J] An army of TAs is critical. I have 6-7 (not all full time). The department here was able to prove that this is not just a statistics class anymore, and people are coming from all over campus. So I got external funding for more TAs than the average graduate class. I only have graduate TAs, no undergraduate TAs.

[T] Unfortunately Quest is a very new institute with limited funding. During class I have no helpers or TAs. This may be a burden or barrier for the student's success. However, there are peer helpers outside of the class who are available.

Jenny what's your class enrollment like?

[J] Strictly anecdotally, the people who are most vocal on Github,in class, and enrollment is disproportionately female (in a good way!). Which I find interesting. The long dialogs I have with students on Github are my female students. Contrast this with my more professional dialogs on Github, which is mostly male.

Additional Questions/Comments from the audience

[Cam] My university is putting a diploma for Library Techs, probably closer to Data Carpentry. We still are figuring things out in terms of delivery, but really appreciate all the models of assessments and topics going on. This is more of a general course which includes undergraduates.

Something Good, Something Bad

Other than the small technical blunder in the beginning where the laptop I was using for the Hangout spontaneously shut off and I had ended up temporarily calling back in oh my phone, there weren't really any blunders to the discussion. I (Daniel) realized after re-watching the stream for this blog post, I asked a few overlapping questions that happened because I did not hear part of an answer. This was really from me trying to get my main laptop's Hangout up again, and trying to follow the etherpad for questions. I suppose the next time we run this, we can have a designated spot on the etherpad for questions so audience questions can be easily found.

Dialogue & Discussion

Comments must follow our Code of Conduct.

Edit this page on Github