Teaching databases by example

This post originally appeared on the Software Carpentry website.

Over the last two weeks I've been spending most of my Software Carpentry time working on the database unit. They began as a fairly straight-forward translation of the Software Carpentry 3.0 lecture notes with only a few changes to the sequencing of the topics. The plan for the unit is fairly simple: each screencast introducing only one or two new topics, and builds on the previous screencasts with as few as possible forward references to later topics. The topics themselves are really just the different language features (SELECT, WHERE, JOIN, etc..) presented as tasks you might want to perform, and illustrated by working with toy data.

It might be clearer if I present the topic plan we had in mind:

An Introduction to Databases (What is a database and why/when would you use one?)
Getting data from a table, filtering, and sorting it. (SELECT, WHERE, and ORDER BY)
Aggregating and grouping results (GROUP BY, and aggregation functions: SUM, MAX, etc..)
Dealing with empty or missing data (NULL)
Combining data from multiple tables (inner JOINs)
Advanced queries (subqueries)
An overview other features (e.g. HAVING, expressions in SELECT, LIMIT, ...)

And then this week Greg suggested I take a look at this gem of a book (sadly, out of print):

The Essence of SQL: A Guide to Learning Most of SQL in the Least Amount of Time by David Rozenshtein

It takes a completely different and, I think, much more useful approach. It begins with a list of typical questions you might ask about a database of students/courses/profs. For example, "What are the student numbers and names of students who take CS112? ", "Who are the youngest students?", "Who does not take CS112?", or "Who takes a course which is not CS112?". These questions are meant as prototypes of the sorts of questions you would use any database to answer. The book proceeds through each question and explains how you'd use SQL to answer it, why it makes sense to do it that way, and why it even works in the first place.

My intuition about this approach is that it makes for a great way to learn about databases. Structuring the book around the prototypical questions will serve as a really useful way to refer back to the course later when you have a real problem to solve, as well as being much more motivating to have the unit be problem-based. My concern is that by organising the database unit around these questions we'll be stuck mixing the language features throughout the units with all sorts of cross-referencing needed in case people just want to learn about using a WHERE clause.

What do you think? Greg suggests that maybe the Rozenshtein approach is something we use in combination with our original approach: we first cover the language features in separate screencasts, we then use the Rozenshtein questions to pull it all together.

There's another question I have that's separate from the question of which approach we choose. In my opinion, having real data to teach around is more exciting and educational than creating toy data that's tuned just to the specific topic. The Nobel Prize data, and the Experiments and Researchers data sets we've used in the first few lectures are okay, but wouldn't it just be so much more interesting if we started with, say data about tabacco use and physical activity, and each screencast taught about a language feature by posing and answering questions about this dataset. This is something the book Data Analysis Using SQL and Excel by Gordon S. Linoff does really well.

So, another question for us (and now for you!) is, what real-world data should we use to teach? We need dataset we can tell a good story around. The data also needs to be varied enough so that the queries results in our small screencast size don't need to be scrolled, and span several tables so that we can illustrate different join types and such.