Home> Blog> Splitting the Repository

Splitting the Repository

This post originally appeared on the Software Carpentry website.

United Airlines messed up my travel again last weekend, so I finally had a chance to think some more about how Software Carpentry works and how we can make it work better. Having topic maintainers is one improvement; another, which was discussed at this month's lab meeting, is to break the bc repository that holds our lessons and workshop home pages into smaller and more manageable pieces.

The Problem

We face three related problems:

  1. The bc repository is too big. My copy (which admittedly includes a few pieces of ongoing work) takes up almost 100 MBytes. More importantly, it includes a bewildering mix of material: page templates, setup instructions, support tools, and more.

  2. Our scope keeps creeping. Software Carpentry's original aim was to teach scientists who were already programming how to program better. We now teach people who've never programmed before, in three languages, and have material on databases, regular expressions, and a bunch of other things. They're all useful, but they don't belong in a single repository any more than every useful piece of software in the world should be in a single large library.

  3. It's hard for people to explore and manage alternatives. Different instructors have different opinions about the best way to introduce certain topics: NumPy early vs. NumPy late in Python, branches before or after pull requests in Git, and so on. Instructors also want to use different examples for different audiences, e.g., bioinformatics data when teaching biologists and climate data when teaching climatologists. We could come up with a naming scheme and directory structure to manage this all in one repository, but again, that's not how we manage software.

The Proposal

Our proposal is to divide the current content of the bc repository between several new repositories. One of these will be the template for workshop websites; each of the others will contain one complete lesson, such as the introduction to the Unix shell for novices. Each of these repositories will be much smaller, and therefore easier to navigate, than what we currently have. As a fringe benefit, people will be able to subscribe to only the ones they care about most, so that they aren't flooded with comment and commit messages on topics they're less interested in.

We will seed these repositories using the content we currently have, but there's no reason to stop there. If a group of people want to fork one of those repositories and tweak the examples to be more relevant to economics, they can do so; if other people want to write a lesson that presents Git in an entirely new way, they can do that too. Software Carpentry will then be responsible for certifying lessons as fit for purpose, i.e., for saying, "This lesson is in our format, meets our quality standards, is aimed at our audience, covers the things we think are important." Any workshop that delivers certified lessons covering our core topics will then be allowed to use our name and logo, and people will have more freedom to develop and use "affiliated" lessons that are compatible with ours, but cover things we don't.

To implement this, we will start by specifying how a lesson should be laid out (just as Ruby on Rails and Django specify what key files must be called and where they must be put). We will then extract the novice Software Carpentry lessons from the bc repo, along with as much of their history as we can bring over. The bits and pieces needed to set up a home page for a workshop will be extracted in the same way into another repository. Almost all workshops will be able to set themselves up by cloning just that repository, which will (probably) contain only a couple of dozen files, then point at rendered lessons on the Software Carpentry website instead of having their own (redundant) copies.

FAQ

Who will certify lessons as "fit for purpose"?
In the short run, we'll certify the lessons we already have. In the long run, our topic maintainers, possibly in conjunction with some sort of scientific advisory board, will decide what meets our needs and standards.
Shouldn't we just use the best lesson for each level and topic?
Sure, but how will we know which lesson is best until we (a) have a better assessment process and (b) have lessons to assess? More importantly, what's best for an astronomer is unlikely to be best for an economist, so we're more likely to see lessons diverge by discipline than by teaching approach.
Can't we keep the lessons the same, but vary the datasets?
This doesn't work well in practice, since the wording throughout the lesson has to change when the data changes.
What happens if we wind up with too many lessons for people to sort through?
We'll cross that bridge when we get to it. (In reality, we're unlikely to have enough high-quality lessons in the next couple of years for it to be a real problem.)
How does this relate to Data Carpentry, Library Carpentry, etc.?
We will (hopefully) share lesson templates and any tools we build to support lesson creation and validation (e.g., something to check that a directory is laid out properly). Each project will then certify lessons as fit for its purposes, and workshops based on what lessons they contain.
What to do about the outstanding pull requests against the bc repository?
We will merge as many as we can, then ask the authors of the rest to redirect them.
How will people find lessons once they're spread out across multiple repositories?
We will start with a page of links on our web site, and worry about how to handle explosive growth if such growth occurs.
How will people keep up with changes to lessons?
We don't know, but we don't think the problem will be any worse than it is right now.
What if someone creates a lesson on topic X for community Y, then we realize there's a better way to teach X?
That's up to the group that certifies lessons: if the old approach is just less good than the new one, there's no reason to de-certify, but if we realize the old approach was just plain wrong, they may pull the certification. In practice, this is unlikely to come up in the next year, and we can figure it out then.
How will we handle resources that are shared by many or all lessons?
Dynamic resources like CSS files are easy: pages will load them from our web site. Compile-time resources like included files and page layouts) can be duplicated for now: Git submodules should do this for us, but our past experience with them hasn't been encouraging. We can include small data sets in the base lesson template for now, and worry about larger or more specialized data sets as we grow.
What about software setup and installation instructions?
We can't put these in a central site because instructors really do customize these frequently. We will therefore duplicate the instructions for now and figure something else out later.

Next Steps

The next steps are to create a layout template for our lessons, then extract our current novice lessons into new repositories based on that template. The most important thing will be to keep the template simple: really, really simple. To ensure this, we need volunteers who aren't experts in Git or page layout to act as a reality check on those who are. If you'd like to help in either capacity, please give us a shout: this will be one of the biggest changes we ever make, and we'd like everyone to be part of it.

Appendix: A Straw Man Lesson Template

This template is guaranteed not to be what we eventually use, but we hope it will serve as a starting point for discussion. If you have specific questions or suggestions, please add them to this GitHub issue rather than putting them in comments on this blog post so that the whole conversation is in one place.

Overall Layout

Each lesson is in a directory laid out as follows:

  1. index.md: home page for lesson
  2. dd-slug.md: topics (where 'dd' is a sequence number and 'slug' is a mnemonic, e.g., 03-functions.md)
    • We'll discuss file formats (e.g., whether or not to use the IPython Notebook for the master copy of each lesson) in a separate thread.
  3. introduction.md: slides for a 3-minute "why learn this?" presentation to give to learners at the start of a lesson
  4. glossary.md: definitions of key terms
  5. reference.md: cheat sheet for key commands, etc.
  6. guide.md: instructor's guide
  7. code/: sub-directory containing all code samples, which are executed from the root directory
  8. data/: sub-directory containing all data specific to this lesson
    • index.md: describes data sets
    • filename.xyz: single-file dataset
    • folder/: multi-file datasets are all in their own directories
  9. img/: images (including plots) used in lesson

Overall Index

The index.md file is structured as follows:

    # Lesson Title

    Paragraph of introductory material.

    > ## Learning Objectives
    >
    > * Overall objective 1
    > * Overall objective 2

    ## Topics

    * [Topic Title](dd-slug.html)
    * [Topic Title](dd-slug.html)

    ## Other Resources

    * [Introduction](intro.html)
    * [Glossary](glossary.html)
    * [Reference Guide](reference.html)
    * [Instructor's Guide](guide.html)
  

Topics

Each topic file's name is dd-slug.md, where 'dd' is a sequence number and 'slug' is a mnenominc, e.g., '03-functions.md'. Each topic should take 10-30 minutes to cover, and should be structured as follows:

    # Topic Title

    > ## Learning Objectives {.objectives}
    >
    > * Learning objective 1
    > * Learning objective 2

    Paragraphs of introductory material.

    ## Sub-heading

    Paragraphs of text.

    ~~~ {.python}
    some code:
        to be displayed
    ~~~
    ~~~ {.output}
    output
    from
    program
    ~~~

    > ## Challenge Title {.challenge}
    >
    > Description of challenge

    ## Another Sub-heading

    As above...

    > ## Key Points {.keypoints}
    >
    > * Key point 1
    > * Key point 2
  

Some features of this are:

  • We will use Pandoc for Markdown-to-HTML conversion, so we can use {.attribute} syntax for specifying things instead of the clunky postfix syntax our current notes use (because it's the only thing that Jekyll supports).
  • Rather than using <div class="whatever">...</div> to mark sections, this uses blockquotes whose headings have specific classes: the sections we need to mark are relatively small, and CSS will take care of displaying these the way we want.

Glossary

Each term in the glossary is laid out as a separate blockquote, with the term in a heading. Yes, this is odd, but we want to avoid putting HTML in Markdown, and we can't add identifiers to paragraphs using {#whatever} notation: that only works on headers.

    # Glossary

    > ## Term {#some-anchor}
    > The definition.
    > See also: [some word](#local-anchor)
  

Introductory Slides

Every lesson must come with a short slide deck (2-3 minutes) that the instructor can use to explain to learners what the subject is, how knowing it will help learners, and what's going to be covered. Slides are written in Markdown, and compiled into HTML using reveal.js.