Pandoc and Building Pages

This post originally appeared on the Software Carpentry website.

Long-time readers of this blog and our discussion list will know that I'm unhappy with the choices we have for formatting our lessons. Thanks to a tweet from Karl Broman, I may have an answer. It's outlined below, and I'd be grateful for comments on usability and feasibility.

Here's a summary of the forces we need to balance:

  1. People should be able to write lessons in Markdown. We choose Markdown rather than LaTeX or HTML because it's easier to read, diff, and merge; we choose it rather than AsciiDoc or reStructuredText (reST) because it's much better known.
  2. People should be able to preview their lessons locally before publishing them, both to avoid embarrassment and because many people compose offline.
  3. Lessons should be easy to write and read. We shouldn't require people to put div's and other bits of HTML in their Markdown.
  4. It should be easy to add machine-comprehensible structure to lessons. We want to be able to build tools to extract lesson titles, count challenge exercises, etc., all of which requires machine-comprehensible source. This is in tension with the point above: everything we do to make lessons more readable to computers means extra work or less readbility for people.
  5. We should use only off-the-shelf tools. We don't want to have to build, document, and maintain custom plugins for formatting tools. We do want to use GitHub's gh-pages magic.
  6. The workflow for creating and publishing lessons should be authentic, i.e., the way people write and publish lessons should be a way they might use to write and publish research papers.

And here's the proposal:

  1. We stop relying on Jekyll and start using Pandoc instead.
  2. Every lesson is stored in a GitHub repository that has a gh-pages branch. (GitHub will automatically publish the files in that branch as a mini-website.)
  3. The root directory of that repository contains:
    • a README.md file with a one-liner about the lesson's content and authorship;
    • a sub-directory called src that contains the source files for the lesson;
    • the compiled versions of those files; and
    • an empty file called .nojekyll to tell GitHub that we don't want it to run Jekyll.
  4. The src directory contains all the source files for the lesson, and a simple Makefile that uses Pandoc instead of Jekyll to compile those files. Pandoc's output goes in the root directory, i.e., one level above the src directory, and the Makefile makes sure that other files (CSS, images, etc.) are copied up as well.
  5. When an author makes a change, she must build locally, then commit those files to the GitHub repository. Yes, this means that generated files are stored in version control, which is normally regarded as a bad idea. But it does mean we can use Pandoc, which supports a nicer dialect of Markdown than Jekyll on GitHub, and we don't have to worry about compiling files on one branch and committing them to another.

I've created a proof-of-concept repository to show what this might look like in practice. It seems to work pretty well, and I think it satisfies the "authentic workflow" requirement (though I'd be grateful if others could tell me it doesn't). The only usability hiccup I can see is that authors will have to remember to commit the generated files: my usual workflow of git add -A followed by git commit -m only adds files in or below the current working directory, so I would have to cd .. up from src to the root directory of their local copy of the repo first.

One variation on this raised by Trevor King is to keep the source files in the root directory of the master branch, and have the lesson maintainer merge changes into the src directory of the gh-pages branch and do the build. This frees authors from having to install the build tools—only the maintainers need that—but on balance, I think most people will want to preview before uploading, so the savings will be mostly theoretical.

If you have other thoughts, or can suggest other improvements, please add comments to this post. We'd particularly like to hear from people who aren't Git experts or aren't familiar with HTML templating systems, Makefiles, and the like. Does the workflow described above make sense? If not, what do you think would go wrong where, and why?

Dialogue & Discussion

Comments must follow our Code of Conduct.

Edit this page on Github