Formats

This post originally appeared on the Software Carpentry website.

As I said in last week's announcement, and mentioned again in a later post, one of the main goals of this rewrite is to make it possible for students to do the course when and where they want to. That means recording audio and video, but much of the material will probably still be textual: code samples (obviously), lecture notes (for those who prefer skimming to viewing, or who want to teach the material locally), and exercises will still be words on a virtual page. And even the AV material will (probably) be accompanied by scripts or transcripts, depending on what turns out to work best.

Which brings up a question everyone working with computers eventually faces: what format(s) should material be stored in? For images, audio, and video, the choices are straightforward: SVG for line drawings, PNG for images, MP3 for audio, and MP4, MPEG, or FLV or video (I don't know enough yet to choose). But there's a bewildering variety of options for text, each with its pros and cons.

Authoring tools: do authors need to use a specialized editor? If so, is it freely available for the three major platforms (Windows, Linux, and Mac)?
Composition: can authors "just type", or do they need to spend a lot of keystrokes on markup?
Diffing and merging: does the format play nicely with version control systems, i.e., if two or more people edit independently, can their changes easily be merged after the fact?
Formatting: does the format allow fine-grained control over layout? (My personal test here is how easy it is to create tables with irregular arrangements of rows and columns.)
Multiple output formats: can HTML pages, slides, PDFs, and what-not all be produced from a single source?
Referencing: does the format take care of section and figure numbering, cross-references, and bibliographic citations automatically?
WYSIWYG: does the raw content have to be compiled or transformed to produce something viewable, or is what you see what you get?

Here are the options as I see them:

Format	A	C	D	F	M	R	W	Minimum
Microsoft Word	-1	+1	-1	+1	-1	+1	+1	-1
OpenOffice	0	+1	-1	+1	-1	+1	+1	-1
DocBook	0	-1	0	0	+1	0	-1	-1
Other XML	0	-1	0	-1	0	-1	-1	-1
Plain Old HTML	0	-1	0	-1	0	-1	+1	-1
S5 and its kin	0	-1	0	-1	0	-1	+1	-1
Wiki text	+1	+1	+1	-1	+1	0	-1	-1
LaTeX	+1	0	0	+1	0	+1	0	0

I use the minimum in evaluation, rather than the average or total score, because what you notice most when you're working with something is usually what's most annoying about it. Or maybe that's just me... But what do these numbers actually mean? In no particular order:

Binary file formats don't work well with version control systems, since the latter use textual differencing to reconcile changes between versions or by concurrent editors. This rules out the default formats used by Microsoft and OpenOffice.
Machine-generated XML doesn't fare any better, since the differencing tools used in version control systems ignore the semantics ("element inserted") and become confused by the representation ("18 lines changed"). This rules out various XML-based options for Word and OO.
In contrast, XML or HTML that has been written using a plain old text editor usually has line breaks in useful places (i.e., more of the semantics is reflected in the representation) so diff and merge work much better. On the other hand, if you're using a POTE, 20-40% of your keystrokes go into markup (all those angle brackets and attributes) rather than content. WYSIWYG XML/HTML editors help a bit (I'm using the one built into WordPress right now), but most generate the same tangled diff-hostile output as the options dismissed above. With respect to particular formats:
- "Real" DocBook is a lot of work to produce. O'Reilly's DocBook Lite (a subset of the official format) is less effort, but there are still a lot of angle brackets to type in—I haven't yet found an editor that will let me type Ctrl-B and switch to DocBook-compliant bolding, for example.
- Homebrew XML markups, like the one used by Pragmatic, all seem to converge on the features of DocBook Lite. There's also the problem of finding (or building), tweaking, and maintaining tools to produce the end result. (I created my own format, and built my own tools, for Version 2 of the course; won't make that mistake again.)
- Plain old HTML has all the disadvantages of homebrew XML markup, but does have the advantage of being able to view without a compilation step—so long as you don't care about numbering, cross-references, etc. For that, you need tools, which need to be created, maintained, and tweaked.
- Various HTML-based slideshow formats, like S5, add some semantic information to plain old HTML that a bit of in-browser Javascript can use to produce PowerPoint-style effects. Numbering and cross-referencing still need tools, though, and S5 and various follow-ons are mostly orphaned these days.
Wiki text: easy to type in (that's the whole point), and plays well with version control, but (a) processing tools (again), and (b) the degree of control over markup is usually fairly limited. That said, Wiki Creole and reStructured Text are appealing: there are lots of compilation/conversion tools for both. The downside is that both actually require compilation: so far as I can tell, there isn't a WYSIWYG editor for either that is still being maintained. (Update: there may be one for reST: I'd welcome input from anyone who has used it.)
LaTeX: ah, LaTeX, my old nemesis—it has been a while, hasn't it? It plays nicely with version control; it handles cross-referencing, gives users fine control over layout—very fine control, if you want it—and there is even a WYSIWYG editor. On the downside, its syntax is complicated, but I've already mastered it, and so have many other scientists. More importantly, though, my past attempts to produce pretty HTML from LaTeX using Latex2Html and Plastex have been frustrating.

So, does that mean LaTeX is the right answer? My scoring says I should—what do you think?