An Exercise With Sets and Dictionaries

This post originally appeared on the Software Carpentry website.

You are working for a nanotechnology company that prides itself on manufacturing some of the finest molecules in the world. Your job is to rewrite parts of their ordering system, which keeps track of what molecules they can actually make. Before trying this exercise, please review:

Introduction
Storage
Dictionaries
Examples
Nanotech Example

Submit your work by mailing Greg:

your final program,
the input file(s) you used to test it, and
a shell script that runs all of your tests.

1. Reading

Your company stores information about molecules in files that contain formulas and names, one per line, like this:

# Molecular formulas and names
# $Revision: 4738$

chlorine : Cl*2
silver nitrate: Ag.N.O*3
sodium chloride :Na.Cl

More specifically:

Lines may be blank (in which case they are ignored).
Lines starting with '#' are comments (which are also ignored).
Each line of actual data has a molecule name, a colon, and a molecular formula. There may or may not be spaces around the colon.
Each formulas has one or more atom-count values, separated by '.'
Each atom-count consist of an atomic symbols (which is either a single upper-case letter, or an upper-case letter followed by a lower-case letter) which may be followed by '*' and an integer greater than 1. If there is no count (i.e., if the '*' and integer are missing), the count is 1.

Write a function called read_molecules which takes a handle to an open file as its only argument, reads everything from that file, and returns a dictionary containing all the formulas in that file. (Here, "a handle to an open file" means either sys.stdin, or the result of using open(filename, 'r') or file(filename, 'r') to open a file.) The result dictionary's keys should be the names of the molecules with leading and trailing whitespace removed. Its values should themselves be dictionaries of atomic symbols and counts. For example, if the data shown above is contained in the file molecules.mol, then this Python:

reader = file('molecules.mol', 'r')
data = read_molecules(reader)
reader.close()
print data

should produce something like:

{
 'chlorine'        : {'Cl' : 2},
 'silver nitrate'  : {'Ag' : 1, 'N' : 1, 'O' : 3},
 'sodium chloride' : {'Na' : 1, 'Cl' : 1}
}

Note: if your tutorial group has already covered regular expressions, use them for this part of the exercise. If you have not yet met regular expressions, use string splitting instead.

2. Merging

Write a function called merge_molecules that takes two dictionaries like the one shown above and produces a third dictionary that contains the contents of both according to the following rules:

If a molecule appears in one input dictionary or the other, it also appears in the result.
If a molecule appears in both input dictionaries with the same formula, one copy of it appears in the result.
If a molecule appears in both input dictionaries with different formulas, it is not copied to the output dictionary at all. (This kind of "silent failure" is actually a really bad practice, but we won't see what we should do until we discuss exceptions.)

Your function must not modify either of its input arguments: the original dictionaries must be left as they were.

3. What Can We Make?

Write a function called can_produce that takes a dictionary of molecular formulas (like the one shown above) and the atomic symbol of one kind of atom, and returns a set containing the names of all the molecules we might be able to make. For example:

reader = file('molecules.mol', 'r')
data = read_molecules(reader)
reader.close()
print can_produce(data, 'Cl')

should print something like:

set(['chlorine', 'sodium chloride'])

4. Putting the Pieces Together

Write a program called produce.py that uses these three functions to tell us the molecules we could make using a particular kind of atom based on the contents of several molecular formula files. For example:

$ python produce.py Cl < molecules.mol

prints:

chlorine
sodium chloride

while:

$ python produce.py Na salts.mol organics.mol alloys.mol

reads and merges all the formulas in the three files salts.mol, organics.mol, and alloys.mol, and prints a list of all the molecules from those files that contain sodium.