Key Points
This post originally appeared on the Software Carpentry website.
On the flight back from Vancouver yesterday, I finally did what I should have done eight months ago and compiled the key points from our core lesson content. The results are presented below, broken down by lesson and topic; going forward, we're going to use something like this as a basis for defining what Software Carpentry is, and what workshop attendees can expect to learn.
The Shell
What and Why
- The shell is a program whose primary purpose is to read commands, run programs, and display results.
Files and Directories
- The file system is responsible for managing information on disk.
- Information is stored in files, which are stored in directories (folders).
- Directories can also store other directories, which forms a directory tree.
/
on its own is the root directory of the whole filesystem.- A relative path specifies a location starting from the current location.
- An absolute path specifies a location from the root of the filesystem.
- Directory names in a path are separated with '/' on Unix, but '\' on Windows.
- '..' means "the directory above the current one"; '.' on its own means "the current directory".
- Most files' names are
something.extension
; the extension isn't required, and doesn't guarantee anything, but is normally used to indicate the type of data in the file. cd path
changes the current working directory.ls path
prints a listing of a specific file or directory;ls
on its own lists the current working directory.pwd
prints the user's current working directory (current default location in the filesystem).whoami
shows the user's current identity.- Most commands take options (flags) which begin with a '-'.
Creating Things
- Unix documentation uses '^A' to mean "control-A".
- The shell does not have a trash bin: once something is deleted, it's really gone.
mkdir path
creates a new directory.cp old new
copies a file.mv old new
moves (renames) a file or directory.nano
is a very simple text editor—please use something else for real work.rm path
removes (deletes) a file.rmdir path
removes (deletes) an empty directory.
Pipes and Filters
- '*' is a wildcard pattern that matches zero or more characters in a pathname.
- '?' is a wildcard pattern that matches any single character.
- The shell matches wildcards before running commands.
command > file
redirects a command's output to a file.first | second
is a pipeline: the output of the first command is used as the input to the second.- The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
cat
displays the contents of its inputs.head
displays the first few lines of its input.sort
sorts its inputs.tail
displays the last few lines of its input.wc
counts lines, words, and characters in its inputs.
Loops
- Use a
for
loop to repeat commands once for every thing in a list. - Every
for
loop needs a variable to refer to the current "thing". - Use
$name
to expand a variable (i.e., get its value). - Do not use spaces, quotes, or wildcard characters such as '*' or '?' in filenames, as it complicates variable expansion.
- Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
- Use the up-arrow key to scroll up through previous commands to edit and repeat them.
- Use
history
to display recent commands, and!number
to repeat a command by number. - Use ^C (control-C) to terminate a running command.
Shell Scripts
- Save commands in files (usually called shell scripts) for re-use.
- Use
bash filename
to run saved commands. $*
refers to all of a shell script's command-line arguments.$1
,$2
, etc., refer to specified command-line arguments.- Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.
Finding Things
- Everything is stored as bytes, but the bytes in binary files do not represent characters.
- Use nested loops to run commands for every combination of two lists of things.
- Use '\' to break one logical line into several physical lines.
- Use parentheses '()' to keep things combined.
- Use
$(command)
to insert a command's output in place. find
finds files with specific properties that match patterns.grep
selects lines in files that match patterns.man command
displays the manual page for a given command.
Version Control with Subversion
- Version control is a better way to manage shared files than email or shared folders.
- The master copy is stored in a repository.
- Nobody ever edits the master directory: instead, each person edits a local working copy.
- People share changes by committing them to the master or updating their local copy from the master.
- The version control system prevents people from overwriting each other's work by forcing them to merge concurrent changes before committing.
- It also keeps a complete history of changes made to the master so that old versions can be recovered reliably.
- Version control systems work best with text files, but can also handle binary files such as images and Word documents.
Basic Use
- Every repository is identified by a URL.
- Working copies of different repositories may not overlap.
- Each changed to the master copy is identified by a unique revision number.
- Revisions identify snapshots of the entire repository, not changes to individual files.
- Each change should be commented to make the history more readable.
- Commits are transactions: either all changes are successfully committed, or none are.
- The basic workflow for version control is update-change-commit.
svn add things
tells Subversion to start managing particular files or directories.svn checkout url
checks out a working copy of a repository.svn commit -m "message" things
sends changes to the repository.svn diff
compares the current state of a working copy to the state after the most recent update.svn diff -r HEAD
compares the current state of a working copy to the state of the master copy.svn history
shows the history of a working copy.svn status
shows the status of a working copy.svn update
updates a working copy from the repository.
Merging Conflicts
- Conflicts must be resolved before a commit can be completed.
- Subversion puts markers in text files to show regions of conflict.
- For each conflicted file, Subversion creates auxiliary files containing the common parent, the master version, and the local version.
svn resolve files
tells Subversion that conflicts have been resolved.
Recovering Old Versions
- Old versions of files can be recovered by merging their old state with their current state.
- Recovering an old version of a file does not erase the intervening changes.
- Use branches to support parallel independent development.
svn merge
merges two revisions of a file.svn revert
undoes local changes to files.
Setting up a Repository
- Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.
svnadmin create name
creates a new repository.
Provenance
$Keyword:$
in a file can be filled in with a property value each time the file is committed.- Put version numbers in programs' output to establish provenance for data.
svn propset svn:keywords property files
tells Subversion to start filling in property values.
Basic Programming
Basic Operations
- Use '=' to assign a value to a variable.
- Assigning to one variable does not change the values associated with other variables.
- Use
print
to display values. - Variables are created when values are assigned to them.
- Variables cannot be used until they have been created.
- Addition ('+'), subtraction ('-'), and multiplication ('*') work as usual in Python.
- Use meaningful, descriptive names for variables.
Creating Programs
- Store programs in files whose names end in
.py
and run them withpython name.py
.
Types
- The most commonly used data types in Python are integers (
int
), floating-point numbers (float
), and strings (str
). - Strings can start and end with either single quote (') or double quote (").
- Division ('/') produces an
int
result when givenint
values: one or both arguments must befloat
to get afloat
result. - "Adding" strings concatenates them, multiplying strings by numbers repeats them.
- Strings and numbers cannot be added because the behavior is ambiguous: convert one to the other type first.
- Variables do not have types, but values do.
Reading Files
- Data is either in memory, on disk, or far away.
- Most things in Python are objects, and have attached functions called methods.
- When lines are read from files, Python keeps their end-of-line characters.
- Use
str.strip
to remove leading and trailing whitespace (including end-of-line characters). - Use
file(name, mode)
to open a file for reading ('r'), writing ('w'), or appending ('a'). - Opening a file for writing erases any existing content.
- Use
file.readline
to read a line from a file. - Use
file.close
to close an open file. - Use
print >> file
to print to a file.
Standard Input and Output
- The operating system automatically gives every program three open "files" called standard input, standard output, and standard error.
- Standard input gets data from the keyboard, from a file when redirected with '<', or from the previous stage in a pipeline with '|'.
- Standard output writes data to the screen, to a file when redirected with '>', or to the next stage in a pipeline with '|'.
- Standard error also writes data to the screen, and is not redirected by '>' or '|'.
- Use
import library
to import a library. - Use
library.thing
to refer to something imported from a library. - The
sys
library provides open "files" calledsys.stdin
andsys.stdout
for standard input and output.
Repeating Things
- Use
for variable in something:
to loop over the parts of something. - The body of a loop must be indented consistently.
- The parts of a string are its characters; the parts of a file are its lines.
Making Choices
- Use
if test
to do something only when a condition is true. - Use
else
to do something when a precedingif
test is not true. - The body of an
if
orelse
must be indented consistently. - Combine tests using
and
andor
. - Use '<', '<=', '>=', and '>' to compare numbers or strings.
- Use '==' to test for equality and '!=' to test for inequality.
- Use
variable += expression
as a shorthand forvariable = variable + expression
(and similarly for other arithmetic operations).
Flags
- The two Boolean values
True
andFalse
can be assigned to variables like any other values. - Programs often use Boolean values as flags to indicate whether something has happened yet or not.
Reading Data Files
- Use
str.split()
to split a string into pieces on whitespace. - Values can be assigned to any number of variables at once.
Provenance Revisited
- Put version numbers in programs' output to establish provenance for data.
Lists
- Use
[value, value, ...]
to create a list of values. for
loops process the elements of a list, in order.len(list)
returns the length of a list.[]
is an empty list with no values.
More About Lists
- Lists are mutable: they can be changed in place.
- Use
list.append(value)
to append something to the end of a list. - Use
list[index]
to access a list element by location. - The index of the first element of a list is 0; the index of the last element is
len(list)-1
. - Negative indices count backward from the end of the list, so
list[-1]
is the last element. - Trying to access an element with an out-of-bounds index is an error.
range(number)
produces the list of numbers[0, 1, ..., number-1]
.range(len(list))
produces the list of legal indices forlist
.
Checking and Smoothing Data
range(start, end)
creates the list of numbers fromstart
up to, but not including,end
.range(start, end, stride)
creates the list of numbers fromstart
up toend
in steps ofstride
.
Nesting Loops
- Use nested loops to do things for combinations of things.
- Make the range of the inner loop depend on the state of the outer loop to automatically adjust how much data is processed.
- Use
min(...)
andmax(...)
to find the minimum and maximum of any number of values.
Nesting Lists
- Use nested lists to store multi-dimensional data or values that have regular internal structure (such as XYZ coordinates).
- Use
list_of_lists[first]
to access an entire sub-list. - Use
list_of_lists[first][second]
to access a particular element of a sub-list. - Use nested loops to process nested lists.
Aliasing
- Several variables can alias the same data.
- If that data is mutable (e.g., a list), a change made through one variable is visible through all other aliases.
Functions and Libraries
How Functions Work
- Define a function using
def name(...)
- The body of a function must be indented.
- Use
name(...)
to call a function. - Use
return
to return a value from a function. - The values passed into a function are assigned to its parameters in left-to-right order.
- Function calls are recorded on a call stack.
- Every function call creates a new stack frame.
- The variables in a stack frame are discarded when the function call completes.
- Grouping operations in functions makes code easier to understand and re-use.
Global Variables
- Every function always has access to variables defined in the global scope.
- Programmers often write constants' names in upper case to make their intention easier to recognize.
- Functions should not communicate by modifying global variables.
Multiple Arguments
- A function may take any number of arguments.
- Define default values for parameters to make functions more convenient to use.
- Defining default values only makes sense when there are sensible defaults.
Returning Values
- A function may return values at any point.
- A function should have zero or more
return
statements at its start to handle special cases, and then one at the end to handle the general case. - "Accidentally" correct behavior is hard to understand.
- If a function ends without an explicit
return
, it returnsNone
.
Aliasing
- Values are actually passed into functions by reference, which means that they are aliased.
- Aliasing means that changes made to a mutable object like a list inside a function are visible after the function call completes.
Libraries
- Any Python file can be imported as a library.
- The code in a file is executed when it is imported.
- Every Python file is a scope, just like every function.
Standard Libraries
- Use
from library import something
to import something under its own name. - Use
from library import something as alias
to import something under the namealias
. from library import *
imports everything inlibrary
under its own name, which is usually a bad idea.- The
math
library defines common mathematical constants and functions. - The system library
sys
defines constants and functions used in the interpreter itself. sys.argv
is a list of all the command-line arguments used to run the program.sys.argv[0]
is the program's name.sys.argv[1:]
is everything except the program's name.
Building Filters
- If a program isn't told what files to process, it should process standard input.
- Programs that explicitly test values' types are more brittle than ones that rely on those values' common properties.
- The variable
__name__
is assigned the string'__main__'
in a module when that module is the main program, and the module's name when it is imported by something else. - If the first thing in a module or function is a string that isn't assigned to a variable, that string is used as the module or function's documentation.
- Use
help(name)
to display the documentation for something.
Functions as Objects
- A function is just another kind of data.
- Defining a function creates a function object and assigns it to a variable.
- Functions can be assigned to other variables, put in lists, and passed as parameters.
- Writing higher-order functions helps eliminate redundancy in programs.
- Use
filter
to select values from a list. - Use
map
to apply a function to each element of a list. - Use
reduce
to combine the elements of a list.
Databases
- A relational database stores information in tables with fields and records.
- A database manager is a program that manipulates a database.
- The commands or queries given to a database manager are usually written in a specialized language called SQL.
Selecting
- SQL is case insensitive.
- The rows and columns of a database table aren't stored in any particular order.
- Use
SELECT fields FROM table
to get all the values for specific fields from a single table. - Use
SELECT * FROM table
to select everything from a table.
Removing Duplicates
- Use
SELECT DISTINCT
to eliminate duplicates from a query's output.
Calculating New Values
- Use expressions in place of field names to calculate per-record values.
Filtering
- Use
WHERE test
in a query to filter records based on logical tests. - Use
AND
andOR
to combine tests in filters. - Use
IN
to test whether a value is in a set. - Build up queries a bit at a time, and test them against small data sets.
Sorting
- Use
ORDER BY field ASC
(orDESC
) to order a query's results in ascending (or descending) order.
Aggregation
- Use aggregation functions like
SUM
MAX
to combine many query results into a single value. - Use the
COUNT
function to count the number of results. - If some fields are aggregated, and others are not, the database manager displays an arbitrary result for the unaggregated field.
- Use
GROUP BY
to group values before aggregation.
Database Design
- Each field in a database table should store a single atomic value.
- No fact in a database should ever be duplicated.
Combining Data
- Use
JOIN
to create all possible combinations of records from two or more tables. - Use
JOIN tables ON test
to keep only those combinations that pass some test. - Use
table.field
to specify a particular field of a particular table. - Use aliases to make queries more readable.
- Every record in a table should be uniquely identified by the value of its primary key.
Self Join
- Use a self join to combine a table with itself.
Missing Data
- Use
NULL
in place of missing information. - Almost every operation involving
NULL
producesNULL
as a result. - Test for nulls using
IS NULL
andIS NOT NULL
. - Most aggregation functions skip nulls when combining values.
Nested Queries
- Use nested queries to create temporary sets of results for further querying.
- Use nested queries to subtract unwanted results from all results to leave desired results.
Creating and Modifying Tables
- Use
CREATE TABlE name(...)
to create a table. - Use
DROP TABLE name
to erase a table. - Specify field names and types when creating tables.
- Specify
PRIMARY KEY
,NOT NULL
, and other constraints when creating tables. - Use
INSERT INTO table VALUES(...)
to add records to a table. - Use
DELETE FROM table WHERE test
to erase records from a table. - Maintain referential integrity when creating or deleting information.
Transactions
- Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable.
Programming With Databases
- Most applications that use databases embed SQL in a general-purpose programming language.
- Database libraries use connections and cursors to manage interactions.
- Programs can fetch all results at once, or a few results at a time.
- If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries.
- Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks.
Number Crunching with NumPy
- High-level libraries are usually more efficient for numerical programming than hand-coded loops.
- Most such libraries use a data-parallel programming model.
- Arrays can be used as matrices, as physical grids, or to store general multi-dimensional data.
Basics
- NumPy is a high-level array library for Python.
import numpy
to import NumPy into a program.- Use
numpy.array(values)
to create an array. - Initial values must be provided in a list (or a list of lists).
- NumPy arrays store homogeneous values whose type is identified by
array.dtype
. - Use
old.astype(newtype)
to create a new array with a different type rather than assigning todtype
. numpy.zeros
creates a new array filled with 0.numpy.ones
creates a new array filled with 1.numpy.identity
creates a new identity matrix.numpy.empty
creates an array but does not initialize its values (which means they are unpredictable).- Assigning an array to a variable creates an alias rather than copying the array.
- Use
array.copy
to create a copy of an array. - Put all array indices in a single set of square brackets, like
array[i0, i1].
array.shape
is a tuple of the array's size in each dimension.array.size
is the total number of elements in the array.
Storage
- Arrays are stored using descriptors and data blocks.
- Many operations create a new descriptor, but alias the original data block.
- Array elements are stored in row-major order.
array.transpose
creates a transposed alias for an array's data.array.ravel
creates a one-dimensional alias for an array's data.array.reshape
creates an arbitrarily-shaped alias for an array's data.array.resize
resizes an array's data in place, filling with zero as necessary.
Indexing
- Arrays can be sliced using
start:end:stride
along each axis. - Values can be assigned to slices as well as read from them.
- Arrays can be used as subscripts to select items in arbitrary ways.
- Masks containing
True
andFalse
can be used to select subsets of elements from arrays. - Use '&' and '|' (or
logical_and
andlogical_or
) to combine tests when subscripting arrays. - Use
where
,choose
, orselect
to select elements or alternatives in a single step.
Linear Algebra
- Addition, multiplication, and other arithmetic operations work on arrays element-by-element.
- Operations involving arrays and scalars combine the scalar with each element of the array.
array.dot
performs "real" matrix multiplication.array.sum
calculates sums or partial sums of array elements.array.mean
calculates array averages.
Making Recommendations
- Getting data in the right format for processing often requires more code than actually processing it.
- Data with many gaps should be stored in sparse arrays.
numpy.cov
calculates variancess and covariances.
The Game of Life
- Padding arrays with fixed elements is an easy way to implement boundary conditions.
scipy.signal.convolve
applies a weighted mask to each element of an array.
Quality
Defensive Programming
- Design programs to catch both internal errors and usage errors.
- Use assertions to check whether things that ought to be true in a program actually are.
- Assertions help people understand how programs work.
- Fail early, fail often.
- When bugs are fixed, add assertions to the program to prevent their reappearance.
Handling Errors
- Use
raise
to raise exceptions. - Raise exceptions to report errors rather than trying to handle them inline.
- Use
try
andexcept
to handle exceptions. - Catch exceptions where something useful can be done about the underlying problem.
- An exception raised in a function may be caught anywhere in the active call stack.
Unit Testing
- Testing cannot prove that a program is correct, but is still worth doing.
- Use a unit testing library like Nose to test short pieces of code.
- Write each test as a function that creates a fixture, executes an operation, and checks the result using assertions.
- Every test should be able to run independently: tests should not depend on one another.
- Focus testing on boundary cases.
- Writing tests helps us design better code by clarifying our intentions.
Numbers
- Floating point numbers are approximations to actual values.
- Use tolerances rather than exact equality when comparing floating point values.
- Use integers to count and floating point numbers to measure.
- Most tests should be written in terms of relative error rather than absolute error.
- When testing scientific software, compare results to exact analytic solutions, experimental data, or results from simpler or previously-tested programs.
Coverage
- Use a coverage analyzer to see which parts of a program have been tested and which have not.
Debugging
- Use an interactive symbolic debugger instead of
print
statements to diagnose problems. - Set breakpoints to halt the program at interesting points instead of stepping through execution.
- Try to get things right the first time.
- Make sure you know what the program is supposed to do before trying to debug it.
- Make sure the program is actually running the test case you think it is.
- Make the program fail reliably.
- Simplify the test case or the program in order to localize the problem.
- Change one thing at a time.
- Be humble.
Designing Testable Code
- Separating interface from implementation makes code easier to test and re-use.
- Replace some components with simplified versions of themselves in order to simplify testing of other components.
- Do not create arbitrary, variable, or random results, as they are extremely hard to test.
- Isolate interactions with the outside world when writing tests.
Sets and Dictionaries
Sets
- Use sets to store distinct unique values.
- Create sets using
set()
or{v1, v2, ...}
. - Sets are mutable, i.e., they can be updated in place like lists.
- A loop over a set produces each element once, in arbitrary order.
- Use sets to find unique things.
Storage
- Sets are stored in hash tables, which guarantee fast access for arbitrary keys.
- The values in sets must be immutable to prevent hash tables misplacing them.
- Use tuples to store multi-part elements in sets.
Dictionaries
- Use dictionaries to store key-value pairs with distinct keys.
- Create dictionaries using
{k1:v1, k2:v2, ...}
- Dictionaries are mutable, i.e., they can be updated in place.
- Dictionary keys must be immutable, but values can be anything.
- Use tuples to store multi-part keys in dictionaries.
dict[key]
refers to the dictionary entry with a particular key.key in dict
tests whether a key is in a dictionary.len(dict)
returns the number of entries in a dictionary.- A loop over a dictionary produces each key once, in arbitrary order.
dict.keys()
creates a list of the keys in a dictionary.dict.values()
creates a list of the keys in a dictionary.
Simple Examples
- Use dictionaries to count things.
- Initialize values from actual data instead of trying to guess what values could "never" occur.
Phylogenetic Trees
- Problems that are described using matrices can often be solved more efficiently using dictionaries.
- When using tuples as multi-part dictionary keys, order the tuple entries to avoid accidental duplication.
Development
The Grid
- Get something simple working, then start to add features, rather than putting everything in the program at the start.
- Leave FIXME markers in programs as you are developing them to remind yourself what still needs to be done.
Aliasing
- Draw pictures of data structures to aid debugging.
Randomness
- Use a well-tested random number generation library to generate pseudorandom values.
- If a random number generation library is given the same seed, it will produce the same sequence of values.
Neighbors
and
andor
stop evaluating arguments as soon as they have an answer.
Bugs
- Test programs with successively more complex cases.
Refactoring
- Refactor programs as necessary to make testing easier.
- Replace randomness with predictability to make testing easier.
Performance
- Scientists want faster programs both to handle bigger problems and to handle more problems with available resources.
- Before speeding a program up, ask, "Does it need to be faster?" and, "Is it correct?"
- Recording start and end times is a simple way to measure performance.
- Analyze algorithms to predict how a program's performance will change with problem size.
Profiling
- Use a profiler to determine which parts of a program are responsible for most of its running time.
A New Beginning
- Better algorithms are better than better hardware.