An Exercise With Matplotlib and Numpy
This post originally appeared on the Software Carpentry website.
For this tutorial, we'll be plotting some weather data from a site call Weather Underground. You can download temperature readings and weather events for your local area in a comma-separated file.
I've put weather data for Bloomington, IN in a file called weather.csv. Each row is one day, and there are columns for min/mean/max temperature, dew point, wind speed, etc. We'll be plotting temperature and weather event data (e.g., rain, snow).
0. Installing matplotlib
I covered installing matplotlib in a previous tutorial. The matplotlib site also has installation instructions. I'll assume for the rest of the tutorial that you have matplotlib installed and working. If you can type this code at a Python shell:
from matplotlib import pyplot
and not receive any errors, then you're good to go.
1. Numpy Crash Course
The numpy module is how you do matrix-y stuff in Python. I'll give a quick example of why we'll need it. Imagine you were to type the following code into a Python shell:
x = [1, 2, 3, 4] print x * 5
What does this print? Why, this of course:
[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
By default, the * operator in Python copies the contents of a list however many times you specify. So x * 5 copied the contents of x five times and stuck them all together.
When we're doing matrix math in Python, it would be nicer if x * 5 produced [5, 10, 15, 20]. We could do this manually with a loop:
for i in range(len(x)): x[i] = x[i] * 5
We could even get fancy with Python's list comprehension syntax:
x = [x_i * 5 for x_i in x]
For a list with only four elements, this won't be so bad. For larger lists, however, it will be quite slow. Using numpy avoids the performance hit by doing the heavy lifting in C instead of in Python. Here's how we'd do the previous example with numpy:
import numpy as np x = np.array([1, 2, 3, 4]) x = x * 5 print x
This prints array([ 5, 10, 15, 20]) which is what we would expect. The array(...) lets you know that x is a numpy array. Onward!
2. Reading the Data
As with many programming problems, our first step is to read the data into memory. I've started a script called plot_data.py with a few import statements and some utility functions. I'll explain these functions in detail as we go forward.
import numpy as np import matplotlib.pyplot as pyplot from datetime import datetime import os event_types = ['Rain', 'Thunderstorm', 'Snow', 'Fog'] num_events = len(event_types) def event2int(event): return event_types.index(event) def date2int(date_str): date = datetime.strptime(date_str, '%Y-%m-%d') return date.toordinal() def r_squared(actual, ideal): actual_mean = np.mean(actual) ideal_dev = np.sum([(val - actual_mean)**2 for val in ideal]) actual_dev = np.sum([(val - actual_mean)**2 for val in actual]) return ideal_dev / actual_dev
In past tutorials, we've either manually parsed our data file(s) or used Python's csv reader. Because of our focus on numpy here, we're going to use the loadtxt function. By passing in the right options, we can get loadtxt to parse our weather.csv file directly into a numpy array.
For a first pass, I've written the following function to read in the weather data:
def read_weather(file_name): data = np.loadtxt(file_name, delimiter=',', skiprows=1, converters = { 0 : date2int }, usecols=(0,1,2,3,21)) return data #-------------------------------------------------- data = read_weather('data/weather.csv') print data
The first two parameters, delmiter and skiprows tell loadtxt to split fields based on commas and skip the first row of the file (which contains column names). numpy doesn't handle dates, so I've used the converters parameter to have have loadtxt convert column 0 (a date string) into an integer using my date2int function. The last parameter, usecols, tells loadtxt to ignore all columns in the file except the first, second, third, forth, and twenty-second column (the date, temperature, and weather event columns).
Unfortunately, running this code produces the following error:
$ python plot_data.py Traceback (most recent call last): File "plot_data-2.py", line 34, in <module> data = read_weather("data/weather.csv") File "plot_data-2.py", line 28, in read_weather usecols=(0,1,2,3,21)) File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt items = [conv(val) for (conv, val) in zip(converters, vals)] ValueError: could not convert string to float: Rain
The final line tells us that numpy can't convert the string "Rain" into a floating point number. This is from the weather events column in our data, which contains text like "Rain" or "Snow-Fog". We could try and write a converter for this column too, but I've chosen to simply have numpy bring the column in as a string (which we'll manually parse later).
To do this, we pass in a special object for the dtype parameter of loadtxt. This object can be constructed by giving a dictionary to the numpy.dtype function. The code below provides names and data types for all of the columns we'll be using:
def read_weather(file_name): dtypes = np.dtype({ 'names' : ('timestamp', 'max temp', 'mean temp', 'min temp', 'events'), 'formats' : [np.int, np.float, np.float, np.float, 'S100'] }) data = np.loadtxt(file_name, delimiter=',', skiprows=1, converters = { 0 : date2int }, usecols=(0,1,2,3,21), dtype=dtypes) return data
The last column format is given as "S100", which means "string up to 100 characters in length." numpy needs to know the maximum size of the column for efficiency, so I gave myself plenty of room with 100 characters. Running this new code produces the following output:
$ python plot_data.py [(734503, 53.0, 43.0, 32.0, 'Rain') (734504, 32.0, 25.0, 18.0, 'Snow') (734505, 27.0, 20.0, 12.0, '') (734506, 42.0, 34.0, 26.0, '') (734509, 52.0, 40.0, 28.0, '') (734510, 47.0, 36.0, 24.0, '') (734511, 51.0, 38.0, 24.0, '') (734512, 57.0, 43.0, 28.0, '') (734513, 45.0, 43.0, 40.0, 'Rain') (734514, 43.0, 29.0, 15.0, 'Fog-Snow') (734515, 19.0, 17.0, 15.0, 'Snow') (734516, 27.0, 18.0, 9.0, 'Snow') ...
We're finally in business. Each row in our data set consists of a timestamp (the date converted to an integer), the maximum, mean, and minimum temperature, and the weather events that occurred that day. We're going to start by plotting the mean temperature versus the day of the year. Since we've given names to each of our columns, we can pull them out easily:
def read_weather(file_name): dtypes = np.dtype({ 'names' : ('timestamp', 'max temp', 'mean temp', 'min temp', 'events'), 'formats' : [np.int, np.float, np.float, np.float, 'S100'] }) data = np.loadtxt(file_name, delimiter=',', skiprows=1, converters = { 0 : date2int }, usecols=(0,1,2,3,21), dtype=dtypes) return data #-------------------------------------------------- data = read_weather('data/weather.csv') min_temps = data['min temp'] mean_temps = data['mean temp'] max_temps = data['max temp'] dates = [datetime.fromordinal(d) for d in data['timestamp']] events = data['events'] for date, temp in zip(dates, mean_temps): print '{0:%b %d}: {1}'.format(date, temp)
Each column can be extract individually from the data array by using data['column name']. I've used the datetime.fromordinal function on the timestamp column to convert the integers back into datetime objects.
Using the handy built-in zip function, I've printed out pairs of dates and mean temperatures. I use advanced string formatting to print the month, day, and temperature (see the datetime documentation for date formatting information). The program now gives the following output:
Jan 01: 43.0 Jan 02: 25.0 Jan 03: 20.0 Jan 04: 34.0 ... May 11: 59.0 May 12: 62.0 May 14: 69.0
Everything looks good, so let's get started plotting.
3. Temperature Plot
We're going to start with a simple line plot that has the day of the year on the x-axis and the mean temperature for that day on the y-axis. Our plotting function, called temp_plot, will take in dates and times, and give us back a matplotlib figure object. Here's the code:
def temp_plot(dates, mean_temps): year_start = datetime(2012, 1, 1) days = [(d - year_start).days + 1 for d in dates] fig = pyplot.figure() pyplot.title('Temperatures in Bloomington 2012') pyplot.ylabel('Mean Temperature (F)') pyplot.xlabel('Day of Year') pyplot.plot(days, mean_temps, marker='o') return fig
We start by computing the day of the year for each date. The datetime module lets us subtract dates from each other, producing a timedelta object. We subtract each date from January 1st of 2012, adding 1 so that our count will start from 1 instead of 0. The days field on a timedelta object gives the total number of days (in this case, from January 1st).
Next, we create a new matplotlib figure. In between calls to pyplot.figure, matplotlib's plotting functions will draw new plots on top of old ones. We'll use this fact to add a trend line to our plot shortly.
After adding a title and some axis labels to our figure, we call pyplot.plot with our days (x values) and mean_temps arrays (y values). I've also passed in 'o' to the optional marker parameter so that small circles will be plotted for each data point.
In the main body of the program, we use the os module to create a "plots" directory (checking if it exists first). Next, we call our temp_plot function and then use savefig to save the figure out to a png file:
data = read_weather('data/weather.csv') min_temps = data['min temp'] mean_temps = data['mean temp'] max_temps = data['max temp'] dates = [datetime.fromordinal(d) for d in data['timestamp']] events = data['events'] if not os.path.exists('plots'): os.mkdir('plots') fig = temp_plot(dates, mean_temps) fig.savefig('plots/day_vs_temp.png')
Running $ python plot_data.py should create a "plots" folder and put a file inside called "day_vs_temp.png" that looks like this:
Not bad! Let's add a trend line to the plot based on a simple linear model of the data.
3.1 Adding a trend line
By using numpy's polyfit function, adding a trend line is a snap. This function takes our x and y values (days and mean_temps), and gives us back a slope and intercept (the final parameter is the degree of the fitted polynomial — we pass 1 for a linear fit).
slope, intercept = np.polyfit(days, mean_temps, 1)
Using the slope and intercept, we can plot a trend line by computing "ideal" temperatures for each day according to the old y = mx + b formula. With our variables below, this will be ideal_temps = (slope * days) + intercept. Note that I've changed the days = ... line to days = np.array(...) so that we can do mathematical operations directly on the array.
def temp_plot(dates, mean_temps): year_start = datetime(2012, 1, 1) days = np.array([(d - year_start).days + 1 for d in dates]) fig = pyplot.figure() pyplot.title('Temperatures in Bloomington 2012') pyplot.ylabel('Mean Temperature (F)') pyplot.xlabel('Day of Year') pyplot.plot(days, mean_temps, marker='o') slope, intercept = np.polyfit(days, mean_temps, 1) ideal_temps = intercept + (slope * days) r_sq = r_squared(mean_temps, ideal_temps) fit_label = 'Linear fit ({0:.2f})'.format(slope) pyplot.plot(days, ideal_temps, color='red', linestyle='--', label=fit_label) pyplot.annotate('r^2 = {0:.2f}'.format(r_sq), (0.05, 0.9), xycoords='axes fraction') pyplot.legend(loc='lower right') return fig
To make the plot a little more useful, I've annotated the plot with the R-squared value of the fit. pyplot.annotate lets you put text on the figure in a variety of ways. Here, I've set the xycoords parameter to "axes fraction" so that annotate interprets my coordinates (0.05, 0.9) as fractions between 0 and 1 relative to the figure axes. The (0.05, 0.9) means to place the text horizontally 5% from the y-axis (left) and 90% from the x-axis (bottom).
The final call to pyplot.legend places a legend on the figure. You must include a label parameter on at least one plot object for this to work (I've included it on the trend line plot call). By default, the legend will show up in the upper-right corner of the figure. This will get in the way on our current plot, so I moved the figure to the lower-right with the loc parameter.
With the changes above, here's the new plot:
Notice that the string formatting ({0:.3f}) has rounded the R-squared value and slope label for us to three decimal places.
3.2 Adding "error" bars
Since we also have the min and max temperatures in our data, let's add "error" bars to our plot to show the temperature range on each day. We'll modify temp_plot to take in two additional parameters (min_temps and max_temps), and plot the temperature range if they both have values (i.e., are not None).
Adding error bars requires us to use the pyplot.errorbar function instead of pyplot.plot. It takes additional parameters (xerr and yerr) for the x and y errors. We will use yerr, and pass in an array with two rows: one for error above each data point, and one for the error below. This array is easily computed by subtracting the max and min temperatures from the mean, and then stacking the two arrays together row-wise with numpy.vstack.
def temp_plot(dates, mean_temps, min_temps = None, max_temps = None): year_start = datetime(2012, 1, 1) days = np.array([(d - year_start).days + 1 for d in dates]) fig = pyplot.figure() pyplot.title('Temperatures in Bloomington 2012') pyplot.ylabel('Mean Temperature (F)') pyplot.xlabel('Day of Year') if (max_temps is None or min_temps is None): # Normal plot without error bars pyplot.plot(days, mean_temps, marker='o') else: # Compute min/max temperature difference from the mean temp_err = np.row_stack((mean_temps - min_temps, max_temps - mean_temps)) # Make line plot with error bars to show temperature range pyplot.errorbar(days, mean_temps, marker='o', yerr=temp_err) pyplot.title('Temperatures in Bloomington 2012 (max/min)') slope, intercept = np.polyfit(days, mean_temps, 1) ideal_temps = intercept + (slope * days) r_sq = r_squared(mean_temps, ideal_temps) fit_label = 'Linear fit ({0:.2f})'.format(slope) pyplot.plot(days, ideal_temps, color='red', linestyle='--', label=fit_label) pyplot.annotate('r^2 = {0:2f}'.format(r_sq), (0.05, 0.9), xycoords='axes fraction') pyplot.legend(loc='lower right') return fig #-------------------------------------------------- data = read_weather('data/weather.csv') min_temps = data['min temp'] mean_temps = data['mean temp'] max_temps = data['max temp'] dates = [datetime.fromordinal(d) for d in data['timestamp']] events = data['events'] if not os.path.exists('plots'): os.mkdir('plots') # Plot without error bars fig = temp_plot(dates, mean_temps) fig.savefig('plots/day_vs_temp.png') # Plot with error bars fig = temp_plot(dates, mean_temps, min_temps, max_temps) fig.savefig('plots/day_vs_temp-all.png')
The new plot is saved to a file named day_vs_temp-all.png and looks like this:
If you need to compute standard error for your errorbar plot, you can use scipy.stats.sem from the scipy module.
For our next plot, we'll do a multi-part histogram of the weather events for each month.
4. Event Histogram
Histograms in matplotlib are generated using the pyplot.hist function. This function takes an array of data, which can itself contain arrays (for a multi-part histogram). We want to count events per month, so we'll need to create an array for each type of event. Inside these arrays will be observations like [1, 1, 2, 3, 3] for "January", "January", "February", "March", "March". Here's a diagram to help out:
When pyplot.hist receives our data, it will attempt to "bin" the month observations automatically. By default, it will break observations into 10 bins. We want a bin for each month instead, and we want the bins aligned properly to the month numbers (1 = January, 2 = February, etc.). The bins parameter to pyplot.hist takes either a number (representing the desired number of bins) or a sequence (representing the desired bin edges). In the code below, we pass range(1, 5 + 2) to ensure that our bins start at 1 (for January) and go through 5 (for May).
def hist_events(dates, events): event_months = [] for i in range(num_events): event_months.append([]) # Build up lists of months where events occurred for date, event_str in zip(dates, events): if len(event_str) == 0: # Skip blank events continue month = date.month # Multiple events in a day are separated by '-' for event in event_str.split('-'): event_code = event2int(event) event_months[event_code].append(month) # Plot histogram fig = pyplot.figure() pyplot.title('Weather Events in Bloomington 2012') pyplot.xlabel('Month') pyplot.ylabel('Event Count') bins = np.arange(1, 5 + 2) pyplot.hist(event_months, bins=bins, label=event_types) pyplot.legend() return fig
The main body of the program is updated to call hist_events and save the resulting figure to plots/event_histogram.png.
data = read_weather('data/weather.csv') min_temps = data['min temp'] mean_temps = data['mean temp'] max_temps = data['max temp'] dates = [datetime.fromordinal(d) for d in data['timestamp']] events = data['events'] if not os.path.exists('plots'): os.mkdir('plots') fig = temp_plot(dates, mean_temps) fig.savefig('plots/day_vs_temp.png') fig = temp_plot(dates, mean_temps, min_temps, max_temps) fig.savefig('plots/day_vs_temp-all.png') fig = hist_events(dates, events) fig.savefig(os.path.join('plots', 'event_histogram.png'))
When we run $ python plot_data.py, the new plot looks like this:
Each collection of bars represents a month, and the individual bars represent the number of Rain, Thunderstorm, etc. events observed for that month. The figure's legend was populated by passing event_types in for the label parameter of pyplot.hist.
The plot looks good, but it would be nice to properly label the months. We could do this manually with pyplot.xticks as follows:
pyplot.xticks( (1.5, 2.5, 3.5, 4.5, 5.5), ('January', 'February', 'March', 'April', 'May') )
This will label each bin in the center (hence the .5 added to each number) with the proper month name. If our data grows to include more months, however, we'll have to manually extend the number of bins and our labels. Let's change hist_events to keep track of the range of months in the data. Additionally, we'll use Python's calendar module to automatically get the month names.
At the top of the program, we'll import the calendar module:
import calendar
and then redefine hist_events as follows:
def hist_events(dates, events): event_months = [] for i in range(num_events): event_months.append([]) # Build up lists of months where events occurred min_month = 13 max_month = 0 for date, event_str in zip(dates, events): if len(event_str) == 0: # Skip blank events continue month = date.month min_month = min(month, min_month) max_month = max(month, max_month) # Multiple events in a day are separated by '-' for event in event_str.split('-'): event_code = event2int(event) event_months[event_code].append(month) # Plot histogram fig = pyplot.figure() pyplot.title('Weather Events in Bloomington 2012') pyplot.xlabel('Month') pyplot.ylabel('Event Count') pyplot.axes().yaxis.grid() num_months = max_month - min_month + 1; bins = np.arange(1, num_months + 2) # Bin edges pyplot.hist(event_months, bins=bins, label=event_types) # Align month labels to bin centers month_names = calendar.month_name[min_month:max_month+1] pyplot.xticks(bins + 0.5, month_names) pyplot.legend() return fig
During the process of building our observation arrays, we now track the minimum and maximum months observed. This allows us to automatically create our bin edges, and let's us grab months names from the calendar module by indexing into the calendar.month_name list.
Note that the bins variables was created using numpy.arange, which is a shortcut for bins = numpy.array(range(1, num_months + 2)). Making bins a numpy array lets us call pyplot.xticks with bins + 0.5, centering month_names on each bin.
As a bonus, I've also added a horizontal grid using the axis.grid function. You can add both a horizontal and vertical grid at the same time by calling pyplot.grid.
Here's the updated plot:
Looks ready for publication!