Here I am going to show just some basic pandas stuff for time series analysis, as I think for the Earth Scientists it's the most interesting topic. If you find this small tutorial useful, I encourage you to watch this video, where Wes McKinney give extensive introduction to the time series data analysis with pandas.
On the official website you can find explanation of what problems pandas solve in general, but I can tell you what problem pandas solve for me. It makes analysis and visualisation of 1D data, especially time series, MUCH faster. Before pandas working with time series in python was a pain for me, now it's fun. Ease of use stimulate in-depth exploration of the data: why wouldn't you make some additional analysis if it's just one line of code? Hope you will also find this great tool helpful and useful. So, let's begin.
Thanks to Will Bryant for the 2017 update.
As an example we are going to use time series of Arctic Oscillation (AO) and North Atlantic Oscillation (NAO) data sets.
Module import¶
First we have to import necessary modules:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel
pd.set_option('display.max_rows',15) # this limit maximum numbers of rows
And "switch on" inline graphic for the notebook:
%pylab inline
Pandas developing very fast, and while we are going to use only basic functionality, some details may still change in the newer versions. This example will only work in version > 0.8. Check yours:
pd.__version__
Loading data¶
Now, when we are done with preparations, let's get some data. If you work on Windows download monthly AO data from here. If you on *nix machine, you can do it directly from ipython notebook using system call to wget command or on Mac a curl command:
#!curl http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii >> 'monthly.ao.index.b50.current.ascii'
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii
Pandas has very good IO capabilities, but we not going to use them in this tutorial in order to keep things simple. For now we open the file simply with numpy loadtxt:
ao = np.loadtxt('monthly.ao.index.b50.current.ascii')
Every line in the file consist of three elements: year, month, value:
ao[0:2]
And here is the shape of our array:
ao.shape
Time Series¶
We would like to convert this data in to time series, that can be manipulated naturally and easily. First step, that we have to do is to create the range of dates for our time series. From the file it is clear, that record starts at January 1950. We will start our dates at January 1950 and generate as many time stamps as we have records. Frequency of the data is one month (freq='M').
dates = pd.date_range('1950-01', periods=ao.shape[0], freq='M')
As you see syntax is quite simple, and this is one of the reasons why I love Pandas so much :) You can check if the range of dates is properly generated:
dates
dates.shape
Now we are ready to create our first time series. Dates from the dates variable will be our index, and AO values will be our, hm... values:
AO = Series(ao[:,2], index=dates)
AO
Now we can plot complete time series:
AO.plot()
or its part:
AO['1980':'1990'].plot()
or even smaller part:
AO['1980-05':'1981-03'].plot()
Reference to the time periods is done in a very natural way. You, of course, can also get individual values. By number:
AO[120]
or by index (date in our case):
AO['1960-01']
And what if we choose only one year?
AO['1960']
Isn't that great? :)
One bonus example :)
AO[AO > 0]
Data Frame¶
Now let's make life a bit more interesting and download more data. This will be NOA time series (Windows users can get it here, Mac users can use curl).
#!curl http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii >> 'norm.nao.monthly.b5001.current.ascii'
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii
Create Series the same way as we did for AO:
nao = np.loadtxt('norm.nao.monthly.b5001.current.ascii')
dates_nao = pd.date_range('1950-01', periods=nao.shape[0], freq='M')
NAO = Series(nao[:,2], index=dates_nao)
Time period is the same:
NAO.index
Now we create Data Frame, that will contain both AO and NAO data. It sort of an Excel table where the first row contain headers for the columns and firs column is an index:
aonao = DataFrame({'AO' : AO, 'NAO' : NAO})
One can plot the data straight away:
aonao.plot(subplots=True)
Or have a look at the first several rows:
aonao.head()
We can reference each column by its name:
aonao['NAO']
or as method of the Data Frame variable (if name of the variable is a valid python name):
aonao.NAO
We can simply add column to the Data Frame:
aonao['Diff'] = aonao['AO'] - aonao['NAO']
aonao.head()
And delete it:
del aonao['Diff']
aonao.tail()
Slicing will also work:
aonao['1981-01':'1981-03']
even in some crazy combinations:
import datetime
aonao.loc[(aonao.AO > 0) & (aonao.NAO < 0)
& (aonao.index > datetime.datetime(1980,1,1))
& (aonao.index < datetime.datetime(1989,1,1)),
'NAO'].plot(kind='barh')
Here we use special advanced indexing attribute .ix. We choose all NAO values in the 1980s for months where AO is positive and NAO is negative, and then plot them. Magic :)
Statistics¶
Back to simple stuff. We can obtain statistical information over elements of the Data Frame. Default is column wise:
aonao.mean()
aonao.max()
aonao.min()
You can also do it row-wise:
aonao.mean(1)
Or get everything at once:
aonao.describe()
Resampling¶
Pandas provide easy way to resample data to different time frequency. Two main parameters for resampling is time period you resemple to and the method that you use. By default the method is mean. Following example calculates annual ('A') mean:
AO_mm = AO.resample("A").mean()
AO_mm.plot(style='g--')
median:
AO_mm = AO.resample("A").median()
AO_mm.plot()
You can use your methods for resampling, for example np.max (in this case we change resampling frequency to 3 years):
AO_mm = AO.resample("3A").apply(np.max)
AO_mm.plot()
You can specify several functions at once as a list:
AO_mm = AO.resample("A").apply(['mean', np.min, np.max])
AO_mm['1900':'2020'].plot(subplots=True)
AO_mm['1900':'2020'].plot()
AO_mm
Moving (rolling) statistics¶
Pandas provide good collection of moving statistics.
Rolling mean:
aonao.rolling(window=12, center=False).mean().plot(style='-g')
Rolling correlation:
aonao.AO.rolling(window=120).corr(other=aonao.NAO).plot(style='-g')
By the way getting correlation coefficients for members of the Data Frame is as simple as:
aonao.corr()
That's it. I hope you at least get a rough impression of what pandas can do for you. Comments are very welcome (below). If you have intresting examples of pandas usage in Earth Science, we would be happy to put them on EarthPy.
Comments !