Time series analysis with pandas. Part 2

Task:

continue interactive analysis of time series (AO, NAO indexes)

Module:

pandas

In the previous part we looked at very basic ways of work with pandas. Here I am going to introduce couple of more advance tricks. We will use very powerful pandas IO capabilities to create time series directly from the text file, try to create seasonal means with resample and multi-year monthly means with groupby. At the end I will show how new functionality from the upcoming IPython 2.0 can be used to explore your data more efficiently with sort of a simple GUI (interact function). There might be easier or better ways to do some of the things discussed here, and I will be happy to hear about them in comments :)

Import usual suspects and change some output formatting:

In [76]:
import pandas as pd
import numpy as np
%matplotlib inline
pd.set_option('max_rows',15) # this limit maximum numbers of rows

We also going to download necessary files. their description can be found in the first part.

In [ ]:
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii

Pandas IO

Pandas is equipped with very rich IO functionality, that allows direct conversion of essentially any text table based data format to Series or DataFrame directly. There is very good extensive documentation with a lot of examples. Here we are going to open AO file in the same way we did in the first part and NAO file with pandas io. Then we are going to combine two in one DataFrame.

Simple numpy loadtxt, create dates and then Series.

In [55]:
ao = np.loadtxt('monthly.ao.index.b50.current.ascii')
dates = pd.date_range('1950-01', '2014-03', freq='M')
AO = pd.Series(ao[:,2], index=dates)

Now let's open NAO. First remind ourselves how the file looks like:

In [56]:
!tail norm.nao.monthly.b5001.current.ascii
 2013    5  0.56906E+00
 2013    6  0.52076E+00
 2013    7  0.67216E+00
 2013    8  0.97019E+00
 2013    9  0.24060E+00
 2013   10 -0.12801E+01
 2013   11  0.90082E+00
 2013   12  0.94566E+00
 2014    1  0.29026E+00
 2014    2  0.13352E+01

We have 3 space separated columns with two first columns containing years and months. Here is the expression that will create time series out of this file:

In [57]:
NAO = pd.read_table('norm.nao.monthly.b5001.current.ascii', sep='\s*', \
              parse_dates={'dates':[0, 1]}, header=None, index_col=0, squeeze=True)
In [58]:
NAO
Out[58]:
dates
1950-01-15    0.92
1950-02-15    0.40
1950-03-15   -0.36
1950-04-15    0.73
1950-05-15   -0.59
...
2013-09-15    0.24060
2013-10-15   -1.28010
2013-11-15    0.90082
2013-12-15    0.94566
2014-01-15    0.29026
2014-02-15    1.33520
Name: 2, Length: 770

Some explanations:

  • first argument is obviously the file name
  • '\s*' - regular expression, that describe separator.
  • parse_dates - combine columns 0 and 1, convert resulting column to dates and give it the name "dates"
  • header - don't use 0 row as header
  • index_col - make column 0 (this will be already result of the parse_dates parsing)
  • squeeze - create Series instead of DataFrame.

Now we would like to combine AO and NAO Series. But there is a little problem - dates in our two Series are different. Pandas date parser returns time stamps, so it uses present day number (15 in my case) and interpret indexes in NAO as points in time. Similar thing happened with AO series. Its index has monthly frequency, but every value is interpreted as point in time associated with last day of the month. As a consequence simple approach will not work:

In [59]:
aonao = pd.DataFrame({'AO':AO, 'NAO':NAO})
In [60]:
aonao.head(10)
Out[60]:
AO NAO
1950-01-15 NaN 0.92
1950-01-31 -0.060310 NaN
1950-02-15 NaN 0.40
1950-02-28 0.626810 NaN
1950-03-15 NaN -0.36
1950-03-31 -0.008127 NaN
1950-04-15 NaN 0.73
1950-04-30 0.555100 NaN
1950-05-15 NaN -0.59
1950-05-31 0.071577 NaN

10 rows × 2 columns

But our data are monthly means, so they are related not to some particular point in time, but rather to the time interval, or time span. We can convert time stamps in our Series to time periods, and then combine them.

In [61]:
aonao = pd.DataFrame({'AO':AO.to_period(freq='M'), 'NAO':NAO.to_period(freq='M')} )
In [62]:
aonao.head(10)
Out[62]:
AO NAO
1950-01 -0.060310 0.92
1950-02 0.626810 0.40
1950-03 -0.008127 -0.36
1950-04 0.555100 0.73
1950-05 0.071577 -0.59
1950-06 0.538570 -0.06
1950-07 -0.802480 -1.26
1950-08 -0.851010 -0.05
1950-09 0.357970 0.25
1950-10 -0.378900 0.85

10 rows × 2 columns

Note that now index show only years and months. Below you can see that type of indexes for original time series and for converted one differ:

In [63]:
type(AO.index)
Out[63]:
pandas.tseries.index.DatetimeIndex
In [64]:
type(AO.to_period(freq='M').index)
Out[64]:
pandas.tseries.period.PeriodIndex

Seasonal means with resample

Initially pandas was created for analysis of financial information and it thinks not in seasons, but in quarters. So we have to resample our data to quarters. We also need to make a shift from standard quarters, so they correspond with seasons. This is done by using 'Q-NOV' as a time frequency, indicating that year in our case ends in November:

In [65]:
q_mean = aonao.resample('Q-NOV')
In [66]:
q_mean.head()
Out[66]:
AO NAO
1950Q1 0.283250 0.660000
1950Q2 0.206183 -0.073333
1950Q3 -0.371640 -0.456667
1950Q4 -0.178680 -0.053333
1951Q1 -0.804333 -0.080000

5 rows × 2 columns

In [67]:
q_mean[q_mean.index.quarter==1].plot(figsize=(8,5))
Out[67]:

If you don't mind to sacrifice first two data points (that strictly speaking can't represent the whole winter of 1949-1950), there is another way to do similar thing by just resampling to 3M (3 months) interval starting from March (third data point):

In [68]:
m3_mean = aonao[2:].resample('3M', closed='left'  )
In [69]:
m3_mean.head()
Out[69]:
AO NAO
1950-05-31 0.206183 -0.073333
1950-08-31 -0.371640 -0.456667
1950-11-30 -0.178680 -0.053333
1951-02-28 -0.804333 -0.080000
1951-05-31 -1.191120 -0.610000

5 rows × 2 columns

Now in order to select all winter months we have to choose Februaries (last month of the season):

In [70]:
m3_mean[m3_mean.index.month==2].plot(figsize=(8,5))
Out[70]:

Result is the same except for the first point. This method allows to use any possible time frequency, but one will have to deal with time stamps again, since periods for arbitrary frequencies are not yet implemented.

Multi-year monthly means with groupby

First step will be to add another column to our DataFrame with month numbers:

In [71]:
aonao['mon'] = aonao.index.month
aonao
Out[71]:
AO NAO mon
1950-01 -0.060310 0.92 1
1950-02 0.626810 0.40 2
1950-03 -0.008127 -0.36 3
1950-04 0.555100 0.73 4
1950-05 0.071577 -0.59 5
1950-06 0.538570 -0.06 6
1950-07 -0.802480 -1.26 7
1950-08 -0.851010 -0.05 8
1950-09 0.357970 0.25 9
1950-10 -0.378900 0.85 10
1950-11 -0.515110 -1.26 11
1950-12 -1.928100 -1.02 12
1951-01 -0.084969 0.08 1
1951-02 -0.399930 0.70 2
1951-03 -1.934100 -1.02 3
... ... ...

770 rows × 3 columns

Now we can use groupby to group our values by months and calculate mean for each of the groups (month in our case):

In [72]:
monmean = aonao['1950':'2013'].groupby('mon').aggregate(mean)
monmean.plot(kind='bar')
Out[72]:

There are very large negative values for winter months of AO. In order to see what is going on there it is useful to look at the box plots for every month:

In [73]:
ax = aonao.boxplot(column=['AO'], by='mon')
ax = aonao.boxplot(column=['NAO'], by='mon')

While NAO show more or less uniform spread, AO have pronounced seasonal variations, with largest spread during winter months.

Interactive exploration of data (only for IPython 2.0)

Say we would like to look at variability of our indexes by individual months, and also (if necessary) do a bit of smoothing in order to filter out high frequencies. Something like this will work:

In [74]:
pd.rolling_mean(aonao[['AO','NAO']][aonao.mon==1], window=1).plot()
Out[74]:

This is data for January and there is no smoothing (window=1).

In [75]:
pd.rolling_mean(aonao[['AO','NAO']][aonao.mon==2], window=10).plot()
Out[75]:

This one is February and rolling mean with 10 year window is applied. Would it be nice to be able to change our two parameters (month and window) somehow interactively? IPython developers include very simple way for such interactive interaction in the upcoming 2.0 version. The following code will only work on local machine, and only with IPython > 2.0.

Import interact:

In [45]:
from IPython.html.widgets import interact

Define function that will use our parameters as input:

In [46]:
def kp(mm=1, wind=1):
    pd.rolling_mean(aonao[['AO','NAO']][aonao.mon==mm], window=wind).plot(ylim=(-4,4))

And now you call interact with our previously defined function as first argument. Other arguments are our parameters with value limits and step size:

In [47]:
cc = interact(kp, mm=(1, 12, 1), wind=(1,10,1))

As simple as that you get two controls for months and window size, that you can operate with mouse or arrow keys. More about this feature you can find in this talk of Brian Granger.

Those who does not have IPython 2.0 yet can enjoy video of the process below :)

In [50]:
from IPython.display import YouTubeVideo
YouTubeVideo('Ba9GAq5PR_8')
Out[50]:

Comments !

links

social