Climatology data access with ulmo

Task:

easy access to climatology data 

Solution:

ulmo

Notebook file

One of the main things that bothers me most at work is data conversion. World would be a much better place for somebody like me if everybody use netCDF file format for data distribution. While situation slightly changing, and more and more organisations switch to netCDF, there are still plenty of those who distribute their data in some crazy forms.

Would it be nice if somebody once and for all create converters for all this formats and provide a way to directly search and access data from python? Imagine - instead of spending time writing regular expressions for another converter you could watch cat videos on youtube do more actual research stuff. Project ulmo tries to do something like this (AFAIU) and provide "clean, simple and fast access to public hydrology and climatology data".

They certainly now do more hydrology than climatology data, but for me there is also something interesting - historical measurements from meteorological stations. In the following I will give you an example of how to access those, so you get an idea of what ulmo is doing.

Nessesary preparations, as usual:

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Installation is very simple with pip, but if you need development version, go to githib repository.

In [2]:
!pip install ulmo

Import ulmo and pandas (you already use pandas for data analysis, right? if not, you have to, go and read this post first).

In [3]:
import ulmo
import pandas

We are going to work with data from National Climatic Data Center Global Historical Climate Network - Daily dataset. Say, we would like to find station in our home town. It's a good idea to start by getting information obout stations in the country of interest. Ulmo has a function ulmo.ncdc.ghcn_daily.get_stations that can obtain information about stations available in GHCN dataset and also let you define some conditions for the search, like country, time span, and specific variables.

Country should be provided as a country code, and list of countries available here. We will search for station in Germany (GM) and ask to return pandas Data Frame. Be patient it might take some time.

In [4]:
st = ulmo.ncdc.ghcn_daily.get_stations(country='GM', as_dataframe=True)

Now we have a nice table with information about all available German meteorological stations:

In [17]:
st.head()
Out[17]:
country network network_id latitude longitude elevation state name gsn_flag hcn_flag wm_oid id
id
GM000001153 GM 0 00001153 51.9506 7.5914 62 NaN MUENSTER NaN NaN 10313 GM000001153
GM000001474 GM 0 00001474 53.0464 8.7992 4 NaN BREMEN NaN NaN 10224 GM000001474
GM000002277 GM 0 00002277 49.7517 6.6467 144 NaN TRIER NaN NaN 10609 GM000002277
GM000002288 GM 0 00002288 49.4253 7.7367 285 NaN KAISERSLAUTERN NaN NaN NaN GM000002288
GM000002698 GM 0 00002698 49.0392 8.3650 112 NaN KARLSRUHE NaN NaN 10727 GM000002698

Let's search for Hamburg stations:

In [45]:
st[st.name.str.contains('HAMBURG')]
Out[45]:
country network network_id latitude longitude elevation state name gsn_flag hcn_flag wm_oid id
id
GM000003865 GM 0 00003865 53.4806 10.2428 35 NaN HAMBURG BERGEDORF NaN NaN NaN GM000003865
GM000010147 GM 0 00010147 53.6350 9.9900 11 NaN HAMBURG FUHLSBUETTEL GSN NaN 10147 GM000010147

There are only two stations, and we are interested only in data from HAMBURG FUHLSBUETTEL.

Getting the data is also very easy,. The only thing you need is the id of your station:

In [7]:
data = ulmo.ncdc.ghcn_daily.get_data('GM000010147', as_dataframe=True)

This function returns dictionary with names of the variables as keys and pandas data frames with measurements as values:

In [9]:
data
Out[9]:
{'PRCP': 
PeriodIndex: 44591 entries, 1891-01-01 to 2013-01-31
Data columns:
value    44591  non-null values
mflag    0  non-null values
qflag    0  non-null values
sflag    44591  non-null values
dtypes: object(4),
 'SNWD': 
PeriodIndex: 28855 entries, 1934-02-01 to 2013-01-31
Data columns:
value    27829  non-null values
mflag    0  non-null values
qflag    0  non-null values
sflag    27829  non-null values
dtypes: object(4),
 'TMAX': 
PeriodIndex: 44591 entries, 1891-01-01 to 2013-01-31
Data columns:
value    44591  non-null values
mflag    0  non-null values
qflag    195  non-null values
sflag    44591  non-null values
dtypes: object(4),
 'TMIN': 
PeriodIndex: 44591 entries, 1891-01-01 to 2013-01-31
Data columns:
value    44591  non-null values
mflag    0  non-null values
qflag    195  non-null values
sflag    44591  non-null values
dtypes: object(4)}

Let's get maximum daily temperatures:

In [10]:
tm = data['TMAX'].copy()
In [11]:
tm.head()
Out[11]:
value mflag qflag sflag
1891-01-01 -72 NaN NaN E
1891-01-02 -43 NaN NaN E
1891-01-03 -32 NaN NaN E
1891-01-04 12 NaN NaN E
1891-01-05 -29 NaN NaN E

Values has to be divided by 10 in order to get degrees Celsius:

In [12]:
tm.value=tm.value/10.0

Now you can plot the data, as you would usually do with pandas:

In [13]:
tm['value']['1980':'2010'].plot()
Out[13]:

Or do some statistical analysis:

In [14]:
pandas.rolling_mean(tm.value, window=365).plot()
Out[14]:

Unfortunately something like

tm.value.resample('A')

will not work, since value have a type that pandas can't process in this case. We first have to convert value variable to float:

In [15]:
tm.value = tm.value.astype('float')

Now it's working:

In [16]:
tm['1950':'2012'].value.resample('A').plot()
title('Annual mean daily maximum temperature in Hamburg')
Out[16]:

Now you know how to easily find meteo station and get it's data with ulmo. Maybe if you want data only for one station ulmo is not that useful, but when you begin to collect statistics about many stations, then it became very handy. If you use hydrological data you certainly has to give ulmo a try (see list of supported data sets). Hopefully authors will continue development and add more data sources in future:)

Comments !

links

social