One of the main things that bothers me most at work is data conversion. World would be a much better place for somebody like me if everybody use netCDF file format for data distribution. While situation slightly changing, and more and more organisations switch to netCDF, there are still plenty of those who distribute their data in some crazy forms.
Would it be nice if somebody once and for all create converters for all this formats and provide a way to directly search and access data from python? Imagine - instead of spending time writing regular expressions for another converter you could watch cat videos on youtube do more actual research stuff. Project ulmo tries to do something like this (AFAIU) and provide "clean, simple and fast access to public hydrology and climatology data".
They certainly now do more hydrology than climatology data, but for me there is also something interesting - historical measurements from meteorological stations. In the following I will give you an example of how to access those, so you get an idea of what ulmo is doing.
Nessesary preparations, as usual:
%pylab inline
Installation is very simple with pip, but if you need development version, go to githib repository.
!pip install ulmo
Import ulmo and pandas (you already use pandas for data analysis, right? if not, you have to, go and read this post first).
import ulmo
import pandas
We are going to work with data from National Climatic Data Center Global Historical Climate Network - Daily dataset. Say, we would like to find station in our home town. It's a good idea to start by getting information obout stations in the country of interest. Ulmo has a function ulmo.ncdc.ghcn_daily.get_stations that can obtain information about stations available in GHCN dataset and also let you define some conditions for the search, like country, time span, and specific variables.
Country should be provided as a country code, and list of countries available here. We will search for station in Germany (GM) and ask to return pandas Data Frame. Be patient it might take some time.
st = ulmo.ncdc.ghcn_daily.get_stations(country='GM', as_dataframe=True)
Now we have a nice table with information about all available German meteorological stations:
st.head()
Let's search for Hamburg stations:
st[st.name.str.contains('HAMBURG')]
There are only two stations, and we are interested only in data from HAMBURG FUHLSBUETTEL.
Getting the data is also very easy,. The only thing you need is the id of your station:
data = ulmo.ncdc.ghcn_daily.get_data('GM000010147', as_dataframe=True)
This function returns dictionary with names of the variables as keys and pandas data frames with measurements as values:
data
Let's get maximum daily temperatures:
tm = data['TMAX'].copy()
tm.head()
Values has to be divided by 10 in order to get degrees Celsius:
tm.value=tm.value/10.0
Now you can plot the data, as you would usually do with pandas:
tm['value']['1980':'2010'].plot()
Or do some statistical analysis:
pandas.rolling_mean(tm.value, window=365).plot()
Unfortunately something like
tm.value.resample('A')
will not work, since value have a type that pandas can't process in this case. We first have to convert value variable to float:
tm.value = tm.value.astype('float')
Now it's working:
tm['1950':'2012'].value.resample('A').plot()
title('Annual mean daily maximum temperature in Hamburg')
Now you know how to easily find meteo station and get it's data with ulmo. Maybe if you want data only for one station ulmo is not that useful, but when you begin to collect statistics about many stations, then it became very handy. If you use hydrological data you certainly has to give ulmo a try (see list of supported data sets). Hopefully authors will continue development and add more data sources in future:)
Comments !