Using Flow Duration Curves to Determine Basin Characteristics and Estimate Flow
This notebook uses the Scientific Python (scipy) stack tools to generate flow duration curves from current USGS NWIS data.
Using recipes from this notebook, you can make:
- USGS Station Summaries
- Flow duration curves
- Iterative import and compilation of USGS station information and data
- boxplots using pandas
- iterative charts (one monthly summary boxplot per station)
- Gantt charts of USGS stations
Background¶
Import necessary packages. ulmo is not a standard package and will have to be loaded into your local python repository for some of these functions to work. Use the following command in your cmd prompt to install ulmo
:
pip install ulmo
You might also be missing the pandas and scipy packages, which is a shame, because pandas
is awesome
pip install pandas
pip install scipy
Check out this for some great pandas
applications: http://earthpy.org/time_series_analysis_with_pandas_part_2.html
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import scipy.stats as sp
import scipy.optimize as op
import statsmodels.api as sm
import ulmo
from pandas.stats.api import ols
from datetime import datetime, date, timedelta
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.pyplot import cm
from pylab import rcParams
rcParams['figure.figsize'] = 10, 8
import logging
logging.basicConfig()
logger = logging.getLogger('ulmo.usgs.nwis.core')
logger.setLevel(logging.WARNING)
rpy2
allows one to run R
files within IPython Notebook. You may need to configure the environment variable settings for this to work properly. If you get errors, use the Google.
import rpy2
Set a directory to store output figures:
pwd
rootname = '/home/magik/INTERNET/EARTHPY/data'
dataroot = '/home/magik/INTERNET/EARTHPY/data'
# rootname = 'U:\\GWP\\Groundwater\\UMSS_Manti\\Data\\Hydrology\\plots\\'
# dataroot = 'U:\\GWP\\Groundwater\\UMSS_Manti\\Data\\Hydrology\\'
Functions¶
The function importusgssite
uses the ulmo package to retrieve U.S. Geological Survey surface water site data from the National Water Information System (NWIS) website. The function also puts the data from the website into a usable format. getusgssiteinfo
gets site information from NWIS, which includes site name, watershed size, and coordinates.
def importusgssite(siteno):
sitename = {}
sitename = ulmo.usgs.nwis.get_site_data(siteno, service="daily", period="all")
sitename = pd.DataFrame(sitename['00060:00003']['values'])
sitename['dates'] = pd.to_datetime(pd.Series(sitename['datetime']))
sitename.set_index(['dates'],inplace=True)
sitename[siteno] = sitename['value'].astype(float)
sitename[str(siteno)+'qual'] = sitename['qualifiers']
sitename = sitename.drop(['datetime','qualifiers','value'],axis=1)
sitename = sitename.replace('-999999',np.NAN)
sitename = sitename.dropna()
#sitename['mon']=sitename.index.month
return sitename
def getusgssiteinfo(siteno):
siteinfo = ulmo.usgs.nwis.get_site_data(siteno, service="daily", period="all")
siteinfo = pd.DataFrame(siteinfo['00060:00003']['site'])
siteinfo['latitude'] = siteinfo.loc['latitude','location']
siteinfo['longitude'] = siteinfo.loc['longitude','location']
siteinfo['latitude'] = siteinfo['latitude'].astype(float)
siteinfo['longitude'] = siteinfo['longitude'].astype(float)
siteinfo = siteinfo.drop(['default_tz','dst_tz','srs','uses_dst','longitude'],axis=0)
siteinfo = siteinfo.drop(['agency','timezone_info','location','state_code','network'],axis=1)
return siteinfo
fdc
generates a flow duration curve for hydrologic data. A flow duration curve is a percent point function (ppf), displaying discharge as a function of probability of that discharge occuring. The ppf is the inverse of the better known cumulative distribution function (cdf). See this USGS publication for more information.
def fdc(df,site,begyear,endyear):
'''
Generate flow duration curve for hydrologic time series data
df = pandas dataframe containing data
site = column within dataframe that contains the flow values
begyear = start year for analysis
endyear = end year for analysis
'''
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
data = data[site].dropna().values
data = np.sort(data)
ranks = sp.rankdata(data, method='average')
ranks = ranks[::-1]
prob = [100*(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
plt.figure()
plt.scatter(prob,data,label=site)
plt.yscale('log')
plt.grid(which = 'both')
plt.xlabel('% of time that indicated discharge was exceeded or equaled')
plt.ylabel('discharge (cfs)')
plt.xticks(range(0,100,5))
plt.title('Flow duration curve for ' + site)
def fdc_simple(df, site, begyear=1900, endyear=2015, normalizer=1):
'''
Generate flow duration curve for hydrologic time series data
PARAMETERS:
df = pandas dataframe of interest; must have a date or date-time as the index
site = pandas column containing discharge data; must be within df
begyear = beginning year of analysis; defaults to 1900
endyear = end year of analysis; defaults to 2015
normalizer = value to use to normalize discharge; defaults to 1 (no normalization)
RETURNS:
matplotlib plot displaying the flow duration curve of the data
REQUIRES:
numpy as np
pandas as pd
matplotlib.pyplot as plt
scipy.stats as sp
'''
# limit dataframe to only the site
df = df[[site]]
# filter dataframe to only include dates of interest
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
# remove na values from dataframe
data = data.dropna()
# take average of each day of year (from 1 to 366) over the selected period of record
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
## uncomment the following to use normalized discharge instead of discharge
#mean = np.mean(data)
#std = np.std(data)
#data = [(data[i]-np.mean(data))/np.std(data) for i in range(len(data))]
data = [(data[i])/normalizer for i in range(len(data))]
# ranks data from smallest to largest
ranks = sp.rankdata(data, method='average')
# reverses rank order
ranks = ranks[::-1]
# calculate probability of each rank
prob = [(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
# plot data via matplotlib
plt.plot(prob,data,label=site+' '+str(begyear)+'-'+str(endyear))
The fdcmatch
uses Python's optimization capabilities to fit natural logarithim and exponential functions the flow duration curves.
def fdcmatch(df, site, begyear=1900, endyear=2015, normalizer=1, fun=1):
'''
* This function creates a flow duration curve (or its inverse) and then matches a natural logrithmic function (or its inverse - exp)
to the flow duration curve
* The flow duration curve will be framed for averaged daily data for the duration of one year (366 days)
PARAMETERS:
df = pandas dataframe of interest; must have a date or date-time as the index
site = pandas column containing discharge data; must be within df
begyear = beginning year of analysis; defaults to 1900
endyear = end year of analysis; defaults to 2015
normalizer = value to use to normalize discharge; defaults to 1 (no normalization)
fun = 1 for probability as a function of discharge; 0 for discharge as a function of probability; default=1
* 1 will choose:
prob = a*ln(discharge*b+c)+d
* 0 will choose:
discharge = a*exp(prob*b+c)+d
RETURNS:
para, parb, parc, pard, r_squared_value, stderr
par = modifying variables for functions = a,b,c,d
r_squared_value = r squared value for model
stderr = standard error of the estimate
REQUIREMENTS:
pandas, scipy, numpy
'''
df = df[[site]]
# filter dataframe to only include dates of interest
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
# remove na values from dataframe
data = data.dropna()
# take average of each day of year (from 1 to 366) over the selected period of record
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
## uncomment the following to use normalized discharge instead of discharge
#mean = np.mean(data)
#std = np.std(data)
#data = [(data[i]-np.mean(data))/np.std(data) for i in range(len(data))]
data = [(data[i])/normalizer for i in range(len(data))]
# ranks data from smallest to largest
ranks = sp.rankdata(data, method='average')
# reverses rank order
ranks = ranks[::-1]
# calculate probability of each rank
prob = [(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
# choose which function to use
try:
if fun==1:
# function to determine probability as a function of discharge
def func(x,a,b,c,d):
return a*np.log(x*b+c)+d
# matches func to data
par, cov = op.curve_fit(func, data, prob)
# checks fit of curve match
slope, interecept, r_value, p_value, stderr = \
sp.linregress(prob, [par[0]*np.log(data[i]*par[1]+par[2])+par[3] for i in range(len(data))])
else:
# function to determine discharge as a function of probability
def func(x,a,b,c,d):
return a*np.exp(x*b+c)+d
# matches func to data
par, cov = op.curve_fit(func, prob, data)
# checks fit of curve match
slope, interecept, r_value, p_value, stderr = \
sp.linregress(data,[par[0]*np.exp(prob[i]*par[1]+par[2])+par[3] for i in range(len(prob))])
# return parameters (a,b,c,d), r-squared of model fit, and standard error of model fit
return par[0], par[1], par[2], par[3], round(r_value**2,2), round(stderr,5)
except (RuntimeError,TypeError):
return np.nan, np.nan, np.nan, np.nan, np.nan, np.nan
The list sites
designates the USGS surface sites you are interested in analyzing.
sites = ['09309600','09309800',
'09310000','09310500','09310700','09311500','09312700','09313000',
'09317000','09317919','09317920','09317997','09318000','09318500','09319000',
'09323000','09324000','09324200','09324500','09325000','09325100','09326500','09327500','09327550',
'09330500','09331900','09331950','09332100',
'10148500',
'10205030','10206000','10206001','10208500',
'10210000','10211000','10215700','10215900','10216210','10216400','10217000']
sitelab = ['Q'+site for site in sites]
print(len(sites))
Import Stream Gage Discharge Data¶
Now we can run through all of the sites and import thier data!
#d = {site: importusgssite(site) for site in sites}
d = {}
for site in sites:
#print site #use this line to error check site numbers
d[site] = importusgssite(site);
Lets merge all of the site data together in one dataframe, so that all of the daily data are aligned and together in one place. We will call that one place USGS_Site_Data
! While we are at it, we will add month, year, and day of year columns to make summarizing the data easier.
f = {}
f[sites[0]] = d[sites[0]]
for i in range(len(sites)-1):
f[sites[i+1]] = pd.merge(d[sites[i+1]], f[sites[i]], left_index=True, right_index=True, how='outer')
USGS_Site_Data = f[sites[-1]]
USGS_Site_Data['mon']=USGS_Site_Data.index.month
USGS_Site_Data['yr']=USGS_Site_Data.index.year
USGS_Site_Data['dy']=USGS_Site_Data.index.dayofyear
USGS_Site_Data.to_csv(dataroot+'USGS_Site_Data.csv')
Import Stream Gage Site Information¶
Now we should import the information on each site.
z = {}
for site in sites:
z[site] = getusgssiteinfo(site);
Like we did with the data, we will combine all of the site information together in one table, so that it is easy to read.
USGS_Site_Info = pd.concat(z)
USGS_Site_Info = USGS_Site_Info.reset_index()
USGS_Site_Info = USGS_Site_Info.drop(['level_1'],axis=1)
USGS_Site_Info = USGS_Site_Info.set_index(['level_0'])
USGS_Site_Info = USGS_Site_Info.drop(['code'],axis=1)
Lets extract the measurement start and end dates from the station data, as well as some basic summary statistics. We can tack this information onto the site information table we created above.
q = {}
m = {}
for site in sites:
q[site] = USGS_Site_Data[site].first_valid_index()
m[site] = USGS_Site_Data[site].last_valid_index()
USGS_start_date = pd.DataFrame(data=q, index=[0])
USGS_finish_date = pd.DataFrame(data=m, index=[0])
USGS_start_date = USGS_start_date.transpose()
USGS_start_date['start_date'] = USGS_start_date[0]
USGS_start_date = USGS_start_date.drop([0],axis=1)
USGS_finish_date = USGS_finish_date.transpose()
USGS_finish_date['fin_date'] = USGS_finish_date[0]
USGS_finish_date = USGS_finish_date.drop([0],axis=1)
USGS_start_fin = pd.merge(USGS_finish_date,USGS_start_date, left_index=True, right_index=True, how='outer' )
USGS_Site_Info = pd.merge(USGS_start_fin,USGS_Site_Info, left_index=True, right_index=True, how='outer' )
USGS_sum_stats = USGS_Site_Data[sites].describe()
USGS_sum_stats = USGS_sum_stats.transpose()
USGS_Site_Info = pd.merge(USGS_sum_stats,USGS_Site_Info, left_index=True, right_index=True, how='outer' )
The next line allows us to save the site information table to a clipboard, which can be pasted into a spreadsheet, for those who like visualizing information in Excel and similar products.
USGS_Site_Info.to_clipboard()
We can use the site information to generate a Gantt chart, showing the duration that each station measured.
# designate variables
x2 = USGS_Site_Info['fin_date'].astype(datetime).values
x1 = USGS_Site_Info['start_date'].astype(datetime).values
y = USGS_Site_Info.index.astype(np.int)
names = USGS_Site_Info['name'].values
labs, tickloc, col = [], [], []
# create color iterator for multi-color lines in gantt chart
color=iter(cm.Dark2(np.linspace(0,1,len(y))))
plt.figure(figsize=[8,10])
fig, ax = plt.subplots()
# generate a line and line properties for each station
for i in range(len(y)):
c=next(color)
plt.hlines(i+1, x1[i], x2[i], label=y[i], color=c, linewidth=2)
labs.append(names[i].title()+" ("+str(y[i])+")")
tickloc.append(i+1)
col.append(c)
plt.ylim(0,len(y)+1)
plt.yticks(tickloc, labs)
# create custom x labels
plt.xticks(np.arange(datetime(np.min(x1).year,1,1),np.max(x2)+timedelta(days=365.25),timedelta(days=365.25*5)),rotation=45)
plt.xlim(datetime(np.min(x1).year,1,1),np.max(x2)+timedelta(days=365.25))
plt.xlabel('Date')
plt.ylabel('USGS Official Station Name and Station Id')
plt.grid()
plt.title('USGS Station Measurement Duration')
# color y labels to match lines
gytl = plt.gca().get_yticklabels()
for i in range(len(gytl)):
gytl[i].set_color(col[i])
plt.tight_layout()
plt.savefig(rootname+'gantt.pdf');
Visualizing Data¶
Exporting Figures¶
The following script will generate a series of box and whisker plots and save them in a pdf. It makes a box plot for each station, breaking the data into monthly intervals. Make sure to change the directory name in the script so it ends up in a recognizable place on your computer.
# create dictionary of integers and their month equivalent
months = {'1':'Jan.', '2':'Feb.', '3':'Mar.', '4':'Apr.', '5':'May', '6':'Jun.',
'7':'Jul.', '8':'Aug.', '9':'Sep.', '10':'Oct.', '11':'Nov.', '12':'Dec.', 'Total':'Total'}
# create empty dictionary to hold pandas Dataframes
j = {}
with PdfPages(rootname + 'station_boxplots.pdf') as pdfs:
ymax = 10000
ymin = 0.01
for i in range(len(sites)):
# make a dataframe containing summary statistics and store it in the j dictionary
j[sites[i]] = USGS_Site_Data.groupby('mon')[sites[i]].agg({'min':np.min, 'mean':np.mean,
'median':np.median, 'max':np.max, 'std':np.std,
'cnt':(lambda x: np.count_nonzero(~np.isnan(x)))}).reset_index()
# make a list of the custom lables you will use for your boxplot; this one will show the number of samples used to make the plot
labs = [months[(str(j[sites[i]]['mon'][b]))] + " (n=" + str(int(j[sites[i]]['cnt'][b])) + ")" for b in range(len(j[sites[i]]))]
# designate the location of each custom label
tickloc = [b+1 for b in range(len(j[sites[i]]['mon']))]
plt.figure()
USGS_Site_Data.boxplot(column=sites[i],by='mon', rot=70)
strtdt = str(USGS_Site_Info.ix[sites[i],'start_date'])[0:10]
findt = str(USGS_Site_Info.ix[sites[i],'fin_date'])[0:10]
siteName = USGS_Site_Info.ix[sites[i],'name'].title()
plt.title( siteName + ' (' + sites[i] + ') ' + strtdt + ' to ' + findt )
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Month')
# here is where your lists for the custom label come into play
plt.xticks(tickloc, labs)
pdfs.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdfs.infodict()
d['Title'] = 'Monthly Station USGS Boxplots UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Boxplots of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Boxplot'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
Let's plot a few of the boxplots so you can see what they look like.
for i in range(1,3):
j[sites[i]] = USGS_Site_Data.groupby('mon')[sites[i]].agg([np.min, np.mean, np.median, np.max, np.std, np.size]).reset_index()
# make a list of the custom lables you will use for your boxplot; this one will show the number of samples used to make the plot
labs = [months[(str(j[sites[i]]['mon'][b]))] + " (n=" + str(int(j[sites[i]]['size'][b])) + ")" for b in range(len(j[sites[i]]))]
# designate the location of each custom label
tickloc = [b+1 for b in range(len(j[sites[i]]['mon']))]
plt.figure()
USGS_Site_Data.boxplot(column=sites[i],by='mon', rot=70)
plt.title(USGS_Site_Info.ix[sites[i],'name'].title() + ' (' + sites[i] + ')')
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Month')
# here is where your lists for the custom label come into play
plt.xticks(tickloc, labs)
plt.show()
plt.close()
This script will generate boxplots showing all of the station data.
# This script summarizes discharge for all sites and limits the number of box plots on one graph to the n variable
j=0
with PdfPages(rootname + 'sum_boxplots.pdf') as pdf:
while j < len(sites):
ymax = 10000
ymin = 0.01
n=10
# if statement allows for uneven number of sites on last page
if j+n >= len(sites):
plt.figure()
USGS_Site_Data[sites[j:-1]].plot(kind='box')
plt.title('Sites '+sites[j]+' to '+sites[-1] )
plt.yscale('log')
plt.xlabel('USGS Site')
plt.xticks(rotation=45)
plt.ylabel('discharge (cfs)')
plt.ylim((ymin,ymax))
pdf.savefig()
plt.show()
plt.close()
j = j+n
else:
plt.figure()
USGS_Site_Data[sites[j:j+n]].plot(kind='box')
plt.title('Sites '+sites[j]+' to '+sites[j+n] )
plt.yscale('log')
plt.xlabel('USGS Site')
plt.xticks(rotation=45)
plt.ylabel('discharge (cfs)')
plt.ylim((ymin,ymax))
pdf.savefig()
plt.show()
plt.close()
j = j+n
# Save metadata of the pdf so you can find it later
d = pdf.infodict()
d['Title'] = 'Summary USGS Boxplots UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Boxplots of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Boxplot'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
We should also produce hydrographs of each station.
xmax = USGS_Site_Data.index.astype(datetime).values[-1]
xmin = USGS_Site_Data.index.astype(datetime).values[0]
pdfs = PdfPages(rootname + 'station_hydrographs.pdf')
ymax = 10000
ymin = 0.1
for i in range(len(sites)):
x = USGS_Site_Data.index.values
y = USGS_Site_Data[sites[i]].values
plt.figure()
plt.plot(x,y)
strtdt = str(USGS_Site_Info.ix[sites[i],'start_date'])[0:10]
findt = str(USGS_Site_Info.ix[sites[i],'fin_date'])[0:10]
siteName = USGS_Site_Info.ix[sites[i],'name'].title()
plt.title( siteName + ' (' + sites[i] + ') ' + strtdt + ' to ' + findt )
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Year')
plt.xticks(np.arange(datetime(1905,1,1),xmax+timedelta(days=365.25),timedelta(days=365.25*5)),rotation=45)
plt.xlim(xmin,xmax)
pdfs.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdfs.infodict()
d['Title'] = 'Monthly Station USGS Hydrographs UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Hydrograph of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Hydrograph'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
pd.date_range(start=xmin, end=xmax, freq='5AS').year
xmax = USGS_Site_Data.index[-1]
xmin = USGS_Site_Data.index[0]
plt.figure()
ticks = pd.date_range(start=xmin, end=xmax, freq='4AS')
USGS_Site_Data[sites[0:3]].plot(subplots=True,sharex=True,figsize=(10,8),logy=True, rot=90)
plt.xlim(xmin,xmax)
labs = pd.date_range(start=xmin, end=xmax, freq='4AS').year
plt.xticks(ticks,labs)
plt.show()
plt.close()
def lumped_hydro(i1,i2):
pdfs = PdfPages(rootname + 'station_hydrographs_lumped.pdf')
plt.figure()
ticks = pd.date_range(start=xmin, end=xmax, freq='4AS')
USGS_Site_Data[sites[i1:i2]].plot(subplots=True, sharex=True, figsize=(10,24),logy=True, rot=90)
plt.xlim(xmin,xmax)
labs = pd.date_range(start=xmin, end=xmax, freq='4AS').year
plt.xticks(ticks,labs)
pdfs.savefig()
plt.close()
lumped_hydro(0,10)
lumped_hydro(10,20)
lumped_hydro(20,30)
lumped_hydro(30,-1)
This script will iteratively produce Flow Duration Curves for each of the stations. A flow duration curve is a percent point function (ppf), displaying discharge as a function of probability of that discharge occuring. The ppf is the inverse of the better known cumulative distribution function (cdf). See this USGS publication for more information.
with PdfPages(rootname+'station_fdc.pdf') as pdf:
ymax = 10000
ymin = 0.01
for i in range(len(sites)):
plt.figure()
fdc_simple(USGS_Site_Data,sites[i],1900,2015)
fdc_simple(USGS_Site_Data,sites[i],1900,1970)
fdc_simple(USGS_Site_Data,sites[i],1970,2015)
plt.ylim(0.01,10000)
plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.legend()
plt.xlabel('probability that discharge was exceeded or equaled')
plt.title('Flow duration curve for ' + str(USGS_Site_Info['name'][i]).title() + ' ('+ sites[i] +')'+'\n'+
'Record: ' + str(USGS_Site_Info['start_date'][i])[0:10] + ' to ' + str(USGS_Site_Info['fin_date'][i])[0:10])
plt.yscale('log')
plt.ylabel('discharge (cfs)')
plt.xticks(np.arange(0,1.05,0.05))
pdf.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdf.infodict()
d['Title'] = 'Flow Duration Curves USGS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Flow Duration Curves of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS FDC Flow Duration'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
For brevity, here is an example plot from the output of the script above.
plt.figure()
fdc_simple(USGS_Site_Data,sites[38],1900,2015)
fdc_simple(USGS_Site_Data,sites[38],1900,1970)
fdc_simple(USGS_Site_Data,sites[38],1970,2015)
plt.yscale('log')
plt.grid(which = 'both')
plt.xlabel('% of time that indicated discharge was exceeded or equaled')
plt.ylabel('discharge (cfs)')
plt.xticks(np.arange(0.0,1.05,0.05))
plt.title('Flow duration curve for ' + str(USGS_Site_Info['name'][38]) + ' ('+ sites[i] +')'+'\n'+
'Record: ' + str(USGS_Site_Info['start_date'][i])[0:10] + ' to ' + str(USGS_Site_Info['fin_date'][i])[0:10])
plt.legend()
Below are some sad attempts to model the flow duration curves:
site = sites[28]
df = USGS_Site_Data[[site]]
begyear = 1900
endyear = 2015
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
data = data.dropna()
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
mean = np.mean(data)
std = np.std(data)
f = [(data[i]) for i in range(len(data))]
#f = [(data[i])/mean for i in range(len(data))]
#f = [(data[i]-std)/mean for i in range(len(data))]
ranks = sp.rankdata(f, method='average')
ranks = ranks[::-1]
prob = [(ranks[i]/(len(f)+1)) for i in range(len(f)) ]
x = prob
y = f
plt.figure()
plt.scatter(y,x,label=site,color='blue')
#plt.yscale('log')
#plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.ylabel('probability that discharge was exceeded or equaled')
plt.xlabel('normalized discharge')
#plt.xticks(np.arange(0,1.00,0.05))
plt.title('Flow duration curve for ' + site + ' averaged from ' + str(begyear) + ' to ' + str(endyear))
def func(x,a,b,c,d):
return a*np.log(x*b+c)+d
par, cov = op.curve_fit(func,y,x,p0=[-0.16528617, 1.54535185, -24.70440088, 0.9])
plt.plot(y, [par[0]*np.log(y[i]*par[1]+par[2])+par[3] for i in range(len(y))], color='red')
print 'curve fit', sp.linregress(x,[par[0]*np.log(y[i]*par[1]+par[2])+par[3] for i in range(len(y))])
plt.figure()
plt.scatter(x,y,label=site,color='blue')
#plt.yscale('log')
#plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.xlabel('probability that discharge was exceeded or equaled')
plt.ylabel('normalized discharge')
#plt.xticks(np.arange(0,1.00,0.05))
plt.title('Flow duration curve for ' + site + ' averaged from ' + str(begyear) + ' to ' + str(endyear))
def func2(x,a,b,c,d):
return a*np.exp(x*b+c)+d
parm, covm = op.curve_fit(func2,x,y,p0=[-0.16528617, 0.02, 0.70440088, 0.9])
plt.plot(x, [parm[0]*np.exp(x[i]*parm[1]+parm[2])+parm[3] for i in range(len(x))], color='red')
print 'curve fit', sp.linregress(y,[parm[0]*np.exp(x[i]*parm[1]+parm[2])+parm[3] for i in range(len(x))])
def dic2df(dic,head):
df = pd.DataFrame(data=dic)
df = df.transpose()
df.columns = [str(head)+'_var1',str(head)+'_var2',str(head)+'_var3',str(head)+'_var4',str(head)+'_r2',str(head)+'_err']
return df
q = {}; m = {}; n = {}; k = {}; j = {}; p = {}
for site in sites:
q[site] = fdcmatch(USGS_Site_Data,site,1900,2015,1,1)
m[site] = fdcmatch(USGS_Site_Data,site,1900,1970,1,1)
n[site] = fdcmatch(USGS_Site_Data,site,1970,2015,1,1)
k[site] = fdcmatch(USGS_Site_Data,site,1900,2015,1,0)
j[site] = fdcmatch(USGS_Site_Data,site,1900,1970,1,0)
p[site] = fdcmatch(USGS_Site_Data,site,1970,2015,1,0)
dics = [q,m,n,k,j,p]
heads = ['all','to70','fm70','allin','to70in','fm70in']
USGS_q = dic2df(q,'all')
USGS_m = dic2df(m,'to70')
USGS_parms = pd.merge(USGS_q,USGS_m, left_index=True, right_index=True, how='outer' )
for i in range(2,6,1):
x = dic2df(dics[i],heads[i])
USGS_parms = pd.merge(USGS_parms,x, left_index=True, right_index=True, how='outer' )
USGS_parms.to_clipboard()
#USGS_finish_date = USGS_finish_date.drop([0],axis=1)
#USGS_start_fin = pd.merge(USGS_finish_date,USGS_start_date, left_index=True, right_index=True, how='outer' )
The equations from the modeled fdc plots can be used to estimate the discharge for a site based on data from a similar site. However, the results are mediocre.
n1 = 12
n2 = 11
USGS_Site_Data[sites[n1]+'p'] = [USGS_parms['all_var1'][n1]*np.log(USGS_Site_Data[sites[n1]][i]*USGS_parms['all_var2'][n1]+\
USGS_parms['all_var3'][n1])+USGS_parms['all_var4'][n1] for i in \
range(len(USGS_Site_Data[sites[n1]]))]
USGS_Data = USGS_Site_Data[USGS_Site_Data[sites[n1]+'p']>0]
USGS_Data[sites[n2]+'d'] = [USGS_parms['allin_var1'][n2]*np.exp(USGS_Data[sites[n1]+'p'][i]*USGS_parms['allin_var2'][n2]+\
USGS_parms['allin_var3'][n2])+USGS_parms['allin_var4'][n2] for i in \
range(len(USGS_Data[sites[n1]+'p']))]
y1 = USGS_Data[sites[n2]+'d'].values
y2 = USGS_Site_Data[sites[n2]].values
x1 = USGS_Data.index.to_datetime()
x2 = USGS_Site_Data.index.to_datetime()
y3 = USGS_Data[sites[n1]]
plt.figure()
plt.plot(x1,y1, label="modeled discharge")
plt.plot(x2,y2, label="actual discharge")
plt.plot(x1,y3, label="reference discharge")
plt.yscale('log')
plt.legend()
#plt.xlim(USGS_Site_Info['start_date'][n2],USGS_Site_Info['fin_date'][n2])
plt.xlim('1/1/1970','1/1/1974')
#plt.ylim(1,100)
Run the following script if you want to see a map of your stations. This assumes that you have the Basemap package installed.
from mpl_toolkits.basemap import Basemap
X = USGS_Site_Info['longitude'].astype(float).values.tolist()
Y = USGS_Site_Info['latitude'].astype(float).values.tolist()
n = 0.05
m = Basemap(llcrnrlon=min(X)+n,llcrnrlat=min(Y)+n,urcrnrlon=max(X)+n,urcrnrlat=max(Y)+n,
resolution='h',projection='cyl',lon_0=np.mean(X),lat_0=np.mean(Y))
m.drawrivers(color='blue',linewidth=0.5)
m.drawcounties(color='red',linewidth=0.5)
m.arcgisimage()
#m.etopo(scale=0.5)
lons = X
lats = Y
x,y = m(lons,lats)
m.plot(x,y,'ro', markersize=8)
#m.drawmapscale(lon=-114, lat=43.5, length=100, lon0=-114, lat0=39, barstyle='simple', units='km')
Analysis in R¶
col = USGS_Site_Data.columns
import rpy2.robjects as robj
import rpy2.rlike.container as rlc
def pandas_data_frame_to_rpy2_data_frame(pDataframe):
orderedDict = rlc.OrdDict()
for columnName in pDataframe:
columnValues = pDataframe[columnName].values
filteredValues = [value if pd.notnull(value) else robj.NA_Real
for value in columnValues]
try:
orderedDict[columnName] = robj.FloatVector(filteredValues)
except ValueError:
orderedDict[columnName] = robj.StrVector(filteredValues)
rDataFrame = robj.DataFrame(orderedDict)
rDataFrame.rownames = robj.StrVector(pDataframe.index)
return rDataFrame
USD = pandas_data_frame_to_rpy2_data_frame(USGS_Site_Data)
Comments !