Task:
Make your python scripts run faster
Solution:
multiprocessor, cython, numba
One of the counterarguments that you constantly hear about using python is that it is slow. This is somehow true for many cases, while most of the tools that scientist mainly use, like numpy
, scipy
and pandas
have big chunks written in C
, so they are very fast. For most of the geoscientific applications main advice would be to use vectorisation whenever possible, and avoid loops. However sometimes loops are unavoidable, and then python speed can get on to your nerves. Fortunately there are several easy ways to make your python loops faster.
Multiprocessor¶
Let's first download some data to work with (NCEP reanalysis air temperature):
#variabs = ['air']
#for vvv in variabs:
# for i in range(2000,2010):
# !wget ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/{vvv}.sig995.{i}.nc
ls *nc
This is netCDF files, so we need netCDF4
library:
from netCDF4 import Dataset
Now we create useless but time consuming function, that have a lot of loops in it. It takes year as an input and then just summs up all the numbers from the file one by one.
def useless(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
a_cum.tofile(year+'.bin')
print(year)
return a_cum
It works slow enough:
%%time
useless('2000')
We can create a loop that will process several files one by one. First make a list of years:
years = [str(x) for x in range(2000,2008)]
years
And now the loop:
%%time
for yy in years:
useless(yy)
Processing of each file is independent from others. This is "embarrassingly parallel" problem and can be done very easily in parallel with multiprocessing
module of the standard library.
Most of the modern computers have more than one processor. Even the 5 year old machine, where I now edit this notebook have two. The way to check how many processors you have is to run:
!nproc
Now let's import multiprocessing
:
import multiprocessing
And create a pool
with 2 processors (you can use your number of processors):
pool = multiprocessing.Pool(processes=2)
Let's have a look how fast do we get:
%%time
r = pool.map(useless, years)
pool.close()
More than two times faster - not bad! But we can do better!
Cython¶
Cython is an optimising static compiler for Python. Cython magic is one of the default extensions, and we can just load it (you have to have cython already installed):
%load_ext cythonmagic
The only thing we do here - add %%cython
magic at the top of the cell. This function will be compiled with cython.
%%cython
def useless_cython(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
a_cum.tofile(year+'.bin')
print(year)
return a_cum
Only this give us a good boost:
%%time
useless_cython('2000')
One processor:
%%time
for yy in years:
useless_cython(yy)
Two processors:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
But the true power of cython revealed only when you provide types of your variables. You have to use cdef
keyword in the function definition to do so. There are also couple other modifications to the function
%%cython
import numpy as np
def useless_cython(year):
# define types of variables
cdef int i, j, n
cdef double a_cum
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
#here we have to convert numpy value to simple float
a_cum = a_cum+float(a[i,j,n])
# since a_cum is not numpy variable anymore,
# we introduce new variable d in order to save
# data to the file easily
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
%%time
useless_cython('2000')
One processor:
%%time
for yy in years:
useless_cython(yy)
Multiprocessing:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
From 9 seconds to 4 on two processors - not bad! But we can do better! :)
Numba¶
Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). The easiest way to install it is to use Anaconda distribution.
from numba import jit, autojit
We now have to split our function in two (that would be a good idea from the beggining). One is just number crunching part, and another responsible for IO. The only thing that we have to do afterwards is to put jit
decorator in front of the first function.
@autojit
def calc_sum(a):
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
return a_cum
def useless_numba(year):
#from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
%%time
useless_numba('2000')
One processor:
%%time
for yy in years:
useless_numba(yy)
Two processors:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba, years)
pool.close()
Nice! Maybe we can speed up a bit more?
You can also provide type for the input and output:
@jit('f8(f4[:,:,:])')
def calc_sum(a):
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
return a_cum
def useless_numba2(year):
#from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
%%time
useless_numba2('2000')
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba2, years)
pool.close()
just a tiny bit...
Native numpy¶
This is how you really should solve this problem using numpy.sum(). Note, that the result will be different compared to previous examples. Only if you first convert to float64
it becomes the same. Be careful when dealing with huge numbers!
import numpy as np
def calc_sum(a):
a = np.float64(a)
return a.sum()
def useless_good(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
%%time
useless_good('2000')
%%time
for yy in years:
useless_good(yy)
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_good, years)
pool.close()
Actually numpy
version is a bit slower than numba
version, but it's not clear to me how much I can trust this result.
Cython 2nd try¶
I was surprised to see Cython results so much different from Numba, and decide to make a second try. Here I make cython to be aware of numpy arrays.
%%cython
import numpy as np
cimport numpy as np
cimport cython
def calc_sum(np.ndarray[float, ndim=3] a):
cdef int i, j, n
cdef float a_cum
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+(a[i,j,n])
return a_cum
def useless_cython2(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
%%time
useless_cython2('2000')
%%time
for yy in years:
useless_cython2(yy)
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython2, years)
pool.close()
Now it's comparable to other competitors :)
Comments !