How to make your python code run faster

Task:

 Make your python scripts run faster

Solution:

multiprocessor, cython, numba

One of the counterarguments that you constantly hear about using python is that it is slow. This is somehow true for many cases, while most of the tools that scientist mainly use, like numpy, scipy and pandas have big chunks written in C, so they are very fast. For most of the geoscientific applications main advice would be to use vectorisation whenever possible, and avoid loops. However sometimes loops are unavoidable, and then python speed can get on to your nerves. Fortunately there are several easy ways to make your python loops faster.

Multiprocessor

Let's first download some data to work with (NCEP reanalysis air temperature):

In [2]:
#variabs = ['air']
#for vvv in variabs:
#    for i in range(2000,2010):
#        !wget ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/{vvv}.sig995.{i}.nc
In [3]:
ls *nc
air.sig995.2000.nc  air.sig995.2002.nc  air.sig995.2004.nc  air.sig995.2006.nc  air.sig995.2008.nc  air.sig995.2012.nc
air.sig995.2001.nc  air.sig995.2003.nc  air.sig995.2005.nc  air.sig995.2007.nc  air.sig995.2009.nc

This is netCDF files, so we need netCDF4 library:

In [38]:
from netCDF4 import Dataset

Now we create useless but time consuming function, that have a lot of loops in it. It takes year as an input and then just summs up all the numbers from the file one by one.

In [39]:
def useless(year):
    from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = 0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                a_cum = a_cum+a[i,j,n]
                
    a_cum.tofile(year+'.bin')
    print(year)
    return a_cum

It works slow enough:

In [40]:
%%time
useless('2000')
2000
CPU times: user 3.49 s, sys: 24 ms, total: 3.52 s
Wall time: 3.49 s
Out[40]:
1068708186.2315979

We can create a loop that will process several files one by one. First make a list of years:

In [41]:
years = [str(x) for x in range(2000,2008)]
years
Out[41]:
['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007']

And now the loop:

In [42]:
%%time
for yy in years:
    useless(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 27.2 s, sys: 288 ms, total: 27.5 s
Wall time: 28 s

Processing of each file is independent from others. This is "embarrassingly parallel" problem and can be done very easily in parallel with multiprocessing module of the standard library.

Most of the modern computers have more than one processor. Even the 5 year old machine, where I now edit this notebook have two. The way to check how many processors you have is to run:

In [43]:
!nproc
2

Now let's import multiprocessing:

In [44]:
import multiprocessing

And create a pool with 2 processors (you can use your number of processors):

In [45]:
pool = multiprocessing.Pool(processes=2)

Let's have a look how fast do we get:

In [46]:
%%time
r = pool.map(useless, years)
pool.close()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 9.69 s
2000
2001
20022003

20042005

20072006

More than two times faster - not bad! But we can do better!

Cython

Cython is an optimising static compiler for Python. Cython magic is one of the default extensions, and we can just load it (you have to have cython already installed):

In [47]:
%load_ext cythonmagic
The cythonmagic extension is already loaded. To reload it, use:
  %reload_ext cythonmagic

The only thing we do here - add %%cython magic at the top of the cell. This function will be compiled with cython.

In [48]:
%%cython
def useless_cython(year):
    from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = 0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                a_cum = a_cum+a[i,j,n]
                
    a_cum.tofile(year+'.bin')
    print(year)
    return a_cum

Only this give us a good boost:

In [49]:
%%time
useless_cython('2000')
2000
CPU times: user 2.57 s, sys: 40 ms, total: 2.61 s
Wall time: 2.62 s
Out[49]:
1068708186.2315979

One processor:

In [50]:
%%time
for yy in years:
    useless_cython(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 19.7 s, sys: 200 ms, total: 19.9 s
Wall time: 19.9 s

Two processors:

In [51]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
CPU times: user 0 ns, sys: 16 ms, total: 16 ms
Wall time: 7.27 s
2000
2001
20022003

20052004

20072006

But the true power of cython revealed only when you provide types of your variables. You have to use cdef keyword in the function definition to do so. There are also couple other modifications to the function

In [52]:
%%cython
import numpy as np

def useless_cython(year):
    
    # define types of variables
    cdef int i, j, n
    cdef double a_cum
    
    from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    
    a_cum = 0.
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                #here we have to convert numpy value to simple float
                a_cum = a_cum+float(a[i,j,n])
    
    # since a_cum is not numpy variable anymore,
    # we introduce new variable d in order to save
    # data to the file easily
    d = np.array(a_cum)
    d.tofile(year+'.bin')
    print(year)
    return d
In [53]:
%%time
useless_cython('2000')
2000
CPU times: user 1.16 s, sys: 20 ms, total: 1.18 s
Wall time: 1.19 s
Out[53]:
array(1068708186.2315979)

One processor:

In [54]:
%%time
for yy in years:
    useless_cython(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 11.1 s, sys: 180 ms, total: 11.3 s
Wall time: 11.3 s

Multiprocessing:

In [55]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
CPU times: user 0 ns, sys: 12 ms, total: 12 ms
Wall time: 4.3 s
2000
2001
20022003

20052004

20072006

From 9 seconds to 4 on two processors - not bad! But we can do better! :)

Numba

Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). The easiest way to install it is to use Anaconda distribution.

In [56]:
from numba import jit, autojit

We now have to split our function in two (that would be a good idea from the beggining). One is just number crunching part, and another responsible for IO. The only thing that we have to do afterwards is to put jit decorator in front of the first function.

In [57]:
@autojit
def calc_sum(a):
    a_cum = 0.
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                a_cum = a_cum+a[i,j,n]
    return a_cum

def useless_numba(year):
    #from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = calc_sum(a)
    
    d = np.array(a_cum)
    d.tofile(year+'.bin')
    print(year)
    return d
In [58]:
%%time
useless_numba('2000')
2000
CPU times: user 464 ms, sys: 12 ms, total: 476 ms
Wall time: 483 ms
Out[58]:
array(1068708186.2315979)

One processor:

In [59]:
%%time
for yy in years:
    useless_numba(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 1.53 s, sys: 152 ms, total: 1.68 s
Wall time: 1.7 s

Two processors:

In [60]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba, years)
pool.close()
CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 912 ms
2000
2001
20022003

20042005

20062007

Nice! Maybe we can speed up a bit more?

You can also provide type for the input and output:

In [61]:
@jit('f8(f4[:,:,:])')
def calc_sum(a):
    a_cum = 0.
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                a_cum = a_cum+a[i,j,n]
    return a_cum

def useless_numba2(year):
    #from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = calc_sum(a)
    
    d = np.array(a_cum)
    d.tofile(year+'.bin')
    print(year)
    return d
In [62]:
%%time
useless_numba2('2000')
2000
CPU times: user 216 ms, sys: 16 ms, total: 232 ms
Wall time: 244 ms
Out[62]:
array(1068708186.2315979)
In [65]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba2, years)
pool.close()
CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 884 ms
2000
2001
20022003

20042005

20062007

just a tiny bit...

Native numpy

This is how you really should solve this problem using numpy.sum(). Note, that the result will be different compared to previous examples. Only if you first convert to float64 it becomes the same. Be careful when dealing with huge numbers!

In [66]:
import numpy as np
def calc_sum(a):
    a = np.float64(a)
    return a.sum()

def useless_good(year):
    from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = calc_sum(a)
    
    d = np.array(a_cum)
    d.tofile(year+'.bin')
    print(year)
    return d
In [67]:
%%time
useless_good('2000')
2000
CPU times: user 172 ms, sys: 44 ms, total: 216 ms
Wall time: 224 ms
Out[67]:
array(1068708186.2315979)
In [68]:
%%time
for yy in years:
    useless_good(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 1.76 s, sys: 296 ms, total: 2.06 s
Wall time: 2.07 s
In [69]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_good, years)
pool.close()
CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 1.04 s
2001
2000
20032002

20052004

20072006

Actually numpy version is a bit slower than numba version, but it's not clear to me how much I can trust this result.

Cython 2nd try

I was surprised to see Cython results so much different from Numba, and decide to make a second try. Here I make cython to be aware of numpy arrays.

In [76]:
%%cython
import numpy as np
cimport numpy as np
cimport cython

def calc_sum(np.ndarray[float, ndim=3] a):
        
    cdef int i, j, n
    cdef float a_cum
    
    a_cum = 0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for n in range(a.shape[2]):
                a_cum = a_cum+(a[i,j,n])
    return a_cum
In [77]:
def useless_cython2(year):
    from netCDF4 import Dataset
    f = Dataset('air.sig995.'+year+'.nc')
    a = f.variables['air'][:]
    a_cum = calc_sum(a)
    
    d = np.array(a_cum)
    d.tofile(year+'.bin')
    print(year)
    return d
In [78]:
%%time
useless_cython2('2000')
2000
CPU times: user 184 ms, sys: 36 ms, total: 220 ms
Wall time: 225 ms
Out[78]:
array(1068708186.2315979)
In [79]:
%%time
for yy in years:
    useless_cython2(yy)
2000
2001
2002
2003
2004
2005
2006
2007
CPU times: user 1.44 s, sys: 164 ms, total: 1.6 s
Wall time: 1.62 s
In [80]:
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython2, years)
pool.close()
CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 866 ms
2000
2001
20022003

20042005

20062007

Now it's comparable to other competitors :)

Further reading and watching

Comments !

links

social