**Task:**

```
Make your python scripts run faster
```

**Solution:**

```
multiprocessor, cython, numba
```

One of the counterarguments that you constantly hear about using python is that it is slow. This is somehow true for many cases, while most of the tools that scientist mainly use, like `numpy`

, `scipy`

and `pandas`

have big chunks written in `C`

, so they are very fast. For most of the geoscientific applications main advice would be to use vectorisation whenever possible, and avoid loops. However sometimes loops are unavoidable, and then python speed can get on to your nerves. Fortunately there are several easy ways to make your python loops faster.

## Multiprocessor

Let's first download some data to work with (NCEP reanalysis air temperature):

```
#variabs = ['air']
#for vvv in variabs:
# for i in range(2000,2010):
# !wget ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/{vvv}.sig995.{i}.nc
```

```
ls *nc
```

This is netCDF files, so we need `netCDF4`

library:

```
from netCDF4 import Dataset
```

Now we create useless but time consuming function, that have a lot of loops in it. It takes year as an input and then just summs up all the numbers from the file one by one.

```
def useless(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
a_cum.tofile(year+'.bin')
print(year)
return a_cum
```

It works slow enough:

```
%%time
useless('2000')
```

We can create a loop that will process several files one by one. First make a list of years:

```
years = [str(x) for x in range(2000,2008)]
years
```

And now the loop:

```
%%time
for yy in years:
useless(yy)
```

Processing of each file is independent from others. This is "embarrassingly parallel" problem and can be done very easily in parallel with `multiprocessing`

module of the standard library.

Most of the modern computers have more than one processor. Even the 5 year old machine, where I now edit this notebook have two. The way to check how many processors you have is to run:

```
!nproc
```

Now let's import `multiprocessing`

:

```
import multiprocessing
```

And create a `pool`

with 2 processors (you can use your number of processors):

```
pool = multiprocessing.Pool(processes=2)
```

Let's have a look how fast do we get:

```
%%time
r = pool.map(useless, years)
pool.close()
```

More than two times faster - not bad! But we can do better!

## Cython

Cython is an optimising static compiler for Python. Cython magic is one of the default extensions, and we can just load it (you have to have cython already installed):

```
%load_ext cythonmagic
```

The only thing we do here - add `%%cython`

magic at the top of the cell. This function will be compiled with cython.

```
%%cython
def useless_cython(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
a_cum.tofile(year+'.bin')
print(year)
return a_cum
```

Only this give us a good boost:

```
%%time
useless_cython('2000')
```

One processor:

```
%%time
for yy in years:
useless_cython(yy)
```

Two processors:

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
```

But the true power of cython revealed only when you provide types of your variables. You have to use `cdef`

keyword in the function definition to do so. There are also couple other modifications to the function

```
%%cython
import numpy as np
def useless_cython(year):
# define types of variables
cdef int i, j, n
cdef double a_cum
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
#here we have to convert numpy value to simple float
a_cum = a_cum+float(a[i,j,n])
# since a_cum is not numpy variable anymore,
# we introduce new variable d in order to save
# data to the file easily
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
```

```
%%time
useless_cython('2000')
```

One processor:

```
%%time
for yy in years:
useless_cython(yy)
```

Multiprocessing:

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython, years)
pool.close()
```

From 9 seconds to 4 on two processors - not bad! But we can do better! :)

## Numba

Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). The easiest way to install it is to use Anaconda distribution.

```
from numba import jit, autojit
```

We now have to split our function in two (that would be a good idea from the beggining). One is just number crunching part, and another responsible for IO. The only thing that we have to do afterwards is to put `jit`

decorator in front of the first function.

```
@autojit
def calc_sum(a):
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
return a_cum
def useless_numba(year):
#from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
```

```
%%time
useless_numba('2000')
```

One processor:

```
%%time
for yy in years:
useless_numba(yy)
```

Two processors:

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba, years)
pool.close()
```

Nice! Maybe we can speed up a bit more?

You can also provide type for the input and output:

```
@jit('f8(f4[:,:,:])')
def calc_sum(a):
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+a[i,j,n]
return a_cum
def useless_numba2(year):
#from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
```

```
%%time
useless_numba2('2000')
```

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_numba2, years)
pool.close()
```

just a tiny bit...

## Native numpy

This is how you really should solve this problem using numpy.sum(). Note, that the result will be different compared to previous examples. Only if you first convert to `float64`

it becomes the same. Be careful when dealing with huge numbers!

```
import numpy as np
def calc_sum(a):
a = np.float64(a)
return a.sum()
def useless_good(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
```

```
%%time
useless_good('2000')
```

```
%%time
for yy in years:
useless_good(yy)
```

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_good, years)
pool.close()
```

Actually `numpy`

version is a bit slower than `numba`

version, but it's not clear to me how much I can trust this result.

## Cython 2nd try

I was surprised to see Cython results so much different from Numba, and decide to make a second try. Here I make cython to be aware of numpy arrays.

```
%%cython
import numpy as np
cimport numpy as np
cimport cython
def calc_sum(np.ndarray[float, ndim=3] a):
cdef int i, j, n
cdef float a_cum
a_cum = 0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
a_cum = a_cum+(a[i,j,n])
return a_cum
```

```
def useless_cython2(year):
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = calc_sum(a)
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
```

```
%%time
useless_cython2('2000')
```

```
%%time
for yy in years:
useless_cython2(yy)
```

```
%%time
pool = multiprocessing.Pool(processes=2)
r = pool.map(useless_cython2, years)
pool.close()
```

Now it's comparable to other competitors :)

## Comments !