⚫ home ⚫ about ⚫ pale of notes ⚫ articles by date

Numba vs Numpy: some sums

Numba is open-source optimizing compiler for Python. It seems work like magic: just add a simple decorator to your pure-python function, and it immediately becomes 200 times faster – at least, so clames the Wikipedia article about Numba. Even this is hard to believe, but Wikipedia goes further and claims that a vary naive implementation of a sum of a numpy array is 30% faster then numpy.sum. Somehow, I would expect that numpy.sum is as optimized as it can be, so this clame sounds even more ambitious. Let’s check this with some benchmarks!

System setup

As usual, I’m working on my Lenovo T440p laptop running AltLinux. It has mobile Haswell processor with 2 cores and 4 hardware threads. To make benchmarks more reproducible, I’ve disabled TurboBoost and locked scaling to almost 2 GHz (~1995).

%%sh
uname -a
Linux imelnikov.localdomain 4.14.24-un-def-alt0.M80P.1 #1 SMP PREEMPT Tue Mar 6 15:00:33 UTC 2018 x86_64 GNU/Linux
%%sh
cat /proc/cpuinfo | egrep '^model name|Hz'
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
cpu MHz		: 1995.544
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
cpu MHz		: 1995.545
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
cpu MHz		: 1995.545
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
cpu MHz		: 1995.543

Let’s import some modules and check the versions:

import sys
import numba
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
print("python version:", ''.join(sys.version.splitlines()))
print("numpy version:", np.__version__)
print("numba version:", numba.__version__)
python version: 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
numpy version: 1.14.0
numba version: 0.36.2
np.random.seed(42)

System under test

Here is the code from Wikipedia:

def sum1d(my_double_array):
    total = 0.0
    for i in range(my_double_array.shape[0]):
        total += my_double_array[i]
    return total

Numba’s jit is intended to be used as decorator, but we’re going to call it explicitly to make sure we run the same code with and without jit:

sum1d_jit = numba.jit(nopython=True)(sum1d)
sum1d_jit
CPUDispatcher(<function sum1d at 0x7fb4eee77158>)

We’ll also check if parallelization will make any difference for us:

sum1d_par = numba.jit(nopython=True, parallel=True)(sum1d)

Let’s run a quick test to make sure we’ve done everything correctly, and to force numba to compile float64 version of our sum1d:

sample_data = np.random.randn(42)
sum1d(sample_data), sum1d_jit(sample_data), sum1d_par(sample_data), np.sum(sample_data)
(-7.835638675532193,
 -7.835638675532193,
 -7.835638675532193,
 -7.835638675532195)

It’s interesting: naive python version and jitted version return the same number exactly, but numpy.sum apparently calculates things in a bit differntly.

Measurements

We’ll make a few runs for various array sizes:

measures = 13
sizes = 4 ** np.arange(measures)

We’ll save the average runtime here:

res_numpy = np.zeros(measures)
res_python = np.zeros(measures)
res_numba = np.zeros(measures)
res_numba_par = np.zeros(measures)

It’s time to measures all the things! This will take a few minutes:

%%time
for idx, size in enumerate(sizes):
    print("size =", size)
    data = np.random.randn(size).astype('f8')
    res = %timeit -o np.sum(data)
    res_numpy[idx] = res.average
    res = %timeit -o sum1d(data)
    res_python[idx] = res.average
    res = %timeit -o sum1d_jit(data)
    res_numba[idx] = res.average
    res = %timeit -o sum1d_par(data)
    res_numba_par[idx] = res.average
del data
size = 1
5.53 µs ± 298 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.04 µs ± 93.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
448 ns ± 8.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
469 ns ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
size = 4
4.07 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.6 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
473 ns ± 6.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
453 ns ± 6.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
size = 16
4.17 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.35 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
526 ns ± 77.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
585 ns ± 26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
size = 64
4.54 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
16.6 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
678 ns ± 46.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
676 ns ± 22.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
size = 256
5.89 µs ± 401 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
75.3 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.02 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
958 ns ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
size = 1024
6.33 µs ± 228 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
319 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.28 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.21 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
size = 4096
8.31 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.2 ms ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
7.06 µs ± 59.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7.02 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
size = 16384
18.2 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.7 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.7 µs ± 363 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.5 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
size = 65536
52.9 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
19.7 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
102 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
101 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
size = 262144
176 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
77.9 ms ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
440 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
457 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
size = 1048576
948 µs ± 36.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
320 ms ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.89 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.86 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
size = 4194304
3.51 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.22 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.31 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.32 ms ± 77.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
size = 16777216
14.8 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.87 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
29.2 ms ± 88.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
29.7 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CPU times: user 6min 4s, sys: 267 ms, total: 6min 5s
Wall time: 6min 5s

Looking at the data

First, let’s look at various speedups and slowdowns:

numba_vs_naive = res_python / res_numba
numba_vs_naive.max()
192.3750032906321

Well, it’s not 200 times as Wikipedia claimed, but it’s pretty much close to it.

numba_vs_numpy = res_numba / res_numpy
numba_vs_numpy.min(), numba_vs_numpy.max()
(0.08103373848130296, 2.505059846262734)

Numba can be 2.5 times slower then numpy, but it can also be faster. Let’s look at the graphs below.

numba_vs_par = res_numba / res_numba_par
numba_vs_par.min(), numba_vs_par.max()
(0.8986877079731856, 1.0664777926145088)

Aparently, numba cannot parallelize our function – it’s too naive for that. Let’s look how speedups/slowdowns change with the array size:

fig, (ax1, ax2, ax3) = plt.subplots(1,3)

fig.set_size_inches(16, 5)

ax1.set_title('numba vs numpy')
ax1.set_xlabel('array size')
ax1.set_ylabel('slowdown')
ax1.semilogx(sizes, numba_vs_numpy, basex=2);


ax2.set_title('numba vs naive python')
ax2.set_xlabel('array size')
ax2.set_ylabel('slowdown')
ax2.semilogx(sizes, numba_vs_naive, basex=2, color='b');

ax3.set_title('effect of parallel=True')
ax3.set_xlabel('array size')
ax3.set_ylabel('speedup')
ax3.semilogx(sizes, numba_vs_par, basex=2, color='g');

png

plt.figure().set_size_inches(12, 10)
plt.loglog(sizes, res_numba, label='numba', basex=2)
plt.loglog(sizes, res_numba_par, label='numba parallel', basex=2)
plt.loglog(sizes, res_numpy, label='numpy', basex=2)
plt.loglog(sizes, res_python, label='naive python', basex=2)
plt.xlabel("array size")
plt.ylabel("runtime, seconds");
plt.legend();

png

The jitted version is more then 5 times faster on smaller arrays, but for larger arrays numpy becomes ~2.5 times faster then numba; they show similar performance when array sizes are about 4000.

Conclusions

I consider both clames of the Wikipedia article confirmed: numba brings tremendous speedup to naive python code at the const of just one decorator (and one huge additional dependency – the numba itself ^_-); for arrays of quite reasonable size numba can be much faster then numpy – e.g. for size 1024 it ~2 times faster!

This allows me to conclude that numba is awesome. Let’s see how it will behave on more real-life projects of mine.

You can download the full notebook here.