# Python Data Structures are Fast

**tl;dr: Our intuition that Python is slow is often incorrect. Data structure
bound Python computations are fast.**

You may also want to see the companion post, Introducing CyToolz.

## We think that Python is slow

Our intuition says that Python is slow:

```
>>> # Python speeds
>>> L = range(1000000)
>>> timeit sum(L)
timeit np.s100 loops, best of 3: 7.79 ms per loop
>>> # C speeds
>>> import numpy as np
>>> A = np.arange(1000000)
>>> timeit np.sum(A)
1000 loops, best of 3: 725 µs per loop
```

Numerical Python with lots of loops is much slower than the equivalent C or Java code. For this we use one of the numeric projects like NumPy, Cython, Theano, or Numba.

## But that only applies to normally cheap operations

This slowdown occurs for cheap operations for which the Python overhead is large relative to their cost in C. However for more complex operations, like data structure random access, this overhead is less important. Consider the relative difference between integer addition and dictionary assignment.

```
>>> x, y = 3, 3
>>> timeit x + y
10000000 loops, best of 3: 43.7 ns per loop
>>> d = {1: 1, 2: 2}
>>> timeit d[x] = y
10000000 loops, best of 3: 65.7 ns per loop
```

A Python dictionary assignment is about as fast as a Python add.

*Disclaimer: this benchmark gets a point across but is is very artificial,
micro-benchmarks like this are hard to do well.*

## Micro-Benchmark: Frequency Counting

*Warning: cherry-picked*

To really show off the speed of Python data structures lets count frequencies of strings. I.e. given a long list of strings

`>>> data = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'] * 1000000`

We want to count the occurence of each name. In principle we would write a
little function like `frequencies`

```
def frequencies(seq):
""" Count the number of occurences of each element in seq """
d = dict()
for item in seq:
if item not in d:
d[item] = 1
else:
d[item] = d[item] + 1
return d
>>> frequencies(data)
{'Alice': 1000000,
'Bob': 1000000,
'Charlie': 1000000,
'Dan': 1000000,
'Edith': 1000000,
'Frank': 1000000}
```

This simple operation tests grouping reductions on non-numerical data. This represents an emerging class of problems that doesn’t fit our performance intuition from our history with numerics.

We compare the naive `frequencies`

function against the following equivalent implementations

- The standard library’s
`collections.Counter`

- PyToolz’ benchmarked and tuned
`frequencies`

operation - Pandas’
`Series.value_counts`

method - A naive implementation in Java, found here

We present the results from worst to best:

```
>>> timeit collections.Counter(data) 1.59 s # Standard Lib
>>> timeit frequencies(data) 805 ms # Naive Python
>>> timeit toolz.frequencies(data) 522 ms # Tuned Python
>>> series = Series(data)
>>> timeit series.value_counts() 286 ms # Pandas
```

```
$ java Frequencies 207 ms # Straight Java
```

Lets observe the following:

- The standard library
`collections.Counter`

performs surprisingly poorly. This is unfair because the`Counter`

object is more complex, providing more exotic functionality that we don’t use here. - The Pandas solution uses C code and C data structures to beat the Python solution, but not by a huge amount. This isn’t the 10x-100x speedup that we expect from numerical applications.
- The
`toolz.frequencies`

function improves on the standard Python solution and gets to within a factor of 2x of Pandas. The PyToolz development team has benchmarked and tuned several implementatations. I believe that this is the fastest solution available in Pure Python. - The compiled Java Solution
is generally fast but, as with the Pandas case it’s not
*that*much faster.

For data structure bound computations, like frequency counting, Python is generally fast enough for me. I’m willing to pay a 2x cost in order to gain access to Pure Python’s streaming data structures and low entry cost.

## CyToolz

Personally, I’m fine with fast Python speeds. Erik Welch on the other hand,
wanted unreasonably fast C speeds so he rewrote `toolz`

in Cython; he calls it
CyToolz. His results are pretty amazing.

```
>>> # import toolz
>>> import cytoolz
>>> timeit toolz.frequencies(data) 522 ms
>>> timeit series.value_counts() 286 ms
>>> timeit cytoolz.frequencies(data) 214 ms
```

```
$ java Frequencies 207 ms
```

CyToolz actually beats the Pandas solution (in this one particular benchmark.) Lets appreciate this for a moment.

Cython on raw Python data structures runs at Java speeds. We discuss CyToolz further in our next blog post

## Conclusion

We learn that data structure bound computations aren’t as slow in Python as we might think. Although we incur a small slowdown (2x-5x), probably due to Python method dispatching, this can be avoided through Cython. When using Cython, the use of Python data structures can match perofrmance we expect from compiled languages like Java.

blog comments powered by Disqus