Python Data Structures are Fast
tl;dr: Our intuition that Python is slow is often incorrect. Data structure bound Python computations are fast.
You may also want to see the companion post, Introducing CyToolz.
We think that Python is slow
Our intuition says that Python is slow:
Numerical Python with lots of loops is much slower than the equivalent C or Java code. For this we use one of the numeric projects like NumPy, Cython, Theano, or Numba.
But that only applies to normally cheap operations
This slowdown occurs for cheap operations for which the Python overhead is large relative to their cost in C. However for more complex operations, like data structure random access, this overhead is less important. Consider the relative difference between integer addition and dictionary assignment.
A Python dictionary assignment is about as fast as a Python add.
Disclaimer: this benchmark gets a point across but is is very artificial, micro-benchmarks like this are hard to do well.
Micro-Benchmark: Frequency Counting
Warning: cherry-picked
To really show off the speed of Python data structures lets count frequencies of strings. I.e. given a long list of strings
We want to count the occurence of each name. In principle we would write a
little function like frequencies
This simple operation tests grouping reductions on non-numerical data. This represents an emerging class of problems that doesn’t fit our performance intuition from our history with numerics.
We compare the naive frequencies
function against the following equivalent implementations
- The standard library’s
collections.Counter
- PyToolz’ benchmarked and tuned
frequencies
operation - Pandas’
Series.value_counts
method - A naive implementation in Java, found here
We present the results from worst to best:
$ java Frequencies 207 ms # Straight Java
Lets observe the following:
- The standard library
collections.Counter
performs surprisingly poorly. This is unfair because theCounter
object is more complex, providing more exotic functionality that we don’t use here. - The Pandas solution uses C code and C data structures to beat the Python solution, but not by a huge amount. This isn’t the 10x-100x speedup that we expect from numerical applications.
- The
toolz.frequencies
function improves on the standard Python solution and gets to within a factor of 2x of Pandas. The PyToolz development team has benchmarked and tuned several implementatations. I believe that this is the fastest solution available in Pure Python. - The compiled Java Solution is generally fast but, as with the Pandas case it’s not that much faster.
For data structure bound computations, like frequency counting, Python is generally fast enough for me. I’m willing to pay a 2x cost in order to gain access to Pure Python’s streaming data structures and low entry cost.
CyToolz
Personally, I’m fine with fast Python speeds. Erik Welch on the other hand,
wanted unreasonably fast C speeds so he rewrote toolz
in Cython; he calls it
CyToolz. His results are pretty amazing.
$ java Frequencies 207 ms
CyToolz actually beats the Pandas solution (in this one particular benchmark.) Lets appreciate this for a moment.
Cython on raw Python data structures runs at Java speeds. We discuss CyToolz further in our next blog post
Conclusion
We learn that data structure bound computations aren’t as slow in Python as we might think. Although we incur a small slowdown (2x-5x), probably due to Python method dispatching, this can be avoided through Cython. When using Cython, the use of Python data structures can match perofrmance we expect from compiled languages like Java.
blog comments powered by Disqus