Introducing CyToolz Functional Python, now in C
tl;dr: We reimplement PyToolz, a functional standard library, in Cython. It’s fast.
This post highlights work done, and was partially written by, Erik N. Welch. When I say “we” below, I really mean “Erik”
Over the last year a number of excellent
benchmarked and tuned these functions to the point where they often beat the
standard library. When you couple these tuned functions with the power of pure
Python data structures you get a nice analytics platform. In my experience
toolz is often fast enough
for large streaming data projects.
Personally, I’m fine with fast Python speeds. Erik Welch on the other hand,
wanted unreasonably fast C speeds so he rewrote
toolz in Cython; he calls it
For data structure bound computations this approach competes with Java. Note that CyToolz accomplishes these speeds even on standard Python data structures. This differs from the traditional NumPy/Pandas approach of applying Cython code onto non-Pythonic C data structures.
Erik just released CyToolz yesterday. Get it while it’s hot
$ pip install cytoolz
Cython is most effective when leveraging C type information for tight inner loops or C data structures like numpy arrays, and speed improvements of 10-100x are common for these cases. We have found that by utilizing Cython and Python’s C API, significant improvements (typically 2-5x and sometimes much more) can be achieved when using pure Python data structures.
We actually don’t know precisely where most of the performance increases come from. Developing for performance was primarily done through trial and error and was driven by curiosity. Cython employs many optimizations, and the code compiles to a native C extension, which is generally faster than the Python interpreter. Still, we were able to improve upon the original Python code for nearly all functions. Here are some of our best guesses for how this was achieved:
- Python’s C API exposes functionality unavailable in the Python interpreter.
- Directly using the C API is faster in some cases even when Python has an equivalent function.
- The C API allows pointers to be used. In many cases this is a “borrowed reference”, and it avoids reference counting and the garbage collector.
- Some C API functions don’t check types or raise exceptions; these must be used with caution, but can be much faster.
- Checking types with
isinstancein Cython is really fast. So is checking whether a pointer is
- The overhead of calling C extension types developed in Cython is low. This, for instance, makes iterables really fast.
- Early binding–that is, pre-declaring variables with
cdef–usually improves performance.
We love NumPy and Pandas, so why do we use toolz? Two reasons
- Streaming analytics - Python’s iterators and Toolz support for lazy operations allows me to crunch over Pretty-Big-Data without the hassle of setting up a distributed machine.
- Trivial parallelism - The functional constructs in PyToolz, coupled with the promise of serialization, make parallelizing PyToolz applications to multicore or cluster computing trivial. See the toolz docs page on the subject.
CyToolz perfectly satisfies the PyToolz test suite. This is especially notable given that PyToolz has 100% coverage.
PyToolz is stable enough that we were able to just copy over the tests en masse. We’d like to find a cleaner way to share test suites between codebases though. Ideas and experiences welcome.
Many Toolz operations provide functional ways of doing plain old Python
pluck operation gets out elements from items in a collection.
In PyToolz we work hard to ensure that we’re not much slower than straight Python (this definitely requires work.)
But CyToolz just beats Python hands down.
A note on Functional Programming
PyToolz integrates functional principles into traditional Python programming. CyToolz supports these same functional principles, in the same workflow, but now backed by C speeds
I started PyToolz because I liked Clojure’s standard library but couldn’t stay on the JVM (I needed to interact with native code). Originally I thought of Python and PyToolz as a hack providing functional programming abstractions in an imperative language. I’ve now come to think of Python as a performant and serious functional language in its own right. While it lacks macros, monads, or any sort of type system, it is just fine for 99% of the pedestrian programming that I do every day.
The toolz functions are simple, fast, and a great way to compose clear and performant code. Check out the docs and find a function that you didn’t know you needed, or a function that you needed, wrote, but didn’t benchmark quite as heavily as we did.
If you’re already a savvy toolz user and want Cython speed then you’ll be happy to know that the cytoolz library is a drop in replacement for toolz.
$ pip install cytoolz
Most functions improve by 2x-5x with some fantastic exceptions.
blog comments powered by Disqus