Parallelism and Serialization how poor pickling breaks multiprocessing
tl;dr: Multiprocessing in Python is crippled by
pickles poor function
serialization. The more robust serialization package
dill improves the
situation. Dill-based solutions for both
IPython.parallel make distributed computing simple again.
To leverage the cores found in modern processors we need to communicate functions between different processes. I.e. if we have some function in one process
then we need to communicate that functionality and all functionality on which it depends to our other worker processes.
To communicate this function we translate it down into a blob of text, ship that text over a wire, and then retranslate that text back into a fully operational function. This process, called serialization, is like the teleporters in Star Trek; it takes an important thing (function or crew member) translates it into something manageable (text or bits) moves it quickly to some other location, and then reassembles it correctly (we hope!) Just as accidents happen in Star Trek it’s easy for function serialization to go awry.
The standard serialization package in Python is
can serialize and deserialize most Python objects, not just functions.
How does Pickle go about serializing functions?
Pickle specifies a function using its module name (see
math on the left) and
its function name (see
sin in the middle). Sadly this approach fails for
many cases. In particular
pickle fails to serialize the following
- Some functions defined interactively
Most large projects use at least one (often all) of these features. This makes multiprocessing a pain.
We care about function serialization because we want to send one function to
many processes in order to leverage parallelism. The standard way to do this
is with the
multiprocessing module. One simple approach is with the
pickle and so inherits its limitations. Here it
fails to serialize and broadcast a lambda
I rarely see
multiprocessing in the wild. I suspect that this is because
poor function serialization makes it a pain for any but the most trivial
dill library is a drop-in alternative to
pickle that can robustly
handle function serialization.
As a result most of the speed-bumps of using multiprocessing should disappear.
Dill and Multiprocessing
The makers of
dill apparently know this and so have developed their own fork
multiprocessing that uses dill. This resides in the
Dill and IPython Parallel
You should know about IPython parallel.
The IPython notebook has gotten a lot of press recently. The notebook became possible after the project was restructured to separate computation and interaction. One important result is that we can now perform computation in a process while interacting in a web browser, giving rise to the ever-popular notebook.
This same computation-is-separate-from-interaction concept supports other innovations. In particular IPython parallel uses this to create a simple platform for both multiprocessing and distributed computing.
mrocklin@notebook:~$ ipcluster start --n=4 mrocklin@notebook:~$ ipython
Note that this system handles the lambda without failing. IPython performs
some custom serializations on top of
pickle. Unfortunately these
customizations still don’t cover all use cases. Fortunately IPython provides
hooks to specify your preferred serialization technique. Thanks to a recent
change, IPython views now provide a convenient
A more explicit treatment of switching IPython’s serializers to dill can be found in this notebook.
My interest into multiprocessing and serialization was originally spurred by a talk by Ian Langmore.
dill project is developed by Mike
Mckerns. Several people have
pointed it out to me. These include
Thanks to @mmckerns and @minrk for their recent interactions to resolve issues related to this topic.
blog comments powered by Disqus