A Weekend with Asyncio

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: I learned asyncio and rewrote part of dask.distributed with it; this details my experience

asyncio

The asyncio library provides concurrent programming in the style of Go, Clojure’s core.async library, or more traditional libraries like Twisted. Asyncio offers a programming paradigm that lets many moving parts interact without involving separate threads. These separate parts explicitly yield control to each other and to a central authority and then regain control as others yield control to them. This lets one escape traps like race conditions to shared state, a web of callbacks, lost error reporting, and general confusion.

I’m not going to write too much about asyncio. Instead I’m going to briefly describe my problem, link to a solution, and then dive into good-and-bad points about using asyncio while they’re fresh in my mind.

Exercise

I won’t actually discuss the application much after this section; you can safely skip this.

I decided to rewrite the dask.distributed Worker using asyncio. This worker has to do the following:

Store local data in a dictionary (easy)
Perform computations on that data as requested by a remote connection (act as a server in a client-server relationship)
Collect data from other workers when we don’t have all of the necessary data for a computation locally (peer-to-peer)
Serve data to other workers who need our data for their own computations (peer-to-peer)

It’s a sort of distributed RPC mechanism with peer-to-peer value sharing. Metadata for who-has-what data is stored in a central metadata store; this could be something like Redis.

The current implementation of this is a nest of threads, queues, and callbacks. It’s not bad and performs well but tends to be hard for others to develop.

Additionally I want to separate the worker code because it’s useful outside of dask.distributed. Other distributed computation solutions exist in my head that rely on this technology.

For the moment the code lives here: https://github.com/mrocklin/dist. I like the design. The module-level docstring of worker.py is short and informative. But again, I’m not going to discuss the application yet; instead, here are some thoughts on learning/developing with asyncio.

General Thoughts

Disclaimer I am a novice concurrent programmer. I write lots of parallel code but little concurrent code. I have never used existing frameworks like Twisted.

I liked the experience of using asyncio and recommend the paradigm to anyone building concurrent applications.

The Good:

I can write complex code that involves multiple asynchronous calls, complex logic, and exception handling all in a single place. Complex application logic is no longer spread in many places.
Debugging is much easier now that I can throw import pdb; pdb.set_trace() lines into my code and expect them to work (this fails when using threads).
My code fails more gracefully, further improving the debug experience. Ctrl-C works.
The paradigm shared by Go, Clojure’s core.async, and Python’s asyncio felt viscerally good. I was able to reason well about my program as I was building it and made nice diagrams about explicitly which sequential processes interacted with which others over which channels. I am much more confident of the correctness of the implementation and the design of my program. However, after having gone through this exercise I suspect that I could now implement just about the same design without asyncio. The design paradigm was perhaps as important as the library itself.
I have to support Python 2. Fortunately I found the trollius port of asyncio to be very usable. It looks like it was a direct fork-then-modify of tulip.

The Bad:

There wasn’t a ZeroMQ connectivity layer for Trollius (though aiozmq exists in Python 3) so I ended up having to use threads anyway for inter-node I/O. This, combined with ZeroMQ’s finicky behavior did mean that my program crashed hard sometimes. I’m considering switching to plain sockets (which are supported nativel by Trollius and asyncio) due to this.
While exceptions raise cleanly I can’t determine from where they originate. There are no line numbers or tracebacks. Debugging in a concurrent environment is hard; my experience was definitely better than threads but still could be improved. I hope that asyncio in Python 3.4 has better debugging support.
The API documentation is thorough but stackoverflow, general best practices, and example coverage is very sparse. The project is new so there isn’t much to go on. I found that reading documentation for Go and presentations on Clojure’s core.async were far more helpful in preparing me to use asyncio than any of the asyncio docs/presentations.

Future

I intend to pursue this into the future and, if the debugging experience is better in Python 3 am considering rewriting the dask.distributed Scheduler in Python 3 with asyncio proper. This is possible because the Scheduler doesn’t have to be compatible with user code.

I found these videos to be useful: