This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

2. Subsequent graph optimizations, both non-linear fusion and avoiding repeatedly creating new graphs
3. Tensorflow and Keras experiments
4. XGBoost experiments
7. Cleanup of Dask + SKLearn project

Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent, BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving problems like logistic regression, but also several others. The mathematical side of this work is mostly done by Chris White and Hussain Sultan at Capital One.

We’ve been using this project also to see how Dask can scale out machine learning algorithms. To this end we ran a few benchmarks here: https://github.com/dask/dask-glm/issues/26 . This just generates and solves some random problems, but at larger scales.

What we found is that some algorithms, like ADMM perform beautifully, while for others, like gradient descent, scheduler overhead can become a substantial bottleneck at scale. This is mostly just because the actual in-memory NumPy operations are so fast; any sluggishness on Dask’s part becomes very apparent. Here is a profile of gradient descent:

Notice all the white space. This is Dask figuring out what to do during different iterations. We’re now working to bring this down to make all of the colored parts of this graph squeeze together better. This will result in general overhead improvements throughout the project.

### Graph Optimizations - Aggressive Fusion

We’re approaching this in two ways:

1. More aggressively fuse tasks together so that there are fewer blocks for the scheduler to think about
2. Avoid repeated work when generating very similar graphs

x = f(w)
y = g(x)
z = h(y)


Dask (along with every other compiler-like project since the 1980’s) already turns this into the following:

z = h(g(f(w)))


What’s tricky with a lot of these mathematical or optimization algorithms though is that they are mostly, but not entirely linear. Consider the following example:

y = exp(x) - 1/x


Visualized as a node-link diagram, this graph looks like a diamond like the following:

         o  exp(x) - 1/x
/ \
exp(x) o   o   1/x
\ /
o  x


Graphs like this generally don’t get fused together because we could compute both exp(x) and 1/x in parallel. However when we’re bound by scheduling overhead and when we have plenty of parallel work to do, we’d prefer to fuse these into a single task, even though we lose some potential parallelism. There is a tradeoff here and we’d like to be able to exchange some parallelism (of which we have a lot) for less overhead.

PR here dask/dask #1979 by Erik Welch (Erik has written and maintained most of Dask’s graph optimizations).

### Graph Optimizations - Structural Sharing

Additionally, we no longer make copies of graphs in dask.array. Every collection like a dask.array or dask.dataframe holds onto a Python dictionary holding all of the tasks that are needed to construct that array. When we perform an operation on a dask.array we get a new dask.array with a new dictionary pointing to a new graph. The new graph generally has all of the tasks of the old graph, plus a few more. As a result, we frequently make copies of the underlying task graph.

y = (x + 1)


Normally this doesn’t matter (copying graphs is usually cheap) but it can become very expensive for large arrays when you’re doing many mathematical operations.

Now we keep dask graphs in a custom mapping (dict-like object) that shares subgraphs with other arrays. As a result, we rarely make unnecessary copies and some algorithms incur far less overhead. Work done in dask/dask #1985.

### TensorFlow and Keras experiments

Two weeks ago I gave a talk with Stan Seibert (Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of that talk I decided to launch tensorflow from Dask and feed Tensorflow from a distributed Dask array. See this blogpost for more information.

That experiment was nice in that it showed how easy it is to deploy and interact with other distributed servies from Dask. However from a deep learning perspective it was immature. Fortunately, it succeeded in attracting the attention of other potential developers (the true goal of all blogposts) and now Brett Naul is using Dask to manage his GPU workloads with Keras. Brett contributed code to help Dask move around Keras models. He seems to particularly value Dask’s ability to manage resources to help him fully saturate the GPUs on his workstation.

### XGBoost experiments

After deploying Tensorflow we asked what would it take to do the same for XGBoost, another very popular (though very different) machine learning library. The conversation for that is here: dmlc/xgboost #2032 with prototype code here mrocklin/dask-xgboost. As with TensorFlow, the integration is relatively straightforward (if perhaps a bit simpler in this case). The challenge for me is that I have little concrete experience with the applications that these libraries were designed to solve. Feedback and collaboration from open source developers who use these libraries in production is welcome.

The dask/dask-tutorial project on Github was originally written or PyData Seattle in July 2015 (roughly 19 months ago). Dask has evolved substantially since then but this is still our only educational material. Fortunately Martin Durant is doing a pretty serious rewrite, both correcting parts that are no longer modern API, and also adding in new material around distributed computing and debugging.