Dask is one year old

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Dask turned one yesterday. We discuss success and failures.

Dask began one year ago yesterday with the following commit (with slight edits here for clarity’s sake).

def istask(x):
    return isinstance(x, tuple) and x and callable(x[0])


def get(d, key):
    v = d[key]
    if istask(v):
        func, args = v[0], v[1:]
        return func(*[get(d, arg) for arg in args])
    else:
        return v

 ... (and around 50 lines of tests)

this is a very inefficient scheduler

Since then dask has matured, expanded to new domains, gathered excellent developers, and spawned other open source projects. I thought it’d be a good time to look back on what worked, what didn’t, and what we should work on in the future.

Collections

Most users experience dask through the high-level collections of dask.array/bag/dataframe/imperative. Each of these evolve as projects of their own with different user groups and different levels of maturity.

dask.array

The parallel larger-than-memory array module dask.array has seen the most success of the dask components. It is the oldest, most mature, and most sophisticated subproject. Much of dask.array’s use comes from downstream projects, notably xray which seems to have taken off in climate science. Dask.array also sees a fair amount of use in imaging, genomics, and numerical algorithms research.

People that I don’t know now use dask.array to do scientific research. From my perspective that’s mission accomplished.

There are still tweaks to make to algorithms, particularly as we scale out to distributed systems (see far below).

dask.bag

Dask.bag started out as a weekend project and didn’t evolve much beyond that. Fortunately there wasn’t much to do and this submodule probably has the highest value/effort ratio .

Bag doesn’t get as much attention as its older sibling array though. It’s handy but not as well used and so not as robust.

dask.dataframe

Dataframe is an interesting case, it’s both pretty sophisticated, pretty mature, and yet also probably generates the most user frustration.

Dask.dataframe gains a lot of value by leveraging Pandas both under the hood (one dask DataFrame is many pandas DataFrames) and by copying its API (Pandas users can use dask.dataframe without learning a new API.) However, because dask.dataframe only implements a core subset of Pandas, users end up tripping up on the missing functionality.

This can be decomposed into to issues:

It’s not clear that there exists a core subset of Pandas that would handle most use cases. Users touch many diffuse parts of Pandas in a single workflow. What one user considers core another user considers fringe. It’s not clear how to agree on a sufficient subset to implement.
Once you implement this subset (and we’ve done our best) it’s hard to convey expectations to the user about what is and is not available.

That being said, dask.dataframe is pretty solid. It’s very fast, expressive, and handles common use cases well. It probably generates the most StackOverflow questions. This signals both confusion and active use.

Special thanks here go out to Jeff Reback, for making Pandas release the GIL and to Masaaki Horikoshi (@sinhrks) for greatly improving the maturity of dask.dataframe.

dask.imperative

Also known as dask.do this little backend remains one of the most powerful and one of the least used (outside of myself.) We should rethink the API here and improve learning materials.

General thoughts on collections

Warning: this section is pretty subjective

Big data collections are cool but perhaps less useful than people expect. Parallel applications are often more complex than can be easily described by a big array or a big dataframe. Many real-world parallel computations end up being more particular in their parallelism needs. That’s not to say that the array and dataframe abstractions aren’t central to parallel computing, it’s just that we should not restrict ourselves to them. The world is more complex.

However, it’s reasonable to break this “world is complex” rule within particular domains. NDArrays seem to work well in climate science. Specialized large dataframes like Dato’s SFrame seem to be effective for a particular class of machine learning algorithms. The SQL table is inarguably an effective abstraction in business intelligence. Large collections are useful in specific contexts, but they are perhaps the focus of too much attention. The big dataframe in particular is over-hyped.

Most of the really novel and impressive work I’ve seen with dask has been done either with custom graphs or with the dask.imperative API. I think we should consider APIs that enable users to more easily express custom algorithms.

Avoid Parallelism

When giving talks on parallelism I’ve started to give a brief “avoid parallelism” section. From the problems I see on stack overflow and from general interactions when people run into performance challenges their first solution seems to be to parallelize. This is sub-optimal. It’s often far cheaper to improve storage formats, use better algorithms, or use C/Numba accelerated code than it is to parallelize. Unfortunately storage formats and C aren’t as sexy as big data parallelism, so they’re not in the forefront of people’s minds. We should change this.

I’ll proudly buy a beer for anyone that helps to make storage formats a sexier topic.

Scheduling

Single Machine

The single machine dynamic task scheduler is very very solid. It has roughly two objectives:

Use all the cores of a machine
Choose tasks that allow the release of intermediate results

This is what allows us to quickly execute complex workflows in small space. This scheduler underlies all execution within dask. I’m very happy with it. I would like to find ways to expose it more broadly to other libraries. Suggestions are very welcome here.

We still run into cases where it doesn’t perform optimally (see issue 874), but so far we’ve always been able to enhance the scheduler whenever these cases arise.

Distributed Cluster

Over the last few months we’ve been working on another scheduler for distributed memory computation. It should be a nice extension to the existing dask collections out to “big data” systems. It’s experimental but usable now with documentation at the follow links:

http://distributed.readthedocs.org/en/latest/
https://github.com/blaze/distributed
http://matthewrocklin.com/distributed-by-example/

Feedback is welcome. I recommend waiting for a month or two if you prefer clean and reliable software. It will undergo a name-change to something less generic.