Dask Scaling Limits
This work is supported by Anaconda Inc.
For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain of deploying and configuring distributed systems. We still believe in the efficiency of single-node parallelism, but in the years since, Dask has extended itself to support larger distributed systems.
After that first year, Dask focused equally on both single-node and distributed parallelism. We maintain two entirely separate schedulers, one optimized for each case. This allows Dask to be very simple to use on single machines, but also scale up to thousand-node clusters and 100+TB datasets when needed with the same API.
Dask’s distributed system has a single central scheduler and many distributed workers. This is a common architecture today that scales out to a few thousand nodes. Roughly speaking Dask scales about the same as a system like Apache Spark, but less well than a high-performance system like MPI.
Most Dask examples in blogposts or talks are on modestly sized datasets, usually in the 10-50GB range. This, combined with Dask’s history with medium-data on single-nodes may have given people a more humble impression of Dask than is appropriate.
As a small nudge, here is an example using Dask to interact with 50 36-core nodes on an artificial terabyte dataset.
This is a common size for a typical modestly sized Dask cluster. We usually see Dask deployment sizes either in the tens of machines (usually with Hadoop style or ad-hoc enterprise clusters), or in the few-thousand range (usually with high performance computers or cloud deployments). We’re showing the modest case here just due to lack of resources. Everything in that example should work fine scaling out a couple extra orders of magnitude.
Challenges to Scaling Out
For the rest of the article we’ll talk about common causes that we see today that get in the way of scaling out. These are collected from experience working both with people in the open source community, as well as private contracts.
Simple Map-Reduce style
If you’re doing simple map-reduce style parallelism then things will be pretty smooth out to a large number of nodes. However, there are still some limitations to keep in mind:
The scheduler will have at least one, and possibly a few connections open to each worker. You’ll want to ensure that your machines can have many open file handles at once. Some Linux distributions cap this at 1024 by default, but it is easy to change.
The scheduler has an overhead of around 200 microseconds per task. So if each task takes one second then your scheduler can saturate 5000 cores, but if each task takes only 100ms then your scheduler can only saturate around 500 cores, and so on. Task duration imposes an inversely proportional constraint on scaling.
If you want to scale larger than this then your tasks will need to start doing more work in each task to avoid overhead. Often this involves moving inner for loops within tasks rather than spreading them out to many tasks.
More complex algorithms
If you’re doing more complex algorithms (which is common among Dask users) then many more things can break along the way. High performance computing isn’t about doing any one thing well, it’s about doing nothing badly. This section lists a few issues that arise for larger deployments:
Dask collection algorithms may be suboptimal.
The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good, but as Dask scales out to larger clusters and its algorithms are used by more domains we invariably find that small corners of the API fail beyond a certain point. Luckily these are usually pretty easy to fix after they are reported.
The graph size may grow too large for the scheduler
The metadata describing your computation has to all fit on a single machine, the Dask scheduler. This metadata, the task graph, can grow big if you’re not careful. It’s nice to have a scheduler process with at least a few gigabytes of memory if you’re going to be processing million-node task graphs. A task takes up around 1kB of memory if you’re careful to avoid closing over any unnecessary local data.
The graph serialization time may become annoying for interactive use
Again, if you have million node task graphs you’re going to be serializaing them up and passing them from the client to the scheduler. This is fine, assuming they fit at both ends, but can take up some time and limit interactivity. If you press
computeand nothing shows up on the dashboard for a minute or two, this is what’s happening.
The interactive dashboard plots stop being as useful
Those beautiful plots on the dashboard were mostly designed for deployments with 1-100 nodes, but not 1000s. Seeing the start and stop time of every task of a million-task computation just isn’t something that our brains can fully understand.
This is something that we would like to improve. If anyone out there is interested in scalable performance diagnostics, please get involved.
Other components that you rely on, like distributed storage, may also start to break
Dask provides users more power than they’re accustomed to. It’s easy for them to accidentally clobber some other component of their systems, like distributed storage, a local database, the network, and so on, with too many requests.
Many of these systems provide abstractions that are very well tested and stable for normal single-machine use, but that quickly become brittle when you have a thousand machines acting on them with the full creativity of a novice user. Dask provies some primitives like distributed locks and queues to help control access to these resources, but it’s on the user to use them well and not break things.
Dask scales happily out to tens of nodes, like in the example above, or to thousands of nodes, which I’m not showing here simply due to lack of resources.
Dask provides this scalability while still maintaining the flexibility and freedom to build custom systems that has defined the project since it began. However, the combination of scalability and freedom makes it hard for Dask to fully protect users from breaking things. It’s much easier to protect users when you can constrain what they can do. When users stick to standard workflows like Dask dataframe or Dask array they’ll probably be ok, but when operating with full creativity at the thousand-node scale some expertise will invariably be necessary. We try hard to provide the diagnostics and tools necessary to investigate issues and control operation. The project is getting better at this every day, in large part due to some expert users out there.
A Call for Examples
Do you use Dask on more than one machine to do interesting work? We’d love to hear about it either in the comments below, or in this online form.
blog comments powered by Disqus