Dask

Dask Arrays Parallelize NumPy

Follow the NumPy API
Use NumPy under the hood
Blocked algorithms for sophisticated processing

import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x.dot(x.T + 1) - x.mean(axis=0)

Process stacks of images

Stack many images into a single logical array
Re-align and communicate data between chunks
Used for satellite, biological, and medical imagery

x = dask.array.image.imread('data/2018-*-*.png')
x = x.rechunk((None, 10, 10))  # arrange into time-series
y = da.fft.fft(x, axis=0)

Query Slabs of Geospatial data

Read stacks of HDF5 or NetCDF data
Leverage powerful upstream libraries XArray and Iris
Leverage HPC or Cloud hardware and data sources

ds = xarray.open_mfdataset('model/2018-*-*.nc')
climatology = ds.groupby('time.month').mean(dim='time')
climatology.mean(dim=['latitude', 'longitude']).compute()

Design novel algorithms

Blocked linear algebra routines powered by BLAS/LAPACK
High performance dynamic scheduling
Efficient network transport

Z = da.tensordot(X, Y.T, axes=[(2, 0), (1, 3)])
q, r = da.linalg.qr(Z)
u, s, v = da.linalg.svd(Z)

Dask Dataframes Parallelize Pandas

Follow the Pandas API
Use Pandas under the hood
Partitioned Index for effiicient processing

import dask.dataframe as dd
df = dd.read_parquet('s3://bucket/*.parq')
df.groupby(df.name).amount.mean()

Access data from many formats and locations

Supports major formats like CSV, HDF, and Parquet
Reads from local, Hadoop, and cloud file systems
Easy to extend with anything Pandas supports

dd.read_parquet('s3://bucket/*.parquet')
dd.read_csv('hdfs://bucket/*.csv')
dd.read_hdf('/path/to/*.hdf', '/data/path')

Query with the Pandas API

Supports groupby-aggregations
Supports Joins
Set a sorted index for accelerated access

df.groupby(df.name).amount.mean()
dd.merge(accounts, customers, on='id')
df.set_index('time')

Intelligent timeseries support

Dataframe is partitioned along a sorted column
Dask knows where each piece of data lives
Random access and Pandas time series algorithms benefit

df.set_index('time')
df.loc['2018-01']
df.rolling('1w').high.mean()

Dask-ML Parallelizes Scikit-Learn

Follow the Scikit-Learn API
Supports existing workflows
And develops new scalable algorithms

from dask_ml.models import LogisticRegression
est = LogisticRegression(...)
est.fit(train, test)

Parallelize SKLearn through Joblib

Parallelizes existing SKLearn code
Only change is to wrap with a context manager
Works for big computation on small data

rf = RandomForest(...)
pipe = Pipeline(..., rf, ...)
grid = GridSearchCV(pipe, ...)

with joblib.parallel_backend("dask"):
    grid.fit(X, y)

Train on large data with new algorithms

Use Dask arrays for scalability
Novel algorithm design
Consistent with scikit-learn API

from dask_ml.models import LogisticRegression
est = LogisticRegression(...)
est.train(train, test)

Coordinate with other distributed systems

Use Dask.dataframe for data access and cleaning
Automatically co-deploy other systems like XGBoost
Hand off data to train

from dask_ml.xgboost import XGBRegressor
est = XGBRregressor(...)
est.train(train, test)

Dask enables custom algorithms

Support applications that aren't arrays or dataframes
Leverage task-based parallelism with data dependencies
Enables parallelization of complex custom applications

for x in X:
    for y in Y:
        if x < y:
            z = f(x, y)
         else:
            z = g(x, y)

Complex Modeling

Wrap model code in simple decorators
Separates model complexity from parallelism
Easy for novices to build complex systems

@dask.delayed
def risk(person, occupation, situation):
    ...
a = risk(alice, ...)  # delayed execution
b = risk(bob, ...)
c = evaluate(a, b)  # chain dependencies

Real-time control

Submit and manage work during computation
Respond to real-world events
Prioritize important work for fast response

futures = [client.submit(func, x) for x in L]
for future in as_completed(futures):
    if future.result() > 10:
        client.submit(binop, future, retry=True)

Integrate with Asynchronous Applications

Support Go-like concurrency model
Integrate seemlessly with web servers
Use Python-3 style async-await syntax

async def respond(request):
    future = client.submit(predict, request.body)  # submit work to cluster
    response = await future
    return result

Scales up

Scales to 1000s of computers on cloud or HPC

Scales Down

Trivial to use on a laptop

Flexible

Enables sophisticated algorithms beyond traditional big data

Native

Plays nicely with native code and GPUs without the JVM

Ecosystem

Part of the broader community

Dask Started with Numpy

Dask began as a project to parallelize NumPy with multi-dimensional blocked algorithms. These algorithms are complex and proved challenging for existing parallel frameworks like Apache Spark or Hadoop. so we developed a light-weight task scheduler that was flexible enough to handle them. It wasn't as highly optimized for SQL-like queries, but could do everything else.

From here it was easy to extend the solution to Python lists, Pandas, and other libraries whose algorithms were somewhat simpler.

Dask Grew to Support Custom Systems

As Dask was adopted by more groups it encountered more problems that did not fit the large array or dataframe programming models. The Dask task schedulers were of value even when the Dask arrays and dataframes were not.

Dask grew APIs like dask.delayed and futures that exposed the task scheduler without forcing big array and dataframe abstractions. This freedom to explore fine-grained task parallelism gave users the control to parallelize other libraries, and build custom distributed systems within their work.

Today

Today Dask is used because it scales Python comfortably, and because it affords users more flexibility, without sacrificing scale. We hope it serves you well.

conda install dask

pip install dask[complete]

git clone git+https://github.com/dask/dask
cd dask
pip install -e .

>>> from dask.distributed import Client
>>> client = Client()  # Creates a local cluster

$ dask-scheduler
Starting scheduler at tcp://localhost:8786

$ dask-worker tcp://localhost:8786  # point workers to scheduler
$ dask-worker tcp://localhost:8786
$ dask-worker tcp://localhost:8786

$ pip install knit

>>> from knit.dask_yarn import YarnCluster
>>> cluster = YarnCluster(...)

$ pip install daskernetes

>>> from daskernetes import KubeCluster
>>> cluster = KubeCluster.from_yaml('worker-template.yaml')

$ helm repo add dask https://dask.github.io/helm-chart
$ helm repo update
$ helm install dask/dask

$ mpirun --np 10 dask-mpi

Learn more

Why use Dask?

Dask Started with Numpy

Dask Grew to Support Custom Systems

Today

Getting Started

Install and deploy Dask on your laptop or cluster