This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Current efforts for June 2018 in Dask and Dask-related projects include the following:

  1. Yarn Deployment
  2. More examples for machine learning
  3. Incremental machine learning
  4. HPC Deployment configuration

Yarn deployment

Dask developers often get asked How do I deploy Dask on my Hadoop/Spark/Hive cluster?. We haven’t had a very good answer until recently.

Most Hadoop/Spark/Hive clusters are actually Yarn clusters. Yarn is the most common cluster manager used by many clusters that are typically used to run Hadoop/Spark/Hive jobs including any cluster purchased from a vendor like Cloudera or Hortonworks. If your application can run on Yarn then it can be a first class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and so has been difficult for Dask to interact with. That’s changing now with a few projects, including:

  • dask-yarn: an easy way to launch Dask on Yarn clusters
  • skein: an easy way to launch generic services on Yarn clusters (this is primarily what backs dask-yarn)
  • conda-pack: an easy way to bundle together a conda package into a redeployable environment, such as is useful when launching Python applications on Yarn

This work is all being done by Jim Crist who is, I believe, currently writing up a blogpost about the topic at large. Dask-yarn was soft-released last week though, so people should give it a try and report feedback on the dask-yarn issue tracker. If you ever wanted direct help on your cluster, now is the right time because Jim is working on this actively and is not yet drowned in user requests so generally has a fair bit of time to investigate particular cases.

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GB")
# Scale out to ten such workers
cluster.scale(10)

# Connect to the cluster
client = Client(cluster)

More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to use the project. This allows people to click a link on the web and quickly be taken to a Jupyter notebook running on the cloud. It’s a fun way to quickly experience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machine learning, etc.

Now Scott Sievert is expanding the examples within the machine learning section. He has submitted the following two so far:

  1. Incremental training with Scikit-Learn and large datasets
  2. Dask and XGBoost

I believe he’s planning on more. If you use dask-ml and have recommendations or want to help, you might want to engage in the dask-ml issue tracker or dask-examples issue tracker.

Incremental training

The incremental training mentioned as an example above is also new-ish. This is a Scikit-Learn style meta-estimator that wraps around other estimators that support the partial_fit method. It enables training on large datasets in an incremental or batchwise fashion.

Before

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)

import pandas as pd

for filename in filenames:
    df = pd.read_csv(filename)
    X, y = ...

    sgd.partial_fit(X, y)

After

from sklearn.linear_model import SGDClassifier
from dask_ml.wrappers import Incremental

sgd = SGDClassifier(...)
inc = Incremental(sgd)

import dask.dataframe as dd

df = dd.read_csv(filenames)
X, y = ...
inc.fit(X, y)

Analysis

From a parallel computing perspective this is a very simple and un-sexy way of doing things. However my understanding is that it’s also quite pragmatic. In a distributed context we leave a lot of possible computation on the table (the solution is inherently sequential) but it’s fun to see the model jump around the cluster as it absorbs various chunks of data and then moves on.

Incremental training with Dask-ML

There’s ongoing work on how best to combine this with other work like pipelines and hyper-parameter searches to fill in the extra computation.

This work was primarily done by Tom Augspurger with help from Scott Sievert

Dask User Stories

Dask developers are often asked “Who uses Dask?”. This is a hard question to answer because, even though we’re inundated with thousands of requests for help from various companies and research groups, it’s never fully clear who minds having their information shared with others.

We’re now trying to crowdsource this information in a more explicit way by having users tell their own stories. Hopefully this helps other users in their field understand how Dask can help and when it might (or might not) be useful to them.

We originally collected this information in a Google Form but have since then moved it to a Github repository. Eventually we’ll publish this as a proper web site and include it in our documentation.

If you use Dask and want to share your story this is a great way to contribute to the project. Arguably Dask needs more help with spreading the word than it does with technical solutions.

HPC Deployments

The Dask Jobqueue package for deploying Dask on traditional HPC machines is nearing another release. We’ve changed around a lot of the parameters and configuration options in order to improve the onboarding experience for new users. It has been going very smoothly in recent engagements with new groups, but will mean a breaking change for existing users of the sub-project.


blog comments powered by Disqus