Matthew Rocklin
NVIDIA
work by Peter Entschev blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks
import dask.array
.
rs = dask.array.random.RandomState()
x = rs.random((1000000, 1000), chunks=(10000, 1000))
u, s, v = np.linalg.svd(x)
u, s, v = dask.compute(u, s, v)
A complex algorithm built on the Numpy API
import dask.array
import cupy
rs = dask.array.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((1000000, 1000), chunks=(10000, 1000))
u, s, v = np.linalg.svd(x)
u, s, v = dask.compute(u, s, v)
Works today on GPU arrays too
>>> import pandas, cudf
>>> %time pandas.read_csv("nyc-taxi-2015-01.csv")
Wall time: 29.2s
>>> %time cudf.read_csv("nyc-taxi-2015-01.csv")
Wall time: 2.12s
$ du -h nyc-taxi-2015-01.csv
1.9G
>>> import umap, cuml
>>> %time umap.UMAP(n_neighbors=5, init="spectral").fit_transform(cpu_data)
Wall time: 1min 49s
>>> %time cuml.UMAP(n_neighbors=5, init="spectral").fit_transform(cpu_data)
Wall time: 19.5s
$ du -h my_data
400 MB
2019: Numpy, Tensorflow, PyTorch, CuPy, Jax, Sparse, Dask, ...
Fractured community
2019: Numpy, Tensorflow, PyTorch, CuPy, Jax, Sparse, Dask, ...
Fractured community
2019: Numpy, Tensorflow, PyTorch, CuPy, Jax, Sparse, Dask, ...
Fractured community
__iter__
Implement the __iter__
protocol to operate in for loops
class MyObject:
def __iter__(self):
for x in MyObject():
...
__array__
Implement the __array__
protocol to convert to a NumPy array
import matplotlib.pyplot as plt
x = np.array(...)
plt.plot(x)
__array__
Implement the __array__
protocol to convert to a NumPy array
import matplotlib.pyplot as plt
df = pandas.read_csv('myfile.csv')
plt.plot(df.balance)
__array__
Implement the __array__
protocol to convert to a NumPy array
# pandas/core/series.py
class Series:
def __array__(self):
return np.array(self._pointer_to_data)
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py
x = numpy.ones(10000)
h5py.File('myfile.h5')['x'] = x
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py
df = pandas.read_csv('myfile.csv')
h5py.File('myfile.h5')['x'] = df.balance # <-- mediate two non-numpy libraries
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py
df = dask.dataframe.read_csv('myfile.csv')
h5py.File('myfile.h5')['x'] = df.balance # <-- mediate two non-numpy libraries
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py
x = dask.array.random.random(1000000)
h5py.File('myfile.h5')['x'] = x
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py, zarr
x = zarr.create_dataset(...)
h5py.File('myfile.h5')['x'] = x
__array__
Implement the __array__
protocol to convert to a NumPy array
import h5py, zarr
x = h5py.File('myfile.h5')['x']
zarr.create_dataset(...)['x'] = x
fit/predict
Implement fit/transform/predict to work with Scikit-Learn
from sklearn import ...
from sklearn.cluster import DBScan
pipeline = Pipeline(..., DBScan(), ...)
pipeline.fit(X, y)
fit/predict
Implement fit/transform/predict to work with Scikit-Learn
from sklearn import ...
from umap import UMAP
pipeline = Pipeline(..., UMAP(), ...)
pipeline.fit(X, y)
fit/predict
Implement fit/transform/predict to work with Scikit-Learn
from sklearn import ...
from cuml import UMAP
pipeline = Pipeline(..., UMAP(), ...)
pipeline.fit(X, y)
.ipynb
.ipynb
.ipynb
__array_function__
Implement __array_function__
to use NumPy functions
import numpy
x = numpy.random.random((10000, 10000))
u, s, v = np.linalg.svd(x)
__array_function__
Implement __array_function__
to use NumPy functions
import dask.array
x = dask.array.random.random((10000, 10000))
u, s, v = np.linalg.svd(x)
__array_function__
Implement __array_function__
to use NumPy functions
import cupy
x = cupy.random.random((10000, 10000))
u, s, v = np.linalg.svd(x)
__array_function__
Implement __array_function__
to use NumPy functions
import cupy, xarray
x = cupy.random.random((10000, 10000))
d = xarray.DataArary(x)
For example an OpenCL Numpy implementation could gain traction quickly
__array_function__
Implement __array_function__
to use NumPy functions
import cupy
x = cupy.random.random((10000, 10000))
u, s, v = np.linalg.svd(x)
__array_function__
Implement __array_function__
to use NumPy functions
import clpy
x = clpy.random.random((10000, 10000))
u, s, v = np.linalg.svd(x)
__array_function__
Implement __array_function__
to use NumPy functions
import clpy, xarray
x = clpy.random.random((10000, 10000))
d = xarray.DataArary(x)
"Let's just throw sparse arrays in Dask and see what happens"
"Let's just throw GPU dataframes in Dask and see what happens"
Works with either Pandas or RAPIDS cuDF for GPUs
PhD students need more opportunities to distract them from research
Pandas has many custom array-like internal types
>>> import pandas as pd
.
>>> pd.Series(["Healthy", "No change", "No change"]).astype("category")
0 Healthy
1 No change
2 No change
dtype: category
Categories (2, object): [Healthy, No change]
.
.
Externalizing that API enables external packages to be Pandas native
>>> import pandas as pd
>>> from cyberpandas import IPArray # <--- External library
>>> arr = IPArray([0, 1, 2, 3])
>>> s = pd.Series(arr) # <--- Native integration
0 0.0.0.0
1 0.0.0.1
2 0.0.0.2
3 0.0.0.3
dtype: ip # <--- Neat!
What would previously be a community fork is now native and integrated
>>> import pandas as pd
>>> from cyberpandas import IPArray
>>> arr = IPArray([0, 1, 2, 3])
>>> s = pd.Series(arr)
0 0.0.0.0
1 0.0.0.1
2 0.0.0.2
3 0.0.0.3
dtype: ip
>>> ds = dask.dataframe.from_pandas(s, npartitions=2)
Dask Series Structure:
npartitions=2
0 ip
2 ...
3 ...
dtype: ip # <--- Downstream benefits
Dask Name: from_pandas, 2 tasks
Users: Experiment new technologies, share experiences
Developers:: Build on standards, not closed systems
Build new things!
Core Maintainers: Build extension points