Towards Out-of-core ND-Arrays -- Frontend
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
tl;dr Blaze adds usability to our last post on out-of-core ND-Arrays
Disclaimer: This post is on experimental buggy code. This is not ready for public use.
Setup
This follows my last post designing a simple task scheduler for use with out-of-core (or distributed) nd-arrays. We encoded tasks-with-data-dependencies as simple dictionaries. We then built functions to create dictionaries that describe blocked array operations. We found that this was an effective-but-unfriendly way to solve some important-but-cumbersome problems.
This post sugars the programming experience with blaze
and into
to give a
numpy-like experience out-of-core.
Old low-level code
Here is the code we wrote for an out-of-core transpose/dot-product (actually a symmetric rank-k update).
Create random array on disk
Define computation A.T * A
New pleasant feeling code with Blaze
Targetting users
The last section “Define computation” is written in a style that is great for library writers and automated systems but is challenging to users accustomed to Matlab/NumPy or R/Pandas style.
We wrap this process with Blaze, an extensible front-end for analytic computations
Redefine computation A.T * A
with Blaze
Under the hood
Under the hood, Blaze creates the same dask dicts we created by hand last time. I’ve doctored the result rendered here to include suggestive names.
We then compute this sequentially on a single core. However we could have passed this on to a distributed system. This result contains all necessary information to go from on-disk arrays to computed result in whatever manner you choose.
Separating Backend from Frontend
Recall that Blaze is an extensible front-end to data analytics technologies. It lets us wrap messy computational APIs with a pleasant and familiar user-centric API. Extending Blaze to dask dicts was the straightforward work of an afternoon. This separation allows us to continue to build out dask-oriented solutions without worrying about user-interface. By separating backend work from frontend work we allow both sides to be cleaner and to progress more swiftly.
Future work
I’m on vacation right now. Work for recent posts has been done in evenings while watching TV with the family. It isn’t particularly robust. Still, it’s exciting how effective this approach has been with relatively little effort.
Perhaps now would be a good time to mention that Continuum has ample grant funding. We’re looking for people who want to create usable large-scale data analytics tools. For what it’s worth, I quit my academic postdoc to work on this and couldn’t be happier with the switch.
Source
This code is experimental and buggy. I don’t expect it to stay around for forever in it’s current form (it’ll improve). Still, if you’re reading this when it comes out then you might want to check out the following:
blog comments powered by Disqus