Introducing Blaze - Expressions interfaces for tabular data
This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project
tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies
Introduction
NumPy and Pandas couple a high level interface with fast low-level computation. They allow us to manipulate data intuitively and efficiently.
Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.)
Each of these systems has their own strengths and weaknesses and an experienced data analyst will choose the right tool for the problem at hand. Unfortunately learning how each system works and pushing data into the proper form often takes most of the data scientist’s time.
The startup costs of learning to munge and migrate data between new technologies often dominate biggish-data analytics.
Blaze strives to reduce this friction. Blaze provides a uniform interface to a variety of database technologies and abstractions for migrating data.
Expressions
At its core, Blaze is a way to express data and computations.
In the following example we build an abstract table for accounts in a
simple bank. We then describe a query, deadbeats
, to find the names of the
account holders with a negative balance.
from blaze import TableSymbol, compute
accounts = TableSymbol('accounts', '{id: int, name: string, amount: int}')
# The names of account holders with negative balance
deadbeats = accounts[accounts['amount'] < 0]['name']
Programmers familiar with Pandas should find the syntax to create deadbeats
familiar. Note that we haven’t actually done any work yet. The table
accounts
is purely imaginary and so the deadbeats
expression is just an
expression of intent. The Pandas-like syntax builds up a graph of operations
to perform later.
However, if we happen to have some similarly shaped data lying around
L = [[1, 'Alice', 100],
[2, 'Bob', -200],
[3, 'Charlie', 300],
[4, 'Dennis', 400],
[5, 'Edith', -500]]
We can combine our expression, deadbeats
with our data L
to compute an
actual result
>>> compute(deadbeats, L) # an iterator as a result
<itertools.imap at 0x7fdd7fcde450>
>>> list(compute(deadbeats, L))
['Bob', 'Edith']
So in its simplest incarnation, Blaze is a way to write down computations abstractly which can later be applied to real data.
Multiple Backends - Pandas
Fortunately the deadbeats
expression can run against many different kinds of
data. We just computed deadbeats
against Python lists, here we compute it
against a Pandas DataFrame
from pandas import DataFrame
df = DataFrame([[1, 'Alice', 100],
[2, 'Bob', -200],
[3, 'Charlie', 300],
[4, 'Dennis', 400],
[5, 'Edith', -500]],
columns=['id', 'name', 'amount'])
>>> compute(deadbeats, df)
1 Bob
4 Edith
Name: name, dtype: object
Note that Blaze didn’t perform the computation here, Pandas did (it’s good at that), Blaze just told Pandas what to do. Blaze doesn’t compute results; Blaze drives other systems to compute results.
Multiple Backends - MongoDB
To demonstrate some breadth, let’s show Blaze driving a Mongo Database.
$ # We install and run MongoDB locally
$ sudo apt-get install mongodb-server
$ mongod &
$ pip install pymongo
import pymongo
db = pymongo.MongoClient().db
db.mycollection.insert([{'id': 1, 'name': 'Alice', 'amount': 100},
{'id': 2, 'name': 'Bob', 'amount': -200},
{'id': 3, 'name': 'Charlie', 'amount': 300},
{'id': 4, 'name': 'Dennis', 'amount': 400},
{'id': 5, 'name': 'Edith', 'amount': -500}])
>>> compute(deadbeats, db.mycollection)
[u'Bob', u'Edith']
Recap
To remind you we created a single Blaze query
accounts = TableSymbol('accounts', '{id: int, name: string, amount: int}')
deadbeats = accounts[accounts['amount'] < 0]['name']
And then executed that same query against multiple backends
>>> list(compute(deadbeats, L)) # Python
['Bob', 'Edith']
>>> compute(deadbeats, df) # Pandas
1 Bob
4 Edith
Name: name, dtype: object
>>> compute(deadbeats, db.mycollection) # MongoDB
[u'Bob', u'Edith']
At the time of this writing Blaze supports the following backends
- Pure Python
- Pandas
- MongoDB
- SQL
- PySpark
- PyTables
- BColz
Interactivity
The separation of expressions and computation is core to Blaze. It’s also
confusing for new Blaze users.
NumPy and Pandas demonstrated the value of immediate data interaction and
having to explicitly call compute
is a step backward from that goal.
To this end we create the Table
abstraction, a TableSymbol
that knows about
a particular data resource. Operations on this Table
object produce abstract
expressions just like normal, but statements that would normally print results
to the screen initiate calls to compute
and then print those results, giving
an interactive feel in a console or notebook
>>> from blaze import Table
>>> t = Table(db.mycollection) # give MongoDB resource to Table
>>> t
amount id name
0 100 1 Alice
1 -200 2 Bob
2 300 3 Charlie
3 400 4 Dennis
4 -500 5 Edith
>>> t[t.amount < 0]
amount id name
0 -200 2 Bob
1 -500 5 Edith
>>> type(t[t.amount < 0]) # Still just an expression
blaze.expr.table.Selection
>>> print repr(t[t.amount < 0]) # repr triggers call to compute
amount id name
0 -200 2 Bob
1 -500 5 Edith
These expressions generate the appropriate MongoDB queries, call compute
only
when we print a result to the screen, and then push the result into a
DataFrame
to use Pandas’ excellent tabular printing. For large datasets we
always append a .head(10)
call to the expression to only retrieve the sample
of the output necessary to print to the screen; this avoids large data
transfers when not necessary.
Using the interactive Table
object we can interact with a variety of
computational backends with the familiarity of a local DataFrame.
More Information
- Documentation: blaze.pydata.org/
- Source: github.com/ContinuumIO/blaze/
-
Install with Anaconda:
conda install blaze
blog comments powered by Disqus