tl;dr into efficiently migrates data between formats.
We spend a lot of time migrating data from common interchange formats, like
CSV, to efficient computation formats like an array, a database or binary
store. Worse, many don’t migrate data to efficient formats because they don’t
know how or can’t manage the particular migration process for their tools.
Your choice of data format is important. It strongly impacts performance (10x
is a good rule of thumb) and who can easily use and interpret your data.
When advocating for Blaze I often say
“Blaze can help you query your data in a variety of formats.” This assumes
that you’re able to actually get it in to that format.
Enter the into project
The into function efficiently migrates data between formats.
These formats include both in-memory data structures like the following:
list, set, tuple, Iterator
numpy.ndarray, pandas.DataFrame, dynd.array
Streaming Sequences of any of the above
as well as persistent data living outside of Python like the following:
CSV, JSON, line-delimited-JSON
Remote versions of the above
HDF5 (both standard and Pandas formatting), BColz, SAS
SQL databases (anything supported by SQLAlchemy), Mongo
The into project migrates data between any pair of these formats efficiently
by using a network of pairwise conversions. (visualized towards the bottom of
How to use it
The into function takes two arguments, a source and a target. It moves data
in the source to the target. The source and target can take the following
A particular DataFrame or list
Like list or pd.DataFrame
So the following would be valid calls to into
Note that into is a single function. We’re used to doing this with various
to_csv, from_sql methods on various types. The into api is very small;
Here is what you need in order to get started:
Translate line-delimited JSON into a Pandas DataFrame
How does it work?
This is challenging. Robust and efficient conversions between any two pairs of
formats is fraught with special cases and bizarre libraries. The common
solution is to convert through a common format like a DataFrame, or streaming
in-memory lists, dicts, etc. (see dat)
or through a serialization format like
Thrift. These are excellent options and often
what you want. Sometimes however this can be slow, particularly when dealing
with live computational systems or with finicky storage solutions.
Consider for example, migrating between a numpy.recarray and a
pandas.DataFrame. We can migrate this data very quickly in place. The bytes
of data don’t need to change, only the metadata surrounding them. We don’t
need to serialize to an interchange format or translate to intermediate
pure Python objects.
Consider migrating data from a CSV file to a PostgreSQL database. Using
Python iterators through SQLAlchemy we rarely exceed migration speeds greater
than 2000 records per second. However using direct CSV loaders native to
PostgreSQL we can achieve speeds greater than 50000 records per second. This
is the difference between an overnight job and a cup of coffee. However this
requires that we’re flexible enough to use special code in special situations.
Expert pairwise interactions are often an order of magnitude faster than
Into is a network of these pairwise migrations. We visualize that network