ReIntroducing Into Clean data migration (with graphs!)
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
tl;dr into
efficiently migrates data between formats.
Motivation
We spend a lot of time migrating data from common interchange formats, like
CSV
, to efficient computation formats like an array, a database or binary
store. Worse, many don’t migrate data to efficient formats because they don’t
know how or can’t manage the particular migration process for their tools.
Your choice of data format is important. It strongly impacts performance (10x is a good rule of thumb) and who can easily use and interpret your data.
When advocating for Blaze I often say “Blaze can help you query your data in a variety of formats.” This assumes that you’re able to actually get it in to that format.
Enter the into
project
The into
function efficiently migrates data between formats.
These formats include both in-memory data structures like the following:
list, set, tuple, Iterator
numpy.ndarray, pandas.DataFrame, dynd.array
Streaming Sequences of any of the above
as well as persistent data living outside of Python like the following:
CSV, JSON, line-delimited-JSON
Remote versions of the above
HDF5 (both standard and Pandas formatting), BColz, SAS
SQL databases (anything supported by SQLAlchemy), Mongo
The into
project migrates data between any pair of these formats efficiently
by using a network of pairwise conversions. (visualized towards the bottom of
this post)
How to use it
The into
function takes two arguments, a source and a target. It moves data
in the source to the target. The source and target can take the following
forms
Target | Source | Example |
Object | Object | A particular DataFrame or list |
String | String | 'file.csv', 'postgresql://hostname::tablename' |
Type | Like list or pd.DataFrame |
So the following would be valid calls to into
Note that into
is a single function. We’re used to doing this with various
to_csv
, from_sql
methods on various types. The into
api is very small;
Here is what you need in order to get started:
$ pip install into
Examples
We now show some of those same examples in more depth.
Turn list into numpy array
Load CSV file into Python list
Translate CSV file into JSON
$ head accounts.json
{"balance": 100, "id": 1, "name": "Alice"}
{"balance": 200, "id": 2, "name": "Bob"}
{"balance": 300, "id": 3, "name": "Charlie"}
{"balance": 400, "id": 4, "name": "Denis"}
{"balance": 500, "id": 5, "name": "Edith"}
Translate line-delimited JSON into a Pandas DataFrame
How does it work?
This is challenging. Robust and efficient conversions between any two pairs of
formats is fraught with special cases and bizarre libraries. The common
solution is to convert through a common format like a DataFrame, or streaming
in-memory lists, dicts, etc. (see dat
)
or through a serialization format like
ProtoBuf or
Thrift. These are excellent options and often
what you want. Sometimes however this can be slow, particularly when dealing
with live computational systems or with finicky storage solutions.
Consider for example, migrating between a numpy.recarray
and a
pandas.DataFrame
. We can migrate this data very quickly in place. The bytes
of data don’t need to change, only the metadata surrounding them. We don’t
need to serialize to an interchange format or translate to intermediate
pure Python objects.
Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators through SQLAlchemy we rarely exceed migration speeds greater than 2000 records per second. However using direct CSV loaders native to PostgreSQL we can achieve speeds greater than 50000 records per second. This is the difference between an overnight job and a cup of coffee. However this requires that we’re flexible enough to use special code in special situations.
Expert pairwise interactions are often an order of magnitude faster than generic solutions.
Into is a network of these pairwise migrations. We visualize that network below:
blog comments powered by Disqus