This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr into efficiently migrates data between formats.

## Motivation

We spend a lot of time migrating data from common interchange formats, like CSV, to efficient computation formats like an array, a database or binary store. Worse, many don’t migrate data to efficient formats because they don’t know how or can’t manage the particular migration process for their tools.

Your choice of data format is important. It strongly impacts performance (10x is a good rule of thumb) and who can easily use and interpret your data.

When advocating for Blaze I often say “Blaze can help you query your data in a variety of formats.” This assumes that you’re able to actually get it in to that format.

## Enter the into project

The into function efficiently migrates data between formats. These formats include both in-memory data structures like the following:

list, set, tuple, Iterator
numpy.ndarray, pandas.DataFrame, dynd.array
Streaming Sequences of any of the above


as well as persistent data living outside of Python like the following:

CSV, JSON, line-delimited-JSON
Remote versions of the above
HDF5 (both standard and Pandas formatting), BColz, SAS
SQL databases (anything supported by SQLAlchemy), Mongo


The into project migrates data between any pair of these formats efficiently by using a network of pairwise conversions. (visualized towards the bottom of this post)

## How to use it

The into function takes two arguments, a source and a target. It moves data in the source to the target. The source and target can take the following forms

 Target Source Example Object Object A particular DataFrame or list String String 'file.csv', 'postgresql://hostname::tablename' Type Like list or pd.DataFrame

So the following would be valid calls to into

Note that into is a single function. We’re used to doing this with various to_csv, from_sql methods on various types. The into api is very small; Here is what you need in order to get started:

$pip install into  See into on github ## Examples We now show some of those same examples in more depth. Turn list into numpy array Load CSV file into Python list Translate CSV file into JSON $ head accounts.json
{"balance": 100, "id": 1, "name": "Alice"}
{"balance": 200, "id": 2, "name": "Bob"}
{"balance": 300, "id": 3, "name": "Charlie"}
{"balance": 400, "id": 4, "name": "Denis"}
{"balance": 500, "id": 5, "name": "Edith"}


Translate line-delimited JSON into a Pandas DataFrame

## How does it work?

This is challenging. Robust and efficient conversions between any two pairs of formats is fraught with special cases and bizarre libraries. The common solution is to convert through a common format like a DataFrame, or streaming in-memory lists, dicts, etc. (see dat) or through a serialization format like ProtoBuf or Thrift. These are excellent options and often what you want. Sometimes however this can be slow, particularly when dealing with live computational systems or with finicky storage solutions.

Consider for example, migrating between a numpy.recarray and a pandas.DataFrame. We can migrate this data very quickly in place. The bytes of data don’t need to change, only the metadata surrounding them. We don’t need to serialize to an interchange format or translate to intermediate pure Python objects.

Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators through SQLAlchemy we rarely exceed migration speeds greater than 2000 records per second. However using direct CSV loaders native to PostgreSQL we can achieve speeds greater than 50000 records per second. This is the difference between an overnight job and a cup of coffee. However this requires that we’re flexible enough to use special code in special situations.

Expert pairwise interactions are often an order of magnitude faster than generic solutions.

Into is a network of these pairwise migrations. We visualize that network below: