Into and Remote Data Mucking about with `ssh` and `hdfs`
into now handles data on remote machines, including HDFS and the Hive
Last week I wrote about
into, a library to migrate data between
formats. We saw that a network of pairwise data conversions can robustly
migrate data, eliminating some of the frustration of data science.
This frustration compounds when data lives on other computers or distributed file systems like HDFS. Moving data from your local machine into something like the Hive metastore often requires several steps.
scpdata to cluster
hadoop fs -cpdata to HDFS
CREATE TABLEin Hive/Impala to register data with metastore
- Write SQL queries
While each of these steps may be relatively straightforward, their combination can be daunting to the casual analyst.
Remote data and into
So we took this as a case study and extended the
into network appropriately.
We integrate the following libraries and protocols
ssh://hostname:myfile.csvaccesses data on remote machines through
hdfs://hostname:myfile.csvaccesses data on the Hadoop distributed file system through WebHDFS using the
hive://hostname::tablenameaccesses data on the Hive Metastore using a combination of SQLAlchemy and hand crafted
into is now a fancy
Because we’re connected to the network, lots of other things work too.
Note that we’re not calling any code on the remote machine so fancy conversions always happen locally.
If you’d like to use ssh generally you might want to take a look at Paramiko which is really doing all of the heavy lifting here.
WebHDFS is a web interface to the Hadoop File System. It is surprisingly high
performance (I often erroneously think of HTTP as slow) but isn’t always turned
on in every instance. If it is then you should be able to transfer data in and
out easily, just as we did for
The interesting piece comes when we come to Hive, which, in
expects one of the following kinds of data:
ssh://single-file.csv ssh://directory-of-files/*.csv hdfs://directory-of-files/*.csv
And so we build these routes, enabling operations like the following:
But Hive is also a bit finicky.
Blaze uses the
PyHive sqlalchemy dialect to query Hive
tables; unfortunately the way Hive works we need to create them by hand. Hive
is different from most databases in that it doesn’t have an internal format.
Instead, it represents tables as directories of CSV files (or other things).
This distinction mucks up
into’s approach a bit but things work ok in normal
We had to add a couple new ideas to
into to expand out to these systems.
First, we needed a way to refer to different variants of the same format of file. For example, for CSV files we now have the following variants
A local CSV file A CSV file accessible through HDFS A CSV file accessible through SSH A directory of CSV files A directory of CSV files on HDFS ...
And the same for JSON, text, etc.. Into decides what conversion functions to run based on the type of the data, so in principle we need subclasses for all combinations of format and location. Yuck.
To solve this problem we create functions,
SSH, HDFS, Directory to create
subclasses, we call these type modifiers. So
SSH(CSV) is a new type that
acts like a CSV file and like the hidden
Note that users don’t usually see these (unless they want to be explicit) they usually interact with uri strings.
Second, we need a way to route through temporary files. E.g. consider the following route:
SSH(CSV) -> CSV -> pd.DataFrame
Both steps of this path are easy given
pandas. However we
don’t want the intermediate CSV file to hang around (users would hate us if we
slowly filled up their
/tmp folder.) We need to clean it up when we’re done.
To solve this problem, we introduce a new type modifier,
itself on garbage collection (
drop is another magic function in
docs). This lets us tie the
Python garbage collector to persistent data outside of the Python process.
It’s not fool-proof, but it is pretty effective.
SSH(CSV) -> Temp(CSV) -> pd.DataFrame
This is also a good example of how we build type modifiers. You can safely ignore the following code:
I won’t be surprised if this approach concerns a few people but I’ve found it to be effective so far.
The number of possible keyword arguments to a single
into call is increasing.
We don’t have a good mechanism to help users discover the valid options for
their situation. Docstrings are hard here because the options depend on the
source and target inputs. For the moment we’re solving this with online
documentation for each complicated format but
there is probably a better solution out there.
The new behavior around
hive:// is new, error
prone, and could really use play-testing. I strongly welcome feedback and
error reporting here. You could
file an issue
or e-mail email@example.com.
I didn’t mention anything about
RedShift support that was also
recently merged. This is because I think Phil Cloud might write a separate
blogpost about it. We did this work in parallel in an effort to hash out how
best to solve the problems above. I think it worked decently well
Also, we’ve added an
into command line interface. It works just like the
into function with strings, except that we’ve reversed the order of the
arguments to be more like
cp. An example is below:
$ into source target --key value --key value --key value $ into myfile.csv ssh://hostname:myfile.json --delimter ','
We also have docs! http://into.readthedocs.org/en/latest/
blog comments powered by Disqus