This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk and so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

In this post we’re going to use the interactive Table object, which wraps up a dataset and calls compute whenever we ask for something to be printed to the screen. It retains the abstract delayed-evaluation nature of Blaze with the interactive feel of NumPy and Pandas.

from blaze import CSV, Table
csv = CSV('hmda_lar-2012.csv')  # Open the CSV file
t = Table(csv)                  # Interact with CSV file using interactive Table object
t

	action_taken	action_taken_name	agency_code	agency_abbr	agency_name	applicant_ethnicity	applicant_ethnicity_name	applicant_income_000s	applicant_race_1	applicant_race_name_1	applicant_sex	applicant_sex_name	application_date_indicator	as_of_year	census_tract_number	co_applicant_ethnicity	co_applicant_ethnicity_name	co_applicant_race_1	co_applicant_race_name_1	co_applicant_sex	co_applicant_sex_name	county_code	county_name	denial_reason_1	edit_status	edit_status_name	hoepa_status	hoepa_status_name	lien_status	lien_status_name	loan_purpose	loan_purpose_name	loan_type	loan_type_name	msamd	msamd_name	owner_occupancy	owner_occupancy_name	preapproval	preapproval_name	property_type	property_type_name	purchaser_type	purchaser_type_name	respondent_id	sequence_number	state_code	state_abbr	state_name	hud_median_family_income	loan_amount_000s	number_of_1_to_4_family_units	number_of_owner_occupied_units	minority_population	population	rate_spread	tract_to_msamd_income
0	1	Loan originated	7	HUD	Department of Housing and Urban Development	2	Not Hispanic or Latino	173	5	White	1	Male	0	2012	8803.06	2	Not Hispanic or Latino	5	White	2	Female	197	Will County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	16974	Chicago, Joliet, Naperville - IL	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	6	Commercial bank, savings bank or savings assoc...	36-4176531	2712	17	IL	Illinois	77300	264	2153	1971	45.820000	7894	NaN	170.679993
1	1	Loan originated	5	NCUA	National Credit Union Administration	2	Not Hispanic or Latino	83	5	White	1	Male	0	2012	2915.00	5	No co-applicant	8	No co-applicant	5	No co-applicant	111	Midland County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	NaN		1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	0	Loan was not originated or was not sold in cal...	0000060137	328	26	MI	Michigan	52100	116	1662	1271	3.340000	4315	NaN	102.760002
2	6	Loan purchased by the institution	9	CFPB	Consumer Financial Protection Bureau	4	Not applicable	70	7	Not applicable	4	Not applicable	2	2012	212.01	4	Not applicable	7	Not applicable	4	Not applicable	7	Benton County	None	6	Quality edit failure only	2	Not a HOEPA loan	4	Not applicable	3	Refinancing	1	Conventional	22220	Fayetteville, Springdale, Rogers - AR, MO	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	8	Affiliate institution	0000476810	43575	5	AR	Arkansas	58200	159	1194	708	21.870001	4239	NaN	127.639999
3	6	Loan purchased by the institution	9	CFPB	Consumer Financial Protection Bureau	2	Not Hispanic or Latino	108	5	White	2	Female	2	2012	407.06	5	No co-applicant	8	No co-applicant	5	No co-applicant	123	Ramsey County	None	NaN		2	Not a HOEPA loan	4	Not applicable	3	Refinancing	1	Conventional	33460	Minneapolis, St. Paul, Bloomington - MN, WI	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	1	Fannie Mae (FNMA)	0000451965	2374657	27	MN	Minnesota	83900	100	1927	1871	13.680000	4832	NaN	137.669998
4	1	Loan originated	3	FDIC	Federal Deposit Insurance Corporation	2	Not Hispanic or Latino	NaN	5	White	1	Male	0	2012	104.00	2	Not Hispanic or Latino	5	White	2	Female	3	Allen County	None	6	Quality edit failure only	2	Not a HOEPA loan	1	Secured by a first lien	2	Home improvement	1	Conventional	23060	Fort Wayne - IN	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	0	Loan was not originated or was not sold in cal...	0000013801	11	18	IN	Indiana	63800	267	1309	1160	4.680000	3612	NaN	139.100006
5	1	Loan originated	7	HUD	Department of Housing and Urban Development	2	Not Hispanic or Latino	144	5	White	1	Male	0	2012	8057.01	2	Not Hispanic or Latino	5	White	1	Male	31	Cook County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	16974	Chicago, Joliet, Naperville - IL	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	3	Freddie Mac (FHLMC)	36-4327855	22594	17	IL	Illinois	77300	260	1390	1700	6.440000	5074	NaN	140.550003
6	1	Loan originated	7	HUD	Department of Housing and Urban Development	2	Not Hispanic or Latino	51	3	Black or African American	1	Male	0	2012	17.00	5	No co-applicant	8	No co-applicant	5	No co-applicant	19	Calcasieu Parish	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	1	Home purchase	1	Conventional	29340	Lake Charles - LA	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	1	Fannie Mae (FNMA)	7056000000	34177	22	LA	Louisiana	62400	115	3500	2797	29.260000	8745	NaN	86.739998
7	1	Loan originated	7	HUD	Department of Housing and Urban Development	2	Not Hispanic or Latino	162	5	White	1	Male	0	2012	2.01	2	Not Hispanic or Latino	5	White	2	Female	49	Grand County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	2	FHA-insured	NaN		1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	6	Commercial bank, savings bank or savings assoc...	87-0623581	2141	8	CO	Colorado	61000	283	5706	1724	9.650000	4817	NaN	128.559998
8	1	Loan originated	3	FDIC	Federal Deposit Insurance Corporation	2	Not Hispanic or Latino	32	5	White	2	Female	0	2012	103.04	5	No co-applicant	8	No co-applicant	5	No co-applicant	3	Allen County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	23060	Fort Wayne - IN	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	0	Loan was not originated or was not sold in cal...	0000013801	10	18	IN	Indiana	63800	40	2384	2210	6.640000	6601	3.33	149.690002
9	1	Loan originated	9	CFPB	Consumer Financial Protection Bureau	2	Not Hispanic or Latino	38	5	White	1	Male	0	2012	9608.00	2	Not Hispanic or Latino	5	White	2	Female	41	Talbot County	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	NaN		1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	3	Freddie Mac (FHLMC)	0000504713	18459	24	MD	Maryland	72600	108	1311	785	6.040000	1920	NaN	77.279999
10	1	Loan originated	7	HUD	Department of Housing and Urban Development	2	Not Hispanic or Latino	47	5	White	1	Male	0	2012	22.04	1	Hispanic or Latino	5	White	2	Female	19	Calcasieu Parish	None	NaN		2	Not a HOEPA loan	1	Secured by a first lien	1	Home purchase	3	VA-guaranteed	29340	Lake Charles - LA	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa...	6	Commercial bank, savings bank or savings assoc...	7056000000	33791	22	LA	Louisiana	62400	158	1854	1463	12.410000	4955	NaN	112.010002

Reducing the Dataset

That’s a lot of columns, most of which are redundant or don’t carry much information. Let’s clean up our dataset a bit by selecting a smaller subset of columns. Already this quick investigation improves our comprehension and reduces the size of the dataset.

columns = ['action_taken_name', 'agency_abbr', 'applicant_ethnicity_name',
           'applicant_race_name_1', 'applicant_sex_name', 'county_name',
           'loan_purpose_name', 'state_abbr']

t = t[columns]
t

	action_taken_name	agency_abbr	applicant_ethnicity_name	applicant_race_name_1	applicant_sex_name	county_name	loan_purpose_name	state_abbr
0	Loan originated	HUD	Not Hispanic or Latino	White	Male	Will County	Refinancing	IL
1	Loan originated	NCUA	Not Hispanic or Latino	White	Male	Midland County	Refinancing	MI
2	Loan purchased by the institution	CFPB	Not applicable	Not applicable	Not applicable	Benton County	Refinancing	AR
3	Loan purchased by the institution	CFPB	Not Hispanic or Latino	White	Female	Ramsey County	Refinancing	MN
4	Loan originated	FDIC	Not Hispanic or Latino	White	Male	Allen County	Home improvement	IN
5	Loan originated	HUD	Not Hispanic or Latino	White	Male	Cook County	Refinancing	IL
6	Loan originated	HUD	Not Hispanic or Latino	Black or African American	Male	Calcasieu Parish	Home purchase	LA
7	Loan originated	HUD	Not Hispanic or Latino	White	Male	Grand County	Refinancing	CO
8	Loan originated	FDIC	Not Hispanic or Latino	White	Female	Allen County	Refinancing	IN
9	Loan originated	CFPB	Not Hispanic or Latino	White	Male	Talbot County	Refinancing	MD
10	Loan originated	HUD	Not Hispanic or Latino	White	Male	Calcasieu Parish	Home purchase	LA

More Complex Computation

Now that we can more clearly see what’s going on let’s ask a simple question:

How many times does each action occur in the state of New York?

t2 = t[t.state_abbr == 'NY']
t2

%%time
from blaze import into, by
from pandas import DataFrame
# Group on action_taken_name, count each group
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))

CPU times: user 13min 50s, sys: 5.23 s, total: 13min 55s
Wall time: 13min 55s

	action_taken_name	action_taken_name_count
0	Loan originated	285106
1	Application denied by financial institution	109423
2	Loan purchased by the institution	75241
3	Application withdrawn by applicant	50563
4	Application approved but not accepted	25632
5	File closed for incompleteness	20585
6	Preapproval request approved but not accepted	259
7	Preapproval request denied by financial instit...	171

Great! Sadly, because it was reading through the CSV file and because it was using a Pure Python backend, that computation took fourteen minutes.

Moving to a Faster Backend

By default computations on CSV files use the streaming Python backend. While robust for large files and decently fast, this backend parses the CSV file each time we do a full-data operation, and this parsing is very slow. Let’s move our reduced dataset to a more efficient and widely accessible backend, sqlite.

from blaze import SQL
sql = SQL('sqlite:///hmda.db', 'data', schema=t.schema) # A SQLite database
into(sql, t)  # Migrate data

Yup, a little sqlite database just arrived

$ ls -lh hmda*

-rw-r--r-- 1 mrocklin mrocklin 2.7G Aug 25 13:38 hmda.db
-rw-r--r-- 1 mrocklin mrocklin  12G Jul 10 12:15 hmda_lar-2012.csv

Working with SQL

Now that we’ve migrated our csv file into a sqlite database let’s redefine t to use the SQL backend and repeat our computation.

# t = Table(csv)
t = Table(sql)
t2 = t[t.state_abbr == 'NY']

%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))

CPU times: user 5.55 s, sys: 1.64 s, total: 7.19 s
Wall time: 7.46 s

	action_taken_name	action_taken_name_count
0	Loan originated	285106
1	Application denied by financial institution	109423
2	Loan purchased by the institution	75241
3	Application withdrawn by applicant	50563
4	Application approved but not accepted	25632
5	File closed for incompleteness	20585
6	Preapproval request approved but not accepted	259
7	Preapproval request denied by financial instit...	171

We’re about to repeat this same computation many times. We’ll omit the table result from here on out. It will always be the same.

Create an index on state name

This was much faster, largely because the data was stored in a binary format. We can improve the query speed significantly by placing an index on the state_abbr field. This will cause the selection t[t.state_abbr == 'NY'] to return more quickly, eliminating the need for an expensive full table scan.

from blaze import create_index
create_index(sql, 'state_abbr', name='state_abbr_index')

Now we can ask this same query for many states at interactive timescales.

t2 = t[t.state_abbr == 'NY']

%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))

CPU times: user 1.74 s, sys: 430 ms, total: 2.17 s
Wall time: 2.17 s

Comparing against MongoDB

Because moving between computational backends is now easy, we can quickly compare performance between backends. SQLite and MongoDB are similarly available technologies, each being trivial to set up on a personal computer. However they’re also fairly different technologies with varying communities.

Which performs faster for our sample computation?

import pymongo
db = pymongo.MongoClient().db

into(db.hmda, sql)  # Migrate to Mongo DB from SQLite database

# t = Table(csv)
# t = Table(sql)
t = Table(db.hmda)

t2 = t[t.state_abbr == 'NY']

%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))

CPU times: user 4.05 ms, sys: 701 µs, total: 4.76 ms
Wall time: 7.61 s

Almost exactly the same time as for SQLite.

We just did a complex thing easily. If we weren’t familiar with MongoDB we would need to learn how to set up a database, how to migrate data from SQL to MongoDB, and finally how to perform queries. Blaze eased that process considerably.

Create an index on state name

Again we create an index on the state name and observe the performance difference.

create_index(db.hmda, 'state_abbr', name='state_abbr_index')

t2 = t[t.state_abbr == 'NY']

%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))

CPU times: user 4.13 ms, sys: 844 µs, total: 4.97 ms
Wall time: 954 ms

Here the indexed MongoDB system seems about twice as fast as the comparably indexed SQLite system.

Results

Disclaimer: These results come from a single run. No attempt was made to optimize the backend configuration, nor was any consideration taken into account about databases being warmed up. These numbers are far from conclusive, and are merely here to present the ease with which intuitive-building experiments are easy with Blaze and the value of choosing the right backend.

Backend	Duration
NumPy/Pandas	need bigger machine
Python/CSV	14 minutes
SQLite	7 seconds
MongoDB	7 seconds
SQLite (indexed)	2 seconds
MongoDB (indexed)	1 second

Conclusion

Blaze enables you to investigate, transform, and migrate large data intuitively. You can choose the right technology for your application without having to worry about learning a new syntax.

We hope that by lowering this barrier more users will use the right tool for the job.

More Information

Documentation: blaze.pydata.org/
Source: github.com/ContinuumIO/blaze/
Install with Anaconda:
```
conda install blaze
```

Introducing Blaze - Practice in which we solve a real problem

Reducing the Dataset

More Complex Computation

Moving to a Faster Backend

Working with SQL

Create an index on state name

Comparing against MongoDB

Create an index on state name

Results

Conclusion

More Information