This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk and so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

In this post we’re going to use the interactive Table object, which wraps up a dataset and calls compute whenever we ask for something to be printed to the screen. It retains the abstract delayed-evaluation nature of Blaze with the interactive feel of NumPy and Pandas.

from blaze import CSV, Table
csv = CSV('hmda_lar-2012.csv')  # Open the CSV file
t = Table(csv)                  # Interact with CSV file using interactive Table object
t
action_taken action_taken_name agency_code agency_abbr agency_name applicant_ethnicity applicant_ethnicity_name applicant_income_000s applicant_race_1 applicant_race_2 applicant_race_3 applicant_race_4 applicant_race_5 applicant_race_name_1 applicant_race_name_2 applicant_race_name_3 applicant_race_name_4 applicant_race_name_5 applicant_sex applicant_sex_name application_date_indicator as_of_year census_tract_number co_applicant_ethnicity co_applicant_ethnicity_name co_applicant_race_1 co_applicant_race_2 co_applicant_race_3 co_applicant_race_4 co_applicant_race_5 co_applicant_race_name_1 co_applicant_race_name_2 co_applicant_race_name_3 co_applicant_race_name_4 co_applicant_race_name_5 co_applicant_sex co_applicant_sex_name county_code county_name denial_reason_1 denial_reason_2 denial_reason_3 denial_reason_name_1 denial_reason_name_2 denial_reason_name_3 edit_status edit_status_name hoepa_status hoepa_status_name lien_status lien_status_name loan_purpose loan_purpose_name loan_type loan_type_name msamd msamd_name owner_occupancy owner_occupancy_name preapproval preapproval_name property_type property_type_name purchaser_type purchaser_type_name respondent_id sequence_number state_code state_abbr state_name hud_median_family_income loan_amount_000s number_of_1_to_4_family_units number_of_owner_occupied_units minority_population population rate_spread tract_to_msamd_income
0 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 173 5 White 1 Male 0 2012 8803.06 2 Not Hispanic or Latino 5 White 2 Female 197 Will County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 16974 Chicago, Joliet, Naperville - IL 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 36-4176531 2712 17 IL Illinois 77300 264 2153 1971 45.820000 7894 NaN 170.679993
1 1 Loan originated 5 NCUA National Credit Union Administration 2 Not Hispanic or Latino 83 5 White 1 Male 0 2012 2915.00 5 No co-applicant 8 No co-applicant 5 No co-applicant 111 Midland County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000060137 328 26 MI Michigan 52100 116 1662 1271 3.340000 4315 NaN 102.760002
2 6 Loan purchased by the institution 9 CFPB Consumer Financial Protection Bureau 4 Not applicable 70 7 Not applicable 4 Not applicable 2 2012 212.01 4 Not applicable 7 Not applicable 4 Not applicable 7 Benton County None 6 Quality edit failure only 2 Not a HOEPA loan 4 Not applicable 3 Refinancing 1 Conventional 22220 Fayetteville, Springdale, Rogers - AR, MO 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 8 Affiliate institution 0000476810 43575 5 AR Arkansas 58200 159 1194 708 21.870001 4239 NaN 127.639999
3 6 Loan purchased by the institution 9 CFPB Consumer Financial Protection Bureau 2 Not Hispanic or Latino 108 5 White 2 Female 2 2012 407.06 5 No co-applicant 8 No co-applicant 5 No co-applicant 123 Ramsey County None NaN 2 Not a HOEPA loan 4 Not applicable 3 Refinancing 1 Conventional 33460 Minneapolis, St. Paul, Bloomington - MN, WI 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 1 Fannie Mae (FNMA) 0000451965 2374657 27 MN Minnesota 83900 100 1927 1871 13.680000 4832 NaN 137.669998
4 1 Loan originated 3 FDIC Federal Deposit Insurance Corporation 2 Not Hispanic or Latino NaN 5 White 1 Male 0 2012 104.00 2 Not Hispanic or Latino 5 White 2 Female 3 Allen County None 6 Quality edit failure only 2 Not a HOEPA loan 1 Secured by a first lien 2 Home improvement 1 Conventional 23060 Fort Wayne - IN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000013801 11 18 IN Indiana 63800 267 1309 1160 4.680000 3612 NaN 139.100006
5 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 144 5 White 1 Male 0 2012 8057.01 2 Not Hispanic or Latino 5 White 1 Male 31 Cook County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 16974 Chicago, Joliet, Naperville - IL 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 3 Freddie Mac (FHLMC) 36-4327855 22594 17 IL Illinois 77300 260 1390 1700 6.440000 5074 NaN 140.550003
6 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 51 3 Black or African American 1 Male 0 2012 17.00 5 No co-applicant 8 No co-applicant 5 No co-applicant 19 Calcasieu Parish None NaN 2 Not a HOEPA loan 1 Secured by a first lien 1 Home purchase 1 Conventional 29340 Lake Charles - LA 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 1 Fannie Mae (FNMA) 7056000000 34177 22 LA Louisiana 62400 115 3500 2797 29.260000 8745 NaN 86.739998
7 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 162 5 White 1 Male 0 2012 2.01 2 Not Hispanic or Latino 5 White 2 Female 49 Grand County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 2 FHA-insured NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 87-0623581 2141 8 CO Colorado 61000 283 5706 1724 9.650000 4817 NaN 128.559998
8 1 Loan originated 3 FDIC Federal Deposit Insurance Corporation 2 Not Hispanic or Latino 32 5 White 2 Female 0 2012 103.04 5 No co-applicant 8 No co-applicant 5 No co-applicant 3 Allen County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional 23060 Fort Wayne - IN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 0 Loan was not originated or was not sold in cal... 0000013801 10 18 IN Indiana 63800 40 2384 2210 6.640000 6601 3.33 149.690002
9 1 Loan originated 9 CFPB Consumer Financial Protection Bureau 2 Not Hispanic or Latino 38 5 White 1 Male 0 2012 9608.00 2 Not Hispanic or Latino 5 White 2 Female 41 Talbot County None NaN 2 Not a HOEPA loan 1 Secured by a first lien 3 Refinancing 1 Conventional NaN 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 3 Freddie Mac (FHLMC) 0000504713 18459 24 MD Maryland 72600 108 1311 785 6.040000 1920 NaN 77.279999
10 1 Loan originated 7 HUD Department of Housing and Urban Development 2 Not Hispanic or Latino 47 5 White 1 Male 0 2012 22.04 1 Hispanic or Latino 5 White 2 Female 19 Calcasieu Parish None NaN 2 Not a HOEPA loan 1 Secured by a first lien 1 Home purchase 3 VA-guaranteed 29340 Lake Charles - LA 1 Owner-occupied as a principal dwelling 3 Not applicable 1 One-to-four family dwelling (other than manufa... 6 Commercial bank, savings bank or savings assoc... 7056000000 33791 22 LA Louisiana 62400 158 1854 1463 12.410000 4955 NaN 112.010002

Reducing the Dataset

That’s a lot of columns, most of which are redundant or don’t carry much information. Let’s clean up our dataset a bit by selecting a smaller subset of columns. Already this quick investigation improves our comprehension and reduces the size of the dataset.

columns = ['action_taken_name', 'agency_abbr', 'applicant_ethnicity_name',
           'applicant_race_name_1', 'applicant_sex_name', 'county_name',
           'loan_purpose_name', 'state_abbr']

t = t[columns]
t
action_taken_name agency_abbr applicant_ethnicity_name applicant_race_name_1 applicant_sex_name county_name loan_purpose_name state_abbr
0 Loan originated HUD Not Hispanic or Latino White Male Will County Refinancing IL
1 Loan originated NCUA Not Hispanic or Latino White Male Midland County Refinancing MI
2 Loan purchased by the institution CFPB Not applicable Not applicable Not applicable Benton County Refinancing AR
3 Loan purchased by the institution CFPB Not Hispanic or Latino White Female Ramsey County Refinancing MN
4 Loan originated FDIC Not Hispanic or Latino White Male Allen County Home improvement IN
5 Loan originated HUD Not Hispanic or Latino White Male Cook County Refinancing IL
6 Loan originated HUD Not Hispanic or Latino Black or African American Male Calcasieu Parish Home purchase LA
7 Loan originated HUD Not Hispanic or Latino White Male Grand County Refinancing CO
8 Loan originated FDIC Not Hispanic or Latino White Female Allen County Refinancing IN
9 Loan originated CFPB Not Hispanic or Latino White Male Talbot County Refinancing MD
10 Loan originated HUD Not Hispanic or Latino White Male Calcasieu Parish Home purchase LA

More Complex Computation

Now that we can more clearly see what’s going on let’s ask a simple question:

How many times does each action occur in the state of New York?

t2 = t[t.state_abbr == 'NY']
t2
%%time
from blaze import into, by
from pandas import DataFrame
# Group on action_taken_name, count each group
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))
CPU times: user 13min 50s, sys: 5.23 s, total: 13min 55s
Wall time: 13min 55s
action_taken_name action_taken_name_count
0 Loan originated 285106
1 Application denied by financial institution 109423
2 Loan purchased by the institution 75241
3 Application withdrawn by applicant 50563
4 Application approved but not accepted 25632
5 File closed for incompleteness 20585
6 Preapproval request approved but not accepted 259
7 Preapproval request denied by financial instit... 171

Great! Sadly, because it was reading through the CSV file and because it was using a Pure Python backend, that computation took fourteen minutes.

Moving to a Faster Backend

By default computations on CSV files use the streaming Python backend. While robust for large files and decently fast, this backend parses the CSV file each time we do a full-data operation, and this parsing is very slow. Let’s move our reduced dataset to a more efficient and widely accessible backend, sqlite.

from blaze import SQL
sql = SQL('sqlite:///hmda.db', 'data', schema=t.schema) # A SQLite database
into(sql, t)  # Migrate data

Yup, a little sqlite database just arrived

$ ls -lh hmda*

-rw-r--r-- 1 mrocklin mrocklin 2.7G Aug 25 13:38 hmda.db
-rw-r--r-- 1 mrocklin mrocklin  12G Jul 10 12:15 hmda_lar-2012.csv

Working with SQL

Now that we’ve migrated our csv file into a sqlite database let’s redefine t to use the SQL backend and repeat our computation.

# t = Table(csv)
t = Table(sql)
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))
CPU times: user 5.55 s, sys: 1.64 s, total: 7.19 s
Wall time: 7.46 s
action_taken_name action_taken_name_count
0 Loan originated 285106
1 Application denied by financial institution 109423
2 Loan purchased by the institution 75241
3 Application withdrawn by applicant 50563
4 Application approved but not accepted 25632
5 File closed for incompleteness 20585
6 Preapproval request approved but not accepted 259
7 Preapproval request denied by financial instit... 171

We’re about to repeat this same computation many times. We’ll omit the table result from here on out. It will always be the same.

Create an index on state name

This was much faster, largely because the data was stored in a binary format. We can improve the query speed significantly by placing an index on the state_abbr field. This will cause the selection t[t.state_abbr == 'NY'] to return more quickly, eliminating the need for an expensive full table scan.

from blaze import create_index
create_index(sql, 'state_abbr', name='state_abbr_index')

Now we can ask this same query for many states at interactive timescales.

t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))
CPU times: user 1.74 s, sys: 430 ms, total: 2.17 s
Wall time: 2.17 s

Comparing against MongoDB

Because moving between computational backends is now easy, we can quickly compare performance between backends. SQLite and MongoDB are similarly available technologies, each being trivial to set up on a personal computer. However they’re also fairly different technologies with varying communities.

Which performs faster for our sample computation?

import pymongo
db = pymongo.MongoClient().db

into(db.hmda, sql)  # Migrate to Mongo DB from SQLite database
# t = Table(csv)
# t = Table(sql)
t = Table(db.hmda)
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))
CPU times: user 4.05 ms, sys: 701 µs, total: 4.76 ms
Wall time: 7.61 s

Almost exactly the same time as for SQLite.

We just did a complex thing easily. If we weren’t familiar with MongoDB we would need to learn how to set up a database, how to migrate data from SQL to MongoDB, and finally how to perform queries. Blaze eased that process considerably.

Create an index on state name

Again we create an index on the state name and observe the performance difference.

create_index(db.hmda, 'state_abbr', name='state_abbr_index')
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
                   t2.action_taken_name.count()).sort('action_taken_name_count',
                                                      ascending=False))
CPU times: user 4.13 ms, sys: 844 µs, total: 4.97 ms
Wall time: 954 ms

Here the indexed MongoDB system seems about twice as fast as the comparably indexed SQLite system.

Results

Disclaimer: These results come from a single run. No attempt was made to optimize the backend configuration, nor was any consideration taken into account about databases being warmed up. These numbers are far from conclusive, and are merely here to present the ease with which intuitive-building experiments are easy with Blaze and the value of choosing the right backend.

Backend Duration
NumPy/Pandas need bigger machine
Python/CSV 14 minutes
SQLite 7 seconds
MongoDB 7 seconds
SQLite (indexed) 2 seconds
MongoDB (indexed) 1 second

Conclusion

Blaze enables you to investigate, transform, and migrate large data intuitively. You can choose the right technology for your application without having to worry about learning a new syntax.

We hope that by lowering this barrier more users will use the right tool for the job.

More Information


blog comments powered by Disqus