Streaming Analytics with `toolz`

By Matthew Rocklin

tl;dr

Streaming Python enables SQL/Pandas-like computations on out-of-core datasets

Problem: Some datasets are too big for memory

In [25]:

!head /home/mrocklin/data/bitcoin/data-code/user_edges.txt

1,2,2,20130410142250,24.375
1,2,782477,20130410142250,0.7709
2,620423,4571210,20111227114312,614.17495129
2,620423,3,20111227114312,128.0405196
3,3,782479,20130410142250,47.1405196
3,3,4,20130410142250,150.0
4,39337,39337,20120617120202,0.31081764
4,39337,3,20120617120202,69.1
5,2071196,2070358,20130304143805,61.60235182
5,2071196,5,20130304143805,100.0

>>> import pandas
>>> df = pandas.read_csv('user_edges.txt')
MemoryError(...)

Solution: Solve problems by streaming data through memory

Lazy Evaluation

In [1]:

from toolz import *

book = open('tale-of-two-cities.txt')
book = drop(112, book)  # drop header

In [2]:

next(book)

Out[2]:

'It was the best of times,\r\n'

In [3]:

next(book)

Out[3]:

'it was the worst of times,\r\n'

Lazy `map`

In [4]:

from toolz import map  # toolz' map is lazy by default

loud_book = map(str.upper, book)
next(loud_book)

Out[4]:

'IT WAS THE AGE OF WISDOM,\r\n'

In [5]:

loud_book = map(str.strip, loud_book)
next(loud_book)

Out[5]:

'IT WAS THE AGE OF FOOLISHNESS,'

Finalize in Memory with Reductions

In [6]:

frequencies(concat(loud_book))  # Frequencies is not lazy

Out[6]:

{' ': 126002,
 '!': 955,
 '"': 5681,
 '$': 2,
 '%': 1,
 "'": 1268,
 '(': 151,
 ')': 151,
 '*': 84,
 ',': 13265,
 '-': 2419,
 '.': 6811,
 '/': 24,
 '0': 17,
 '1': 61,
 '2': 10,
 '3': 12,
 '4': 9,
 '5': 13,
 '6': 9,
 '7': 13,
 '8': 14,
 '9': 14,
 ':': 263,
 ';': 1108,
 '?': 913,
 '@': 2,
 'A': 48036,
 'B': 8402,
 'C': 13812,
 'D': 28000,
 'E': 74624,
 'F': 13527,
 'G': 12517,
 'H': 38856,
 'I': 40866,
 'J': 708,
 'K': 4764,
 'L': 22002,
 'M': 15274,
 'N': 42305,
 'O': 46409,
 'P': 9891,
 'Q': 666,
 'R': 37090,
 'S': 37498,
 'T': 53858,
 'U': 16710,
 'V': 5175,
 'W': 14091,
 'X': 694,
 'Y': 12165,
 'Z': 215,
 '_': 182,
 '\xa9': 2,
 '\xc3': 2}

Recall `groupby`

In [7]:

from toolz.curried import *

names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']

groupby(len, names)

Out[7]:

{3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']}

Common Question: I like groupby from SQL/Pandas, what else does toolz have that looks like SQL?

Common Answer: Probably your data fits in memory, so use Pandas

If you insist:

toolz: map, filter, groupby, reduceby, join, take, unique
Python: sorted, max, min, sum, ...

Some of these are streaming, some aren't

In [26]:

from toolz.curried import *

names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']

groupby(len, names)

Out[26]:

{3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']}

SQL-like Queries with `(cy)toolz`

Dataset

In [38]:

accounts = [(1, 'Alice',   100, 'F'),  # id, name, balance, gender
            (2, 'Bob',     200, 'M'),
            (3, 'Charlie', 150, 'M'),
            (4, 'Dennis',   50, 'M'),
            (5, 'Edith',   300, 'F')]

SELECT ... FROM ... WHERE

SELECT name, balance
FROM accounts
WHERE balance > 150;

In [39]:

from toolz.curried import pipe, map, filter, get
pipe(accounts, filter(lambda (id, name, balance, gender): balance > 150),
               pluck([1, 2]),
               list)

Out[39]:

[('Bob', 200), ('Edith', 300)]

In [40]:

[(name, balance) for (id, name, balance, gender) in accounts
                 if balance > 150]

Out[40]:

[('Bob', 200), ('Edith', 300)]

In Memory Split-Apply-Combine

SELECT gender, SUM(balance)
FROM accounts
GROUP BY gender;

In [41]:

groupby(get(3), accounts)

Out[41]:

{'F': [(1, 'Alice', 100, 'F'), (5, 'Edith', 300, 'F')],
 'M': [(2, 'Bob', 200, 'M'), (3, 'Charlie', 150, 'M'), (4, 'Dennis', 50, 'M')]}

In [42]:

valmap(pluck(2), _)

Out[42]:

{'F': <itertools.imap at 0x7f97e938f410>,
 'M': <itertools.imap at 0x7f97e938f590>}

In [43]:

valmap(sum, _)

Out[43]:

{'F': 400, 'M': 400}

In [30]:

pipe(accounts, groupby(get(3)),
               valmap(pluck(2)),
               valmap(sum))

Out[30]:

{'F': 400, 'M': 400}

Streaming Split-Apply-Combine

In [31]:

def iseven(n):
    return n % 2 == 0

def add(x, y):
    return x + y

reduceby(iseven, add, [1, 2, 3, 4])

Out[31]:

{False: 4, True: 6}

In [15]:

groups = groupby(iseven, [1, 2, 3, 4])
groups

Out[15]:

{False: [1, 3], True: [2, 4]}

In [32]:

valmap(sum, groups)

Out[32]:

{False: 4, True: 6}

Streaming Split-Apply-Combine

In [17]:

accounts = [(1, 'Alice',   100, 'F'),  # id, name, balance, gender
            (2, 'Bob',     200, 'M'),
            (3, 'Charlie', 150, 'M'),
            (4, 'Dennis',   50, 'M'),
            (5, 'Edith',   300, 'F')]

In [18]:

key = lambda (id, name, balance, gender): gender
binop = lambda total, (id, name, balance, gender): total + balance

reduceby(key, binop, accounts, 0)

Out[18]:

{'F': 400, 'M': 400}

confused? it's ok. I am too.

Bitcoin Again

In [35]:

import csv
filename = '/home/mrocklin/data/bitcoin/data-code/user_edges.txt'

key = get(1)
binop = lambda total, (t, s, r, ts, value): total + float(value)

pipe(filename, open, csv.reader,  # Open file
                     reduceby(key, binop, init=0),  # do split-apply-combine
                     dict.items, sorted(key=second, reverse=True), # sort by values
                     take(10), list)  # take top ten as list

Out[35]:

[('11', 52461821.94165766),
 ('1374', 23394277.034151807),
 ('25', 13178095.975724494),
 ('29', 5330179.983046564),
 ('12564', 3669712.399824968),
 ('782688', 2929023.064647781),
 ('74', 2122710.961163437),
 ('91638', 2094827.8251607446),
 ('27', 2058124.131470339),
 ('20', 1182868.148780274)]

Semi-Streaming Join

In [19]:

accounts = [(1, 'Alice',   100, 'F'),  # id, name, balance, gender
            (2, 'Bob',     200, 'M'),
            (3, 'Charlie', 150, 'M'),
            (4, 'Dennis',   50, 'M'),
            (5, 'Edith',   300, 'F')]

addresses = [(1,  '123 Main Street'),  # id, address
             (2,      '5 Adams Way'),
             (5, '34 Rue St Michel')]

In [20]:

list(join(first, addresses, first, accounts))

Out[20]:

[((1, '123 Main Street'), (1, 'Alice', 100, 'F')),
 ((2, '5 Adams Way'), (2, 'Bob', 200, 'M')),
 ((5, '34 Rue St Michel'), (5, 'Edith', 300, 'F'))]

In [21]:

list(join(0, addresses, 0, accounts))

Out[21]:

[((1, '123 Main Street'), (1, 'Alice', 100, 'F')),
 ((2, '5 Adams Way'), (2, 'Bob', 200, 'M')),
 ((5, '34 Rue St Michel'), (5, 'Edith', 300, 'F'))]

In [22]:

for (id, address), (id, name, balance, gender) in join(0, addresses, 0, accounts):
    print( address, name, balance)

('123 Main Street', 'Alice', 100)
('5 Adams Way', 'Bob', 200)
('34 Rue St Michel', 'Edith', 300)

Another `join` example

In [23]:

friends = [('Alice', 'Edith'),
           ('Alice', 'Zhao'),
           ('Edith', 'Alice'),
           ('Zhao', 'Alice'),
           ('Zhao', 'Edith')]

cities = [('Alice', 'NYC'),
          ('Alice', 'Chicago'),
          ('Dan', 'Syndey'),
          ('Edith', 'Paris'),
          ('Edith', 'Berlin'),
          ('Zhao', 'Shanghai')]

Vacation opportunities

In what cities do people have friends?

In [24]:

result = join(second, friends,
              first, cities)

for ((name, friend), (friend, city)) in sorted(unique(result)):
    print((name, city))

('Alice', 'Berlin')
('Alice', 'Paris')
('Alice', 'Shanghai')
('Edith', 'Chicago')
('Edith', 'NYC')
('Zhao', 'Chicago')
('Zhao', 'NYC')
('Zhao', 'Berlin')
('Zhao', 'Paris')

Join Performance

Left sequence must fit in memory, right sequence can stream

cytoolz.join is fast. It easily competes with pandas.join.

Like groupby, join is a powerful abstraction. Often when you write code, you're actually just writing join.

Recap

Appeal to use core data structures
Don't use map, filter, reduce, but do think about them
They have friends like groupby in toolz
Python data structures are surprisingly fast, particularly with cytoolz
How to handle datasets that don't fit in memory with streaming computation

Front Matter

Streaming Analytics with toolz

tl;dr

Problem: Some datasets are too big for memory

Solution: Solve problems by streaming data through memory

Lazy Evaluation

Lazy map

Finalize in Memory with Reductions

Recall groupby

Some of these are streaming, some aren't

SQL-like Queries with (cy)toolz

Dataset

SELECT ... FROM ... WHERE

In Memory Split-Apply-Combine

Streaming Split-Apply-Combine

Streaming Split-Apply-Combine

confused? it's ok. I am too.

Bitcoin Again

Semi-Streaming Join

Another join example

Join Performance

Recap

Streaming Analytics with `toolz`

Lazy `map`

Recall `groupby`

SQL-like Queries with `(cy)toolz`

Another `join` example