Data Bandwidth
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
tl;dr: We list and combine common bandwidths relevant in data science
Understanding data bandwidths helps us to identify bottlenecks and write efficient code. Both hardware and software can be characterized by how quickly they churn through data. We present a rough list of relevant data bandwidths and discuss how to use this list when optimizing a data pipeline.
Name | Bandwidth MB/s |
---|---|
Memory copy | 3000 |
Basic filtering in C/NumPy/Pandas | 3000 |
Fast decompression | 1000 |
SSD Large Sequential Read | 500 |
Interprocess communication (IPC) | 300 |
msgpack deserialization | 125 |
Gigabit Ethernet | 100 |
Pandas read_csv | 100 |
JSON Deserialization | 50 |
Slow decompression (e.g. gzip/bz2) | 50 |
SSD Small Random Read | 20 |
Wireless network | 1 |
Disclaimer: all numbers in this post are rule-of-thumb and vary by situation
Understanding these scales can help you to identify how to speed up your program. For example, there is no need to use a faster network or disk if you store your data as JSON.
Combining bandwidths
Complex data pipelines involve many stages. The rule to combine bandwidths is to add up the inverses of the bandwidths, then take the inverse again:
\[\textrm{total bandwidth} = \left(\sum_i \frac{1}{x_i}\right)^{-1}\]This is the same principle behind adding conductances in serial within electrical circuits. One quickly learns to optimize the slowest link in the chain first.
Example
When we read data from disk (500 MB/s) and then deserialize it from JSON (50 MB/s) our full bandwidth is 45 MB/s:
\[\left(\frac{1}{500} + \frac{1}{50}\right)^{-1} = 45.4\]If we invest in a faster hard drive system that has 2GB of read bandwidth then we get only marginal performance improvement:
\[\left(\frac{1}{2000} + \frac{1}{50}\right)^{-1} = 48.8\]However if we invest in a faster serialization technology, like msgpack (125 MB/s), then we double our effective bandwidth.
\[\left(\frac{1}{500} + \frac{1}{125}\right)^{-1} = 100\]This example demonstrates that we should focus on the weakest bandwidth first. Cheap changes like switching from JSON to msgpack can be more effective than expensive changes, like purchasing expensive hardware for fast storage.
Overlapping Bandwidths
We can overlap certain classes of bandwidths. In particular we can often overlap communication bandwidths with computation bandwidths. In our disk+JSON example above we can probably hide the disk reading time completely. The same would go for network applications if we handle sockets correctly.
Parallel Bandwidths
We can parallelize some computational bandwidths. For example we can
parallelize JSON deserialization by our number of cores to quadruple the
effective bandwidth 50 MB/s * 4 = 200 MB/s
. Typically communication
bandwidths are not parallelizable per core.
blog comments powered by Disqus