I pushed out this text on all three platforms. On Twitter / Mastodon it was as
a stream. On Reddit / LinkedIn it was a single post.
Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale
I gave this talk at PyData NYC last week. It was fun working with devs from various projects (Dask, Arrow, Polars, Spark) in the week leading up to the event. Thought I’d share a re-recording of it here
This is the result of a couple weeks of work comparing large data frameworks on benchmarks ranging in size 10GB to 10TB. No project wins. It’s really interesting analyzing results though.
DuckDB and Dask are the only projects that reliably finish things (although possibly Dask’s success here has to do with me knowing Dask better than the others). DuckDB is way faster at small scale (along with Polars). Dask and Spark are generally more robust and performant at large scale, mostly because they’re able to parallelize S3 access. Really-good-S3 access seems to be the way you win at real-world cloud performance.
Looking more deeply at Dask results, we’re wildly inefficient. There’s at least a 2x-5x performance increase to be had here. Given that Dask does about as well as any other project on cloud this really means that no one has optimized cloud well yet.
This talk also goes into how we attempted to address bias (super hard to do in benchmarks). We had active collaborations with Polars and Spark people (made Polars quite a bit faster during this process actually). See https://matthewrocklin.com/biased-benchmarks.html for more thoughts.
This also shows the improvement Dask made in the last six months. Dask used to suck at benchmarks. Now it doesn’t win, but reliably places among the top. This is due to …
New shuffling algorithms
There’s a lot of work for projects like Dask and Polars to fix themselves up in this space. They’re both moving pretty fast right now. I’m curious to see how they progress in the next few months.
For future work I’d like to expand this out a bit beyond TPC-H. TPC-H is great because they’re fairly serious queries (lots of tables, lots of joins) and not micro-benchmarks. We could use broader coverage though. Any ideas?