GroupBy and Package Management

I believe that packages are an artificial abstraction. I’ll demonstrate this with my favorite function, groupby.

Groupby

groupby partitions a collection by a key function. Here is a test demonstrating two examples

Groupby is a cousin of map and filter, other higher order functions that take a function and a collection as inputs. Groupby returns a dictionary that associates groups of that collection to their value under that function. For example names 'Bob' and 'Dan' have value 3 under the function len and so were associated to the group with key 3 while 'Alice' was put into the group associated to 5. It places all items of the collection into the proper group in one pass of the data.

Groupby is a common function implemented in several standard libraries. It exists in C#, Lisp languages (Scheme, Clojure), and is a keyword in SQL. It doesn’t exist in Python (see footnote) but fortunately is easy to implement.

A final example, the common function histogram is a trivial extension of groupby

Map, filter, and reduce/fold each replace a commonly occuring programming pattern. When these functions don’t fit we often revert to traditional for loops or list comprehensions. Groupby efficiently handles another surprisingly large class of problems. You probably implement groupby relatively frequently without realizing it.

Package Management

As I mentioned, this is my favorite function; I implement it in every project I have. It lives in a util.py file at the base of each project’s directory structure. This is code duplication. Code duplication is bad. The common solution is to put groupby into a separate package and then import that single package within each project.

In which project should groupby live?

I’ve often thought of making an itertools2 (in my head often called itertoolz) that contains more utility functions for iterables. Or perhaps a rocklin_util project the for general utility functions that I often use. At times I’ve thought of just a groupby project that has only one function, also named groupby, e.g.

from groupby import groupby

Which decision is best? What other functions should live in the project that houses groupby?

To me this question is subjective and all answers I can come up with feel artificial. If you have ideas I’d love to hear them. Please post in the comments below.

At the moment my answer is that package management is not the correct abstraction. Instead perhaps the solution is to manage functions directly. I have some thoughts on this but this post is already longer than I like.

Footnote

itertools has a function called groupby. It is a streaming variant of this operation. It requires that all elements of a group be adjacent to one another and so does not satisfy the traditional interface. I know of no other implementation within the standard library. SymPy has a version which it calls sift.