GroupBy and Package Management
I believe that packages are an artificial abstraction. I’ll demonstrate this with my favorite function, groupby
.
Groupby
groupby
partitions a collection by a key function. Here is a test demonstrating two examples
Groupby is a cousin of map
and filter
, other higher order functions that take a function and a collection as inputs. Groupby returns a dictionary that associates groups of that collection to their value under that function. For example names 'Bob'
and 'Dan'
have value 3
under the function len
and so were associated to the group with key 3
while 'Alice'
was put into the group associated to 5
. It places all items of the collection into the proper group in one pass of the data.
Groupby is a common function implemented in several standard libraries. It exists in C#, Lisp languages (Scheme, Clojure), and is a keyword in SQL. It doesn’t exist in Python (see footnote) but fortunately is easy to implement.
A final example, the common function histogram
is a trivial extension of groupby
Map, filter, and reduce/fold each replace a commonly occuring programming pattern. When these functions don’t fit we often revert to traditional for loops or list comprehensions. Groupby efficiently handles another surprisingly large class of problems. You probably implement groupby relatively frequently without realizing it.
Package Management
As I mentioned, this is my favorite function; I implement it in every project I have. It lives in a util.py
file at the base of each project’s directory structure. This is code duplication. Code duplication is bad. The common solution is to put groupby
into a separate package and then import that single package within each project.
In which project should groupby
live?
I’ve often thought of making an itertools2
(in my head often called itertoolz
) that contains more utility functions for iterables. Or perhaps a rocklin_util
project the for general utility functions that I often use. At times I’ve thought of just a groupby
project that has only one function, also named groupby
, e.g.
Which decision is best? What other functions should live in the project that houses groupby
?
To me this question is subjective and all answers I can come up with feel artificial. If you have ideas I’d love to hear them. Please post in the comments below.
At the moment my answer is that package management is not the correct abstraction. Instead perhaps the solution is to manage functions directly. I have some thoughts on this but this post is already longer than I like.
Footnote
itertools
has a function called groupby. It is a streaming variant of this operation. It requires that all elements of a group be adjacent to one another and so does not satisfy the traditional interface. I know of no other implementation within the standard library. SymPy has a version which it callssift
.
blog comments powered by Disqus