GroupBy and Package Management
I believe that packages are an artificial abstraction. I’ll demonstrate this with my favorite function,
groupby partitions a collection by a key function. Here is a test demonstrating two examples
Groupby is a cousin of
filter, other higher order functions that take a function and a collection as inputs. Groupby returns a dictionary that associates groups of that collection to their value under that function. For example names
'Dan' have value
3 under the function
len and so were associated to the group with key
'Alice' was put into the group associated to
5. It places all items of the collection into the proper group in one pass of the data.
Groupby is a common function implemented in several standard libraries. It exists in C#, Lisp languages (Scheme, Clojure), and is a keyword in SQL. It doesn’t exist in Python (see footnote) but fortunately is easy to implement.
A final example, the common function
histogram is a trivial extension of
Map, filter, and reduce/fold each replace a commonly occuring programming pattern. When these functions don’t fit we often revert to traditional for loops or list comprehensions. Groupby efficiently handles another surprisingly large class of problems. You probably implement groupby relatively frequently without realizing it.
As I mentioned, this is my favorite function; I implement it in every project I have. It lives in a
util.py file at the base of each project’s directory structure. This is code duplication. Code duplication is bad. The common solution is to put
groupby into a separate package and then import that single package within each project.
In which project should
I’ve often thought of making an
itertools2 (in my head often called
itertoolz) that contains more utility functions for iterables. Or perhaps a
rocklin_util project the for general utility functions that I often use. At times I’ve thought of just a
groupby project that has only one function, also named
Which decision is best? What other functions should live in the project that houses
To me this question is subjective and all answers I can come up with feel artificial. If you have ideas I’d love to hear them. Please post in the comments below.
At the moment my answer is that package management is not the correct abstraction. Instead perhaps the solution is to manage functions directly. I have some thoughts on this but this post is already longer than I like.
itertoolshas a function called groupby. It is a streaming variant of this operation. It requires that all elements of a group be adjacent to one another and so does not satisfy the traditional interface. I know of no other implementation within the standard library. SymPy has a version which it calls
blog comments powered by Disqus