Thread First Pattern
My physics colleague is learning Python and asked me for a simple way to parse the following file header
CPMD CUBE FILE. OUTER LOOP: X, MIDDLE LOOP: Y, INNER LOOP: Z 3 0.000000 0.000000 0.000000 40 0.283459 0.000000 0.000000 40 0.000000 0.283459 0.000000 40 0.000000 0.000000 0.283459 8 0.000000 5.570575 5.669178 5.593517 1 0.000000 5.562867 5.669178 7.428055 1 0.000000 7.340606 5.669178 5.111259 -0.25568E-04 0.59213E-05 0.81068E-05 0.10868E-04 0.11313E-04 0.35999E-05 : : : : : : : : : : : : : : : : : : In this case there will be 40 x 40 x 40 floating point values : : : : : : : : : : : : : : : : : :
Normally for tabular data I would just point him towards
numpy.loadtxt or the
csv module. This particular dataset is a tad annoying though, look closely for the following complications
- There are three separate numerical tables
- Lines 2-5 are three lines with four columns
- Lines 6-9 are three lines with five columns
- Lines 11 onward is a traditional table with many float columns
- The first column to the first two sections contains integers instead of floating point numbers
csv module nor the
numpy.loadtxt/genfromtext functions are capable of wrapping this complexity. This is commonplace, we often have data that doesn’t quite fit our module’s expectations. At this point we often fall back to traditional programming.
My colleague’s solution was a sequence of for loops, appending data onto a set of lists.
This solution is standard but, in my colleague’s words, not particularity “cute”. He wanted to know if there was a “cute” language idiom to handle tasks like these. I’ve been programming in a functional style for a while now and decided to see if that paradigm could offer a better option. After some back and forth with him I stabilized on the following solution.
This solution is very different from my colleague’s. Rather than provide explicit instructions on how to manipulate the data at a low-level (for loop with list appends) it describes a sequence of high-level transformations. The
thread_first function takes the first argument (the data) and runs it through the following functions sequentially. I’ll describe those functions below:
||Our original input, a filename like ‘file.txt’|
||Open the file for reading|
||Take lines 2:5|
||Join them together into a single string|
||Turn that string into a file-like object|
||Parse into an array|
||Separate array into first and all other columns|
It is equivalent to the following lines:
Why this Solution is Worse
Here are my thoughts on why the functional solution is worse. If you have other thoughts please list them in the comments:
- It’s weird and the current base of programmers will be put off
- It uses functions that are not commonly known like
numpy.genfromtxt. It demands a richer vocabulary.
- It is inefficient because it opens and reads the file for each section. It does not optimally manage state.
Why this Solution is Better
Here are my thoughts on why the functional solution is better. Again, if you have other thoughts please do list them in the comments:
- The intent of the operation is more explicit.
- The component functions used are well tested. There are fewer opportunities for bugs.
- The information content of each term is high. This solution reinvents the minimum amount of technology.
Today I wouldn’t use this solution in production code mainly because my usual colleagues aren’t familiar with it. However I do think that the ideas behind the solution do have substantial merit. In particular
- Function reuse: Reinventing wheels is both wasteful to the programmer and harmful to the longevity the code.
- Standard library - The use of standard library functions supports more rapid understanding of your code.
- Composition - Mechanisms for function composition and encapsulation (like
thread_first) promote rapid development of novel and robust solutions.
In general I think that while for loops and low-level code are globally accessible they also demand significant investment to understand their particular role in a certain application. High-level/broad vocabulary solutions more directly present their intent but are limited to those programmers who understand them.
functoolz: The functional solution uses
thread_first, a function from the non-standard
blog comments powered by Disqus