Blaze Datasets

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

This post was composed using Blaze version 0.6.6. You can follow along with the following conda command.

conda install -c blaze blaze=0.6.6

When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked.

This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to.

For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files.

Typically we navigate these files in Python with h5py or pytables.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

HDF5 files organize datasets with an internal file system. The h5py library accesses this internal file system through successive calls to .keys() and item access.

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()
['ADDITIONAL', 'SWATHS']

>>> f['/HDFEOS/SWATHS'].keys()
['ColumnAmountAerosol']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol'].keys()
['Data Fields', 'Geolocation Fields']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields'].keys()
['AerosolIndexUV',
 'AerosolIndexVIS',
 'AerosolModelMW',
 'AerosolModelsPassedThreshold',
 'AerosolOpticalThicknessMW',
 ...

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure']
<HDF5 dataset "TerrainPressure": shape (1643, 60), type "<i2">

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'][:]
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       ...,
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

This interaction between programmer and interpreter feels like a long and awkward conversation.

Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together.

Exploring a SQL Database

For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name.

>>> iris = Data('sqlite:////my.db::iris')  # protocol://database::table-name
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
...

This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table.

>>> db = Data('sqlite:////my.db')  # protocol://database
>>> db
Data:       Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
DataShape:  {
  iris: var * {
    sepal_length: ?float64,
    sepal_width: ?float64,
    petal_length: ?float64,
    petal_width: ?float64,
    species: ?string
  ...

The db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers.

Blaze isn’t doing any work of the grunt work here, SQLAlchemy is. SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end.

We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species.

>>> by(db.iris.species, shortest=db.iris.sepal_length.min(),
...                      longest=db.iris.sepal_length.max())
           species  longest  shortest
0      Iris-setosa      5.8       4.3
1  Iris-versicolor      7.0       4.9
2   Iris-virginica      7.9       4.9

Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas.

A Larger Database

Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables.

>>> lahman = Data('sqlite:///baseball-archive-2012.sqlite')
>>> lahman.dshape  # ask for the entire datashape
dshape("""{
  battingpost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  awardsmanagers: var * {
    managerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  hofold: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
    },
  salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
    },
  pitchingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  managers: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
    },
  teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
    },
  teamshalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
    },
  fieldingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
    },
  master: var * {
    lahmanID: ?int32,
    playerID: ?string,
    managerID: ?string,
    hofID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameNote: ?string,
    nameGiven: ?string,
    nameNick: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?string,
    finalGame: ?string,
    college: ?string,
    lahman40ID: ?string,
    lahman45ID: ?string,
    retroID: ?string,
    holtzID: ?string,
    bbrefID: ?string
    },
  fieldingof: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
    },
  pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  managershalf: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
    },
  appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
    },
  halloffame: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
    },
  awardsplayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  schoolsplayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
    },
  schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
    },
  teamsfranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
    },
  allstarfull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
    },
  tmp_batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
    },
  seriespost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
    },
  awardssharemanagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    managerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
    },
  awardsshareplayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
    },
  fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64
    }
  }""")

Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections.

We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine.

>>> lahman.<tab>
lahman.allstarfull          lahman.fieldingpost         lahman.pitchingpost
lahman.appearances          lahman.fields               lahman.relabel
lahman.awardsmanagers       lahman.halloffame           lahman.salaries
lahman.awardsplayers        lahman.hofold               lahman.schema
lahman.awardssharemanagers  lahman.isidentical          lahman.schools
lahman.awardsshareplayers   lahman.like                 lahman.schoolsplayers
lahman.battingpost          lahman.managers             lahman.seriespost
lahman.data                 lahman.managershalf         lahman.teams
lahman.dshape               lahman.map                  lahman.teamsfranchises
lahman.fielding             lahman.master               lahman.teamshalf
lahman.fieldingof           lahman.pitching             lahman.tmp_batting

And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year

>>> joined = join(lahman.teams, lahman.salaries, ['yearID', 'teamID', 'lgID'])

>>> by(joined[['name', 'yearID']], average_salary=joined.salary.mean())
                    name  yearID  average_salary
0         Anaheim Angels    1997  1004370.064516
1         Anaheim Angels    1998  1214147.058824
2         Anaheim Angels    1999  1384704.150000
3         Anaheim Angels    2000  1715472.233333
4         Anaheim Angels    2001  1584505.566667
5         Anaheim Angels    2002  2204345.250000
6         Anaheim Angels    2003  2927098.777778
7         Anaheim Angels    2004  3723506.185185
8   Arizona Diamondbacks    1998   898527.777778
9   Arizona Diamondbacks    1999  2020705.852941
...

Looks good, we compute and store to CSV file with into

>>> into('salaries.csv',
...      by(j[['name', 'yearID']], total_salary=j.salary.mean()))

(Final result here: salaries.csv)

Beyond SQL

SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files.

>>> d = {'accounts 1': CSV('accounts_1.csv'),
...      'accounts 2': CSV('accounts_2.csv')}

>>> accounts = Data(d)

>>> accounts.dshape
dshape("""{
  accounts 1: var * {id: ?int64, name: string, amount: ?int64},
  accounts 2: var * {id: ?int64, name: string, amount: ?int64}
  }""")

None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame.

Finish with HDF5 example

To conclude we revisit our motivating problem, HDF5 file navigation.

Raw H5Py

Recall that we previously had a long back-and-forth conversation with the Python interpreter.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()
['ADDITIONAL', 'SWATHS']
...

H5Py with Blaze expressions

With Blaze we get a quick high-level overview and an API that is shared with countless other systems.

>>> from blaze import Data
>>> d = Data(f)  # Wrap h5py file with Blaze interactive expression
>>> d
Data:       <HDF5 file "OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5" (mode r+)>
DataShape:  {
  HDFEOS: {
    ADDITIONAL: {FILE_ATTRIBUTES: {}},
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
  ...

By default we see the data and a truncated datashape.

We ask for the datashape explicitly to see the full picture.

>>> d.dshape
dshape("""{
  HDFEOS: {
    ADDITIONAL: {FILE_ATTRIBUTES: {}},
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
          AerosolIndexVIS: 1643 * 60 * int16,
          AerosolModelMW: 1643 * 60 * uint16,
          AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16,
          AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16,
          AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16,
          AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16,
          AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16,
          CloudFlags: 1643 * 60 * uint8,
          CloudPressure: 1643 * 60 * int16,
          EffectiveCloudFraction: 1643 * 60 * int8,
          InstrumentConfigurationId: 1643 * uint8,
          MeasurementQualityFlags: 1643 * uint8,
          NumberOfModelsPassedThreshold: 1643 * 60 * uint8,
          ProcessingQualityFlagsMW: 1643 * 60 * uint16,
          ProcessingQualityFlagsNUV: 1643 * 60 * uint16,
          RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16,
          SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16,
          SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16,
          SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16,
          SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16,
          SmallPixelRadiancePointerUV: 1643 * 2 * int16,
          SmallPixelRadiancePointerVIS: 1643 * 2 * int16,
          SmallPixelRadianceUV: 6783 * 60 * float32,
          SmallPixelRadianceVIS: 6786 * 60 * float32,
          SmallPixelWavelengthUV: 6783 * 60 * uint16,
          SmallPixelWavelengthVIS: 6786 * 60 * uint16,
          TerrainPressure: 1643 * 60 * int16,
          TerrainReflectivity: 1643 * 60 * 9 * int16,
          XTrackQualityFlags: 1643 * 60 * uint8
          },
        Geolocation Fields: {
          GroundPixelQualityFlags: 1643 * 60 * uint16,
          Latitude: 1643 * 60 * float32,
          Longitude: 1643 * 60 * float32,
          OrbitPhase: 1643 * float32,
          SolarAzimuthAngle: 1643 * 60 * float32,
          SolarZenithAngle: 1643 * 60 * float32,
          SpacecraftAltitude: 1643 * float32,
          SpacecraftLatitude: 1643 * float32,
          SpacecraftLongitude: 1643 * float32,
          TerrainHeight: 1643 * 60 * int16,
          Time: 1643 * float64,
          ViewingAzimuthAngle: 1643 * 60 * float32,
          ViewingZenithAngle: 1643 * 60 * float32
          }
        }
      }
    },
  HDFEOS INFORMATION: {
    ArchiveMetadata.0: string[65535, 'A'],
    CoreMetadata.0: string[65535, 'A'],
    StructMetadata.0: string[32000, 'A']
    }
  }""")

When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped (1643, 60); the collection in Data Fields holds what appear to be weather measurements while the collection in Geolocation Fields holds coordinate information.

Moreover we can quickly navigate this structure to compute relevant information.

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Latitude
array([[-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -68., ..., -61., -61., -60.],
       ...,
       [ 69.,  70.,  71., ...,  85.,  84.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Longitude
array([[  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -4.,   -5.,   -7.],
       ...,
       [ 123.,  124.,  124., ..., -141., -131., -121.],
       [ 123.,  124.,  124., ..., -141., -130., -120.],
       [ 123.,  123.,  124., ..., -140., -130., -119.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       ...,
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure.min()
620

Final Thoughts

Often the greatest challenge is finding what you already have.

Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results.

This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend.

Blaze Datasets Often the greatest challenge is finding what you already have

Motivating problem - Intuitive HDF5 File Navigation