Making data archives

Problem: Having expended a lot of time (both your own and your computers’) on producing a dataset in a notebook, you want to be able to store it and share it over a long period.

Solution: This is a perennial problem with computational science: how do we make data readable, and keep it that way? Even more than code (which we discussed under Reproducing experiments reliably), data suffers from “bit rot” and becomes unreadable, both in technical and semantic terms.

The technical part – a file that’s in an outdated format – is the easier problem to deal with. We can use a format that’s already survived the test of time, that has widespread support, and that – although it eventually will go out of date – will have emough commitment that it’ll be possible to convert and upgrade it. HDF5, as used by the HDF5LabNotebook, meets these criteria well, and can be accessed natively by epyc.

Note that epyc also records the class names of experiments in their results. This is only a guide, of course: there’s nothing that automatically identifies where the code of a class is stored, or which version was used. It’s possible to address these issues as part of dataset semantics, though.

The semantic problem requires that we maintain an understanding of what each field in a dataset means. At a trivial level, sensible field names help, as do free-text descriptions of how and why a datset was collected. This metadata is all stored within a persistent result set or notebook, and can be accessed when the notebook is re-loaded or used within some other tool.

One can be even more structured. Each parameter and result field in a result set (and each metadata field, for that matter) will presumably have a particular purpose and likely some units. We can use attributes to store this metadata too:

from epyc import HDF5LabNotebook

# load the notebook and give it a new description
with HDF5LabNotebook('my-important-dataset.h5') as nb:
    # set the description
    nb.setDescription('A notebook I want to understand later')

    # select the result set we want to annotate with metadata
    rs = nb.select('first-experiment')
    rs.setDescription('Some physics stuff')

    # create attributes for each parameter and result
    rs[MyExperiment.VELOCITY] = 'Velocity of particle (ms^-1)'
    rs[MyExperiment.MASS] = 'Mass of particle (g)'
    rs[MyExperiment.NPARTICLES] = 'Number of particls (number)'
    rs[MyExperiment.DENSITY] = 'Final particle density (m^-2)'

    # lock the result set against further updates
    rs.finish()

We’ve assumed we have a class MyExperiment that defines field names for its parameter and result fields. For each of these we create an attribute of the result set holding a text description and units. Now, when sometime later we examine the notebook, we’ll have at least some idea of what’s what. Admittedly that metadata isn’t machine-readable to allow a program to (for example) work out that masses are measured in grams: that would require a far more sophisticated system using ontologies to describe the structure of information. But it’s a start to have the information recorded in a human-readable form, closely associated with the data.

In particular application domains it may also be worth adhering to specific standards for metadata. The UK Digital Curation Centre maintains a list that may be useful.

Finally, we called ResultSet.finish() to finish and lock the result set. This will (hopefully) prevent accidental corruption, and will also tidy up the final file by cancelling any submitted-but-not-completed pending results. (Any such results will still be recorded in the dataset for audit purposes.)