HDF5LabNotebook: Results stored in a standard format

class epyc.HDF5LabNotebook(name: str, create: bool = False, description: str = None)

A lab notebook that persists itself to an HDF5 file. HDF5 is a very common format for sharing large scientific datasets, allowing epyc to interoperate with a larger toolchain.

epyc is built on top of the h5py Python binding to HDF5, which handles most of the heavy lifting using a lot of machinery for typing and so on matched with numpy. Note that the limitations of HDF5’s types mean that some values may have different types when read than when acquired. (See HDF5 type management for details.)

The name of the notebook can be a file or a URL. Only files can be created or updated: if a URL is provided then the notebook will be read and immediately marked as locked. This implies that create=True won’t work in conjunction with URLs.

Important

Note that because of the design of the requests library used for handling URLs, using a file:-schema URL will result in an exception being raised. Use filenames for accessing files.

Parameters:
  • name – HDF5 file or URL backing the notebook
  • create – (optional) if True, erase any existing file (defaults to False)
  • description – (optional) free text description of the notebook

Managing result sets

HDF5LabNotebook.addResultSet(tag: str, description: str = None) → epyc.resultset.ResultSet

Add the necessary structure to the underlying file when creating the new result set. This ensures that, even if no results are added, there will be structure in the persistent store to indicate that the result set was created.

Parameters:
  • tag – the tag
  • description – (optional) the description
Returns:

the result set

Persistence

HDF5 notebooks are persistent, with the data being saved into a file identified by the notebook’s name. Committing the notebook forces any changes to be saved.

HDF5LabNotebook.isPersistent() → bool

Return True to indicate the notebook is persisted to an HDF5 file.

Returns:True
HDF5LabNotebook.commit()

Persist any changes in the result sets in the notebook to disc.

HDF5 file access

The notebook will open the underlying HDF5 file as required, and generally will leave it open. If you want more control, for example to make sure that the file is closed and finalised, HDF5LabNotebook also behaves as a context manager and so can be used in code such as:

nb = HDF5LabNotebook(name='test.h5')
with nb.open():
    nb.addResult(rc1)
    nb.addResult(rc2)

After this the notebook’s underlying file will be closed, with the new results having been saved. Alternatively simply use LabNotebook.commit() to flush any changes to the underlying file, for example:

nb = HDF5LabNotebook(name='test.h5')
nb.addResult(rc1)
nb.addResult(rc2)
nb.commit()

Remote notebooks

Remote notebooks can be accessed by providing a URL instead of a filename to the notebook constructor:

nb = HDF5LabNotebook(name='http://example.com/test.h5')

Since remote updating doesn’t usually work, any notebook loaded from a URL is treated as “finished” (as though you’d called LabNotebook.finish())

Structure of the HDF5 file

Note

The structure inside an HDF5 file is only really of interest if you’re planning on using an epyc-generated dataset with some other tools.

HDF5 is a “container” file format, meaning that it behaves like an archive containing directory-like structure. epyc structures its storage by using a group for each result set, held within the “root” group of the container. The root group has attributes that hold “housekeeping” information about the notebook.

HDF5LabNotebook.VERSION = 'version'

Attribute holding the version of file structure used.

HDF5LabNotebook.DESCRIPTION = 'description'

Attribute holding the notebook and result set descriptions.

HDF5LabNotebook.CURRENT = 'current-resultset'

Attribute holding the tag of the current result set.

Any attributes of the notebook are also written as top-level attributes in this grup. Then, for each ResultSet in the notebook, there is a group whose name corresponds to the result set’s tag. This group contains any attributes of the result set, always including three attributes storing the metadata, parameter, and experimental result field names.

Note

Attributes are all held as strings at the moment. There’s a case for giving them richer types in the future.

The attributes also include the description of the result set and a flag indicating whether it has been locked.

HDF5LabNotebook.DESCRIPTION = 'description'

Attribute holding the notebook and result set descriptions.

HDF5LabNotebook.LOCKED = 'locked'

Attribute flagging a result set or notebook as being locked to further changes.

Within the group are two datasets: one holding the results of experiments, and one holding pending results yet to be resolved.

HDF5LabNotebook.RESULTS_DATASET = 'results'

Name of results dataset within the HDF5 group for a result set.

HDF5LabNotebook.PENDINGRESULTS_DATASET = 'pending'

Name of pending results dataset within the HDF5 group for a result set.

If there are no pending results then there will be no pending results dataset. This makes for cleaner interaction when archiving datasets, as there are no extraneous datasets hanging around.

So an epyc notebook containing a result set called “my_data” will give rise to an HDF5 file containing a group called “my_data”, within which will be a dataset named by HDF5LabNotebook.RESULTS_DATASET and possibly another dataset named by HDF5LabNotebook.PENDINGRESULTS_DATASET. There will also be a group named by LabNotebook.DEFAULT_RESULTSET which is where results are put “by default” (i.e., if you don’t define explicit result sets).

HDF5 type management

epyc takes a very Pythonic view of experimental results, storing them in a results dict with an unconstrained set of keys and types: and experiment can store anything it likes as a result. The ResultSet class handles mapping Python types to numpy dtypes: see Type mapping and inference for details.

The HDF5 type mapping follows the numpy approach closely. Some types are mapped more restrictively than in numpy: this is as one would expect, of course, since HDF5 is essentially an archive format whose files need to be readable by a range of tools over a long period. Specifically this affects exceptions, tracebacks, and datetime values, all of which are mapped to HDF5 strings (in ISO standard date format for the latter). Strings are in turn stored in ASCII, not Unicode.

A little bit of patching happens for “known” metadata values (specifically Experiment.START_TIME and Experiment.END_TIME) which are automatically patched to datetime instances when loaded. List-valued results are supported, and can be “ragged” (not have the same length) across results.

Warning

Because of the differences between Python’s and HDF5’s type systems you may not get back a value with exactly the same type as the one you saved. Specifically, lists come back as numpy arrays. The values and the behaviours are the same, though. If you need a specific type, be sure to cast the value before use.

See types-ecperimewnt for a list of “safe” types.

Tuning parameters

Some parameters are available for tuning the notebook’s behaviour.

The default size of a new dataset can be increased if desired, to pre-allocate space for more results.

HDF5LabNotebook.DefaultDatasetSize = 10

Default initial size for a new HDF5 dataset.

The dataset will expand and contract automatically to accommodate the size of a result set: its hard to see why this value would need to be changed.

Low-level protocol

The low-level handling of the HDF5 file is performed by a small number of private methods – never needed directly in client code, but possibly in need of sub-classing for some specialist applications.

Three methods handle file creation and access.

HDF5LabNotebook._create(name: str)

Create the HDF5 file to back this notebook.

Parameters:
  • name – the filename
  • description – the free text description of this notebook
HDF5LabNotebook._open()

Open the HDF5 file that backs this notebook.

HDF5LabNotebook._close()

Close the underlying HDF5 file.

Five other methods control notebook-level and result-set-level I/O. These all assume that the file is opened and closed around them, and will fail if not.

HDF5LabNotebook._load()

Load the notebook and all result sets.

HDF5LabNotebook._save()

Save all dirty result sets. These are written out completely.

HDF5LabNotebook._purge()

Delete any HDF5 datasets that relate to deleted result sets.

HDF5LabNotebook._read(tag: str)

Read the given result set into memory.

Parameters:tag – the result set tag
HDF5LabNotebook._write(tag: str)

Write the given result set to the file.

Tag:the result set tag

There are also two private methods that handle the conversion of numpy dtypes to the (ever so slightly different) h5py dtypes.

HDF5LabNotebook._HDF5simpledtype(dtype: numpy.dtype) → numpy.dtype

Patch a simple numpy dtype to the formats available in HDF5.

Parameters:dtype – the numpy dtype
Returns:the HDF5 dtype
HDF5LabNotebook._HDF5dtype(dtype: numpy.dtype) → numpy.dtype

Patch a numpy dtype into its HDF5 equivalent. This method handles structured types with named fields.

Parameters:dtype – the numpy dtype
Returns:the HDF5 dtype