HDF5LabNotebook
: Results stored in a standard format¶
-
class
epyc.
HDF5LabNotebook
(name: str, create: bool = False, description: str = None)¶ A lab notebook that persists itself to an HDF5 file. HDF5 is a very common format for sharing large scientific datasets, allowing
epyc
to interoperate with a larger toolchain.epyc
is built on top of theh5py
Python binding to HDF5, which handles most of the heavy lifting using a lot of machinery for typing and so on matched withnumpy
. Note that the limitations of HDF5’s types mean that some values may have different types when read than when acquired. (See HDF5 type management for details.)The name of the notebook can be a file or a URL. Only files can be created or updated: if a URL is provided then the notebook will be read and immediately marked as locked. This implies that
create=True
won’t work in conjunction with URLs.Important
Note that because of the design of the
requests
library used for handling URLs, using afile:
-schema URL will result in an exception being raised. Use filenames for accessing files.Parameters: - name – HDF5 file or URL backing the notebook
- create – (optional) if True, erase any existing file (defaults to False)
- description – (optional) free text description of the notebook
Managing result sets¶
-
HDF5LabNotebook.
addResultSet
(tag: str, description: str = None) → epyc.resultset.ResultSet¶ Add the necessary structure to the underlying file when creating the new result set. This ensures that, even if no results are added, there will be structure in the persistent store to indicate that the result set was created.
Parameters: - tag – the tag
- description – (optional) the description
Returns: the result set
Persistence¶
HDF5 notebooks are persistent, with the data being saved into a file identified by the notebook’s name. Committing the notebook forces any changes to be saved.
-
HDF5LabNotebook.
isPersistent
() → bool¶ Return True to indicate the notebook is persisted to an HDF5 file.
Returns: True
-
HDF5LabNotebook.
commit
()¶ Persist any changes in the result sets in the notebook to disc.
HDF5 file access¶
The notebook will open the underlying HDF5 file as required, and
generally will leave it open. If you want more control, for example to
make sure that the file is closed and finalised,
HDF5LabNotebook
also behaves as a context manager and so can
be used in code such as:
nb = HDF5LabNotebook(name='test.h5')
with nb.open():
nb.addResult(rc1)
nb.addResult(rc2)
After this the notebook’s underlying file will be closed, with the new
results having been saved. Alternatively simply use
LabNotebook.commit()
to flush any changes to the underlying
file, for example:
nb = HDF5LabNotebook(name='test.h5')
nb.addResult(rc1)
nb.addResult(rc2)
nb.commit()
Remote notebooks¶
Remote notebooks can be accessed by providing a URL instead of a filename to the notebook constructor:
nb = HDF5LabNotebook(name='http://example.com/test.h5')
Since remote updating doesn’t usually work, any notebook loaded from a
URL is treated as “finished” (as though you’d called
LabNotebook.finish()
)
Structure of the HDF5 file¶
Note
The structure inside an HDF5 file is only really of interest if you’re planning on
using an epyc
-generated dataset with some other tools.
HDF5 is a “container” file format, meaning that it behaves like an archive containing
directory-like structure. epyc
structures its storage by using a group for each
result set, held within the “root” group of the container. The root group has
attributes that hold “housekeeping” information about the notebook.
-
HDF5LabNotebook.
VERSION
= 'version'¶ Attribute holding the version of file structure used.
-
HDF5LabNotebook.
DESCRIPTION
= 'description'¶ Attribute holding the notebook and result set descriptions.
-
HDF5LabNotebook.
CURRENT
= 'current-resultset'¶ Attribute holding the tag of the current result set.
Any attributes of the notebook are also written as top-level
attributes in this grup. Then, for each ResultSet
in the
notebook, there is a group whose name corresponds to the result set’s
tag. This group contains any attributes of the result set, always
including three attributes storing the metadata, parameter, and
experimental result field names.
Note
Attributes are all held as strings at the moment. There’s a case for giving them richer types in the future.
The attributes also include the description of the result set and a flag indicating whether it has been locked.
-
HDF5LabNotebook.
DESCRIPTION
= 'description' Attribute holding the notebook and result set descriptions.
-
HDF5LabNotebook.
LOCKED
= 'locked'¶ Attribute flagging a result set or notebook as being locked to further changes.
Within the group are two datasets: one holding the results of experiments, and one holding pending results yet to be resolved.
-
HDF5LabNotebook.
RESULTS_DATASET
= 'results'¶ Name of results dataset within the HDF5 group for a result set.
-
HDF5LabNotebook.
PENDINGRESULTS_DATASET
= 'pending'¶ Name of pending results dataset within the HDF5 group for a result set.
If there are no pending results then there will be no pending results dataset. This makes for cleaner interaction when archiving datasets, as there are no extraneous datasets hanging around.
So an epyc
notebook containing a result set called “my_data” will
give rise to an HDF5 file containing a group called “my_data”, within
which will be a dataset named by
HDF5LabNotebook.RESULTS_DATASET
and possibly another dataset
named by HDF5LabNotebook.PENDINGRESULTS_DATASET
. There will
also be a group named by LabNotebook.DEFAULT_RESULTSET
which
is where results are put “by default” (i.e., if you don’t define
explicit result sets).
HDF5 type management¶
epyc
takes a very Pythonic view of experimental results, storing
them in a results dict with an unconstrained set of keys and
types: and experiment can store anything it likes as a result. The
ResultSet
class handles mapping Python types to numpy
dtypes: see Type mapping and inference for details.
The HDF5 type mapping follows the numpy
approach closely. Some
types are mapped more restrictively than in numpy
: this is as one
would expect, of course, since HDF5 is essentially an archive format
whose files need to be readable by a range of tools over a long
period. Specifically this affects exceptions, tracebacks, and
datetime
values, all of which are mapped to HDF5 strings (in ISO
standard date format for the latter). Strings are in turn stored in
ASCII, not Unicode.
A little bit of patching happens for “known” metadata values
(specifically Experiment.START_TIME
and
Experiment.END_TIME
) which are automatically patched to
datetime
instances when loaded. List-valued results are supported,
and can be “ragged” (not have the same length) across results.
Warning
Because of the differences between Python’s and HDF5’s type
systems you may not get back a value with exactly the same type as
the one you saved. Specifically, lists come back as numpy
arrays. The values and the behaviours are the same, though. If you
need a specific type, be sure to cast the value before use.
See types-ecperimewnt for a list of “safe” types.
Tuning parameters¶
Some parameters are available for tuning the notebook’s behaviour.
The default size of a new dataset can be increased if desired, to pre-allocate space for more results.
-
HDF5LabNotebook.
DefaultDatasetSize
= 10¶ Default initial size for a new HDF5 dataset.
The dataset will expand and contract automatically to accommodate the size of a result set: its hard to see why this value would need to be changed.
Low-level protocol¶
The low-level handling of the HDF5 file is performed by a small number of private methods – never needed directly in client code, but possibly in need of sub-classing for some specialist applications.
Three methods handle file creation and access.
-
HDF5LabNotebook.
_create
(name: str)¶ Create the HDF5 file to back this notebook.
Parameters: - name – the filename
- description – the free text description of this notebook
-
HDF5LabNotebook.
_open
()¶ Open the HDF5 file that backs this notebook.
-
HDF5LabNotebook.
_close
()¶ Close the underlying HDF5 file.
Five other methods control notebook-level and result-set-level I/O. These all assume that the file is opened and closed around them, and will fail if not.
-
HDF5LabNotebook.
_load
()¶ Load the notebook and all result sets.
-
HDF5LabNotebook.
_save
()¶ Save all dirty result sets. These are written out completely.
-
HDF5LabNotebook.
_purge
()¶ Delete any HDF5 datasets that relate to deleted result sets.
-
HDF5LabNotebook.
_read
(tag: str)¶ Read the given result set into memory.
Parameters: tag – the result set tag
-
HDF5LabNotebook.
_write
(tag: str)¶ Write the given result set to the file.
Tag: the result set tag
There are also two private methods that handle the conversion of
numpy
dtypes to the (ever so slightly different) h5py
dtypes.
-
HDF5LabNotebook.
_HDF5simpledtype
(dtype: numpy.dtype) → numpy.dtype¶ Patch a simple
numpy
dtype to the formats available in HDF5.Parameters: dtype – the numpy
dtypeReturns: the HDF5 dtype
-
HDF5LabNotebook.
_HDF5dtype
(dtype: numpy.dtype) → numpy.dtype¶ Patch a
numpy
dtype into its HDF5 equivalent. This method handles structured types with named fields.Parameters: dtype – the numpy
dtypeReturns: the HDF5 dtype