Third tutorial: Larger notebooks

So far we’ve generated fairly modest datasets from experiments. That can rapidly change if you start running lots of repetitions of an experiment, and if you want to collect the results of different (but related) experiments within the same notebook, and to share these datasets with others or make use of tools other than epyc and your own analytics code to study them. For these reasons to need to be able to handle larger-scale datasets.

Structuring larger datasets

A single result is a homogeneous collection of results from experimental runs. Each “row” of the result set contains values for the same set of parameters, experimental results, and metadata. Typically that means that all the results in the result set were collected for the same experiment, or at least for experiments with the same overall structure.

That might not be the case. You might want to collect together the results of lots of different sorts of experiment that share a common theme. An example of this might be where you have a model of a disease which you want to run over different underlying models of human contacts described with different combinations of parameters. Putting all these in the same result set would be potentially quite confusing.

An epyc LabNotebook can therefore store multiple separate result sets, each with its own structure. You choose which result set to add results to by selecting that result set as “current”, and similarly can access the results in the selected result set.

Each result set has a unique “tag” that identifies it within the notebook. epyc places no restrictions on tags, except that they must be strings: a meaningful name would probably make sense.

When a notebook is first created, it has a single default result set ready to receive results immediately.

from epyc import Experiment, LabNotebook, ResultSet

nb = LabNotebook()
print(len(nb.resultSets())

The default result set is named according to LabNotebook.DEFAULT_RESULTSET.

print(nb.resultSets())

We can add some results, assuming we have an experiment lying around:

e = MyExperiment()

# first result
params = dict(a=12, b='one')
rc = e.set(params).run()
nb.addResult(rc)

# second result
params['b']='two'
rc = e.set(params).run()
nb.addResult(rc)

These results will set the “shape” of the default result set.

You can create a new result set simply by providing a tag:

nb.addResultSet('my-first-results')

Adding the result set selects it and makes it current. We can then add some other results from a different experiment, perhaps with a different set of parameters:

e = MyOtherExperiment()
params = dict(a=12, c=55.67)
rc = e.set(params).run()
nb.addResult(rc)

We can check how many results we have:

print(nb.numberOfResults())

The result will be 1: the number of results in the current result set. If we select the default result set instead, we’ll see those 2 results instead:

nb.select(LabNotebook.DEFAULT_RESULTSET)
print(nb.numberOfResults())

The two results sets are entirely separate and can be selected between as required. They can also be given attributes that for example describe the circumstances under which they were collected or the significance of the different parameters. This kind of documentation metadata becomes more important datasets become larger, become more complicated, and are stored for longer.

Persistent datasets

We already met epyc’s JSONLabNotebook class, which stores notebooks using a portable JSON format. This is neither compact nor standard: it takes a lot of disc space for large notebooks, and doesn’t immediately interoperate with other tools.

The HDF5LabNotebook by contrast uses the HDF5 file format which is supported by a range of other tools. It can be used in the same way as any other notebook:

from epyc import Lab, HDF5LabNotebook

lab = Lab(notebook=HDF5LabNotebook('mydata.h5'))

Change to result sets are persisted into the underlying file. It’s possible to apply compression and other filters to optimise storage, and the file format is as far as possible self-describing with metadata to allow it to be read and interpreted at a later date.

Pending results

Both multiple result sets and persistent notebooks really come into their own when combined with the use of ClusterLab instances to perform remote computations.

Once you have a connection to a cluster, you can start computations in multiple result sets.

from epyc import ClusterLab

lab = ClusterLab(profile="mycluster",
                 notebook=HDF5LabNotebook('mydata.h5', create=True))
nb = lab.notebook()

# perform some of the first experiments
nb.addResultSet('first-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyFirstExperiment()
lab.runExperiment(e)

# and then some others
nb.addResultSet('second-experiments')
lab['a] = 15
lab['b'] = ['cat', 'dog', 'snake']
lab['c'] = range(200)
e = MySecondExperiment()
lab.runExperiment(e)

These experiments will give rise to “pending” results as they are computed on the cluster: 1000 experimental runs in the first case, 300 in the second (the sizes of the respective parameter spaces). If we check the fraction of results that are ready in each dataset they will be retieved from the cluster and stored in the notebook:

nb.select('first-experiments')
print(lab.readyFraction())

Note that results that are ready are collected for all datasets, not just the one we query, but the fraction refers only to the selected current result set. This means that over over time all the results in the notebook will be collected from the cluster and stored.

Clusters support disconnected operation, which work well with persistent notebooks. See Using a cluster without staying connected to it for details.

Locking result sets

If you’ve just burned hundred of core-hours on significant experiments, you want to be careful of the results! The persistent notebooks try to ensure that results obtained are committed to persistent storage, which you can always force with a call to LabNotebook.commit().

Because notebooks can store the results of different experimental configurations, you may find yourself managing several result sets under different tags. There’s a risk that you’ll accidentally use one result set when you meant to use another one, for example by not calling LabNotebook.select() to select the correct one.

The risk of this can be minimised by locking a result set once all its experiments have been completed. We might re-write the experiments we did above to lock each result set once the experiments are all done.

# perform some more experiments
nb.addResultSet('third-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyThirdExperiment()
lab.runExperiment(e)

# wait for the results to complete ... time passes ...
lab.wait()

# the results are in, lock and commit
nb.current().finish()
nb.commit()

The ResultSet.finish() method does two things. It cancels any pending results still missing there won’t be any of these in this example, because of the call to ClusterLab.wait()), and then locks the result set against any further updates. This locking is persistent, so re-loading the notebook later will still leave this result set locked, in perpetuity.