epyc: Computational experiment management in Python¶
Vision: Automated, repeatable management of computational experiments¶
epyc
aims to simplify the process of conducting large-scale computational experiments.
What is epyc
?¶
epyc
is a Python module that conducts computational
experiments. An experiment is simply a Python object that is passed a
tuple of parameters which is uses to perform some computation
(typically a simulation) and return a tuple of results. epyc
allows a single experiment to be conducted across a space of
parameters in a single command, configuring and running the experiment
at each point in the parameter space. The results are collected
together into a notebook along with the parameters for each result and
some metadata about the performamnce of the experiment. Notebooks are
immutable, can be persistent, and can be imported into pandas
for
analysis.
epyc
can run the same experiment sequentially on a single local
machine, in parallel on a multicore machine, or in parallel on a
distributed compute cluster of many machines. This allows simulations
to be scaled-out with minimal re-writing of code. Moreover jobs
running on compute servers don’t require the submitting machine to
remain connected, meaning one can work from a laptop, start a
simulation, and come back later to collect the results. Results can be
collected piecemeal, to fully support disconnected operation.
Current features¶
- Single-class definition of experiments
- Run experiments across an arbitrarily-dimensioned parameter space with a single command
- Different experimental designs for managing how many experimental runs are needed, and at what points
- Experiment combinators to control repetition and summarisation, separate from the core logic of an experiment
- Immutable storage of results
- Experiments can run locally, remotely, and in parallel
- Works with Python 3.6 and later, and (to a limited extent) with PyPy3
- Remote simulation jobs run asynchronously
- Start and walk away from remote simulations, collect results as they become available
- Fully integrated with
ipython
andipyparallel
for parallel and distributed simulation - Fully compatible with
jupyter
notebooks and labs - Annotated with
typing
type annotations
Installation¶
In a nutshell¶
Pythons: 3.6 or later, and PyPy3
Operating systems: Linux, OS X
License: GNU General Public License v3
Repository: https://github.com/simoninireland/epyc
Maintainer: Simon Dobson
Installation with pip
¶
Use the following command:
pip install epyc
epyc
works perfectly in virtual environments, and indeed doing so is
good practice for scientific code. See Reproducing experiments reliably for a discussion.
Tutorials¶
This page is a tutorial introduction to epyc
.
First tutorial: Getting started¶
In the first tutorial we’ll look at the basic components of epyc
, and how to define and
run experiments on a single machine.
Note
The code for this tutorial can be found in the Github repository.
Concepts¶
epyc
is built around three core concepts: experiments, labs, and notebooks.
An experiment is just an object inheriting from the Experiment
class. An experiment can contain
pretty much any code you like, which is then called according to a strict protocol (described in detail
in The lifecycle of an experiment). The most important thing about experiments is that they can be parameterised with a set
of parameters passed to your code as a dict.
Parameters can come directly from calling code if the experiment is called directly, but it is more common for
experiments to be invoked from a lab. A lab defines a parameter space, a collection of parameters whose values
are each taken from a range. The set of all possible combinations of parameter values forms a multi-dimensional space,
and the lab runs an experiment for each point in that space (each combination of values). This allows the same
experiment to be done across a whole range of parameters with a single command. Different lab implementations provide
simple local sequential execution of experiments, parallel execution on a multicore machine, or parallel and
distributed execution on a compute cluster or in the cloud. The same experiment can usually be performed in any
lab, with epyc
taking care of all the house-keeping needed. This makes it easy to scale-out experiments
over larger parameter spaces or numbers of repetitions.
Running the experiment produces results, which are stored in result sets collected
together to form a notebook. Result sets collect together the results from multiple
experimental runs. Each result contains the parameters of the experiment, the results
obtained, and some metadata describing how the experimental run proceeded. A result
set is hoogeneous, in the sense that the names of parameters, results, and metata will be the
same for all the results in the result set (although the values will of course be different).
Results can be searched and retrieved either as Python dicts or as pandas
dataframes.
However, result sets are immutable: once results have been entered, they can’t be modified
or discarded.
By contrast, notebooks are heterogeneous, containing result sets with different experiments and sets of parameters and results. Notebooks can also be persistent, for example storing their results in JSON or HDF5 for easy access from other tools.
A repeatable computational environment¶
Before we dive in, we should adopt best practices and make sure that we have a controlled computational environment in which to perform experiments. See Reproducing experiments reliably for a thorough discussion of these topics.
A simple experiment¶
Computational experiments that justify using infrastructure like epyc
are by definition
usually large and complicated – and not suitable for a tutorial. So we’ll start with a
very simple example: admittedly you could do this more easily with straight Python code,
but that’s an advantage when describing how to use a more complicated set-up.
So suppose we want to compute a set of values for some function so that we can plot
them as a graph. A complex function, or one that involved simulation, might justify
using epyc
. For the time being let’s use a simple function:
We’ll plot this function about \((0, 0)\) extending \(2 \pi\) radians in each axial direction.
Defining the experiment¶
We first create a class describing our experiment. We do this by extending
Experiment
and overriding Experiment.do()
to provide the code actually executed:
from epyc import Experiment, Lab, JSONLabNotebook
import numpy
class CurveExperiment(Experiment):
def do(self, params):
'''Compute the sin value from two parameters x and y, returning a dict
containing a result key with the computed value.
:param params: the parameters
:returns: the result dict'''
x = params['x']
y = params['y']
r = numpy.sin(numpy.sqrt(x**2 + y**2))
return dict(result = r)
That’s it: the code for the computational experiment, that will be executed at a point driven by the provided parameters dict.
Types of results from the experiment¶
epyc
tries to be very Pythonic as regards the types one can return
from experiments. However, a lot of scientific computing must
interoperate with other tools, which makes some Python types less than
attractive.
The following Python types are safe to use in epyc
result sets:
int
float
complex
string
bool
- one-dimensional arrays of the above
- empty arrays
There are some elements of experimental metadata that use exceptions or datestamps: these get special handling.
Also note that epyc
can handle lists and one-dimensional arrays in
notebooks, but it can’t handle higher-dimensional arrays. If you
have a matrix, for example, it needs to be unpacked into one or more
one-dimensional vectors. This is unfortunate in general but not often
an issue in practice.
There are some conversions that happen when results are saved to
persistent notebooks, using either JSON (JSONLabNotebook
) or
HDF5 (HDF5LabNotebook
): see the class documentation for the
details (or if unexpected things happen), but generally it’s fairly
transparent when you stick top the types listed above.
Testing the experiment¶
Good development practice demands that we now test the experiment before running it in anger.
Usually this would involve writing unit tests within a framework provided by Python’s unittest
library, but that’s beyond the scope of this tutorial: we’ll simply run the experiment at a point
for which we know the answer:
# set up the test
params = dict()
params['x'] = 0
params['y'] = 0
res = numpy.sin(numpy.sqrt(x**2 + y**2)) # so we know the answer
# run the experiment
e = CurveExperiment()
rc = e.set(params).run()
print(rc[epyc.RESULTS]['result'] == res)
The result should be True
. Don’t worry about how we’ve accessed the result: that’ll become
clear in a minute.
A lab for the experiment¶
To perform our experiment properly, we need to run the experiment at a lot of points,
to give use a “point cloud” dataset that we can then plot to see the shape of the function.
epyc
lets us define the space of parameters over which we want to run the experiment,
and then will automatically run and collect results.
The object that controls this process is a Lab
, which we’ll create first:
lab = Lab()
This is the most basic use of labs, which will store the results in an in-memory LabNotebook
.
For more serious use, if we wanted to save the results for later, then we can create an persistent
JSONLabNotebook
that stores results in a file in a JSON encoding:
lab = Lab(notebook = JSONLabNotebook("sin.json",
create = True,
description = "A point cloud of $sin \sqrt{x^2 + y^2}$"))
This creates a JSON file with the name given in the first argument. The create
argument, if set to True
,
will overwrite the contents of the file; it defaults to False
, which will load the contents of the file
instead, allowing the notebook to be extended with further results. The description
is just free text.
Important
epyc
lab notebooks are always immutable: you can delete them, but you can’t change their contents
(at least not from within epyc
). This is intended to avoid the loss of data.
Specifying the experimental parameters¶
We next need to specify the parameter space over which we want the lab to run the experiment. This is done by mapping variables to values in a dict. The keys of the dict match the parameter values references in the experiment; the values can be single values (constants) or ranges of values. The lab will then run the the experiment for all combinations of the values provided.
For our purposes we want to run the experiment over a range \([-2 \pi, 2 \pi]\) in two axial directions.
We can define this using numpy
:
lab['x'] = numpy.linspace(-2 * numpy.pi, 2 * numpy.pi)
lab['y'] = numpy.linspace(-2 * numpy.pi, 2 * numpy.pi)
How many points are created in these ranges? We’ve simply let numpy
use its default, which is 50 points:
we could have specified a number if we wanted to , to get finer or coarser resolution for the point cloud.
Notice that the lab itself behaves as a dict for the parameters.
What experiments will the lab now run? We can check by retrieving the entire parameter space for the lab:
print(lab.parameterSpace())
This returns a list of the combinations of parameters that the lab will use for running experiments. If you’re only interested in how many experiments will run, you can get this with:
Running the experiment¶
We can now run the entire experiment with one command:
lab.runExperiment(CurveExperiment())
What experiments will be run depends on the lab’s experimental
design. By default labs use a FactorialDesign
that performs
an experiment for each combination of parameter values, which in this
case will have 250 points: 50 points along each axis.
Time will now pass until all the experiments are finished.
Where are the results? They’ve been stored into the notebook we associated with the lab, either in-memory or in a JSON file on disk.
Accessing the results¶
There are several ways to get at the results. The simplest is that we can simply get back a list of dicts:
results = lab.results()
results now contains a list, each element of which is a results dict. A results dict is a Python dict that’s structured in a particular way. It contains three top-level keys:
Experiment.PARAMETERS
, which maps to a dict of the parameters that were used for this particular run of the experiment (x
andy
in our case, each mapped to a value taken from the parameter space);Experiment.RESULTS
, which maps to a dict of the experimental results generated by theExperiment.do()
method (result
in our case); andExperiment.METADATA
, which contains some metadata about this particular experimental run including the time taken for it to execute, any exceptions raised, and so forth. The standard metedata elements are described inExperiment
: sub-classes can add extra metadata.
A list isn’t a very convenient way to get at results, and analysing an
experiment typically requires some more machinery. Many experiments
will use pandas
to perform analysis, and the lab can generate a
pandas.DataFrame
structure directly:
import pandas
df = lab.dataframe()
The dataframe contains all the information from the runs: each row
holds a single run, with columns for each result, parameters, and
metadata element. We can now do anaysis in pandas
as appropriate:
for example we can use matplotlib
to draw the results as a point
cloud:
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8, 8))
ax = fig.add_subplot(projection = '3d')
ax.scatter(df['x'], df['y'], df['result'],
c=df['result'], depthshade=False, cmap=cm.coolwarm)
ax.set_xlim(numpy.floor(df['x'].min()), numpy.ceil(df['x'].max()))
ax.set_ylim(numpy.floor(df['y'].min()), numpy.ceil(df['y'].max()))
ax.set_zlim(numpy.floor(df['result'].min()), numpy.ceil(df['result'].max()))
plt.title(lab.notebook().description())
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.show()

Running more experiments¶
We can run more experiments from the same lab if we choose: simply change the parameter bindings, as one would with a dict. It’s also possible to remove parameters as one would expect:
del lab['x']
del lab['y]
For convenience there’s also a method Lab.deleteAllParameters()
that returns the lab to an empty parameter state, This can be useful
for using the same lab for multiple sets of experiments. If you’re
going to do this, it’s often advisable to use multiple result sets and
a more structured approach to notebooks, as described in the
Third tutorial: Larger notebooks.
Second tutorial: Parallel execution¶
epyc
’s main utility comes from being able to run experiments, like those we defined in
the first tutorial and ran on a single machine, on multicore machines and clusters of machines.
In this tutorial we’ll explain how epyc
manages parallel machines.
(If you know about parallel computing, then it’ll be enough for you to know that epyc
creates
a task farm of experiments across multiple cores. If this didn’t make sense, then you
should first read Parallel processing concepts.)
Two ways to get parallelism¶
epyc
arranges the execution of experiments around the Lab
class, which handles
the execution of experiments across a parameter space. The default Lab
executes
experiments sequentially.
But what if you have more than one core? – very common on modern workstations. Or if
you have access to a cluster of machines? Then epyc
can make use of these resources
with no change to your experiment code.
If you have a multicore machine, the easiest way to use it with epyc
is to replace
the Lab
managing the experiments with a ParallelLab
to get
local parallelism. This will execute
experiments in parallel using the available cores. (You can limit the number of cores
used if you want to.) For example:
from epyc import ParallelLab, HDF5LabNotebook
nb = HF5LabNotebook('mydata.h5', create=True)
lab = ParallelLab(nb, cores=-1) # leave one core free
e = MyExperiment()
lab['first'] = range(1, 1000)
lab.runExperiment(e)
On a machine with, say, 16 cores, this will use 15 of the cores to run experiments and return when they’re all finished.
If you have a cluster, things are a little more complicated as you need to set up
some extra software to manage the cluster for you. Once that’s done, though, accessing
the cluster from, epyc
is largely identical to accessing local parallelism.
Setting up a compute cluster¶
epyc
doesn’t actually implement parallel computing itself: instead it builds on top of
existing Python infrastructure for this purpose. The underlying library epyc
uses is
called ipyparallel, which provides
portable parallel processing on both multicore machines and collections of machines.
Warning
Confusingly, there’s also a system called PyParallel which is a completely
different beast to ipyparallel
.
epyc
wraps-up ipyparallel
within the framework of experiments, labs, and notebooks,
so that, when using epyc
, there’s no need to interact directly with``ipyparallel``.
However, before we get to that stage we do need to set up the parallel compute cluster that
epyc
will use, and (at present) this does require interacting to some degree with
ipyparallel
’s commands.
Setting up a cluster depends on what kind of cluster you have, and we’ll describe each one individually. It’s probably easiest to start with the simplest system to which you have access, and then – if and when you need more performance – move onto the more advanced systems.
Setting up a machine with a single core¶
You might ask why you’d do this: isn’t a single-core machine useless for parallel processing? Well, yes … and no, it’s the same basic architecture as for a multicore machine, so it’s useful to understand how things go together.
The first thing we need to do is create a “profile”, which is just a small description of how we want the cluster to behave. Creating a profile requires one command:
ipython profile create --parallel cluster
This creates a profile called cluster
– you can choose any name you like for yours. Profiles let
us run multiple clusters (should we want to), each with a different name.
We can now start our compute cluster using this profile:
ipcluster start --profile=cluster
That’s it! (If you used a different name for your profile, of course, use that instead of cluster
above.)
You’ll see some debugging information streaming past, which indicates that the cluster has
started and is ready for use.
Unsurprisingly, if you want to halt the cluster, you execute:
ipcluster stop --profile=cluster
You need to provide the profile name to make sure epyc
stops the right cluster.
Setting up a machine with multiple cores¶
A multicore machine is a far more sensible system on which to do large-scale computing, and they’re now surprisingly common: despite what we said before, a lot of laptops (and even a lot of phones) are now multicore. Setting up an cluster on a multicore machine is just as easy as for a single core machine – once we know how many cores we have. If you’re running on Linux, we can ask the operating system, how many cores you have:
grep -c processor /proc/cpuinfo
This prints the number of cores available.
Note
At the moment querying the number of available cores only works on Linux.
We use ipcluster
to start the cluster as before, but tell it how many cores it has to work with.
If your machine has 16 cores, then the command would be:
upcluster start --profile=cluster --n=16
(Obviously we’re assuming your cluster is called cluster
as before.)
Note
If you have a machine with only 8 cores, there’s no point telling the cluster you have more: you won’t get
any extra speedup, and in fact things will probably run slower than if you let ipyparallel
optimise
itself for the actualy hardware it has available.
Occasionally you may want to run a cluster with fewer cores than you actually have, to stop epyc
monopolising the machine: for example if you’re sharing it with others, or if you want to also run
other things. In this case you can tell ipcluster
to use a smaller number of cores, leaving the
remainder free for other programs to use.
Running experiments on a cluster¶
Having set uo the cluster of whatever kind, we can now let epyc
run experiments on it.
This involves using a ClusterLab
, which is simply a Lab
that runs
experiments remotely on a cluster rather than locally on the machine with the lab.
We create a ClusterLab
in the same way as “ordinary” labs:
clab = epyc.ClusterLab(profile = 'cluster',
notebook = epyc.JSONLabNotebook('our-work.json'))
The lab thus created will connect to the cluster described in the cluster
profile
(which must already have been created and started).
A ClusterLab
behaves like Lab
in most respects: we can set a
parameter space, run a set of experiments at each point in the space, and so forth.
But they differ in one important respect. Runn ing an experiment in a Lab
is a synchronous process: when you call Lab.runExperiment()
you wait until the
experiments finish before regaining control. That’s fine for small cases, but what if
you’re wanting top run a huge computation? – many repetitionsd of experiments across
a large parameter space? That after all is the reason we want to do parallel computing:
to support large computations. It would be inconvenient to say the least if performing
such experiments locked-up a computer forr a long period.
ClusterLab
differs from Lab
by being asynchronous. When you
call ClusterLab.runExperiment()
, the experiments are submitted to the cluster in one
go and control returns to your program: the computation happens “in the background”
on the cluster.
So suppose we go back to our example of computing a curve. This wasn’t a great example for a sequential lab, and it’s monumentally unrealistic for parallel computation except as an example. We can set up the parameter space and run them all in parallel using the same syntax as before:
clab['x'] = numpy.linspace(-2 * numpy.pi, 2 * numpy.pi)
clab['y'] = numpy.linspace(-2 * numpy.pi, 2 * numpy.pi)
clab.runExperiment(CurveExperiment())
Control will return immediately, as the computation is spun-up on the cluster.
How can we tell when we’re finished? There are three ways. The first is to make the whole comp[utation synchronous by waiting for it to finish:
clab.wait()
This will lock-up your computer waiting for all the experiments to finish. That’s not very flexible. We can instead test whether the computations have finished:
clab.ready()
which wil return True
when everything has finished. But that might take a long time,
and we might want to get results as they become available – for example to plot them
partially. We can see what fraction of experiments are finished using:
clab.readyFraction()
which returns a number between 0 and 1 indicating how far along we are.
As results come in, they’re stored in the lab’s notebook and can be retrieved
as normal: as a list of dicts, as a DataFrame
, and
so forth. As long as ClusterLab.ready()
is returning False
(and
ClusterLab.readyFraction()
is therefore returning less than 1), there are
still “pending” results that will be filled in later. Each call to one of these
“query” methods synchronises the notebook with the results computed on the
cluster.
In fact ClusterLab
has an additional trick up its sleeve, allowing
completely disconnected operation. But that’s another topic.
Common problems with clusters¶
ipyparallel
is a fairly basic cluster management system, but one that’s adequate
for a lot of strightforward experiments. That means it sometimes need tweaking to
work effectively, in ways that rely on you (the user) rather than being automated
as might be the case in a more advanced system.
The most common problem is one of overloading. This can occur both for both multicore and multi-machine set-ups, and is when the machine spends so long doing your experiments that it stops being able to do other work. While this may sound like a good thing – an efficient use of resources – some of that “other” work includes communicating with the cluster controller. It’s possible that too many engines crowd-out something essential, which often manifests itself in one of two ways:
- You can’t log-in to the machine or run simple processes; or
- You can’t retrieve results.
The solution is actually quite straightforward: don’t run as much work! This can easily be done by, for example, always leaving one or two cores free on each machine you use: so an eight-core machine would run six engines, leaving two free for other things.
Clusters versus local parallelism¶
You probably noticed that, if you have a single multicore workstation, there are two ways
to let epyc
use it:
- a
ParallelLab
; or - a
ClusterLab
that happens to only run engines locally.
There are pros and cons to each approach. For the ParallelLab
we have:
- it’s very simple to start, requiring no extra software to manage; but
- you only get (at most) as many cores as you have on your local machine; and
- experiments run synchronously, meaning the program that runs them is locked out until they complete (this is especially inconvenient when using Jupyter).
For the ClusterLab
:
- you need to set up the cluster outside
epyc
; but - experiments run asynchronously, meaning you can get on with other things; and
- you can use all the cores of all the machines you can get access to.
As a rule of thumb, a suite of experiments likely to take hours or days will be better run on a cluster; shorter campaigns can use local parallelism to get a useful speed-up.
Third tutorial: Larger notebooks¶
So far we’ve generated fairly modest datasets from experiments. That can rapidly
change if you start running lots of repetitions of an experiment, and if you
want to collect the results of different (but related) experiments within
the same notebook, and to share these datasets with others or make use of tools
other than epyc
and your own analytics code to study them. For these reasons
to need to be able to handle larger-scale datasets.
Structuring larger datasets¶
A single result is a homogeneous collection of results from experimental runs. Each “row” of the result set contains values for the same set of parameters, experimental results, and metadata. Typically that means that all the results in the result set were collected for the same experiment, or at least for experiments with the same overall structure.
That might not be the case. You might want to collect together the results of lots of different sorts of experiment that share a common theme. An example of this might be where you have a model of a disease which you want to run over different underlying models of human contacts described with different combinations of parameters. Putting all these in the same result set would be potentially quite confusing.
An epyc
LabNotebook
can therefore store multiple separate result sets,
each with its own structure. You choose which result set to add results to
by selecting that result set as “current”, and similarly can access the
results in the selected result set.
Each result set has a unique “tag” that identifies it within the notebook. epyc
places no restrictions on tags, except that they must be strings: a meaningful
name would probably make sense.
When a notebook is first created, it has a single default result set ready to receive results immediately.
from epyc import Experiment, LabNotebook, ResultSet
nb = LabNotebook()
print(len(nb.resultSets())
The default result set is named according to LabNotebook.DEFAULT_RESULTSET
.
print(nb.resultSets())
We can add some results, assuming we have an experiment lying around:
e = MyExperiment()
# first result
params = dict(a=12, b='one')
rc = e.set(params).run()
nb.addResult(rc)
# second result
params['b']='two'
rc = e.set(params).run()
nb.addResult(rc)
These results will set the “shape” of the default result set.
You can create a new result set simply by providing a tag:
nb.addResultSet('my-first-results')
Adding the result set selects it and makes it current. We can then add some other results from a different experiment, perhaps with a different set of parameters:
e = MyOtherExperiment()
params = dict(a=12, c=55.67)
rc = e.set(params).run()
nb.addResult(rc)
We can check how many results we have:
print(nb.numberOfResults())
The result will be 1: the number of results in the current result set. If we select the default result set instead, we’ll see those 2 results instead:
nb.select(LabNotebook.DEFAULT_RESULTSET)
print(nb.numberOfResults())
The two results sets are entirely separate and can be selected between as required. They can also be given attributes that for example describe the circumstances under which they were collected or the significance of the different parameters. This kind of documentation metadata becomes more important datasets become larger, become more complicated, and are stored for longer.
Persistent datasets¶
We already met epyc
’s JSONLabNotebook
class, which stores notebooks using a portable JSON format. This is neither
compact nor standard: it takes a lot of disc space for large notebooks, and
doesn’t immediately interoperate with other tools.
The HDF5LabNotebook
by contrast uses the HDF5 file format which is
supported by a range of other tools. It can be used in the same way as
any other notebook:
from epyc import Lab, HDF5LabNotebook
lab = Lab(notebook=HDF5LabNotebook('mydata.h5'))
Change to result sets are persisted into the underlying file. It’s possible to apply compression and other filters to optimise storage, and the file format is as far as possible self-describing with metadata to allow it to be read and interpreted at a later date.
Pending results¶
Both multiple result sets and persistent notebooks really come into their own when combined
with the use of ClusterLab
instances to perform remote computations.
Once you have a connection to a cluster, you can start computations in multiple result sets.
from epyc import ClusterLab
lab = ClusterLab(profile="mycluster",
notebook=HDF5LabNotebook('mydata.h5', create=True))
nb = lab.notebook()
# perform some of the first experiments
nb.addResultSet('first-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyFirstExperiment()
lab.runExperiment(e)
# and then some others
nb.addResultSet('second-experiments')
lab['a] = 15
lab['b'] = ['cat', 'dog', 'snake']
lab['c'] = range(200)
e = MySecondExperiment()
lab.runExperiment(e)
These experiments will give rise to “pending” results as they are computed on the cluster: 1000 experimental runs in the first case, 300 in the second (the sizes of the respective parameter spaces). If we check the fraction of results that are ready in each dataset they will be retieved from the cluster and stored in the notebook:
nb.select('first-experiments')
print(lab.readyFraction())
Note that results that are ready are collected for all datasets, not just the one we query, but the fraction refers only to the selected current result set. This means that over over time all the results in the notebook will be collected from the cluster and stored.
Clusters support disconnected operation, which work well with persistent notebooks. See Using a cluster without staying connected to it for details.
Locking result sets¶
If you’ve just burned hundred of core-hours on significant experiments,
you want to be careful of the results! The persistent notebooks try
to ensure that results obtained are committed to persistent storage,
which you can always force with a call to LabNotebook.commit()
.
Because notebooks can store the results of different experimental configurations,
you may find yourself managing several result sets under different
tags. There’s a risk that you’ll accidentally use one result set when you
meant to use another one, for example by not calling LabNotebook.select()
to select the correct one.
The risk of this can be minimised by locking a result set once all its experiments have been completed. We might re-write the experiments we did above to lock each result set once the experiments are all done.
# perform some more experiments
nb.addResultSet('third-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyThirdExperiment()
lab.runExperiment(e)
# wait for the results to complete ... time passes ...
lab.wait()
# the results are in, lock and commit
nb.current().finish()
nb.commit()
The ResultSet.finish()
method does two things. It cancels any pending results
still missing there won’t be any of these in this example, because of the call
to ClusterLab.wait()
), and then locks the result set against any further updates.
This locking is persistent, so re-loading the notebook later will still leave
this result set locked, in perpetuity.
Fourth tutorial: Integration with Jupyter¶
Computational experiments are increasingly happening within the Jupyter ecosystem
of notebooks, labs, binders, and so on. epyc
integrates well with Jupyter tools,
and in this tutorial we’ll show you how.
What is Jupyter¶
Jupyter is an interactive notebook system that provides a flexible front-end onto a number of different language “kernels” – including Python. Notebooks can include text, graphics, and executable code, and so are an excellent framework for the interactive running of computation experiments. The ecosystem has expended in recent years and now includes a multi-window IDE (Jupyter Lab), a publishing system (Jupyter Book), and various interactive widget libraries for visualisation.
epyc
can be used from Jupyter notebooks in a naive way simply by creating a local
lab in a notebook and working with it as normal:
from epyc import Lab
lab = epyc.Lab()
However this provides access only to the simplest kind of lab, running on a single local core. It is clearly desirable to be able to use more flexible resources. It’s also desirable to be able to make use of code and classes defined in notebooks.
Jupyter notebooks and compute servers¶
The interactive nature of Jupyer notebooks means that they lend themselves to
“disconnected” operation with a ClusterLab
. To recap, this is where
we submit experiments to a remote compute cluster and then re-connect later
to retrieve the results. See Second tutorial: Parallel execution for a description of
how to set up and use a remote cluster.
This mode of working is perfect for a notebook. You simply create a connection to the cluster and run it as you wish:
from epyc import ClusterLab, HDF5LabNotebook
import numpy
nb = HDF5LabNotebook('notebook-dataset.h5', description='My data', create=True)
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)
lab['first'] = numpy.linspace(0.0, 1.0, num=100)
lab.runExperiment(MyExperiment())
Running these experiments happens in the background rather than freezing your
notebook (as would happen with a ordinary Lab
. It also doesn’t attempt
to melt your laptop by running too many experiments locally. Instead the
experiments are scattered onto the cluster and can be collected later – even after
turning your computer off, if necessary.
Note
We used an HDF5LabNotebook
to hold the results so that everything is
saved between sessions. If you used a non-persistent notebook then you’d lose
the details of the pending results. And anyway, do you really not want to
save all your experimental results for later?
All that then needs to happen is that you re-connect to the cluster and check for results:
from epyc import ClusterLab, HDF5LabNotebook
import numpy
nb = HDF5LabNotebook('notebook-dataset.h5', description='My data')
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)
print(lab.readyFraction())
This will print the fraction of results that are now ready, a number between 0 and 1.
Warning
Notice that we omitted the create=True
flag from the HDF5LabNotebook
the second time: we want to re-use the notebook (which holds all the details
of the pending results we’re expecting to collect), not create it afresh.
Many core-hours of computation have been lost this way….
Checking the results pulls any that are completed, andyou can immediately start using them if you wish: you don’t have to wait for everything to finish. Just be careful that whatever anaysis you start to perform understands that this is partial set of results.
Avoiding repeated computation¶
Consider the following use case. You create a Jupyter notebook for calculating your exciting new results. You create a notebook, add some result sets, do some stunning long computations, and save the notebook for later sharing.
After some time, you decide want to add some more computations to the notebook. You open it, and realise that all the results you computed previously are still available in the notebook. But if you were to re-execute the notebook, you’d re-compute all those results when instead you could simply load them and get straight on with your new
Or perhaps you share your notebook “live” using Binder. The notebook is included in the binder, and people can see the code you used to get them (as with all good notebooks). But again they may want to use and re-analyse your results but not re-compute them.
One way round this is to have two cells, something like this to do (or re-do) the calculations:
# execute this cell if you want to compute the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.addResultSet('first-results',
description='My long-running first computation')
e = LongComputation()
lab = Lab(nb)
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
and then another one to re-use the ones you prepared earlier:
# execute this cell if you want to re-load the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.select('first-results')
Of course this is quite awkward, relies on the notebook user to decide which cells to execute – and means that you can’t simply run all the cells to get back to where you started.
epyc
has two solutions to this problem.
Using createWith
¶
The first (and preferred)_solution uses the Lab.createWith()
method, which takes a function used to create a result set. When
called, the method checks if the result set already exists in the
lab’s notebook and, if it does, selects it for use; if it doesn’t,
then it is created, selected, and the creation function is called to
populate it.
To use this method we first define a function to create the data we want:
def createResults(lab):
e = LongComputation()
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
We then use this function to create a result set, if it hasn’t been created already:
lab.createWith('first-results',
createResults,
description='My long-running first computation')
Note that the creation function is called with the lab it should use
as argument. Lab.createWith()
will return when the result set
has been created, which will be when all experiments have been run.
By default the lab’s parameter space is reset before the creation
function is called, and any exception raised in the creation function
causes the result set to be deleted (to avoid partial results) and
propagates the exception to the caller. There are several ways to
customise this behaviour, described in the Lab.createWith()
reference.
Using already
¶
The second solution uses the LabNotebook.already()
method. This
is appropriate if you want to avoid creating additional functions just
for data creation, and instead have the code inline.
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
if not nb.already('first-results',
description='My long-running first computation'):
e = LongComputation()
lab = Lab(nb)
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
If this cell is run on a lab notebook that doesn’t contain the result
set, that set is created and the body of the conditional executes to
compute the results. If the cell is executed when the result set
already exists, it selects it ready for use (and any description
passed the LabNotebook.already()
is ignored). In either event,
subsequent code can assume that the result set exists, is selected,
and is populated with results.
Note
Note that Lab.createWith()
is called against a lab while
LabNotebook.already()
is called against a notebook. (The
former uses the latter internally.)
In general Lab.createWith()
is easier to use than
LabNotebook.already()
, as the former handles exceptions,
parameter space initialisation, result set locking and the like.
Limitations¶
There are some limitations to be aware of, of course:
- Neither approach works with disconnected operation when the results come back over a long period.
- You need to be careful about your choice of result set tags, so that they’re meaningful to you later. This also makes the description and metadata more important.
- We assume that all the results in the set are computed in one go, since future cells protected by the same code pattern wouldn’t be run.
The latter can be addressed by locking the result set after the
computation has happened (by calling ResultSet.finish()
) to fix
the results. Lab.createWith()
can do this automatically.
Sharing code between notebooks and engines¶
Of course if you use Jupyter as your primary coding platform, you’ll probably define experiments in the notebook too. This will unfortunately immediately introduce you to one of the issues with distributed computation in Python.
Here’s an example. Suppose you define an experiment in a notebook, something like this:
from epyc import Experiment
class MyExperiment(Experiment):
def __init__(self):
super(MyExperiment, self).__init__()
# your initialisation code here
def setUp(self, params):
super(MyExperiment, self).setUp(params)
# your setup code here
def do(self, params):
# whatever your experiment does
Pretty obvious, yes? And it works fine when you run it locally. Then,
when you try to run it in on a cluster, you create your experiment and
submit it with ClusterLab.runExperiment()
, but all the instances
fail with an error complaining about there being no such class as
MyExperiment
.
What’s happening is that epyc
is creating an instance of your
class in your notebook (or program) where MyExperiment
is known
(so the call in __init__()
works fine). Then it’s passing objects
(instances of this class) over to the cluster, where MyExperiment
isn’t known. When the cluster calls Experiment.setUp()
as part
of the experiment’s lifecycle, it goes looking for MyExperiment
–
and fails, even though it does actually have all the code it
needs. This happens because of the way Python dynamically looks for
code at run-time, which is often a useful feature but in this case
pulls things over.
To summarise, the problem is that code you define here (in the notebook) isn’t immediately available there (in the engines running the experiments): it has to be transferred there. And the easiest way to do that is to make sure that all classes defined here are also defined there as a matter of course.
To do this we make use of an under-used feature of Jupyter, the cell magic. These are annotations placed on cells that let you control the code that executes the cell itself. So rather than a Python code cell being executed by the notebook’s default mechanism, you can insert code that provides a new mechanism. In this case we want to have a cell execute both here and there, so that the class is defined on notebook and engine.
Important
If you haven’t taken the heart the advice about Reproducing experiments reliably,
now would be a really good time to do so. Create a venv for both
the notebook and the engines: the venv at the engine side doesn’t
need Jupyter, but it mostly does no harm to use the same
requirements.txt
file on both sides.
The cell magic we need uses the following code, so put it into a cell and execute it:
# from https://nbviewer.jupyter.org/gist/minrk/4470122
def pxlocal(line, cell):
ip = get_ipython()
ip.run_cell_magic("px", line, cell)
ip.run_cell(cell)
get_ipython().register_magic_function(pxlocal, "cell")
This defines a new cell magic, %%pxlocal
. The built-in cell magic
%%px
runs a cell on a set of engines. %%pxlocal
runs a cell
both on the engines and locally (in the notebook). If you decorate
your experiment classes this way, then they’re defined here and
there as required:
%%pxlocal
class MyExperiment(Experiment):
def __init__(self):
super(MyExperiment, self).__init__()
# your initialisation code here
def setUp(self, params):
super(MyExperiment, self).setUp(params)
# your setup code here
def do(self, params):
# whatever your experiment does
Now when you submit your experiments they will function as required.
Important
You only need to use %%pxlocal
for cells in which you’re
defining classes. When you’re running the experiments all the
code runs notebook-side only, and epyc
handles passing the
necessary objects around the network.
API reference¶
Core classes¶
Experiment
: A single computational experiment¶
-
class
epyc.
Experiment
¶ Base class for an experiment conducted in a lab.
An
Experiment
defines a computational experiment that can be run independently or (more usually) controlled from an instamnce of theLab
class. Experiments should be long-lasting, able to conduct repeated runs at several different parameter points.From an experimenter’s (or a lab’s) perspective, an experiment has public methods
set()
andrun()
. The former sets the parameters for the experiment; the latter runs the experiment. producing a set of results that include direct experimental results and metadata on how the experiment ran. A single run may produce a list of result dicts if desired, each filled-in with the correct metadata.Experimental results, parameters, and metadata can be access directly from the
Experiment
object. The class also exposes an indexing interface to access experimental results by name.
Important
Experiments have quite a detailed lifcycle that it is important to understand when writing any but the simplest experiments. See The lifecycle of an experiment for a detailed description.
Creating the results dict¶
The results dict is the structure returned from running an
Experiment
. They are simply nested Python dicts which can be
created using a static method.
-
static
Experiment.
resultsdict
() → Dict[str, Dict[str, Any]]¶ Create an empty results dict, structured correctly.
Returns: an empty results dict
The ResultsDict
type is an alias for this structure. The dict has three top-level keys:
-
Experiment.
PARAMETERS
¶ Results dict key for describing the point in the parameter space the experiment ran on.
-
Experiment.
RESULTS
¶ Results dict key for the experimental results generated at the experiment’s parameter point.
-
Experiment.
METADATA
¶ Results dict key for metadata values, mainly timing.
The contents of the parameters and results dicts are defined by the Experiment
designer. The metadata dict includes a number of standard elements.
Standard metadata elements¶
The metadata elements include:
-
Experiment.
STATUS
¶ Metadata element that will be True if experiment completed successfully, False otherwise.
-
Experiment.
EXCEPTION
¶ Metadata element containing the exception thrown if experiment failed.
-
Experiment.
TRACEBACK
¶ Metadata element containing the traceback from the exception (as a string).
-
Experiment.
START_TIME
¶ Metadata element for the datetime experiment started.
-
Experiment.
END_TIME
¶ Metadata element for the datetime experiment ended.
-
Experiment.
SETUP_TIME
¶ Metadata element for the time spent on setup in seconds.
-
Experiment.
EXPERIMENT_TIME
¶ Metadata element for the time spent on experiment itself in seconds.
-
Experiment.
TEARDOWN_TIME
¶ Metadata element for the time spent on teardown in seconds.
Experiment
sub-classes may add other metata elements as required.
Note
Since metadata can come from many sources, it’s important to consider the names given to the different values. epyc uses structured names based on the class names to avoid collisions.
If the Experiment
has run successfully, the
Experiment.STATUS
key will be True
; if not, it will be
False
and the Experiment.EXCEPTION
key will contain the
exception that was raised to cause it to fail and the Experiment.TRACEBACK
key will hold the traceback for that exception.
Warning
The exception traceback, if present, is a string, not a traceback
object, since these
do not work well in a distributed environment.
Configuring the experiment¶
An Experiment
is given its parameters, a “point” in the
parameter space being explored, by called Experiment.set()
. This
takes a dict of named parameters and returns the Experiment
itself.
-
Experiment.
set
(params: Dict[str, Any]) → epyc.experiment.Experiment¶ Set the parameters for the experiment, returning the now-configured experiment.
Parameters: params – the parameters Returns: the experiment
-
Experiment.
configure
(params: Dict[str, Any])¶ Configure the experiment for the given parameters. The default stores the parameters for later use. Be sure to call this base method when overriding.
Parameters: params – the parameters
-
Experiment.
deconfigure
()¶ De-configure the experiment prior to setting new parameters. The default removes the parameters. Be sure to call this base method when overriding.
Important
Be sure to call the base methods when overriding Experiment.configure()
and
Experiment.deconfigure()
. (There should be no need to override Experiment.set()
.)
Running the experiment¶
To run the experiment, a call to Experiment.run()
will run the experiment
at the given parameter point.
The dict of experimental results returned by Experiment.do()
is
formed into a results dict by the private Experiment.report()
method. Note the division of responsibilities here: Experiment.do()
returns the results
of the experiment (as a dict), which are then wrapped in a further dict by
Experiment.report()
.
If the experiment returns a list of results dicts instead of just a single set, then by default they are each wrapped in the same parameters and metadata and returned as a list of results dicts.
-
Experiment.
setUp
(params: Dict[str, Any])¶ Set up the experiment. Default does nothing.
Parameters: params – the parameters of the experiment
-
Experiment.
run
(fatal: bool = False) → Dict[str, Dict[str, Any]]¶ Run the experiment, using the parameters set using
set()
. A “run” consists of callingsetUp()
,do()
, andtearDown()
, followed by collecting and storing (and returning) the experiment’s results. If running the experiment raises an exception, that will be returned in the metadata along with its traceback to help with experiment debugging. If fatal is True it will also be raised from this method: the default prints the exception but doesn’t raise it.Parameters: fatal – (optional) raise any exceptions (default suppresses them into the results dict) Returns: a results dict
-
Experiment.
do
(params: Dict[str, Any]) → Union[Dict[str, Any], List[Dict[str, Dict[str, Any]]]]¶ Do the body of the experiment. This should be overridden by sub-classes. Default does nothing.
An experiment can return two types of results:
- a dict mapping names to values for experimental results; or
- a results dict list, each of which represents a fully-formed experiment
Parameters: params – a dict of parameters for the experiment Returns: the experimental results
-
Experiment.
tearDown
()¶ Tear down the experiment. Default does nothing.
-
Experiment.
report
(params: Dict[str, Any], meta: Dict[str, Any], res: Union[Dict[str, Any], List[Dict[str, Dict[str, Any]]]]) → Dict[str, Dict[str, Any]]¶ Return a properly-structured dict of results. The default returns a dict with results keyed by
Experiment.RESULTS
, the data point in the parameter space keyed byExperiment.PARAMETERS
, and timing and other metadata keyed byExperiment.METADATA
. Overriding this method can be used to record extra metadata values, but be sure to call the base method as well.If the experimental results are a list of results dicts, then we report a results dict whose results are a list of results dicts. This is used by
RepeatedExperiment
amd other experiments that want to report multiple sets of results.Parameters: - params – the parameters we ran under
- meta – the metadata for this run
- res – the direct experimental results from do()
Returns:
Important
Again, if you override any of these methods, be sure to call the base class
to get the default management functionality. (There’s no such basic functionality
for Experiment.do()
, though, so it can be overridden freely.)
Note
You can update the parameters controlling the experiment from
within Experiment.setUp()
and Experiment.do()
, and
these changes will be saved in the results dict eventually
returned by Experiment.run()
.
Accessing results¶
The easiest way to access an Experiment
’s results is to store
the results dict returned by Experiment.run()
. It is also
possible to access the results post facto from the
Experiment
object itself, or using a dict-like interface keyed
by name. These operations only make sense on a newly-run Experiment
.
-
Experiment.
success
() → bool¶ Test whether the experiment has been run successfully. This will be False if the experiment hasn’t been run, or if it’s been run and failed.
Returns: True if the experiment has been run successfully
-
Experiment.
failed
() → bool¶ Test whether an experiment failed. This will be True if the experiment has been run and has failed, which means that there will be an exception and traceback information stored in the metadata. It will be False if the experiment hasn’t been run.
Returns: True if the experiment has failed
-
Experiment.
results
() → Union[Dict[str, Dict[str, Any]], List[Dict[str, Dict[str, Any]]]]¶ Return a complete results dict. Only really makes sense for recently-executed experimental runs.
Returns: the results dict, or a list of them
-
Experiment.
experimentalResults
() → Union[Dict[str, Any], List[Dict[str, Any]]]¶ Return the experimental results from our last run. This will be None if we haven’t been run, or if we ran and failed.
Returns: the experimental results dict, which may be empty, and may be a list of dicts
-
Experiment.
__getitem__
(k: str) → Any¶ Return the given element of the experimental results. This only gives access to the experimental results, not to the parameters or metadata.
Parameters: k – the result key Returns: the value Raises: KeyError if there is no such result
-
Experiment.
parameters
() → Dict[str, Any]¶ Return the current experimental parameters, which will be None if none have been given by a call to set()
Returns: the parameters, which may be empty
-
Experiment.
metadata
() → Dict[str, Any]¶ Return the metadata we collected at out last execution, which will be None if we’ve not been executed and an empty dict if we’re mid-run (i.e., if this method is called from do() for whatever reason).
Returns: the metadata, which may be empty
ResultSet
: A homogeneous collection of results from experiments¶
-
class
epyc.
ResultSet
(description: str = None)¶ A “page” in a lab notebook for the results of a particular set of experiments. This will consist of metadata, notes, and a data table resulting from the execution of the experiment. Each experiment runs with a specific set of parameters: the parameter names are fixed once set initially, with the specific values being stored alongside each result. There may be multiple results for the same parameters, to allow for repetition of experiments at a data point. Results committ5ed to result sets are immutable: once entered, a result can’t be deleted or changed.
Result sets also record “pending” results, allowing us to record experiments in progress. A pending result can be finalised by providing it with a value, or can be cancelled.
A result set can be used very Pythonically using a results dict holding the metadata, parameters, and results of experiments. For larger experiment sets the results are automatically typed using
numpy
’sdtype
system, which both provides more checking and works well with more archival storage formats like HDF5 (seeHDF5LabNotebook
).Parameters: - nb – notebook this result set is part of
- description – (optional) description for the result set (defaults to a datestamp)
Important
Most interactions with results should go through a LabNotebook
to allow
for management of persistence and so on.
Adding results¶
Results can be added one at a time to the result set. Since result sets are persistent there are no other operations.
-
ResultSet.
addSingleResult
(rc: Dict[str, Dict[str, Any]])¶ Add a single result. This should be a single results dict as returned from an instance of
Experiment
, that contains metadata, parameters, and result.The results dict may add metadata, parameters, or results to the result set, and these will be assumed to be present from then on. Missing values in previously-saved results will receive default values.
Parameters: rc – a results dict
The LabNotebook.addResult()
has a much more flexible approach to addition that
handles adding lists of results at one time.
Retrieving results¶
A result set offers two distinct ways to access results: as results dicts,
or as a pandas.DataFrame
. The former is often easier on small scales,
the latter for large scales.
-
ResultSet.
numberOfResults
() → int¶ Return the number of results in the results set, including any repetitions at the same parameter point.
Returns: the total number of results
-
ResultSet.
__len__
() → int¶ Return the number of results in the results set, including any repetitions at the same parameter point.mEquivalent to
numberOfResults()
.Returns: the number of results
-
ResultSet.
results
() → List[Dict[str, Dict[str, Any]]]¶ Return all the results as a list of results dicts. This is useful for avoiding the use of
pandas
and having a more Pythonic interface – which is also a lot less efficient and more memory-hungry.Returns: a list of results dicts
-
ResultSet.
resultsFor
(params: Dict[str, Any]) → List[Dict[str, Dict[str, Any]]]¶ Return all the results for the given paramneters as a list of results dicts. This is useful for avoiding the use of
pandas
and having a more Pythonic interface – which is also a lot less efficient and more memory-hungry. The parameters are interpreted as fordataframeFor()
, with lists or other iterators being converted into disjunctions of values.Parameters: params – the parameters Returns: a list of results dicts
-
ResultSet.
dataframe
(only_successful: bool = False) → pandas.core.frame.DataFrame¶ Return all the available results. The results are returned as a pandas DataFrame object, which is detached from the results held in the result set, thereby keeping the result set itself immutable.
You can pre-filter the contents of the dataframe to only include results for specific parameter values using
dataframeFor()
. You can also discard any unsuccessful results the using only_successful flag.Parameters: only_successful – (optional) filter out any failed results (defaults to False) Returns: a dataframe of results
-
ResultSet.
dataframeFor
(params: Dict[str, Any], only_successful: bool = False) → pandas.core.frame.DataFrame¶ Extract a dataframe the results for only the given set of parameters. These need not be all the parameters for the experiments, so it’s possible to project-out all results for a sub-set of the parameters. If a parameter is mapped to an iterator or list then these are treated as disjunctions and select all results with any of these values for that parameter.
An empty set of parameters filters out nothing and so returns all the results. This is far less efficient that calling
dataframe()
.The results are returned as a pandas DataFrame object, which is detached from the results held in the result set, thereby keeping the result set itself immutable.
You can also discard any unsuccessful results the using only_successful flag.
Parameters: - params – a dict of parameters and values
- only_successful – (optional) filter out any failed results (defaults to False)
Returns: a dataframe containing results matching the parameter constraints
Important
The results dict access methods return all experiments, or all that have the specified parameters, regardless of whether they were successful or not. The dataframe access methods can pre-filter to extract only the successful experiments.
Parameter ranges¶
A result set can hold results for a range of parameter values. These are all returned as part of the results dicts or dataframes, but it can be useful to access them alone as well, independntly of specific results. The ranges returned by these methods refer only to real results.
-
ResultSet.
parameterRange
(param: str) → Set[Any]¶ Return all the values for this parameter for which we have results.
Parameters: param – the parameter name Returns: a collection of values for which we have data
-
ResultSet.
parameterSpace
() → Dict[str, Any]¶ Return a dict mapping parameter names to all their values, which is the space of all possible paramater points at which results could have been collected. This does not guarantee that all combinations of values have results associated with them: that function is provided by
parameterCombinations()
.Returns: a dict mapping parameter names to their ranges
-
ResultSet.
parameterCombinations
() → List[Dict[str, Any]]¶ Return a list of all combinations of parameters for which we have results, as a list of dicts. This means that there are results (possible more than one set) associated with the combination of parameters in each dict. The ranges of the parameters can be found using
parameterSpace()
.Returns: a list of dicts
Managing pending results¶
Pending results are those that are in the process of being computed based on a set of experimental parameters.
-
ResultSet.
pendingResults
() → List[str]¶ Return the job identifiers of all pending results.
Returns: a list of pending job identifiers
-
ResultSet.
numberOfPendingResults
() → int¶ Return the number of pending results.
Returns: the number of pending results
-
ResultSet.
pendingResultsFor
(params: Dict[str, Any]) → List[str]¶ Return the ids of all pending results with the given parameters. Not all parameters have to be provided, allowing for partial matching.
Parameters: params – the experimental parameters Returns: a list of job ids
-
ResultSet.
pendingResultParameters
(jobid: str) → Dict[str, Any]¶ Return a dict of the parameters for the given pending result.
Parameters: jobid – the job id Returns: a dict of parameter values
-
ResultSet.
ready
() → bool¶ Test whether there are pending results.
Returns: True if all pending results have been either resolved or cancelled
Three methods within the interface are used by LabNotebook
to management
pending results. They shouldn’t be needed from user code.
-
ResultSet.
addSinglePendingResult
(params: Dict[str, Any], jobid: str)¶ Add a pending result for the given point in the parameter space under the given job identifier. The identifier will generally be meaningful to the lab that submitted the request. They must be unique.
Parameters: - params – the experimental parameters
- jobid – the job id
-
ResultSet.
cancelSinglePendingResult
(jobid: str)¶ Cancel a pending job, This records the cancellation using a
CancelledException
, storing a traceback to show where the cancellation was triggered from. User code should callLabNotebook.cancelPendingResult()
rather than using this method directly.Cancelling a result generates a message to standard output.
Parameters: jobid – the job id
-
ResultSet.
resolveSinglePendingResult
(jobid: str)¶ Resolve the given pending result. This drops the job from the pending results table. User code should call
LabNotebook.resolvePendingResult()
rather than using this method directly, since this method doesn’t actually store the completed pending result, it just manages its non-pending-ness.Parameters: jobid – the job id
Metadata access¶
The result set gives access to its description and the names of the various elements it stores. These names may change over time, if for example you add a results dict that has extra results than those you added earlier.
-
ResultSet.
description
() → str¶ Return the free text description of the result set.
Returns: the description
-
ResultSet.
setDescription
(d: str)¶ Set the free text description of the result set.
Parameters: d – the description
Important
You can change the description of a result set after it’s been created – but you can’t change any results that’ve been added to it.
-
ResultSet.
names
() → Dict[str, Optional[List[str]]]¶ Return a dict of sets of names, corresponding to the entries in the results dicts for this result set. If only pending results have so far been added the
Experiment.METADATA
andExperiment.RESULTS
sets will be empty.Returns: the dict of parameter names
-
ResultSet.
metadataNames
() → List[str]¶ Return the set of metadata names associated with this result set. If no results have been submitted, this set will be empty.
Returns: the set of experimental metadata names
-
ResultSet.
parameterNames
() → List[str]¶ Return the set of parameter names associated with this result set. If no results (pending or real) have been submitted, this set will be empty.
Returns: the set of experimental parameter names
-
ResultSet.
resultNames
() → List[str]¶ Return the set of result names associated with this result set. If no results have been submitted, this set will be empty.
Returns: the set of experimental result names
The result set can also have attributes set, which can be accessed either using methods or by treating the result set as a dict.
-
ResultSet.
setAttribute
(k: str, v: Any)¶ Set the given attribute.
Parameters: - k – the key
- v – the attribute value
-
ResultSet.
getAttribute
(k: str) → Any¶ Retrieve the given attribute. A KeyException will be raised if the attribute doesn’t exist.
Parameters: k – the attribute Returns: the attribute value
-
ResultSet.
keys
() → Set[str]¶ Return the set of attributes.
Returns: the attribute keys
-
ResultSet.
__contains__
(k: str)¶ True if there is an attribute with the given name.
Oparam k: the attribute Returns: True if that attribute exists
-
ResultSet.
__setitem__
(k: str, v: Any)¶ Set the given attribute. The dict-like form of
setAttribute()
.Parameters: - k – the key
- v – the attribute value
-
ResultSet.
__getitem__
(k: str) → Any¶ Retrieve the given attribute. The dict-like form of
getAttribute()
.Parameters: k – the attribute Returns: the attribute value
-
ResultSet.
__delitem__
(k: str)¶ Delete the named attribute. This method is invoiked by the
del
operator. A KeyException will be raised if the attribute doesn’t exist.Parameters: k – the attribute
There are various uses for these attributes: see Making data archives for one common use case.
Important
The length of a result set (ResultSet.__len__()
) refers to the
number of results, not to the number of attributes (as would be the
case for a dict).
Locking¶
Once the set of experiments to be held in a result set is finished, it’s probably sensible to prevent any further updated. This is accomplished by “finishing” the result set, leaving it locked against any further updates.
-
ResultSet.
finish
()¶ Finish and lock this result set. This cancels any pending results and locks the result set against future additions. This is useful to tidy up after experiments are finished, and protects against accidentally re-using a result set for something else.
One can check the lock in two ways, either by polling or as an assertion that
raises a ResultSetLockedException
when called on a locked result set. This
is mainly used to protect update methods.
-
ResultSet.
isLocked
() → bool¶ Returns true if the result set is locked.
Returns: True if the result set is locked
-
ResultSet.
assertUnlocked
()¶ Tests whether the result set is locked, and raises a
ResultSetLockedException
if so. This is used to protect update methods, since locked result sets are never updated.
Dirtiness¶
Adding results or pending results to a result set makes it dirty, in need of storing if being used with a persistent notebook. This is used to avoid unnecessary writing of unchanged data.
-
ResultSet.
dirty
(f: bool = True)¶ Mark the result set as dirty (the default) or clean.
Parameters: f – True if the result set is dirty
-
ResultSet.
isDirty
() → bool¶ Test whether the result set is dirty, i.e., if its contents need persisting (if the containing notebook is persistent).
Returns: True if the result set is dirty
Type mapping and inference¶
A result set types all the elements within a results dict using numpy
’s
“dtype” data type system.
Note
This approach is transparent to user code, and is explained here purely for the curious.
There are actually two types involved: the dtype of results dicts formed from the metadata, parameters, and experimental results added to the result set; and the dtype of pending results which includes just the parameters.
-
ResultSet.
dtype
() → numpy.dtype¶ Return the dtype of the results, combining the metadata, parameters, and results elements.
Returns: the dtype
-
ResultSet.
pendingdtype
() → numpy.dtype¶ Return the dtype of pending results, using just parameter elements.
Returns: the dtype
The default type mapping maps each Python type we expect to see to a corresponding
dtype
. The type mapping can be changed on a per-result set basis if required.
-
ResultSet.
TypeMapping
¶ Default type mapping from Python types to
numpy
dtypes
.
There is also a mapping from numpy
type kinds to appropriate default values, used
to initialise missing fields.
-
ResultSet.
TypeMapping
Default type mapping from Python types to
numpy
dtypes
.
-
ResultSet.
zero
(dtype: numpy.dtype) → Any¶ Return the appropriate “zero” for the given simple dtype.
Parameters: dtype – the dtype Returns: “zero”
The type mapping is used to generate a dtype for each Python type, but preserving
any numpy
types used.
-
ResultSet.
typeToDtype
(t: type) → numpy.dtype¶ Return the dtype of the given Python type. An exception is thrown if there is no appropriate mapping.
Parameters: t – the (Python) type Returns: the dtype of the value
-
ResultSet.
valueToDtype
(v: Any) → numpy.dtype¶ Return the dtype of a Python value. An exception is thrown if there is no appropriate mapping.
Parameters: v – the value Returns: the dtype
The result set infers the numpy
-level types automatically as results (and pending
results) are added.
-
ResultSet.
inferDtype
(rc: Dict[str, Dict[str, Any]])¶ Infer the dtype of the given result dict. This will include all the standard and exceptional metedata defined for an
Experiment
, plus the parameters and results (if present) for the results dict.If more elements are provided than have previously been seen, the underlying results dataframe will be extended with new columns.
This method will be called automatically if no explicit dtype has been provided for the result set by a call to
setDtype()
.Returns: the dtype
-
ResultSet.
inferPendingResultDtype
(params: Dict[str, Any])¶ Infer the dtype of the pending results of given dict of experimental parameters. This is essentially the same operation as
inferDtype()
but restricted to experimental parameters and including a string job identifier.Parameters: params – the experimental parameters Returns: the pending results dtype
This behaviour can be sidestapped by explicitly setting the stypes (with care!).
-
ResultSet.
setDtype
(dtype) → numpy.dtype¶ Set the dtype for the results. This should be done with care, ensuring that the element names all match. It does however allow precise control over the way data is stored (if required).
Parameters: dtype – the dtype
-
ResultSet.
setPendingResultDtype
(dtype) → numpy.dtype¶ Set the dtype for pending results. This should be done with care, ensuring that the element names all match.
Parameters: dtype – the dtype
The progressive nature of typing a result set means that the type may change as new results are added. This “type-level dirtiness” is controlled by two methods:
-
ResultSet.
typechanged
(f: bool = True)¶ Mark the result set as having changed type (the default) or not.
Parameters: f – True if the result set has changed type
-
ResultSet.
isTypeChanged
() → bool¶ Test whether the result set has changed its metadata, parameters, or results. This is used by persistent notebooks to re-construct the backing storage.
Returns: True if the result set has changed type
LabNotebook
: A persistent store for results¶
-
class
epyc.
LabNotebook
(name: str = '', description: str = None)¶ A “laboratory notebook” collecting together the results obtained from different sets of experiments. A notebook is composed of
ResultSet
objects, which are homogeneous collections of results of experiments performed at different values for the same set of parameters. Each result set is tagged for access, with the notebook using one result set as “current” at any time.The notebook collects together pending results from all result sets so that they can be accessed uniformly. This is used by labs to resolve pending results if there are multiple sets of experiments running simultaneously.
Result sets are immutable, but can be added and deleted freely from notebooks: their contents cannot be changed, however.
Parameters: - name – (optional) the notebook name (may be meaningful for sub-classes)
- description – (optional) a free text description
Metadata access¶
-
LabNotebook.
name
() → str¶ Return the name of the notebook. If the notebook is persistent, this likely relates to its storage in some way (for example a file name).
Returns: the notebook name or None
-
LabNotebook.
description
() → str¶ Return the free text description of the notebook.
Returns: the notebook description
-
LabNotebook.
setDescription
(d: str)¶ Set the free text description of the notebook.
Parameters: d – the description
Persistence¶
Notebooks may be persistent, storing results and metadata to disc. The default implementation is simply in-memory and volatile. Committing a notebook ensures its data is written-through to persistent storage (where applicable).
-
LabNotebook.
isPersistent
() → bool¶ By default notebooks are not persistent.
Returns: False
-
LabNotebook.
commit
()¶ Commit to persistent storage. By default does nothing. This should be called periodically to save intermediate results: it may happen automatically in some sub-classes, depending on their implementation.
With blocks¶
Notebooks support with
blocks, like files. For persistent notebooks
this will ensure that the notebook is committed. (For the default in-memory
notebook this does nothing.)
-
LabNotebook.
open
()¶ Open and close the notebook using a
with
block. For persistent notebooks this will cause the notebook to be committed to persistent storage in a robust manner.
(See JSON file access and HDF5 file access for examples of this method in use.)
The with
block approach is slightly more robust than the explicit
use of LabNotebook.commit()
as the notebook will be committed
even if exceptions are thrown while it is open, ensuring no changes
are lost accidentally. However notebooks are often held open for a
long time while experiments are run and/or analysed, so the explicit
commit can be more natural.
Result sets¶
Results are stored as ResultSet
objects, each with a unique tag.
The notebook allows them to be created, and to be selected to receive
results.
They can also be deleted altogether.
-
LabNotebook.
addResultSet
(tag: str, description: str = None) → epyc.resultset.ResultSet¶ Start a new experiment. This creates a new result set to hold the results, which will receive any results and notes.
Parameters: tag – unique tag for this result set Param: (optional) free text description of the result set Returns: the result set
-
LabNotebook.
deleteResultSet
(rs: Union[str, epyc.resultset.ResultSet])¶ Delete a result set. The default result set can’t be deleted: this ensures that a notebook always has at least one result set.
Parameters: rs – the result set or its tag
-
LabNotebook.
resultSet
(tag: str) → epyc.resultset.ResultSet¶ Return the tagged result set.
Parameters: tag – the tag Returns: the result set
-
LabNotebook.
resultSets
() → List[str]¶ Return the tags for all the result sets in this notebook.
Returns: a list of keys
-
LabNotebook.
keys
() → List[str]¶ Return the result set tags in this notebook. The same as
resultSets()
.Returns: the result set tags
-
LabNotebook.
numberOfResultSets
() → int¶ Return the number of result sets in this notebook.
Returns: the number of result sets
-
LabNotebook.
__len__
() → int¶ Return the number of result sets in this notebook. Same as
numberOfResultSets()
.Returns: the number of result sets
-
LabNotebook.
__contains__
(tag: str) → bool¶ Tests if the given result set ic contained in this notebook.
Parameters: tag – the result set tag Returns: True if the result set exists
-
LabNotebook.
resultSetTag
(rs: epyc.resultset.ResultSet) → str¶ Return the tag associated with the given result set.
Parameters: rs – the result set Returns: the tag
-
LabNotebook.
current
() → epyc.resultset.ResultSet¶ Return the current result set.
Returns: the result set
-
LabNotebook.
currentTag
() → str¶ Return the tag of the current result set.
Returns: the tag
-
LabNotebook.
select
(tag: str) → epyc.resultset.ResultSet¶ Select the given result set as current. Sub-classes may use this to manage memory, for example by swapping-out non-current result sets.
Parameters: tag – the tag Returns: the result set
Conditional creation of result sets¶
Sometimes it’s useful to create a result set in an “all or nothing” fashion: if it already exists then do nothing.
-
LabNotebook.
already
(tag: str, description: str = None) → bool¶ Check whether a result set exists. If it does, select it and return True; if it doesn’t, add it and return False. This is a single-call combination of
contains()
andselect()
that’s useful for avoiding repeated computation.Parameters: - tag – the result set tag
- description – (optional) description if a result set is created
Returns: True if the set existed
Note
See the Lab.createWith()
method for a more conmvenient way to
use this function.
Result storage and access¶
Results are stored using the results dict structure of parameters, experimental results, and metadata. There may be many results dicts associated with each parameter point.
-
LabNotebook.
addResult
(results: Union[Dict[str, Dict[str, Any]], List[Dict[str, Dict[str, Any]]]], tag: str = None)¶ Add one or more results dicts to the current result set. Each should be a results dict as returned from an instance of
Experiment
, that contains metadata, parameters, and result.The results may include one or more nested results dicts, for example as returned by
RepeatedExperiment
, whose results are a list of results at the same point in the parameter space. In this case the embedded results will themselves be unpacked and added.One may also add a list of results dicts, in which case they will be added individually.
Any structure of results dicts that can’t be handled will raise a
ResultsStructureException
.Parameters: - result – a results dict or collection of them
- tag – (optional) result set to add tp (defalts to the current result set)
Results can be accessed in a number of ways: all together; as a
pandas.DataFrame
object for easier analysis; or as a list
corresponding to a particular parameter point.
-
LabNotebook.
numberOfResults
(tag: str = None) → int¶ Return the number of results in the tagged dataset.
Params tag: (optional) the result set tag (defaults to the current set) Returns: the number of results
-
LabNotebook.
__len__
() → int Return the number of result sets in this notebook. Same as
numberOfResultSets()
.Returns: the number of result sets
-
LabNotebook.
results
(tag: str = None) → List[Dict[str, Dict[str, Any]]]¶ Return results as a list of results dicts. If no tag is provided, use the current result set. This is a lot slower and more memory-hungry than using
dataframe()
(which is therefore to be preferred), but may be useful for small sets of results that need a more Pythonic interface than that provided by DataFrames. You can pre-filter the results dicts to those matching only some parameters combinations usingresultsFor()
.Params tag: (optional) the tag of the result set (defaults to the currently select result set) Returns: the results dicts
-
LabNotebook.
resultsFor
(params: Dict[str, Any], tag: str = None) → List[Dict[str, Dict[str, Any]]]¶ Return results for the given parameter values a list of results dicts. If no tag is provided, use the current result set. This is a lot slower and more memory-hungry than using
dataframeFor()
(which is therefore to be preferred), but may be useful for small sets of results that need a more Pythonic interface than that provided by DataFrames.Parameters: params – the experimental parameters Returns: results dicts
-
LabNotebook.
dataframe
(tag: str = None, only_successful: bool = True) → pandas.core.frame.DataFrame¶ Return results as a
pandas.DataFrame
. If no tag is provided, use the current result set.If the only_successful flag is set (the default), then the DataFrame will only include results that completed without an exception; if it is set to False, the DataFrame will include all results and also the exception details.
If you are only interested in results corresponding to some sets of parameters you can pre-filter the dataframe using
dataframeFor()
.Params tag: (optional) the tag of the result set (defaults to the currently select result set) Parameters: only_successful – include only successful experiments (defaults to True) Returns: the parameters, results, and metadata in a DataFrame
-
LabNotebook.
dataframeFor
(params: Dict[str, Any], tag: str = None, only_successful: bool = True) → pandas.core.frame.DataFrame¶ Return results for the goven parameter values as a
pandas.DataFrame
. If no tag is provided, the current result set is queried. If the only_successful flag is set (the default), then the DataFrame will only include results that completed without an exception; if it is set to False, the DataFrame will include all results and also the exception details.Parameters: - params – the experimental parameters
- only_successful – include only successful experiments (defaults to True)
Params tag: (optional) the tag of the result set (defaults to the currently select result set)
Returns: the parameters, results, and metadata in a DataFrame
Pending results¶
Pending results allow a notebook to keep track of on-going
experiments, and are used by some Lab
sub-classes (for
example ClusterLab
) to manage submissions to a compute
cluster. A pending result is identified by some unique identifier,
typically a job id. Pending results can be resolved (have their
results filled in) using LabNotebook.addResult()
, or can be
cancelled, which removes the record from the notebook but not from
the lab managing the underlying job.
Since a notebook can have multiple result sets, the pending results interface is split into three parts. Firstly there are the operations on the currently-selected result set.
-
LabNotebook.
addPendingResult
(params: Dict[str, Any], jobid: str, tag: str = None)¶ Add a pending result for the given point in the parameter space under the given job identifier to the current result set. The identifier will generally be meaningful to the lab that submitted the request, and must be unique.
Parameters: - params – the experimental parameters
- jobid – the job id
- tag – (optional) the tag of the result set receiving the pending result (defaults to the current result set)
-
LabNotebook.
numberOfPendingResults
(tag: str = None) → int¶ Return the number of results pending in the tagged dataset.
Params tag: (optional) the result set tag (defaults to the current set) Returns: the number of results
-
LabNotebook.
pendingResults
(tag: str = None) → List[str]¶ Return the identifiers of the results pending in the tagged dataset.
Params tag: (optional) the result set tag (defaults to the current set) Returns: a set of job identifiers
Secondly, there are operations that work on any result set. You can resolve or cancel a pending result simply by knowing its job id and regardless of which is the currently selected result set.
-
LabNotebook.
resolvePendingResult
(rc: Dict[str, Dict[str, Any]], jobid: str)¶ Resolve the pending result with the given job id with the given results dict. The experimental parameters of the result are sanity-checked against what the result set expected for that job.
The result may not be pending within the current result set, but can be within any result set in the notebook. This will not affect the result set that is selected as current.
Parameters: - rc – the results dict
- jobid – the job id
-
LabNotebook.
cancelPendingResult
(jobid: str)¶ Cancel the given pending result.
The result may not be pending within the current result set, but can be within any result set in the notebook. This will not affect the result set that is selected as current.
Parameters: jobid – the job id
You can also check whether there are pending results remaining in any result set, which defaults to the surrently selected result set.
-
LabNotebook.
ready
(tag: str = None) → bool¶ Test whether the result set has pending results.
Params tag: (optional) the result set tag (defaults to the current set) Returns: True if all pending results have been resolved (or cancelled)
-
LabNotebook.
readyFraction
(tag: str = None) → float¶ Test what fraction of results are available in the tagged result set.
Params tag: (optional) the result set tag (defaults to the current set) Returns: the fraction of available results
Thirdly, there are operations that work on all result sets.
-
LabNotebook.
allPendingResults
() → Set[str]¶ Return the identifiers for all pending results in all result sets.
Returns: a set of job identifiers
-
LabNotebook.
numberOfAllPendingResults
() → int¶ Return the number of results pending in all result sets.
Returns: the total number of pending results
Locking the notebook¶
Locking a notebook prevents further updates: result sets cannot be added, all pending results are cancelled, and all individual result sets locked. Locking is preserved for persistent notebooks, so once locked a notebook is locked forever.
-
LabNotebook.
finish
(commit: bool = True)¶ Mark the entire notebook as finished, closing and locking all result sets against further changes. Finishing a persistent notebook commits it.
By default the finished notebook is committed as such. In certain cases it may be desirable to finish the notebook but not commit it, i.e., to stop updates in memory without changing the backing file. Setting
commit=False
will accomplish this.Parameters: commit – (optional) commit the notebook (defaults to True)
-
LabNotebook.
isLocked
() → bool¶ Returns true if the notebook is locked.
Returns: True if the notebook is locked
Lab
: An environment for running experiments¶
-
class
epyc.
Lab
(notebook: epyc.labnotebook.LabNotebook = None, design: epyc.design.Design = None)¶ A laboratory for computational experiments.
A
Lab
conducts an experiment at different points in a multi-dimensional parameter space. The default performs all the experiments locally; sub-classes exist to perform remote parallel experiments.A
Lab
stores its result in a notebook, an instance ofLabNotebook
. By default the baseLab
class uses an in-memory notebook, essentially just a dict; sub-classes use persistent notebooks to manage larger sets of experiments.Each lab has an associated
Design
that turns a set of parameter ranges into a set of individual “points” of the parameter space at which to perform actual experiments. The default is to use aFactorialDesign
that performs an experiment for every combination of parameter values. This might be a lot of experiments, and other designs can be used to reduce or modify the space.Parameters: - notebook – the notebook used to store results (defaults to an empty
LabNotebook
) - design – the experimental design to use (defaults to a
FactorialDesign
)
- notebook – the notebook used to store results (defaults to an empty
Lab creation and management¶
-
Lab.
__init__
(notebook: epyc.labnotebook.LabNotebook = None, design: epyc.design.Design = None)¶ Initialize self. See help(type(self)) for accurate signature.
-
Lab.
open
()¶ Open a lab for business. Sub-classes might insist the they are opened and closed explicitly when experiments are being performed. The default does nothing.
-
Lab.
close
()¶ Shut down a lab. Sub-classes might insist the they are opened and closed explicitly when experiments are being performed. The default does nothing.
-
Lab.
updateResults
()¶ Update the lab’s results. This method is called by all other methods that return results in some sense, and may be overridden to let the results “catch up” with external processing. The default does nothing.
Parameter management¶
A Lab
is equipped with a multi-dimensional parameter space
over which to run experiments, one experiment per point. The
dimensions of the space can be defined by single values, lists, or
iterators that give the points along that dimension. Strings are
considered to be single values, even though they’re technically
iterable in Python. Experiments are then conducted on the cross
product of the dimensions.
-
Lab.
addParameter
(k: str, r: Any)¶ Add a parameter to the experiment’s parameter space. k is the parameter name, and r is its range. The range can be a single value or a list, or any other iterable. (Strings are counted as single values.)
Parameters: - k – parameter name
- r – parameter range
-
Lab.
parameters
() → List[str]¶ Return a list of parameter names.
Returns: a list of parameter names
-
Lab.
__len__
() → int¶ The length of an experiment is the total number of data points that will be explored. This is the length of the experimental configuration returned by
experiments()
.Returns: the number of experimental runs
-
Lab.
__getitem__
(k: str) → Any¶ Access a parameter range using array notation.
Parameters: k – parameter name Returns: the parameter range
-
Lab.
__setitem__
(k: str, r: Any)¶ Add a parameter using array notation.
Parameters: - k – the parameter name
- r – the parameter range
Parameters can be dropped, either individually or en masse, to
prepare the lab for another experiment. This will often accompany
creating or selecting a new result set in the LabNotebook
.
-
Lab.
__delitem__
(k: str)¶ Delete a parameter using array notation.
Parameters: k – the key
-
Lab.
deleteParameter
(k: str)¶ Delete a parameter from the parameter space. If the parameter doesn’t exist then this is a no-op.
Parameters: k – the parameter name
-
Lab.
deleteAllParameters
()¶ Delete all parameters from the parameter space.
Building the parameter space¶
The parameter ranges defined above need to be translated into “points”
in the parameter space at which to conduct experiments. This function
is delegated to the experimental design, an instance of
Design
, which turns ranges into points. The design is
provided at construction time: by default a FactorialDesign
is used, and this will be adequate for most use cases..
-
Lab.
design
() → epyc.design.Design¶ Return the experimental design this lab uses.
Returns: the design
-
Lab.
experiments
(e: epyc.experiment.Experiment) → List[Tuple[epyc.experiment.Experiment, Dict[str, Any]]]¶ Return the experimental configuration, a list consisting of experiments and the points at which they should be run. The structure of the experimental space is defined by the lab’s experimental design, which may also change the experiment to be run.
Parameters: e – the experiment Returns: an experimental configuration
Running experiments¶
Running experiments involves providing a Experiment
object
which can then be executed by setting its parameter point (using Experiment.set()
)
and then run (by calling Experiment.run()
) The Lab
co-ordinates the running of the experiment at all the points chosen by
the design.
-
Lab.
runExperiment
(e: epyc.experiment.Experiment)¶ Run an experiment over all the points in the parameter space. The results will be stored in the notebook.
Parameters: e – the experiment
-
Lab.
ready
(tag: str = None) → bool¶ Test whether all the results are ready in the tagged result set – that is, none are pending.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: True if the results are in
-
Lab.
readyFraction
(tag: str = None) → float¶ Return the fraction of results available (not pending) in the tagged result set after first updating the results.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: the ready fraction
Conditional experiments¶
Sometimes it is useful to run experiments conditionally, for example
to create a result set only if it doesn’t already
exist. Lab
can do this by providing a function to execute in
order to populate a result set.
Note
This technique work especially well with Jupyter notebooks, to avoid re-computing some cells. See Avoiding repeated computation.
-
Lab.
createWith
(tag: str, f: Callable[[Lab], bool], description: str = None, propagate: bool = True, delete: bool = True, finish: bool = False, deleteAllParameters: bool = True)¶ Use a function to create a result set.
If the result set already exists in the lab’s notebook, it is selected; if it doesn’t, it is created, selected, and the creation function is called. The creation function is passed a reference to the lab it is populating.
By default any exception in the creation function will cause the incomplete result set to be deleted and the previously current result set to be re-selected: this can be inhibited by setting
delete=False
. Any raised exception is propagated by default: this can be inhibited by settingpropagate = False
. The result set can be locked after creation by settingfinished=True
, as long as the creation was successful: poorly-created result sets aren’t locked.By default the lab has its parameters cleared before calling the creation function, so that it occurd “clean”. Set
deleteAllParameters=False
to inhibit this.Parameters: - tag – the result set tag
- f – the creation function (taking Lab as argument)
- description – (optional) description if a result set is created
- propagate – (optional) propagate any excepton (defaults to True)
- delete – (optional) delete on exception (default is True)
- finish – (optional) lock the result set after creation (defaults to False)
- deleteAllParameters – (optional) delete all lab parameters before creation (defaults to True)
Returns: True if the result set exists already or was properly created
Accessing results¶
Results of experiments can be accessed directly. via the lab’s
underlying LabNotebook
, or directly as a DataFrame
from
the pandas
analysis package.
-
Lab.
notebook
() → epyc.labnotebook.LabNotebook¶ Return the notebook being used by this lab.
Returns: the notebook
-
Lab.
results
() → List[Dict[str, Dict[str, Any]]]¶ Return the current results as a list of results dicts after resolving any pending results that have completed. This makes use of the underlying notebook’s current result set. For finer control, access the notebook’s
LabNotebook.results()
or :meth:LabNotebook.resultsFor` methods directly.Note that this approach to acquiring results is a lot slower and more memory-hungry than using
dataframe()
, but may be useful for small sets of results that benefit from a more Pythonic intertface.
-
Lab.
dataframe
(only_successful: bool = True) → pandas.core.frame.DataFrame¶ Return the current results as a pandas DataFrame after resolving any pending results that have completed. This makes use of the underlying notebook’s current result set. For finer control, access the notebook’s
LabNotebook.dataframe()
or :meth:LabNotebook.dataframeFor` methods directly.Parameters: only_successful – only return successful results Returns: the resulting dataset as a DataFrame
Persistent storage¶
JSONLabNotebook
: A persistent store in JSON format¶
Note
This style of notebook is fine for storing small datasets, and those
that need to be accessed in a very portable manner, but is very
wasteful for large datasets, for which an HDF5LabNotebook
is almost certainly a better choice.
-
class
epyc.
JSONLabNotebook
(name: str, create: bool = False, description: str = None)¶ A lab notebook that persists intself to a JSON file. This is the most basic kind of persistent notebook, readable by virtually any tooling.
Using JSON presents some disadvantages, as not all types can be represented. Specifically, exceptions from the metadata of failed experiments (with
Experiment.EXCEPTION
) will be saved as strings. We also need to convert datetime objects to ISO-format strings when saving.Parameters: - name – JSON file to persist the notebook to
- create – if True, erase existing file (defaults to False)
- description – free text description of the notebook
Persistence¶
JSON notebooks are persistent, with the data being saved into a file identified by the notebook’s name. Committing the notebook forces a save.
-
JSONLabNotebook.
isPersistent
() → bool¶ Return True to indicate the notebook is persisted to a JSON file.
Returns: True
-
JSONLabNotebook.
commit
()¶ Persist to disc.
JSON file access¶
If you want to make sure that the file is closed and commited after use you can use code such as:
with JSONLabNotebook(name='test.json', create=True).open() as nb:
nb.addResult(rc1)
nb.addResult(rc2)
After this the notebook’s underlying file will be closed, with the new results having been saved.
Structure of the JSON file¶
The version 1 file format is flat and stores all results in a single block. This has been replaced by the version 2 format that has a structure the follows the structure of the result sets in the notebook.
Important
epyc
can still read version 1 JSON notebooks, but will only save
in the version 2 format.
The top level JSON object consists of elements holding the notebook title and some housekeeping attributes. There is also a nested dict holding result sets, keyed by their tag.
Each result set object contains elements for its description and any attributes. There are also two further nested JSON objects, one holding results and one holding pending results. Each result is simply a results dict rendered in JSON; each pending result is a job identifier mapped to the parameters controlling the pending result.
HDF5LabNotebook
: Results stored in a standard format¶
-
class
epyc.
HDF5LabNotebook
(name: str, create: bool = False, description: str = None)¶ A lab notebook that persists itself to an HDF5 file. HDF5 is a very common format for sharing large scientific datasets, allowing
epyc
to interoperate with a larger toolchain.epyc
is built on top of theh5py
Python binding to HDF5, which handles most of the heavy lifting using a lot of machinery for typing and so on matched withnumpy
. Note that the limitations of HDF5’s types mean that some values may have different types when read than when acquired. (See HDF5 type management for details.)The name of the notebook can be a file or a URL. Only files can be created or updated: if a URL is provided then the notebook will be read and immediately marked as locked. This implies that
create=True
won’t work in conjunction with URLs.Important
Note that because of the design of the
requests
library used for handling URLs, using afile:
-schema URL will result in an exception being raised. Use filenames for accessing files.Parameters: - name – HDF5 file or URL backing the notebook
- create – (optional) if True, erase any existing file (defaults to False)
- description – (optional) free text description of the notebook
Managing result sets¶
-
HDF5LabNotebook.
addResultSet
(tag: str, description: str = None) → epyc.resultset.ResultSet¶ Add the necessary structure to the underlying file when creating the new result set. This ensures that, even if no results are added, there will be structure in the persistent store to indicate that the result set was created.
Parameters: - tag – the tag
- description – (optional) the description
Returns: the result set
Persistence¶
HDF5 notebooks are persistent, with the data being saved into a file identified by the notebook’s name. Committing the notebook forces any changes to be saved.
-
HDF5LabNotebook.
isPersistent
() → bool¶ Return True to indicate the notebook is persisted to an HDF5 file.
Returns: True
-
HDF5LabNotebook.
commit
()¶ Persist any changes in the result sets in the notebook to disc.
HDF5 file access¶
The notebook will open the underlying HDF5 file as required, and
generally will leave it open. If you want more control, for example to
make sure that the file is closed and finalised,
HDF5LabNotebook
also behaves as a context manager and so can
be used in code such as:
nb = HDF5LabNotebook(name='test.h5')
with nb.open():
nb.addResult(rc1)
nb.addResult(rc2)
After this the notebook’s underlying file will be closed, with the new
results having been saved. Alternatively simply use
LabNotebook.commit()
to flush any changes to the underlying
file, for example:
nb = HDF5LabNotebook(name='test.h5')
nb.addResult(rc1)
nb.addResult(rc2)
nb.commit()
Remote notebooks¶
Remote notebooks can be accessed by providing a URL instead of a filename to the notebook constructor:
nb = HDF5LabNotebook(name='http://example.com/test.h5')
Since remote updating doesn’t usually work, any notebook loaded from a
URL is treated as “finished” (as though you’d called
LabNotebook.finish()
)
Structure of the HDF5 file¶
Note
The structure inside an HDF5 file is only really of interest if you’re planning on
using an epyc
-generated dataset with some other tools.
HDF5 is a “container” file format, meaning that it behaves like an archive containing
directory-like structure. epyc
structures its storage by using a group for each
result set, held within the “root” group of the container. The root group has
attributes that hold “housekeeping” information about the notebook.
-
HDF5LabNotebook.
VERSION
= 'version'¶ Attribute holding the version of file structure used.
-
HDF5LabNotebook.
DESCRIPTION
= 'description'¶ Attribute holding the notebook and result set descriptions.
-
HDF5LabNotebook.
CURRENT
= 'current-resultset'¶ Attribute holding the tag of the current result set.
Any attributes of the notebook are also written as top-level
attributes in this grup. Then, for each ResultSet
in the
notebook, there is a group whose name corresponds to the result set’s
tag. This group contains any attributes of the result set, always
including three attributes storing the metadata, parameter, and
experimental result field names.
Note
Attributes are all held as strings at the moment. There’s a case for giving them richer types in the future.
The attributes also include the description of the result set and a flag indicating whether it has been locked.
-
HDF5LabNotebook.
DESCRIPTION
= 'description' Attribute holding the notebook and result set descriptions.
-
HDF5LabNotebook.
LOCKED
= 'locked'¶ Attribute flagging a result set or notebook as being locked to further changes.
Within the group are two datasets: one holding the results of experiments, and one holding pending results yet to be resolved.
-
HDF5LabNotebook.
RESULTS_DATASET
= 'results'¶ Name of results dataset within the HDF5 group for a result set.
-
HDF5LabNotebook.
PENDINGRESULTS_DATASET
= 'pending'¶ Name of pending results dataset within the HDF5 group for a result set.
If there are no pending results then there will be no pending results dataset. This makes for cleaner interaction when archiving datasets, as there are no extraneous datasets hanging around.
So an epyc
notebook containing a result set called “my_data” will
give rise to an HDF5 file containing a group called “my_data”, within
which will be a dataset named by
HDF5LabNotebook.RESULTS_DATASET
and possibly another dataset
named by HDF5LabNotebook.PENDINGRESULTS_DATASET
. There will
also be a group named by LabNotebook.DEFAULT_RESULTSET
which
is where results are put “by default” (i.e., if you don’t define
explicit result sets).
HDF5 type management¶
epyc
takes a very Pythonic view of experimental results, storing
them in a results dict with an unconstrained set of keys and
types: and experiment can store anything it likes as a result. The
ResultSet
class handles mapping Python types to numpy
dtypes: see Type mapping and inference for details.
The HDF5 type mapping follows the numpy
approach closely. Some
types are mapped more restrictively than in numpy
: this is as one
would expect, of course, since HDF5 is essentially an archive format
whose files need to be readable by a range of tools over a long
period. Specifically this affects exceptions, tracebacks, and
datetime
values, all of which are mapped to HDF5 strings (in ISO
standard date format for the latter). Strings are in turn stored in
ASCII, not Unicode.
A little bit of patching happens for “known” metadata values
(specifically Experiment.START_TIME
and
Experiment.END_TIME
) which are automatically patched to
datetime
instances when loaded. List-valued results are supported,
and can be “ragged” (not have the same length) across results.
Warning
Because of the differences between Python’s and HDF5’s type
systems you may not get back a value with exactly the same type as
the one you saved. Specifically, lists come back as numpy
arrays. The values and the behaviours are the same, though. If you
need a specific type, be sure to cast the value before use.
See types-ecperimewnt for a list of “safe” types.
Tuning parameters¶
Some parameters are available for tuning the notebook’s behaviour.
The default size of a new dataset can be increased if desired, to pre-allocate space for more results.
-
HDF5LabNotebook.
DefaultDatasetSize
= 10¶ Default initial size for a new HDF5 dataset.
The dataset will expand and contract automatically to accommodate the size of a result set: its hard to see why this value would need to be changed.
Low-level protocol¶
The low-level handling of the HDF5 file is performed by a small number of private methods – never needed directly in client code, but possibly in need of sub-classing for some specialist applications.
Three methods handle file creation and access.
-
HDF5LabNotebook.
_create
(name: str)¶ Create the HDF5 file to back this notebook.
Parameters: - name – the filename
- description – the free text description of this notebook
-
HDF5LabNotebook.
_open
()¶ Open the HDF5 file that backs this notebook.
-
HDF5LabNotebook.
_close
()¶ Close the underlying HDF5 file.
Five other methods control notebook-level and result-set-level I/O. These all assume that the file is opened and closed around them, and will fail if not.
-
HDF5LabNotebook.
_load
()¶ Load the notebook and all result sets.
-
HDF5LabNotebook.
_save
()¶ Save all dirty result sets. These are written out completely.
-
HDF5LabNotebook.
_purge
()¶ Delete any HDF5 datasets that relate to deleted result sets.
-
HDF5LabNotebook.
_read
(tag: str)¶ Read the given result set into memory.
Parameters: tag – the result set tag
-
HDF5LabNotebook.
_write
(tag: str)¶ Write the given result set to the file.
Tag: the result set tag
There are also two private methods that handle the conversion of
numpy
dtypes to the (ever so slightly different) h5py
dtypes.
-
HDF5LabNotebook.
_HDF5simpledtype
(dtype: numpy.dtype) → numpy.dtype¶ Patch a simple
numpy
dtype to the formats available in HDF5.Parameters: dtype – the numpy
dtypeReturns: the HDF5 dtype
-
HDF5LabNotebook.
_HDF5dtype
(dtype: numpy.dtype) → numpy.dtype¶ Patch a
numpy
dtype into its HDF5 equivalent. This method handles structured types with named fields.Parameters: dtype – the numpy
dtypeReturns: the HDF5 dtype
Extended functionality¶
ExperimentCombinator
: Building experiments from experiments¶
-
class
epyc.
ExperimentCombinator
(ex: epyc.experiment.Experiment)¶ Bases:
epyc.experiment.Experiment
An experiment that wraps-up another, underlying experiment. This is an abstract class that just provides the common wrapping logic.
Experiment combinators aren’t expected to have parameters of their own: they simply use the parameters of their underlying experiment. They may however give rise to metadata of their own, and modify the results returned by running their underlying experiment.
Accessing the underlying experiment¶
-
ExperimentCombinator.
experiment
() → epyc.experiment.Experiment¶ Return the underlying experiment.
Returns: the underlying experiment
-
ExperimentCombinator.
set
(params: Dict[str, Any]) → epyc.experiment.Experiment¶ Set the parameters for the experiment, returning the now-configured experiment.
Parameters: params – the parameters Returns: the experiment combinator itself
-
ExperimentCombinator.
parameters
() → Dict[str, Any]¶ Return the current experimental parameters, taken from the underlying experiment.
Returns: the parameters,
RepeatedExperiment
: Repeating an experiment experiment¶
-
class
epyc.
RepeatedExperiment
(ex: epyc.experiment.Experiment, N: int)¶ Bases:
epyc.experimentcombinator.ExperimentCombinator
A experiment combinator that takes a “base” experiment and runs it several times. This means you can define a single experiment separate from its repeating logic.
When run, a repeated experiment runs a number of repetitions of the underlying experiment at the same point in the parameter space. The result of the repeated experiment is the list of results from the underlying experiment. If the underlying experiment itself returns a list of results, these are all flattened into a single list.
Parameters: ex – the underlying experiment Pamam N: the number of repetitions to perform
Performing repetitions¶
-
RepeatedExperiment.
__init__
(ex: epyc.experiment.Experiment, N: int)¶ Create a combinator based on the given experiment.
ex: the underlying experiment
-
RepeatedExperiment.
repetitions
() → int¶ Return the number of repetitions of the underlying experiment we expect to perform.
Returns: the number of repetitions
Extra metadata elements in the results dict¶
-
RepeatedExperiment.
REPETITIONS
¶ Metadata element for number of repetitions performed.
Running the experiment¶
-
RepeatedExperiment.
do
(params: Dict[str, Any]) → List[Dict[str, Dict[str, Any]]]¶ Perform the number of repetitions we want. The results returned will be a list of the results dicts generated by the repeated experiments. The metadata for each experiment will include an entry
RepeatedExperiment.REPETITIONS
for the number of repetitions that occurred (which will be the length of this list) and an entryRepeatedExperiment.I
for the index of the result in that sequence.Parameters: params – the parameters to the experiment Returns: a list of result dicts
SummaryExperiment
: Statistical summaries of experiments¶
-
class
epyc.
SummaryExperiment
(ex: epyc.experiment.Experiment, summarised_results: List[str] = None)¶ Bases:
epyc.experimentcombinator.ExperimentCombinator
An experiment combinator that takes an underlying experiment and returns summary statistics for some of its results. This only really makes sense for experiments that return lists of results, such as those conducted using
RepeatedExperiment
, but it works with any experiment.When run, a summary experiment summarises the experimental results, creating a new set of results that include the mean and variance for each result that the underyling experiments generated. (You can also select which results to summarise.) The raw results are discarded. The new results have the names of the raw results with suffices for mean, median, variance, and extrema.
The summarisation obviously only works on result keys coming from the underlying experiments that are numeric. The default behaviour is to try to summarise all keys: you can restrict this by providing a list of keys to the constructor in the summarised_results keyword argument. Trying to summarise non-numeric results will be ignored (with a warining).
The summary calculations only include those experimental runs that succeeded, that is that have their status set to True. Failed runs are ignored.
Extra metadata elements in the results dict¶
Summarisation removes the raw results of the various experiments from the results dict and replaces them with summary values. Each summarised value is replaced by five derived values for the mean, median, variance, and extrema, with standard suffices.
-
SummaryExperiment.
MEAN_SUFFIX
¶ Suffix for the mean of the underlying values.
-
SummaryExperiment.
MEDIAN_SUFFIX
¶ Suffix for the median of the underlying values.
-
SummaryExperiment.
VARIANCE_SUFFIX
¶ Suffix for the variance of the underlying values.
-
SummaryExperiment.
MIN_SUFFIX
¶ Suffix for the minimum of the underlying values.
-
SummaryExperiment.
MAX_SUFFIX
¶ Suffix for the maximum of the underlying values.
The metadata also enumerates the number of experiments performed, the number summarised (since unsuccessful experiments are omitted), and any exceptions raised.
-
SummaryExperiment.
UNDERLYING_RESULTS
¶ Metadata element for the number of results that were obtained.
-
SummaryExperiment.
UNDERLYING_SUCCESSFUL_RESULTS
¶ Metadata elements for the number of results that were summarised.
Running the experiment¶
-
SummaryExperiment.
do
(params: Dict[str, Any]) → Dict[str, Any]¶ Perform the underlying experiment and summarise its results. Our results are the summary statistics extracted from the results of the instances of the underlying experiment that we performed.
We drop from the calculations any experiments whose completion status was False, indicating an error. Our own completion status will be True unless we had an error summarising a field (usually caused by trying to summarise non-numeric data).
We record the exceptions generated by any experiment we summarise under the metadata key
SummaryExperiment.UNDERLYING_EXCEPTIONS
Parameters: params – the parameters to the underlying experiment Returns: the summary statistics of the underlying results
Creating and changing the summary statistics¶
-
SummaryExperiment.
summarise
(results: List[Dict[str, Dict[str, Any]]]) → Dict[str, Dict[str, Any]]¶ Generate a summary of results from a list of experimental results dicts returned by running the underlying experiment. By default we generate mean, median, variance, and extrema for each value recorded.
Override this method to create different or extra summary statistics.
Parameters: results – an array of experimental results dicts Returns: a dict of summary statistics
Parallel experiments¶
ParallelLab
: Running experiments locally in parallel¶
-
class
epyc.
ParallelLab
(notebook: epyc.labnotebook.LabNotebook = None, cores: int = 0)¶ A
Lab
that uses local parallelism.Unlike a basic
Lab
, this class runs multiple experiments in parallel to accelerate throughput. Unlike aClusterLab
it runs all jobs synchronously and locally, and so can’t make use of a larger compute cluster infrastructure and can’t run tasks in the background to be collected later. This does however mean thatepyc
can make full use of a multicore machine quite trivially.The optional
cores
parameter selects the number of cores to use:- a value of 1 uses 1 core (sequential mode);
- a value of +n uses n cores;
- a value of 0 uses all available cores; and
- a value of -n uses (available - n) cores.
So a value of
cores=-1
will run on 1 fewer cores than the total number of physical cores available on the machine.Important
This behaviour is slightly different to that of
joblib
as described here.Note that you can specify more cores to use than there are physical cores on the machine: this will have no positive effects. Note also that using all the cores on a machine may result in you being locked out of the user interface as your experiments consume all available computational resources, and may also be regarded as an unfriendly act by any other users with whom you share the machine.
Parameters: - notebook – (optional) the notebook used to store results
- cores – (optional) number of cores to use (defaults to all available)
-
ParallelLab.
numberOfCores
() → int¶ Return the number of cores we will use to run experiments.
Returns: maximum number of concurrent experiments
Running experiments¶
As with the sequential Lab
class, experiments run on a
ParallelLab
will be run synchronously: the calling thread
will block until all the experiments have completed.
Note
If you need asynchronous behaviour than you need to use a ClusterLab
.
-
ParallelLab.
runExperiment
(e: epyc.experiment.Experiment)¶ Run the experiment across the parameter space in parallel using the allowed cores. The experiments are all run synchronously.
Parameters: e – the experiment
Warning
ParallelLab
uses Python’s joblib
internally to create
parallelism, and joblib
in turn creates sub-processes in which to
run experiments. This means that the experiment is running in a
different process than the lab, and hence in a different address
space. The upshot of this is that any changes made to variables in
an experiment will only be visible to that experiment, and won’t be
seen by either other experiments or the lab. You can’t, for
example, have a class variable that’s accessed and updated by all
instances of the same experiment: this would work in a “normal”
Lab
, but won’t work on a ParallelLab
(or indeed
on a ClusterLab
).
The way to avoid any issues with this is to only communicate via
the Experiment
API, accepting parameters to set the
experiment up and returning them through a results dict.
Any updates to experimental parameters or metadata are also
communicated correctly (see Advanced experimental parameter handling).
ClusterLab
: Flexible, parallel, asynchronous experiments¶
-
class
epyc.
ClusterLab
(notebook: epyc.labnotebook.LabNotebook = None, url_file=None, profile=None, profile_dir=None, ipython_dir=None, context=None, debug=False, sshserver=None, sshkey=None, password=None, paramiko=None, timeout=10, cluster_id=None, **extra_args)¶ A
Lab
running on anpyparallel
compute cluster.Experiments are submitted to engines in the cluster for execution in parallel, with the experiments being performed asynchronously to allow for disconnection and subsequent retrieval of results. Combined with a persistent
LabNotebook
, this allows for fully decoupled access to an on-going computational experiment with piecewise retrieval of results.This class requires a cluster to already be set up and running, configured for persistent access, with access to the necessary code and libraries, and with appropriate security information available to the client.
Interacting with the cluster¶
The ClusterLab
can be queried to determine the number of
engines available in the cluster to which it is connected, which
essentially defines the degree of available parallelism. The lab also
provides a ClusterLab.sync_imports()
method that allows modules
to be imported into the namespace of the cluster’s engines. This needs
to be done before running experiments, to make all the code used by an
experiment available in the cluster.
-
ClusterLab.
numberOfEngines
() → int¶ Return the number of engines available to this lab.
Returns: the number of engines
-
ClusterLab.
engines
() → ipyparallel.client.view.DirectView¶ Return a list of the available engines.
Returns: a list of engines
-
ClusterLab.
sync_imports
(quiet: bool = False) → contextlib.AbstractContextManager¶ Return a context manager to control imports onto all the engines in the underlying cluster. This method is used within a
with
statement.Any imports should be done with no experiments running, otherwise the method will block until the cluster is quiet. Generally imports will be one of the first things done when connecting to a cluster. (But be careful not to accidentally try to re-import if re-connecting to a running cluster.)
Parameters: quiet – if True, suppresses messages (defaults to False) Returns: a context manager
Running experiments¶
Cluster experiments are run as with a normal Lab
, by setting
a parameter space and submitting an experiment to ClusterLab.runExperiment()
.
The experiment is replicated and passed to each engine, and
experiments are run on points in the parameter space in
parallel. Experiments are run asynchronously: runExperiment()
returns as soon as the experiments have been sent to the cluster.
-
ClusterLab.
runExperiment
(e: epyc.experiment.Experiment)¶ Run the experiment across the parameter space in parallel using all the engines in the cluster. This method returns immediately.
The experiments are run asynchronously, with the points in the parameter space being explored randomly so that intermediate retrievals of results are more representative of the overall result. Put another way, for a lot of experiments the results available will converge towards a final answer, so we can plot them and see the answer emerge.
Parameters: e – the experiment
The ClusterLab.readyFraction()
method returns the fraction of
results that are ready for retrieval, i.e., the fraction of the
parameter space that has been explored. ClusterLab.ready()
tests
whether all results are ready. For cases where it is needed (which
will hopefully be few and far between), ClusterLab.wait()
blocks
until all results are ready.
-
ClusterLab.
readyFraction
(tag: str = None) → float¶ Return the fraction of results available (not pending) in the tagged result set after first updating the results.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: the ready fraction
-
ClusterLab.
ready
(tag: str = None) → bool¶ Test whether all the results are ready in the tagged result set – that is, none are pending.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: True if the results are in
-
ClusterLab.
wait
(timeout: int = -1) → bool¶ Wait for all pending results in all result sets to be finished. If timeout is set, return after this many seconds regardless.
Parameters: timeout – timeout period in seconds (defaults to forever) Returns: True if all the results completed
Results management¶
A cluster lab is performing computation remotely to itself, typically on another machine or machines. This means that pending results may become ready spontaneously (from the lab’s perspective.) Most of the operations that access results first synchronise the lab’s notebook with the cluster, retrieving any results that have been resolved since the previous check. (Checks can also be carried out directly.)
-
ClusterLab.
updateResults
(purge: bool = False) → int¶ Update our results within any pending results that have completed since we last retrieved results from the cluster. Optionally purges any jobs that have crashed, which can be due to engine failure within the cluster. This prevents individual crashes blocking the retrieval of other jobs.
Parameters: purge – (optional) cancel any jobs that have crashed (defaults to False) Returns: the number of pending results completed at this call
Connection management¶
A ClusterLab
can be opened and closed to
connect and disconnect from the cluster: the class’ methods do this
automatically, and try to close the connection where possible to avoid
occupying network resources. Closing the connection explicitly will
cause no problems, as it re-opens automatically when needed.
Important
Connection management is intended to be transparent, so there will seldom be a need to use any these methods directly.
-
ClusterLab.
open
()¶ Connect to the cluster. This will involve several possible re-tries.
-
ClusterLab.
close
()¶ Close down the connection to the cluster.
In a very small number of circumstances it may be necessary to take control of (or override) the basic connection functionality, which is provided by two other helped methods.
-
ClusterLab.
connect
()¶ Low-level connection to the cluster. Most code should use
open()
to open the connection: this method performs a single connection attempt, raising an exception if it fails.
-
ClusterLab.
activate
()¶ Make the connection active to ipyparallel/Jupyter. Usually only needed when there are several labs active in one program, where this method selects the lab used by, fo example, parallel magics.
Tuning parameters¶
There are a small set of tuning parameters that can be adjusted to cope with particular circumstances.
-
ClusterLab.
WaitingTime
= 30¶ Waiting time for checking for job completion. Lower values increase network traffic.
-
ClusterLab.
Reconnections
= 5¶ Number of attempts when re-connecting to a cluster.
-
ClusterLab.
Retries
= 3¶ Number of re-tries for failed jobs.
Experimental designs¶
Design
: Experimental designs¶
-
class
epyc.
Design
¶ Base class for experimental designs.
A “design” is a protocol for conducting a set of experiments so as to maximise the amount of useful data collected. It is a common topic in real-world experiments, and can be applied to computational experiments as well.
A design in
epyc
converts a set of experimental parameters into ann experimental configuration, a list consisting of pairs of an experiment to run and the parameters at which to run it.A design must be able to cope with being passed None as an experiment, and should return None for all the experiments in the configuration: this allows for pre-checks to be performed.
A design is associated with each
Lab
. By default the standardFactorialDesign
is used, and no further action is needed. Other designs can be selected at lab creation time.
Creating experiments¶
-
Design.
experiments
(e: epyc.experiment.Experiment, ps: Dict[str, Any]) → List[Tuple[epyc.experiment.Experiment, Dict[str, Any]]]¶ Convert a mapping from parameter name to list of values into a list of mappings from parameter names to single values paired with experiment to run at that point, according to the requirements of the design. This method must be overridden by sub-classes.
Parameters: ps – a dict of parameter values Returns: an experimental configuration
Standard experimental designs¶
epyc comes with a small set of experimental designs: we intend to add more to reflect experiences in doing a wider set of experiments.
FactorialDesign
: All combinations of of parameters¶
-
class
epyc.
FactorialDesign
¶ A simple factorial design.
In a factorial design, an experiment is perform for every combination of a lab’s parameters. Essentially this forms the cross-product of all parameter values, returned as a list of dicts. If the lab was set up with the following parameters:
lab['a'] = [1, 2] lab['b'] = [3, 4]
then this design would generate a space consisting of four points:
- {a=1, b=3}
- {a=1, b=4}
- {a=2, b=3}
- {a=2, b=4}
at which it would run the given experiment. The experiments are returned in random order.
-
FactorialDesign.
experiments
(e: epyc.experiment.Experiment, ps: Dict[str, Any]) → List[Tuple[epyc.experiment.Experiment, Dict[str, Any]]]¶ Form the cross-product of all parameters.
Parameters: ps – a dict of parameter values Returns: an experimental configuration
PointwiseDesign
: Corresponding parameters combined¶
-
class
epyc.
PointwiseDesign
¶ A design whose space is the sequence of values taken from the range of each parameter. If the lab was set up with the following parameters:
lab['a'] = [1, 2] lab['b'] = [3, 4]
then this design would generate a space consisting of two points:
- {a=1, b=3}
- {a=2, b=4}
This design requires that all parameters have the same length of range: if a parameter is a singleton (only a single value), this will be extended across all the space. So if the parameters were:
lab['a'] = 1 lab['b'] = [3, 4]
the design would generate:
- {a=1, b=3}
- {a=1, b=4}
-
PointwiseDesign.
experiments
(e: epyc.experiment.Experiment, ps: Dict[str, Any]) → List[Tuple[epyc.experiment.Experiment, Dict[str, Any]]]¶ Form experimental points from corresponding values in the parameter ranges, extending any singletons.
Parameters: ps – a dict of parameter values Returns: an experimental configuration
Exceptions¶
CancelledException
: A result was cancelled¶
-
class
epyc.
CancelledException
¶ An exception stored within the
Experiment
results dict when a pending result is cancelled without completeing the experiment. This means that all experiments started either complete successfully (and have their results recorded), or fail within the experiment itself (and have that exception stored, without results), or are cancelled (and have this exception and a traceback stored).
ResultSetLockedException
: Trying to change a locked result set¶
-
class
epyc.
ResultSetLockedException
¶ An exception raised if an attempt is made to write new results to a result set that’s been locked by a call to
ResultSet.finish()
.
LabNotebookLockedException
: Trying to change a locked lab notebook¶
-
class
epyc.
LabNotebookLockedException
¶ An exception raised if an attempt is made to write to a notebook that’s been locked by a call to
LabNotebook.finish()
. This includes attemoting to add result sets.
PendingResultException
: Unrecognised pending result job identifier¶
-
class
epyc.
PendingResultException
(jobid: str)¶ An exception raised if an invalid pending result job identifier is used. A common cause of this is a pending result that failed on submission and so was never actually started.
Parameters: jobid – the job id
-
PendingResultException.
jobid
() → str¶ Return the uinrecopgnised job id.
Returns: the job id
ResultsStructureException
: Badly-structured results dict (or dicts)¶
-
class
epyc.
ResultsStructureException
(rc: Union[Dict[str, Dict[str, Any]], List[Dict[str, Dict[str, Any]]]])¶ An exception raised when there is a problem with the structure of a results dict.
Parameters: rc – the results dict structure that causes the problem
-
ResultsStructureException.
resultsdict
() → Union[Dict[str, Dict[str, Any]], List[Dict[str, Dict[str, Any]]]]¶ Return the results dict that caused the problem.
Returns: the results dict or list of them
NotebookVersionException
: Unexpected version of a notebook file¶
-
class
epyc.
NotebookVersionException
(expected: str, actual: str)¶ An exception raised when a notebook encounters an unexpected version of a persistent file format.
Parameters: - expected – the expected version
- actual – the actual version
-
NotebookVersionException.
expectedVersion
() → str¶ Return the expected version of the file.
Returns: the expected version
-
NotebookVersionException.
actualVersion
() → str¶ Return the actual version of the file.
Returns: the actual version
DesignException
: Impossible design¶
-
class
epyc.
DesignException
(msg: str)¶ An exception raised whenever a set of parameter ranges can’t be used as the basis for a design.
Command-line interface¶
epyc
includes a simple command-line tool for interacting with HDF5
notebooks. This allows you to edit notebooks without needing to write
Python code, which can be good for curating datasets ready for
publication.
Note
This interface is still at a very early stage of development, and is likely to change considerably in future releases.
The command is unimaginatively called epyc
, and is a “container”
command that provides access to sub-commands for different
operations:
- Copying results sets between notebooks (
epyc copy
) - Selecting a result set as current (
epyc select
) - Delete a result set (
epyc remove
) - Show the structure of a notebook (
epyc show
)
The details of each sub-command can be found using the --help
option, for example:
epyc remove --help
A possible curation workflow would be to list all the results sets in
a notebook using epyc show
and then delete any that shouldn’t be
published using epyc remove
. Note that in keeping with epyc
’s
philosophy of immutability you can only remove whole result sets:
there’s no way to remove individual experiments from a result set.
Cookbook¶
This section is a work-in-progress cookbook of using epyc
in practice.
Note
In this cookbook we assume you’re using either Linux or OS X (or some other Unix variant). We don’t venture into Windows, as it’s less used for (and less convenient for) scientific computing.
Reproducing experiments reliably¶
Problem: Over time the versions numbers of different packages you use change as the code is developed. You’re worried this might affect your code, either by breaking it or by changing its results somehow.
Solution: This is a real problem with computational science. Fortunately it’s fairly easy to address, at least at a simple level.
Python includes a feature called virtual environments or venvs. A venv is an installation of Python and its libraries that’s closed-off from any other installations you may have on your machine. Essentially it takes the global installation of Python and throws away anything that’s not part of the core distribution. You can “enter” the venv and install exactly those packages you want – and only those packages, and with specific version numbers if you like – secure in the knowledge that if the global environment, or another venv, wants diffrent pacxkages and version numnbers they won’t interfere with you. You can also “freeze” your venv by grabbing a list of packages and version numbers installed, and then install this exact environment again later – or indeed elsewhere, on another machine.
Let’s assume we want to create a venv that we’ve imaginatively named
venv
. (You can pick any name you like.) You create venvs from the
command line:
python3 -m venv ./venv
We next need to activate the environment, making it the “current” one that Python will use. This is again done from the command line:
. venv/bin/activate
This alters the various include paths, command paths, and other elements to make sure that, when you execute the Python interpreter or any of the related tools, it runs the ones in the venv and not any others.
We next need to populate the venv, that is, add the packages we
want. We do this using pip
as normal:
pip3 install ipython ipyparallel
Note
In some installations, pip
always refers to the pip
tool
of Python 2.7, while Python 3’s tool is called pip3
. It never
hurts when unsure to run pip3
explicitly if you’re working
with Python 3. Similarly you may find there’s a tool called
python3
or even python3.7
in your venv.
Remember that because we’ve activated the venv, the Python tools we
run (including pip
) are those of the venv, and they affect the
venv: thus this call to pip
will install the latest versions of
ipython
and ipyparallel
just as we’d expect – but into the
venv, not into the global environment. We can call pip
repeatedly
to install all the packages we need. If we then run some Python code
(either interactively or as a script) from the shell in which we
activated the venv, it will use the packages we’ve installed. If we’ve
missed out a package that the code needs, then an exception will be
raised even if the package is available globally: only what’s
explicitly loaded into the venv is available in the venv. Conversely
if we run the same code from a shell in which we haven’t activated
this (or any other) venv, it will run in the packages installed
globally: what happens in the venv stays in the venv.
Suppose we now want to be able to reproduce this venv for later
use. We can use pip
to freeze the state of the venv for us:
pip freeze >requirements.txt
This generates a requirements.txt
file including all the packages
and their version numbers: remember to execute this command from the
shell in which we activated the venv. If we later want to reproduce
this environment, so we’re sure of the package versions our code will
use, we can create another venv that uses this file to reproduce the
frozen venv:
python3 -m venv ./venv2
. venv2/bin/activate
pip install -r requirements.txt
This new venv now has exactly the structure of the old one, meaning we can move the computational environment across machines.
Warning
This sometimes doesn’t work as well as it might: Python’s requirements files aren’t very well structured, not all packages (or all package versions) are available on all operating systems, Python on OS X has some unique packages, Anaconda includes a huge set by default, and so forth. But at least you get start from a place where the environment is well-known.
A handy debugging strategy is to run pip install -r
requirements.txt
and, if it fails, delete the offending line
from requirements.txt
and try again. If you remove a package
that’s needed by another, then a compatible version should be
found by pip
– but possibly not the one you were using
originally. This doesn’t often cause problems in real life.
Advanced experimental parameter handling¶
Problem: The parameters affecting your experiment come from a range of sources, some only found at set-up time (or later).
Solution: Ideally everything you need to know to run an
experiment is know when the experiment is first configured, either
directly or from a Lab
. Sometimes this isn’t the case,
though: it may be that, in setting up an experiment, you want to
record additional material about the experiment. You can do this in
three ways:
- by adding to the metadata of the experiment;
- by adding to the experimental parameters; or
- by returning it as part of the experimental results.
Which to choose? You can simply choose which makes most sense. These three different set of values are intended to represent different sorts of things: monitoring information, configuration information, and computed information respectively. Generally speaking we expect experiments to yield results (only). Sometimes it’s also worth adding (for example) timing information to the metadata.
Occasionally one might also want to extend the set of experimental
parameters – because, for example, in the process of setting-up the
experiment according to the parameters given, additional information
comes about that’s also pertinent to how the experiment was run. In
that case it’s entirely legitimate to add to the experimental
parameters. You can do this simply by writing to the parameters passed
to Experiment.setUp()
:
def setUp(self, params):
super().setUp(params)
# do our setup
...
# update the parameters
params['variance'] = var
This change to the dict of experimental parameters will be stored with the rest of the parameters of the experiment.
It’s probably only sensible to add parameters in this way, not to delete or change them.
Different experimental designs¶
Problem: The default behaviour of a Lab
is to run an
experiment at every combination of parameter points. You want to do
something different – for example use specific combinations of
parameters only.
Solution: This is a problem of experimental design: how many experiments to run, and with what parameters?
epyc encapsulates experimental designs in the Design
class. The default design is a FactorialDesign
that runs
experiments at every combination of points: essentially this design
forms the cross-product of all the possible values of all the
parameters, and runs an experiment at each. This is a sensible
default, but possibly too generous in some applications. You can
therefore sub-class Design
to implement other strategies.
In the specific case above, the SingletonDesign
performs the
necessary function. This design takes each parameter range and
combines the corresponding values, with any parameters with only a
single value in their range being extended to all experiments. (This
implies that all parameters are either singletons or have
ranges of the same size.)
We can create a lab that uses this design:
lab = Lab(design=epyc.SingletonDesign())
lab['a'] = range(100)
lab['b'] = range(100, 200)
lab['c'] = 4
When an experiment is run under this design, it will generate 100 experimental runs (one per corresponding pair of elements of the ranges of parameters ‘a’ and ‘b’, with ‘c’ being constantly 4) rather than the 40,000 runs that a factorial design would generate under the same conditions. Of course that’s not a sensible comparison: the singleton design doesn’t explore the parameter space the way the factorial design does.
Using a cluster without staying connected to it¶
Problem: You’re using a remote machine to run your simulations on, and don’t want your local machine to have to stay connected while they’re running because you’re doing a lot of computation.
Solution: epyc
’s cluster labs can work asynchronously, so you
submit the experiments you want to do and then come back to collect
them later. This is good for long-running sets of experiments, and
especially good when your front-end machine is a laptop that you want
to be able to take off the network when you go home.
Asynchronous operation is actually the default for
ClusterLab
. Starting experiments by default creates a pending
result that will be resoplved when it’s been computed on the cluster.
from epyc import ClusterLab
lab = ClusterLab(profile="mycluster",
notebook=HDF5LabNotebook('mydata.h5', create=True))
nb = lab.notebook()
# perform some of the first experiments
nb.addResultSet('first-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyFirstExperiment()
lab.runExperiment(e)
# and then some others
nb.addResultSet('second-experiments')
lab['a] = 15
lab['b'] = ['cat', 'dog', 'snake']
lab['c'] = range(200)
e = MySecondExperiment()
lab.runExperiment(e)
You can then wait to get all the results:
lab.wait()
which will block until all the results become available, implying that your machine has to stay connected to the cluster until the experiments finish: possibly a long wait. Alternatively you can check what fraction of each result set has been successfully computed:
lab = ClusterLab(profile="mycluster",
notebook=HDF5LabNotebook('mydata.h5'))
nb = lab.notebook()
nb.select('first-experiments')
print(lab.readyFraction())
nb.select('second-experiments')
print(lab.readyFraction())
(This is an important use case especially when using a remote cluster with a Jupyter notebook, detailed more in the Fourth tutorial: Integration with Jupyter.) The notebook will gradually be emptied of pending results and filled with completed results, until none remain.
import time
allReady = False
tags = nb.resultSets()
while not allReady:
time.sleep(5) # wait 5s
allReady = all(map(lambda tag: lab.ready(tag), tags))
print('All ready!')
The system for retrieving completed results is quite robust in that it commits the notebook as results come in, minimising the posibility for loss through a crash.
Important
If you look at the API for LabNotebook
you’ll see methods for
LabNotebook.ready()
and LabNotebook.readyFraction()
. These
check the result set without updating; the corresponding methods
Lab.ready()
and Lab.readyFraction()
check the result set
after updating with newly-completed results.
You can also, if you prefer, force an update of pending results directly:
lab.updateResults()
The call to ClusterLab.updateResults()
connects to the cluster
and pulls down any results that have completed, entering them into the
notebook. You can then query the notebook (rather than the lab) about
what fraction of results are ready, taking control of when the cluster
is interrogated.
Making data archives¶
Problem: Having expended a lot of time (both your own and your computers’) on producing a dataset in a notebook, you want to be able to store it and share it over a long period.
Solution: This is a perennial problem with computational science: how do we make data readable, and keep it that way? Even more than code (which we discussed under Reproducing experiments reliably), data suffers from “bit rot” and becomes unreadable, both in technical and semantic terms.
The technical part – a file that’s in an outdated format – is the easier problem
to deal with. We can use a format that’s already survived the test of time,
that has widespread support, and that – although it eventually will go out
of date – will have emough commitment that it’ll be possible to convert and
upgrade it. HDF5, as used by the HDF5LabNotebook
, meets these criteria
well, and can be accessed natively by epyc
.
Note that epyc
also records the class names of experiments in their results.
This is only a guide, of course: there’s nothing that automatically identifies where
the code of a class is stored, or which version was used. It’s possible to address
these issues as part of dataset semantics, though.
The semantic problem requires that we maintain an understanding of what each field in a dataset means. At a trivial level, sensible field names help, as do free-text descriptions of how and why a datset was collected. This metadata is all stored within a persistent result set or notebook, and can be accessed when the notebook is re-loaded or used within some other tool.
One can be even more structured. Each parameter and result field in a result set (and each metadata field, for that matter) will presumably have a particular purpose and likely some units. We can use attributes to store this metadata too:
from epyc import HDF5LabNotebook
# load the notebook and give it a new description
with HDF5LabNotebook('my-important-dataset.h5') as nb:
# set the description
nb.setDescription('A notebook I want to understand later')
# select the result set we want to annotate with metadata
rs = nb.select('first-experiment')
rs.setDescription('Some physics stuff')
# create attributes for each parameter and result
rs[MyExperiment.VELOCITY] = 'Velocity of particle (ms^-1)'
rs[MyExperiment.MASS] = 'Mass of particle (g)'
rs[MyExperiment.NPARTICLES] = 'Number of particls (number)'
rs[MyExperiment.DENSITY] = 'Final particle density (m^-2)'
# lock the result set against further updates
rs.finish()
We’ve assumed we have a class MyExperiment
that defines field names for its
parameter and result fields. For each of these we create an attribute of the result
set holding a text description and units. Now, when sometime later we examine the notebook,
we’ll have at least some idea of what’s what. Admittedly that metadata isn’t machine-readable
to allow a program to (for example) work out that masses are measured in grams: that
would require a far more sophisticated system using ontologies to describe the structure
of information. But it’s a start to have the information recorded in a human-readable form,
closely associated with the data.
In particular application domains it may also be worth adhering to specific standards for metadata. The UK Digital Curation Centre maintains a list that may be useful.
Finally, we called ResultSet.finish()
to finish and lock the result set. This
will (hopefully) prevent accidental corruption, and will also tidy up the final
file by cancelling any submitted-but-not-completed pending results. (Any such results
will still be recorded in the dataset for audit purposes.)
Getting access to more run-time information¶
Problem: You need to get more information out of epyc
.
Solution: epyc
makes use of Python’s standard logging
module. Various operations emit logging messages that can be
intercepted and used in various ways.
epyc
uses its own logger, whose name is stored in the constant
epyc.Logger
: unsurprisingly it is called “epyc”. You can
use this name to configure the details of logging that epyc
performs. For example, if you want to suppress all messages except for
those that are errors (or worse), you could use code such as:
import logging
import epyc
epycLogger = logging.getLogger(epyc.Logger)
epycLogger.setLevel(logging.ERROR)
There are lots of other configuration options, including logging to files or to management services: see the Python logging HOWTO for details.
Glossary¶
- experiment
- A computational experiment, inheriting from
Experiment
. Experiments are run at a point in a multi-dimensional parameter space, and should be designed to be repeatable. - experiment combinators
- Experiments that wrap-up other, underlying experiments and perform them in some way, perhaps repeating them or summarising or re-writing their results. They allow common experimental patterns to be coded.
- experimental configuration
- A list of pairs of an experiment and the parameters at which it will be run, created according to an experimental design.
- experimental design
- The way in which a set of parameters is converted into points at which experiments are run.
- experimental parameters
- The values used to position an individual experimental run in the “space” of all experiments. Each experiment has its own parameters, which it can use to configure itself and perform set-up (see The lifecycle of an experiment).
- experimental results
- The collection of values returned by an experimental run.
- lab
- A computational laboratory co-ordinating the execution of
multiple experiments, inheriting from
Lab
. - metadata
- Additional information about an experiment, returned as part of a results dict.
- notebook
- An immutable and often persistent store experimental results and
metadata, inheriting from
LabNotebook
. - parameter space
- The set of experimental parameters at which experiments
will be run. The parameter space is defined by a
Design
, - result set
- A collection of results within a notebook, inheriting
from
ResultSet
. Result sets can be created, deleted, and added to by running new experiments – but can’t have their contents changed. - results dict
- A dict structured according to a particular convention. The dict uses three top-level keys, defined by the Experiment class, for the parameter values of the experiment, the experimental results, and some metadata values. Each of these top-level keys themselves map to a hash of further values: for some experiments, the experimental results key may refer to a list of hashes.
Contributing¶
epyc
is an open-source project, and we welcome comments, issue
reports, requests for new features, and (especially!) code for new
features.
To report an issue¶
Issue (“bug”) reports are handled through epyc
’s GitHub
repository. To report an issue, go to
https://github.com/simoninireland/epyc/issues
and click the “New issue” button.
Please be as specific as possible about the problem. Code that illustrates an issue is very welcome, but please make it as simple as possible!
To request a feature¶
If you simply want to suggest a feature, please open an issue report as above.
To contribute a feature¶
If on the other hand you have proposed code for a new feature, please
create a pull request containing your proposal, with using git
directly or through GitHub’s Pull requests manager.
In submitting a pull request, please include:
- a clear description of what the code does, and what it adds to
epyc
for a general user; - well-commented code, including docstrings for methods;
- types for all methods using Python’s type hints;
- a tutorial and/or cookbook recipe to illustrate the new feature in use; and
- tests in the
test/
sub-directory that let us automatically test any new features.
Please don’t neglect the tests. We use continuous integration for
epyc
to keep everything working, so it’s important that new
features provide automated unit tests. Please also don’t neglect
documentation, and remember that docstrings aren’t enough on their own.
We use the Python black coding style, and it’d be helpful if any pulled code did the same. We use type annotations to improve maintainability.
Installing the codebase¶
To get your own copy of the codebase, simply clone the repo from GitHub and (optionally) create your own branch to work on
# clone the repo
git clone git@github.com:simoninireland/epyc.git
cd epyc
# create a new branch to work on
git branch my-new-feature
The makefile has several targets that are needed for development:
make env
build a virtual environment with all the necessary libraries. This include both those thatepyc
needs to run (specified inrequirements.txt
, and those that are simply needed when developing and testing (specified indev-requirements.txt
)make test
runs the test suite. This consists of a lot of tests, and so may take some timemake cluster
starts anipyparallel
compute cluster. Run this in one shell, and then runmake test
in another shell to include the tests of cluster behaviour. (Cluster tests are skipped unless there’s a cluster calledepyctest
running locally.)make testclusterlab
runs only the cluster tests rather than the whole suitemake clean
delete s a lo of constructed files for a clean buildmake reallyclean
also deletes the venv
Calling make
on its own prints all the available targets.
Copyrights on code¶
You retain copyright over any code you submit that’s incorporated in
epyc
’s code base, and this will be noted in the source code
comments and elsewhere.
We will only accept code that’s licensed with the same license as
epyc
itself (currently GPLv3). Please indicate
this clearly in the headers of all source files to avoid confusion.
Please also note that we may need an explicit declaration from your
employer that this work can be released under GPL: see
https://www.gnu.org/licenses/ for details.
Citing¶
If you use epyc
in your work, find it useful, and want to
acknowledge it, you can cite the following reference:
Simon Dobson. epyc: Computational experiment management in Python. Journal of Open Source Software 7(72). 2022. https://doi.org/10.21105/joss.03764
One possible BibTeX record of this is:
@article{epyc-joss,
author = {Simon Dobson},
title = "{epyc}: Computational experiment management in {P}ython",
journal = {Journal of Open-Source Software},
year = {2022},
number = {72},
volume = {7},
doi = {10.21105/joss.03764},
}