Fourth tutorial: Integration with Jupyter¶
Computational experiments are increasingly happening within the Jupyter ecosystem
of notebooks, labs, binders, and so on. epyc
integrates well with Jupyter tools,
and in this tutorial we’ll show you how.
What is Jupyter¶
Jupyter is an interactive notebook system that provides a flexible front-end onto a number of different language “kernels” – including Python. Notebooks can include text, graphics, and executable code, and so are an excellent framework for the interactive running of computation experiments. The ecosystem has expended in recent years and now includes a multi-window IDE (Jupyter Lab), a publishing system (Jupyter Book), and various interactive widget libraries for visualisation.
epyc
can be used from Jupyter notebooks in a naive way simply by creating a local
lab in a notebook and working with it as normal:
from epyc import Lab
lab = epyc.Lab()
However this provides access only to the simplest kind of lab, running on a single local core. It is clearly desirable to be able to use more flexible resources. It’s also desirable to be able to make use of code and classes defined in notebooks.
Jupyter notebooks and compute servers¶
The interactive nature of Jupyer notebooks means that they lend themselves to
“disconnected” operation with a ClusterLab
. To recap, this is where
we submit experiments to a remote compute cluster and then re-connect later
to retrieve the results. See Second tutorial: Parallel execution for a description of
how to set up and use a remote cluster.
This mode of working is perfect for a notebook. You simply create a connection to the cluster and run it as you wish:
from epyc import ClusterLab, HDF5LabNotebook
import numpy
nb = HDF5LabNotebook('notebook-dataset.h5', description='My data', create=True)
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)
lab['first'] = numpy.linspace(0.0, 1.0, num=100)
lab.runExperiment(MyExperiment())
Running these experiments happens in the background rather than freezing your
notebook (as would happen with a ordinary Lab
. It also doesn’t attempt
to melt your laptop by running too many experiments locally. Instead the
experiments are scattered onto the cluster and can be collected later – even after
turning your computer off, if necessary.
Note
We used an HDF5LabNotebook
to hold the results so that everything is
saved between sessions. If you used a non-persistent notebook then you’d lose
the details of the pending results. And anyway, do you really not want to
save all your experimental results for later?
All that then needs to happen is that you re-connect to the cluster and check for results:
from epyc import ClusterLab, HDF5LabNotebook
import numpy
nb = HDF5LabNotebook('notebook-dataset.h5', description='My data')
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)
print(lab.readyFraction())
This will print the fraction of results that are now ready, a number between 0 and 1.
Warning
Notice that we omitted the create=True
flag from the HDF5LabNotebook
the second time: we want to re-use the notebook (which holds all the details
of the pending results we’re expecting to collect), not create it afresh.
Many core-hours of computation have been lost this way….
Checking the results pulls any that are completed, andyou can immediately start using them if you wish: you don’t have to wait for everything to finish. Just be careful that whatever anaysis you start to perform understands that this is partial set of results.
Avoiding repeated computation¶
Consider the following use case. You create a Jupyter notebook for calculating your exciting new results. You create a notebook, add some result sets, do some stunning long computations, and save the notebook for later sharing.
After some time, you decide want to add some more computations to the notebook. You open it, and realise that all the results you computed previously are still available in the notebook. But if you were to re-execute the notebook, you’d re-compute all those results when instead you could simply load them and get straight on with your new
Or perhaps you share your notebook “live” using Binder. The notebook is included in the binder, and people can see the code you used to get them (as with all good notebooks). But again they may want to use and re-analyse your results but not re-compute them.
One way round this is to have two cells, something like this to do (or re-do) the calculations:
# execute this cell if you want to compute the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.addResultSet('first-results',
description='My long-running first computation')
e = LongComputation()
lab = Lab(nb)
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
and then another one to re-use the ones you prepared earlier:
# execute this cell if you want to re-load the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.select('first-results')
Of course this is quite awkward, relies on the notebook user to decide which cells to execute – and means that you can’t simply run all the cells to get back to where you started.
epyc
has two solutions to this problem.
Using createWith
¶
The first (and preferred)_solution uses the Lab.createWith()
method, which takes a function used to create a result set. When
called, the method checks if the result set already exists in the
lab’s notebook and, if it does, selects it for use; if it doesn’t,
then it is created, selected, and the creation function is called to
populate it.
To use this method we first define a function to create the data we want:
def createResults(lab):
e = LongComputation()
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
We then use this function to create a result set, if it hasn’t been created already:
lab.createWith('first-results',
createResults,
description='My long-running first computation')
Note that the creation function is called with the lab it should use
as argument. Lab.createWith()
will return when the result set
has been created, which will be when all experiments have been run.
By default the lab’s parameter space is reset before the creation
function is called, and any exception raised in the creation function
causes the result set to be deleted (to avoid partial results) and
propagates the exception to the caller. There are several ways to
customise this behaviour, described in the Lab.createWith()
reference.
Using already
¶
The second solution uses the LabNotebook.already()
method. This
is appropriate if you want to avoid creating additional functions just
for data creation, and instead have the code inline.
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
if not nb.already('first-results',
description='My long-running first computation'):
e = LongComputation()
lab = Lab(nb)
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)
If this cell is run on a lab notebook that doesn’t contain the result
set, that set is created and the body of the conditional executes to
compute the results. If the cell is executed when the result set
already exists, it selects it ready for use (and any description
passed the LabNotebook.already()
is ignored). In either event,
subsequent code can assume that the result set exists, is selected,
and is populated with results.
Note
Note that Lab.createWith()
is called against a lab while
LabNotebook.already()
is called against a notebook. (The
former uses the latter internally.)
In general Lab.createWith()
is easier to use than
LabNotebook.already()
, as the former handles exceptions,
parameter space initialisation, result set locking and the like.
Limitations¶
There are some limitations to be aware of, of course:
- Neither approach works with disconnected operation when the results come back over a long period.
- You need to be careful about your choice of result set tags, so that they’re meaningful to you later. This also makes the description and metadata more important.
- We assume that all the results in the set are computed in one go, since future cells protected by the same code pattern wouldn’t be run.
The latter can be addressed by locking the result set after the
computation has happened (by calling ResultSet.finish()
) to fix
the results. Lab.createWith()
can do this automatically.
Sharing code between notebooks and engines¶
Of course if you use Jupyter as your primary coding platform, you’ll probably define experiments in the notebook too. This will unfortunately immediately introduce you to one of the issues with distributed computation in Python.
Here’s an example. Suppose you define an experiment in a notebook, something like this:
from epyc import Experiment
class MyExperiment(Experiment):
def __init__(self):
super(MyExperiment, self).__init__()
# your initialisation code here
def setUp(self, params):
super(MyExperiment, self).setUp(params)
# your setup code here
def do(self, params):
# whatever your experiment does
Pretty obvious, yes? And it works fine when you run it locally. Then,
when you try to run it in on a cluster, you create your experiment and
submit it with ClusterLab.runExperiment()
, but all the instances
fail with an error complaining about there being no such class as
MyExperiment
.
What’s happening is that epyc
is creating an instance of your
class in your notebook (or program) where MyExperiment
is known
(so the call in __init__()
works fine). Then it’s passing objects
(instances of this class) over to the cluster, where MyExperiment
isn’t known. When the cluster calls Experiment.setUp()
as part
of the experiment’s lifecycle, it goes looking for MyExperiment
–
and fails, even though it does actually have all the code it
needs. This happens because of the way Python dynamically looks for
code at run-time, which is often a useful feature but in this case
pulls things over.
To summarise, the problem is that code you define here (in the notebook) isn’t immediately available there (in the engines running the experiments): it has to be transferred there. And the easiest way to do that is to make sure that all classes defined here are also defined there as a matter of course.
To do this we make use of an under-used feature of Jupyter, the cell magic. These are annotations placed on cells that let you control the code that executes the cell itself. So rather than a Python code cell being executed by the notebook’s default mechanism, you can insert code that provides a new mechanism. In this case we want to have a cell execute both here and there, so that the class is defined on notebook and engine.
Important
If you haven’t taken the heart the advice about Reproducing experiments reliably,
now would be a really good time to do so. Create a venv for both
the notebook and the engines: the venv at the engine side doesn’t
need Jupyter, but it mostly does no harm to use the same
requirements.txt
file on both sides.
The cell magic we need uses the following code, so put it into a cell and execute it:
# from https://nbviewer.jupyter.org/gist/minrk/4470122
def pxlocal(line, cell):
ip = get_ipython()
ip.run_cell_magic("px", line, cell)
ip.run_cell(cell)
get_ipython().register_magic_function(pxlocal, "cell")
This defines a new cell magic, %%pxlocal
. The built-in cell magic
%%px
runs a cell on a set of engines. %%pxlocal
runs a cell
both on the engines and locally (in the notebook). If you decorate
your experiment classes this way, then they’re defined here and
there as required:
%%pxlocal
class MyExperiment(Experiment):
def __init__(self):
super(MyExperiment, self).__init__()
# your initialisation code here
def setUp(self, params):
super(MyExperiment, self).setUp(params)
# your setup code here
def do(self, params):
# whatever your experiment does
Now when you submit your experiments they will function as required.
Important
You only need to use %%pxlocal
for cells in which you’re
defining classes. When you’re running the experiments all the
code runs notebook-side only, and epyc
handles passing the
necessary objects around the network.