Fourth tutorial: Integration with Jupyter

Computational experiments are increasingly happening within the Jupyter ecosystem of notebooks, labs, binders, and so on. epyc integrates well with Jupyter tools, and in this tutorial we’ll show you how.

What is Jupyter

Jupyter is an interactive notebook system that provides a flexible front-end onto a number of different language “kernels” – including Python. Notebooks can include text, graphics, and executable code, and so are an excellent framework for the interactive running of computation experiments. The ecosystem has expended in recent years and now includes a multi-window IDE (Jupyter Lab), a publishing system (Jupyter Book), and various interactive widget libraries for visualisation.

epyc can be used from Jupyter notebooks in a naive way simply by creating a local lab in a notebook and working with it as normal:

from epyc import Lab

lab = epyc.Lab()

However this provides access only to the simplest kind of lab, running on a single local core. It is clearly desirable to be able to use more flexible resources. It’s also desirable to be able to make use of code and classes defined in notebooks.

Jupyter notebooks and compute servers

The interactive nature of Jupyer notebooks means that they lend themselves to “disconnected” operation with a ClusterLab. To recap, this is where we submit experiments to a remote compute cluster and then re-connect later to retrieve the results. See Second tutorial: Parallel execution for a description of how to set up and use a remote cluster.

This mode of working is perfect for a notebook. You simply create a connection to the cluster and run it as you wish:

from epyc import ClusterLab, HDF5LabNotebook
import numpy

nb = HDF5LabNotebook('notebook-dataset.h5', description='My data', create=True)
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)

lab['first'] = numpy.linspace(0.0, 1.0, num=100)
lab.runExperiment(MyExperiment())

Running these experiments happens in the background rather than freezing your notebook (as would happen with a ordinary Lab. It also doesn’t attempt to melt your laptop by running too many experiments locally. Instead the experiments are scattered onto the cluster and can be collected later – even after turning your computer off, if necessary.

Note

We used an HDF5LabNotebook to hold the results so that everything is saved between sessions. If you used a non-persistent notebook then you’d lose the details of the pending results. And anyway, do you really not want to save all your experimental results for later?

All that then needs to happen is that you re-connect to the cluster and check for results:

from epyc import ClusterLab, HDF5LabNotebook
import numpy

nb = HDF5LabNotebook('notebook-dataset.h5', description='My data')
lab = ClusterLab(url_file='ipcontroller-client.json', notebook=nb)

print(lab.readyFraction())

This will print the fraction of results that are now ready, a number between 0 and 1.

Warning

Notice that we omitted the create=True flag from the HDF5LabNotebook the second time: we want to re-use the notebook (which holds all the details of the pending results we’re expecting to collect), not create it afresh. Many core-hours of computation have been lost this way….

Checking the results pulls any that are completed, andyou can immediately start using them if you wish: you don’t have to wait for everything to finish. Just be careful that whatever anaysis you start to perform understands that this is partial set of results.

Avoiding repeated computation

Consider the following use case. You create a Jupyter notebook for calculating your exciting new results. You create a notebook, add some result sets, do some stunning long computations, and save the notebook for later sharing.

After some time, you decide want to add some more computations to the notebook. You open it, and realise that all the results you computed previously are still available in the notebook. But if you were to re-execute the notebook, you’d re-compute all those results when instead you could simply load them and get straight on with your new

Or perhaps you share your notebook “live” using Binder. The notebook is included in the binder, and people can see the code you used to get them (as with all good notebooks). But again they may want to use and re-analyse your results but not re-compute them.

One way round this is to have two cells, something like this to do (or re-do) the calculations:

# execute this cell if you want to compute the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.addResultSet('first-results',
                description='My long-running first computation')
e = LongComputation()
lab = Lab(nb)
lab[LongComputation.INITIAL] = int(1e6)
lab.runExperiment(e)

and then another one to re-use the ones you prepared earlier:

# execute this cell if you want to re-load the results
nb = epyc.HDF5LabNotebook('lots-of-results.h5')
nb.select('first-results')

Of course this is quite awkward, relies on the notebook user to decide which cells to execute – and means that you can’t simply run all the cells to get back to where you started.

epyc has two solutions to this problem.

Using createWith

The first (and preferred)_solution uses the Lab.createWith() method, which takes a function used to create a result set. When called, the method checks if the result set already exists in the lab’s notebook and, if it does, selects it for use; if it doesn’t, then it is created, selected, and the creation function is called to populate it.

To use this method we first define a function to create the data we want:

def createResults(lab):
    e = LongComputation()
    lab[LongComputation.INITIAL] = int(1e6)
    lab.runExperiment(e)

We then use this function to create a result set, if it hasn’t been created already:

lab.createWith('first-results',
               createResults,
               description='My long-running first computation')

Note that the creation function is called with the lab it should use as argument. Lab.createWith() will return when the result set has been created, which will be when all experiments have been run.

By default the lab’s parameter space is reset before the creation function is called, and any exception raised in the creation function causes the result set to be deleted (to avoid partial results) and propagates the exception to the caller. There are several ways to customise this behaviour, described in the Lab.createWith() reference.

Using already

The second solution uses the LabNotebook.already() method. This is appropriate if you want to avoid creating additional functions just for data creation, and instead have the code inline.

nb = epyc.HDF5LabNotebook('lots-of-results.h5')
if not nb.already('first-results',
                  description='My long-running first computation'):
    e = LongComputation()
    lab = Lab(nb)
    lab[LongComputation.INITIAL] = int(1e6)
    lab.runExperiment(e)

If this cell is run on a lab notebook that doesn’t contain the result set, that set is created and the body of the conditional executes to compute the results. If the cell is executed when the result set already exists, it selects it ready for use (and any description passed the LabNotebook.already() is ignored). In either event, subsequent code can assume that the result set exists, is selected, and is populated with results.

Note

Note that Lab.createWith() is called against a lab while LabNotebook.already() is called against a notebook. (The former uses the latter internally.)

In general Lab.createWith() is easier to use than LabNotebook.already(), as the former handles exceptions, parameter space initialisation, result set locking and the like.

Limitations

There are some limitations to be aware of, of course:

  • Neither approach works with disconnected operation when the results come back over a long period.
  • You need to be careful about your choice of result set tags, so that they’re meaningful to you later. This also makes the description and metadata more important.
  • We assume that all the results in the set are computed in one go, since future cells protected by the same code pattern wouldn’t be run.

The latter can be addressed by locking the result set after the computation has happened (by calling ResultSet.finish()) to fix the results. Lab.createWith() can do this automatically.

Sharing code between notebooks and engines

Of course if you use Jupyter as your primary coding platform, you’ll probably define experiments in the notebook too. This will unfortunately immediately introduce you to one of the issues with distributed computation in Python.

Here’s an example. Suppose you define an experiment in a notebook, something like this:

from epyc import Experiment

class MyExperiment(Experiment):

    def __init__(self):
        super(MyExperiment, self).__init__()
        # your initialisation code here

    def setUp(self, params):
        super(MyExperiment, self).setUp(params)
        # your setup code here

    def do(self, params):
        # whatever your experiment does

Pretty obvious, yes? And it works fine when you run it locally. Then, when you try to run it in on a cluster, you create your experiment and submit it with ClusterLab.runExperiment(), but all the instances fail with an error complaining about there being no such class as MyExperiment.

What’s happening is that epyc is creating an instance of your class in your notebook (or program) where MyExperiment is known (so the call in __init__() works fine). Then it’s passing objects (instances of this class) over to the cluster, where MyExperiment isn’t known. When the cluster calls Experiment.setUp() as part of the experiment’s lifecycle, it goes looking for MyExperiment – and fails, even though it does actually have all the code it needs. This happens because of the way Python dynamically looks for code at run-time, which is often a useful feature but in this case pulls things over.

To summarise, the problem is that code you define here (in the notebook) isn’t immediately available there (in the engines running the experiments): it has to be transferred there. And the easiest way to do that is to make sure that all classes defined here are also defined there as a matter of course.

To do this we make use of an under-used feature of Jupyter, the cell magic. These are annotations placed on cells that let you control the code that executes the cell itself. So rather than a Python code cell being executed by the notebook’s default mechanism, you can insert code that provides a new mechanism. In this case we want to have a cell execute both here and there, so that the class is defined on notebook and engine.

Important

If you haven’t taken the heart the advice about Reproducing experiments reliably, now would be a really good time to do so. Create a venv for both the notebook and the engines: the venv at the engine side doesn’t need Jupyter, but it mostly does no harm to use the same requirements.txt file on both sides.

The cell magic we need uses the following code, so put it into a cell and execute it:

# from https://nbviewer.jupyter.org/gist/minrk/4470122
def pxlocal(line, cell):
    ip = get_ipython()
    ip.run_cell_magic("px", line, cell)
    ip.run_cell(cell)
get_ipython().register_magic_function(pxlocal, "cell")

This defines a new cell magic, %%pxlocal. The built-in cell magic %%px runs a cell on a set of engines. %%pxlocal runs a cell both on the engines and locally (in the notebook). If you decorate your experiment classes this way, then they’re defined here and there as required:

%%pxlocal

class MyExperiment(Experiment):

    def __init__(self):
        super(MyExperiment, self).__init__()
        # your initialisation code here

    def setUp(self, params):
        super(MyExperiment, self).setUp(params)
        # your setup code here

    def do(self, params):
        # whatever your experiment does

Now when you submit your experiments they will function as required.

Important

You only need to use %%pxlocal for cells in which you’re defining classes. When you’re running the experiments all the code runs notebook-side only, and epyc handles passing the necessary objects around the network.