ClusterLab: Flexible, parallel, asynchronous experiments

class epyc.ClusterLab(notebook: epyc.labnotebook.LabNotebook = None, url_file=None, profile=None, profile_dir=None, ipython_dir=None, context=None, debug=False, sshserver=None, sshkey=None, password=None, paramiko=None, timeout=10, cluster_id=None, **extra_args)

A Lab running on an pyparallel compute cluster.

Experiments are submitted to engines in the cluster for execution in parallel, with the experiments being performed asynchronously to allow for disconnection and subsequent retrieval of results. Combined with a persistent LabNotebook, this allows for fully decoupled access to an on-going computational experiment with piecewise retrieval of results.

This class requires a cluster to already be set up and running, configured for persistent access, with access to the necessary code and libraries, and with appropriate security information available to the client.

Interacting with the cluster

The ClusterLab can be queried to determine the number of engines available in the cluster to which it is connected, which essentially defines the degree of available parallelism. The lab also provides a ClusterLab.sync_imports() method that allows modules to be imported into the namespace of the cluster’s engines. This needs to be done before running experiments, to make all the code used by an experiment available in the cluster.

ClusterLab.numberOfEngines() → int

Return the number of engines available to this lab.

Returns:the number of engines
ClusterLab.engines() → ipyparallel.client.view.DirectView

Return a list of the available engines.

Returns:a list of engines
ClusterLab.sync_imports(quiet: bool = False) → contextlib.AbstractContextManager

Return a context manager to control imports onto all the engines in the underlying cluster. This method is used within a with statement.

Any imports should be done with no experiments running, otherwise the method will block until the cluster is quiet. Generally imports will be one of the first things done when connecting to a cluster. (But be careful not to accidentally try to re-import if re-connecting to a running cluster.)

Parameters:quiet – if True, suppresses messages (defaults to False)
Returns:a context manager

Running experiments

Cluster experiments are run as with a normal Lab, by setting a parameter space and submitting an experiment to ClusterLab.runExperiment(). The experiment is replicated and passed to each engine, and experiments are run on points in the parameter space in parallel. Experiments are run asynchronously: runExperiment() returns as soon as the experiments have been sent to the cluster.

ClusterLab.runExperiment(e: epyc.experiment.Experiment)

Run the experiment across the parameter space in parallel using all the engines in the cluster. This method returns immediately.

The experiments are run asynchronously, with the points in the parameter space being explored randomly so that intermediate retrievals of results are more representative of the overall result. Put another way, for a lot of experiments the results available will converge towards a final answer, so we can plot them and see the answer emerge.

Parameters:e – the experiment

The ClusterLab.readyFraction() method returns the fraction of results that are ready for retrieval, i.e., the fraction of the parameter space that has been explored. ClusterLab.ready() tests whether all results are ready. For cases where it is needed (which will hopefully be few and far between), ClusterLab.wait() blocks until all results are ready.

ClusterLab.readyFraction(tag: str = None) → float

Return the fraction of results available (not pending) in the tagged result set after first updating the results.

Parameters:tag – (optional) the result set to check (default is the current result set)
Returns:the ready fraction
ClusterLab.ready(tag: str = None) → bool

Test whether all the results are ready in the tagged result set – that is, none are pending.

Parameters:tag – (optional) the result set to check (default is the current result set)
Returns:True if the results are in
ClusterLab.wait(timeout: int = -1) → bool

Wait for all pending results in all result sets to be finished. If timeout is set, return after this many seconds regardless.

Parameters:timeout – timeout period in seconds (defaults to forever)
Returns:True if all the results completed

Results management

A cluster lab is performing computation remotely to itself, typically on another machine or machines. This means that pending results may become ready spontaneously (from the lab’s perspective.) Most of the operations that access results first synchronise the lab’s notebook with the cluster, retrieving any results that have been resolved since the previous check. (Checks can also be carried out directly.)

ClusterLab.updateResults(purge: bool = False) → int

Update our results within any pending results that have completed since we last retrieved results from the cluster. Optionally purges any jobs that have crashed, which can be due to engine failure within the cluster. This prevents individual crashes blocking the retrieval of other jobs.

Parameters:purge – (optional) cancel any jobs that have crashed (defaults to False)
Returns:the number of pending results completed at this call

Connection management

A ClusterLab can be opened and closed to connect and disconnect from the cluster: the class’ methods do this automatically, and try to close the connection where possible to avoid occupying network resources. Closing the connection explicitly will cause no problems, as it re-opens automatically when needed.

Important

Connection management is intended to be transparent, so there will seldom be a need to use any these methods directly.

ClusterLab.open()

Connect to the cluster. This will involve several possible re-tries.

ClusterLab.close()

Close down the connection to the cluster.

In a very small number of circumstances it may be necessary to take control of (or override) the basic connection functionality, which is provided by two other helped methods.

ClusterLab.connect()

Low-level connection to the cluster. Most code should use open() to open the connection: this method performs a single connection attempt, raising an exception if it fails.

ClusterLab.activate()

Make the connection active to ipyparallel/Jupyter. Usually only needed when there are several labs active in one program, where this method selects the lab used by, fo example, parallel magics.

Tuning parameters

There are a small set of tuning parameters that can be adjusted to cope with particular circumstances.

ClusterLab.WaitingTime = 30

Waiting time for checking for job completion. Lower values increase network traffic.

ClusterLab.Reconnections = 5

Number of attempts when re-connecting to a cluster.

ClusterLab.Retries = 3

Number of re-tries for failed jobs.