ClusterLab
: Flexible, parallel, asynchronous experiments¶
-
class
epyc.
ClusterLab
(notebook: epyc.labnotebook.LabNotebook = None, url_file=None, profile=None, profile_dir=None, ipython_dir=None, context=None, debug=False, sshserver=None, sshkey=None, password=None, paramiko=None, timeout=10, cluster_id=None, **extra_args)¶ A
Lab
running on anpyparallel
compute cluster.Experiments are submitted to engines in the cluster for execution in parallel, with the experiments being performed asynchronously to allow for disconnection and subsequent retrieval of results. Combined with a persistent
LabNotebook
, this allows for fully decoupled access to an on-going computational experiment with piecewise retrieval of results.This class requires a cluster to already be set up and running, configured for persistent access, with access to the necessary code and libraries, and with appropriate security information available to the client.
Interacting with the cluster¶
The ClusterLab
can be queried to determine the number of
engines available in the cluster to which it is connected, which
essentially defines the degree of available parallelism. The lab also
provides a ClusterLab.sync_imports()
method that allows modules
to be imported into the namespace of the cluster’s engines. This needs
to be done before running experiments, to make all the code used by an
experiment available in the cluster.
-
ClusterLab.
numberOfEngines
() → int¶ Return the number of engines available to this lab.
Returns: the number of engines
-
ClusterLab.
engines
() → ipyparallel.client.view.DirectView¶ Return a list of the available engines.
Returns: a list of engines
-
ClusterLab.
sync_imports
(quiet: bool = False) → contextlib.AbstractContextManager¶ Return a context manager to control imports onto all the engines in the underlying cluster. This method is used within a
with
statement.Any imports should be done with no experiments running, otherwise the method will block until the cluster is quiet. Generally imports will be one of the first things done when connecting to a cluster. (But be careful not to accidentally try to re-import if re-connecting to a running cluster.)
Parameters: quiet – if True, suppresses messages (defaults to False) Returns: a context manager
Running experiments¶
Cluster experiments are run as with a normal Lab
, by setting
a parameter space and submitting an experiment to ClusterLab.runExperiment()
.
The experiment is replicated and passed to each engine, and
experiments are run on points in the parameter space in
parallel. Experiments are run asynchronously: runExperiment()
returns as soon as the experiments have been sent to the cluster.
-
ClusterLab.
runExperiment
(e: epyc.experiment.Experiment)¶ Run the experiment across the parameter space in parallel using all the engines in the cluster. This method returns immediately.
The experiments are run asynchronously, with the points in the parameter space being explored randomly so that intermediate retrievals of results are more representative of the overall result. Put another way, for a lot of experiments the results available will converge towards a final answer, so we can plot them and see the answer emerge.
Parameters: e – the experiment
The ClusterLab.readyFraction()
method returns the fraction of
results that are ready for retrieval, i.e., the fraction of the
parameter space that has been explored. ClusterLab.ready()
tests
whether all results are ready. For cases where it is needed (which
will hopefully be few and far between), ClusterLab.wait()
blocks
until all results are ready.
-
ClusterLab.
readyFraction
(tag: str = None) → float¶ Return the fraction of results available (not pending) in the tagged result set after first updating the results.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: the ready fraction
-
ClusterLab.
ready
(tag: str = None) → bool¶ Test whether all the results are ready in the tagged result set – that is, none are pending.
Parameters: tag – (optional) the result set to check (default is the current result set) Returns: True if the results are in
-
ClusterLab.
wait
(timeout: int = -1) → bool¶ Wait for all pending results in all result sets to be finished. If timeout is set, return after this many seconds regardless.
Parameters: timeout – timeout period in seconds (defaults to forever) Returns: True if all the results completed
Results management¶
A cluster lab is performing computation remotely to itself, typically on another machine or machines. This means that pending results may become ready spontaneously (from the lab’s perspective.) Most of the operations that access results first synchronise the lab’s notebook with the cluster, retrieving any results that have been resolved since the previous check. (Checks can also be carried out directly.)
-
ClusterLab.
updateResults
(purge: bool = False) → int¶ Update our results within any pending results that have completed since we last retrieved results from the cluster. Optionally purges any jobs that have crashed, which can be due to engine failure within the cluster. This prevents individual crashes blocking the retrieval of other jobs.
Parameters: purge – (optional) cancel any jobs that have crashed (defaults to False) Returns: the number of pending results completed at this call
Connection management¶
A ClusterLab
can be opened and closed to
connect and disconnect from the cluster: the class’ methods do this
automatically, and try to close the connection where possible to avoid
occupying network resources. Closing the connection explicitly will
cause no problems, as it re-opens automatically when needed.
Important
Connection management is intended to be transparent, so there will seldom be a need to use any these methods directly.
-
ClusterLab.
open
()¶ Connect to the cluster. This will involve several possible re-tries.
-
ClusterLab.
close
()¶ Close down the connection to the cluster.
In a very small number of circumstances it may be necessary to take control of (or override) the basic connection functionality, which is provided by two other helped methods.
-
ClusterLab.
connect
()¶ Low-level connection to the cluster. Most code should use
open()
to open the connection: this method performs a single connection attempt, raising an exception if it fails.
-
ClusterLab.
activate
()¶ Make the connection active to ipyparallel/Jupyter. Usually only needed when there are several labs active in one program, where this method selects the lab used by, fo example, parallel magics.
Tuning parameters¶
There are a small set of tuning parameters that can be adjusted to cope with particular circumstances.
-
ClusterLab.
WaitingTime
= 30¶ Waiting time for checking for job completion. Lower values increase network traffic.
-
ClusterLab.
Reconnections
= 5¶ Number of attempts when re-connecting to a cluster.
-
ClusterLab.
Retries
= 3¶ Number of re-tries for failed jobs.