Using a cluster without staying connected to it¶
Problem: You’re using a remote machine to run your simulations on, and don’t want your local machine to have to stay connected while they’re running because you’re doing a lot of computation.
Solution: epyc
’s cluster labs can work asynchronously, so you
submit the experiments you want to do and then come back to collect
them later. This is good for long-running sets of experiments, and
especially good when your front-end machine is a laptop that you want
to be able to take off the network when you go home.
Asynchronous operation is actually the default for
ClusterLab
. Starting experiments by default creates a pending
result that will be resoplved when it’s been computed on the cluster.
from epyc import ClusterLab
lab = ClusterLab(profile="mycluster",
notebook=HDF5LabNotebook('mydata.h5', create=True))
nb = lab.notebook()
# perform some of the first experiments
nb.addResultSet('first-experiments')
lab['a'] = 12
lab['b'] = range(1000)
e = MyFirstExperiment()
lab.runExperiment(e)
# and then some others
nb.addResultSet('second-experiments')
lab['a] = 15
lab['b'] = ['cat', 'dog', 'snake']
lab['c'] = range(200)
e = MySecondExperiment()
lab.runExperiment(e)
You can then wait to get all the results:
lab.wait()
which will block until all the results become available, implying that your machine has to stay connected to the cluster until the experiments finish: possibly a long wait. Alternatively you can check what fraction of each result set has been successfully computed:
lab = ClusterLab(profile="mycluster",
notebook=HDF5LabNotebook('mydata.h5'))
nb = lab.notebook()
nb.select('first-experiments')
print(lab.readyFraction())
nb.select('second-experiments')
print(lab.readyFraction())
(This is an important use case especially when using a remote cluster with a Jupyter notebook, detailed more in the Fourth tutorial: Integration with Jupyter.) The notebook will gradually be emptied of pending results and filled with completed results, until none remain.
import time
allReady = False
tags = nb.resultSets()
while not allReady:
time.sleep(5) # wait 5s
allReady = all(map(lambda tag: lab.ready(tag), tags))
print('All ready!')
The system for retrieving completed results is quite robust in that it commits the notebook as results come in, minimising the posibility for loss through a crash.
Important
If you look at the API for LabNotebook
you’ll see methods for
LabNotebook.ready()
and LabNotebook.readyFraction()
. These
check the result set without updating; the corresponding methods
Lab.ready()
and Lab.readyFraction()
check the result set
after updating with newly-completed results.
You can also, if you prefer, force an update of pending results directly:
lab.updateResults()
The call to ClusterLab.updateResults()
connects to the cluster
and pulls down any results that have completed, entering them into the
notebook. You can then query the notebook (rather than the lab) about
what fraction of results are ready, taking control of when the cluster
is interrogated.