DMTCP (Checkpointing)
caution
DMTCP is still being tested on Hyak. The module name may change after testing. Please report any issues to help@uw.edu with "Hyak" in the subject.
DMTCP is a tool to transparently checkpoint and restart jobs, saving it to disk to be resumed at a later time. It requires no changes to application code, allowing easy use. Using checkpointing allows for shorter job times using requeing and better use of ckpt
resources, allowing higher throughput for your jobs. More extensive documentation can be found here or via relevant man
pages.
info
DMTCP currently does not support the following:
- Jobs using GPUs
- Jobs using Apptainer without built-in checkpointing enabled
- Jobs using MPI. We hope to provide support through MANA in the future.
#
DMTCP UsageWe provide some opinionated examples of DMTCP usage on Hyak here, for more information see more general documentation here or the man
pages.
To use DMTCP on Hyak, first load the module using module load testing/dmtcp/3.0.0
.
Set the directory checkpoints will be stored at with the environment variable: DMTCP_CHECKPOINT_DIR
. For example:
To start a job, use the dmtcp_launch
command, e.g.:
Where <NUM>
is the number of seconds between checkpoints and <COMMANDS TO RUN>
is your application (e.g. python3 do_research.py
). The application will be checkpointed in the DMTCP_CHECKPOINT_DIR
directory every <NUM>
seconds.
To restart a stopped job, use the dmtcp_restart
command, e.g.:
This will reload a job from a saved checkpoint. The restarted job will also be checkpointed every <NUM>
seconds.
DMTCP can also be used in a batch script which automatically resumes from a prior checkpoint if it exists. A brief outline is as follows:
This batch script runs the application python3 do_research.py
and checkpoints it every five minutes. It makes the assumption that DMTCP_CHECKPOINT_DIR
doesn't exist prior to the job starting.
Jobs utilizing checkpointing can be requeued (either with the --requeue
Slurm flag or ckpt
partition automatically requeuing). This allows for better usage of the ckpt
partition and shorter request times, both of which get your jobs done quicker!
Acknowledgements
This documentation is inspired by Clemson's DMTCP documentation and NERSC's DMTCP documentation.