Resource Monitoring
The Hyak clusters make use of the Slurm scheduler to submit and run jobs. The scheduler provides a rich set of commands (e.g., sacct
, sinfo
) to query the state of the cluster but the extensive options can be daunting to navigate. We'll provide some useful example calls below in addition to some information about our custom resource monitoring program called hyakalloc
.
#
squeuesqueue
is used to monitor the Slurm queue. By default, squeue
displays a list of all jobs running on Hyak. To monitor the status of a specific user's jobs, use squeue -u UWNetID
, replacing UWNetID
with the UW Net ID of the user of interest. To view the queue for a specific account, use squeue -A accountname
.
#
sacctsacct
displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.
#
See all running jobsThe -a
flag implies all users, you can switch it out with -u netID
for specific users. You can also modify the output fields by adjusting the -o
flag.
#
See all pending jobsThe -a
flag implies all users, you can switch it out with -u netID
for specific users. You can also modify the output fields by adjusting the -o
flag.
#
sinfosinfo
allows you to view information about Slurm nodes and partitions.
#
GPUsSummary of all the GPUs on the cluster and their current state. You can adjust the output fields (i.e., -O
) for resources of interest (e.g., /tmp
space on a node).
#
hyakallocWhile you can use built-in Slurm commands query the resources used and what is available, the Hyak team has provided a useful utility called hyakalloc
. This program will make those queries on your behalf and present it in a user friendly format.
#
DefaultIf you run hyakalloc
without any arguments it will display all the account and partition combinations you have access to, where you can submit jobs that will start right away and have no interruptions.
- You can see the resource: your limits, what is currently in use, and what is available.
- At the bottom you can also see the overall checkpoint (or
ckpt
) partition where you can access idle resources from other groups from across the cluster. - The output concludes with a notice about when the next cluster maintenance is. This is important to remember when submitting jobs to specify a job runtime that will end before the next maintenance.
#
OptionsThe hyakalloc
program has a rich set of command line arguments for more complex queries. For example, perhaps you want to know what another user's resource access and limits are (i.e., -u netID
). You may be interested in auditing what your group's limits are (i.e., -g GROUP
). Some queries may be too verbose so you can filter down to specific partitions with the -p
flag.