Scheduling Jobs
When you first ssh into klone
you land on one of the two login nodes (e.g., klone-login01
). Login nodes are shared amongst all users to transfer data, navigate the file system, and request resource slices to perform heavy duty computing. You should never use login nodes for heavy computing and automated mechanisms exist to monitor and enforce violations. The tool used to notify users of violations is "arbiter2" and you will receive an email for each offending process (Gardner, Migacz, and Haymore 2019).
To keep the login node in stable working order and ensure fair usage of the login node as a community resource, Hyak has a job scheduling software that will give you access to other nodes (i.e., different computers that are part of the klone
cluster). The job scheduler software is called Slurm, and regular users of Hyak need to learn how to use Slurm to effectively and efficiently make use of Hyak as a resource for research computing.
Relevant Vocabulary
Account: In the context of using Slurm, "account" refers to the group/s you belong to, not your UWnetID. The hyakalloc
command will display accounts you can submit jobs with (i.e., under the Slurm sbatch
directive --account
).
Checkpoint partitions: Abbreviated ckpt
, ckpt-g2
, and ckpt-all
, represents idle resources across the cluster at any moment. All cluster users are eligible to submit jobs to this partition and they will run subject to availability. To provide some regular churn in pending checkpoint jobs, jobs running for >4 hours (for HPC jobs) and >8 hours (for GPU jobs) are re-queued (i.e., re-submitted to the checkpoint partition queue). The jobs will continue in this manner until the job exits or the requested runtime is fulfilled. For more information see Using Idle Resources.
Idle Resource: A cluster resource is "idle" when it currently has no running jobs. Requested idle resources are not guaranteed.
Interactive Session: An interactive session on the cluster allows users to access a compute node in real time for tasks that require direct interaction, exploration, or debugging. Request an interactive job with the salloc
command.
Node: In HPC, a server is synonymous with a node. 1 server = 1 node so it is OK to use those two terms interchangeably. You can also think of nodes as distinct but network-connected computers.
- HPC Node: A standard compute node with no additional components and variable amounts of memory at time of procurement.
- GPU Node: A standard node with GPU cards added in at time of procurement. GPUs are typically used for machine learning workflows and in rarer cases for applications that have been specifically ported over to GPUs to speed up the runtime.
Partition: A partition is a logical subdivision of the Hyak cluster resources. Specifically, each partition represents a class of node. For example, the partitions on the cluster are compute
, cpu-g2
, ckpt
, ckpt-g2
, ckpt-all
,compute-bigmem
,cpu-g2-mem2x
, and GPU nodes. hyakalloc
will display paritions in addition to ckpt
that you can submit jobs with (i.e., under the Slurm sbatch
directive --partition
).
Queue: A queue is a waiting area for jobs that have been submitted to the cluster but are not yet executing. The scheduler manages the order in which jobs are taken from the queue for execution. The Slurm queue can be monitored with the command squeue
and squeue -u UWNetID
replacing the word UWNetID
with your UW Net ID will show you submitted that are waiting in the queue or are being executed.
Scheduler: A job scheduler is a component or software system responsible for managing and optimizing the allocation of computing resources and tasks within a distributed computing environment. It orchestrates the execution of jobs, tasks, or processes across available resources such as CPUs, memory, and storage.
Slurm: The job scheduler used on Hyak. Slurm stands for Simple Linux Utility (for) Resource Management. See Slurm documentation for detailed help using the job scheduler.
#
Set UpIf you haven't already log on to klone
for this tutorial.
For the following exercises, we will create a working directory for this tutorial. We recommend starting your working directory in a filesystem location where you have a large storage quota, not in your Home directory (limit 10GB; Click here to learn more about Home directory storage limits). For this demonstration, we will create a working directory to use Hyak's free community storage under /gscratch/scrubbed
(Click here to learn more about Scrubbed storage). First navigate to /gscratch/scrubbed
:
If you have not already, make a directory with your UW NetID with mkdir
and go into it:
This will be your working directory for this section. Note that files and directories will be deleted after 21 days if they are not used.
To start, copy the necessary tutorial materials to your working directory. Because we are copying an entire directory, make sure to use -r
to recursively copy:
Ensure all materials were copied into your working directory:
This directory contains other materials, but the materials listed below will be used in the next exercises.
#
Accounts and PartitionsThe first stop on understanding job scheduling is to understand that every user is part of an account and thus has access to certain partitions. Your account is usually related to a lab or research group that you belong to; for example, you may be part of a lab group that has contributed resources to Hyak, affording you priority usage of those resources, which are organized into one or more partitions. Alternatively, you may be a student user who is part of the Research Computing Club, or account stf
, meaning that you have priority access on the stf
account, which allows you to use several partitions. Additionally, all users can use Hyak resources when they are idle by scheduling jobs on the ckpt
, ckpt-g2
, or ckpt-all
parititions (Click here to learn about more about ckpt
jobs.).
Pro Tip - Get an STF account
If you are a student who is paying the student technology fee (STF), you are eligible for an stf
account which will increase your access and user experience on Hyak because there are designated resources for students. Click here to find out how to get an STF account. NOTE: The Hyak Team doesn't manage the stf
account group.
Let's start by checking which accounts and partitions you have access to with the hyakalloc
command.
The result will look different for each user. Yours might looks something like this:
This exmaple output for a user from the fictional Account called "account" shows access to a compute parition and a gpu-rtx6k parition. The displayed partitions could include compute partitions, larger memory partitions, and GPUs. The table also shows which resources under the account are being used when the hyakalloc
command was executed (as shown, all resources are free at the moment). The bottom table shows how many CPUs and GPUs are idle under the checkpoint partitions.
The hyakalloc
results are a tool for you to prepare your job request. For example, if you want to use a compute
partition, but all the CPUs are being used, you might consider using another partition with free CPUs or the ckpt-all
partition.
Hyak Demo Account Users
If you are using Hyak is a demonstration account, your hyakalloc
table will look like the following because you are not part of a Hyak account group and you have not been given access to a partition with priority access.
For some of the next presented exercises, you can ignore the --account
flag, and you will only be able to request jobs from the checkpoint partitions (ckpt
, ckpt-g2
, and ckpt-all
) with the --partition
flag.
While you won't necessarily have access to them, it might be useful for you to see a list of Hyak's partitions. The sinfo
commands contains information about the servers or nodes that compose Hyak, and the sinfo -s
commands give you a summary for this information including the partitions and the hostnames that fall into each partition.
Pro Tip
#
Monitoring the Slurm Job QueueIn the following section, it is often useful to have two terminal windows open and logged into klone
. One for editing scipts and issuing commands and one for monitoring active jobs in the squeue. Open up a second terminal and use ssh
to login to Hyak. In this terminal, monitor jobs using the command:
watch -n 10
will issue the following command ( squeue -u UWNetID
) every 10 seconds, allowing you to see the youbs you have submitted enter the queue and change states. Right now the queue is likely empty because we haven't requested any jobs yet, but they will appear in this window as we continue with the tutorial.
The state of the job is listed under "ST" in this window. Some of the most common job states are:
- PD: Pending job
- R: Running job
- S: Suspended job
- CG: Completing job
- CD: Completed job
In the next exercises, leave this terminal open and executing the watch -n 10 squeue -u UWNetID
and continue with the exercises in the other terminal window.
#
Interactive JobsAn interactive session on the cluster allows users to access a computing node in real time for tasks that require direct interaction, exploration, or debugging. Use the salloc
command to request an interactive job. If you have a quick job, need to test many commands individual that will later become the components of a script, or you are preparing software to use later, an interactive session maybe the best choice. Let's start an interactive job on the ckpt-all
partition (feel free to use another partition if you like). We will specify that we want a single CPU with the flag --cpus-per-task=1
, 10G of RAM with --mem=10G
, and a maximum time of 2 hours with --time=2:00:00
. The job will automatically end after 2 hours if we don't end it before 2 hours has elapsed.
The output will look something like this:
Relevant information from this output is:
- The JobID, which in this exmaple is 18981043 but will be a different integer for every job on Hya, is a unqiue identifier for the job. If you have specific questions about a job you ran, it can be good practice to make note of the JobID so that the Hyak support staff find the job and better understand the request and behavior.
- The nodes or Hostname that were provisioned for the job, in this example the job is running on a compute node called
n3424
.
After your job has been requested and has started, your shell prompt will show that you are no longer on the login node, or look something like this:
Except that the word UWNetID
will be replaced with your Net ID and n3424
will be replaced with the node Slurm assigned to your interactive job. Finally, the word basics
will be replaced with the name of your current directory (your location on the filesystem); if you have been following along, your current directory should be the basics directory that we copied as part of the Set Up for this tutorial.
By completing the salloc
command exercise, you have successfully scheduled a job on Hyak using Slurm. You might always hear this method described as, "Starting an interactive job on a compute node." In the next section, we will use this interactive job to run a simple script.
Pro tip : Requesting a GPU job
You can also request an interactive session on a GPU with salloc
.
#
Requesting GPUs from a GPU partitionUse your hyakalloc
command to see the GPU partitions associated to the accounts you are part of; you will be able to schedule jobs on these GPU resources with priority. Using this example:
The command to start an interactive job on a GPU node would look like the following:
Which would give you a job with 1 CPU (default), 1 GPU, and 10G of RAM for 2 hours.
#
Requesting GPUs from CheckpointIf you are requesting a GPU job as a demonstration account user or as a user without priority access to a GPU partition, you will request a GPU jobs from checkpoint idle resources. If you need to request a GPU node via checkpoint, the salloc
command will be similar to the following:
Which would give you a job with 1 CPU (default), 1 A40 GPU, and 10G of RAM for 2 hours. Note that when requesting GPU jobs from the ckpt
, ckpt-g2
, or ckpt-all
paritions you must include the GPU model or type after --gpus-per-node
and the number of GPUs you want to allocate. In this example, we are requesting one Nvidia A40 GPU.
#
Identifying Idle GPU TypesWhen requesting GPUs from checkpoint idle resources, it might be useful to use the following command to identify idle GPU types.
The start of the output will look something like this indicating that there :
Showing that there are at least seven (out of eight) idle 2080ti GPUs and one could be requested with
#
Confirming the GPU is ActiveAfter requesting a GPU job, you can check to see if the GPU is active using the nvidia-smi
command:
The output will be two tables. The first table shows information such as the temperature (degrees Celsius), performance state (ranging from P0-P12, where P0 is the maximum performance state) and how much memory is used for all available GPUs. The second table provides information on all the processes using GPUs.
To continuously update the output every 5 seconds, use the flag --loop = 5
:
#
A Simple Script as a Command Proxy#
A CPU job and a GPU job will work equivilantly in this section.After requesting an interactive job, let's try to run a simple script on the compute node. If you have been following along, you should have loop_script.sh
in the basics directory.
Use the cat
or nano
command to view this script.
important
The point of showing you this script is not how the script is coded or what the script does. We will use this script in the next sections to demonstrate executing commands with Slurm job scheulder. This script is a proxy for scripts or commands you might use on Hyak. Our main goal is to prepare you to adapt the Slurm commands for your research computing projects on Hyak.
loop_script.sh
will take a starting point and an ending point and count until variable i=ending point. To execute this, use ./
with the desired starting and ending values:
The output should look like this:
To see how long a job took, use the time
command:
The output should look something like this:
Understanding the time output
The real time is the wall-clock time it takes for a job to finish. In this case, the job completed in 4.216 seconds. It might have taken more or less time for you. The user time refers to the amount of time the CPU spends in user mode within the process and the system time is the amount of time the CPU spends in kernal (or supervisor) mode.
In the next section, we will execute time ./loop_script.sh 0 1000000
again but this time as a batch job, that is executed without our supervision.
exit
into the terminal.#
OPTIONAL: To end an interactive job, type #
Batch JobsFor longer running jobs, we recommend using a Slurm or sbatch
script to submit a batch job rather than execute commands with an interactive session. Batch jobs are executed "in the background" or submitted to be executed on a compute node without the user's supervision. With batch jobs, a user can prepare their commands, instruct Slurm which resources (the number of CPUs, GPUs, memory, and time) the job needs to execute successfully, submit the job and end their connection to Hyak, and then return to view the results of the job when it has completed.
In this section, we will be using the loop_job.slurm
script in the basics directory to run loop_script.sh
as a batch job. Use nano
to view and edit loop_job.slurm
, and we will walk through sections of the script.
The first few lines of loop_job.slurm
should look like this:
Slurm job scripts are writen in the coding language bash and as such start with #!/bin/bash
, also known as a "shebang." This ensures that the bash shell is used to run the script. The subsequent flags starting with #SBATCH
are options for the sbatch
command and communicate the specifications of the job that is being requested. Notice how the flags are reminiscent of the salloc
command flags.
As written, this script requests a single task with 1G RAM named loop_job
to be sent to the ckpt
partition. The maximum time for the job is set to 5 minutes. The #SBATCH -o log/%x_%j.out
requests that the standard output (stdout) or information that is usually printed to the screen during the execution of a command be saved to an output file. This line also sets up the filename for the output file; in this case, the file will be saved to a directory called log/
and the file will be similar to loop_job_123456789.out
because the %x
is shorthand for the job name (loop_job
) and the %j
is shorthand for the JobID which will be assigned when the job is submitted.
Notice, we haven't specified an account, and Slurm will choose an account by default. However, if you would like to specify the account, you would include an additional option with #SBATCH --account=
followed by the desired account. Use hyakalloc
to see the available accounts and partitions can use for requesting jobs.
The commands you wish to execute will follow the #SBATCH
option lines. In this case, we want to run loop_script.sh
from 0 to 1000000 and see how long it takes:
At the bottome of loop_job.slurm
we explain all options in some detail for your reference.
Exit the nano text editor with ctrl+x. The command to submit dat jobs is sbatch
. Submit the job using sbatch
:
If you set up a separate window to monitor your jobs (see the pro tip above), details about loop_job should appear in this window. The new log directory containing the output file should also be made by now:
The listed output file name will look something like this:
Examine the contents of the output file to see how long the sequence took:
All outputs and error messages will appear in this file:
The purpose of this exercise was to execute a command via a batch job. The loop_job.slurm
script requested the resources and executed the command to run loop_script.sh
.
Pro tip - multithreading
TODO
#
Literature CitedGardner, Dylan, Robben Migacz, and Brian Haymore. "Arbiter: Dynamically Limiting Resource Consumption on Login Nodes." Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning). 2019. 1-7. [DOI: 10.1145/3332186.3333043] [Code: Gitlab]