Slurm

checkpoint partitions

salloc --qos=debug --gpus=1 --cpus-per-task=8 --mem=200G --time=00:30:00

salloc --qos=normal --gpus=2 --cpus-per-task=16 --mem=400G --time=04:00:00

salloc --qos=normal --cpus-per-task=1 --mem=4G --time=01:00:00

salloc: error: Req GPUs: 0
salloc: error: ERROR: Jobs must request at least 1 GPU, use -G <num> or --gpus <num> or --gres=gpu:<num>.
salloc: error: Job submit/allocate failed: Unspecified error

#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --qos=normal
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=800G
#SBATCH --time=08:00:00
#SBATCH --output=slurm-%j.out

module load conda
conda activate my_env
python my_script.py

squeue -u $USER

$ sinfo -O nodehost,statecompact,cpusstate,freemem,gresused -S nodehost
HOSTNAMES   STATE       CPUS(A/I/O/T)  FREE_MEM    GRES_USED           
g001        mix         24/40/0/64     1943019     gpu:h200:3(IDX:0-1,3)  
g002        idle        0/64/0/64      2005920     gpu:h200:0(IDX:N/A) 
g003        alloc       64/0/0/64      1690496     gpu:h200:8(IDX:0-7) 
...

salloc --qos=normal --gpus=1 --time=2:00:00

salloc: Req GPUs: 1
salloc: Req Time: 2.00 hrs
salloc: YOUR COST: $1.80 (Est.) 
salloc: NOTE: This is only an estimate based upon GPU hours requested. Billing is rounded DOWN to the nearest GPU hour on actual GPU hours consumed at a rate of $0.90 per 1 GPU per 1 hour. You can still cancel this job at no charge (i.e., scancel <JOB_ID>).
salloc: Granted job allocation 4809
salloc: Waiting for resource configuration
salloc: Nodes g020 are ready for job

seff 231

Job ID: 231
Cluster: tillicum
User/Group: UWNetID/account
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 01:23:45
CPU Efficiency: 85.23% of 02:00:00 core-walltime
Memory Utilized: 150.00 GB
Memory Efficiency: 75.00% of 200.00 GB

QOS	Max Time	Max GPUs per Job	Concurrent GPU Limit	Notes
normal	24 hours	16	96 GPUs	Standard production work
debug	1 hour	1	1 job	Quick testing and setup
interactive	8 hours	2	2 jobs	For real-time work or debugging
long	by request	details TBA	details TBA	For special long jobs
wide	by request	details TBA	details TBA	For distributed jobs

Job Type	Command	Best For	Runs On
Interactive Job	`salloc`	Exploratory or hands-on work	A compute node you connect to directly
Batch Job	`sbatch`	Long or unattended jobs	Runs automatically when resources are available

Slurm

Key Differences from Klone

Tillicum QOS

Understanding Job Types

Running Jobs

Interactive job with `salloc`

Batch job with `sbatch`

Monitoring Jobs and Resource Availability

Budgeting and Tillicum Usage

Monitoring Job Efficiency with `seff`

Key Differences from Klone​

Tillicum QOS​

Understanding Job Types​

Running Jobs​

Interactive job with salloc​

Batch job with sbatch​

Monitoring Jobs and Resource Availability​

Budgeting and Tillicum Usage​

Monitoring Job Efficiency with seff​