Scheduling Jobs

sinfo -s

PARTITION        AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
compute-bigmem      up   infinite        28/0/0/28 n[3008-3011,3064,3066,3132-3133,3190,3244-3247,3252-3255,3353-3355,3400-3407]
ckpt                up   infinite   327/173/10/510 g[3001-3007,3010-3017,3020-3027,3030-3037,3040-3047,3050-3057,3060-3067,3070-3077,3080-3085],n[3000-3431],z[3001-3002,3005-3011]
ckpt-all            up   infinite   368/195/10/573 g[3001-3007,3010-3017,3020-3027,3030-3037,3040-3047,3050-3057,3060-3067,3070-3077,3080-3085,3090-3122],n[3000-3461],z[3001-3002,3005-3011]
ckpt-g2             up   infinite       41/22/0/63 g[3090-3122],n[3432-3461]
compute*            up   infinite    197/170/7/374 n[3012-3015,3024-3063,3068-3131,3134-3189,3191-3239,3248-3251,3256-3299,3304-3352,3356-3363,3368-3399,3408-3431]
cpu-g2              up   infinite        7/19/0/26 n[3432-3438,3440,3442,3444,3446-3461]
cpu-g2-mem2x        up   infinite          2/2/0/4 n[3439,3441,3443,3445]
gpu-2080ti          up   infinite        10/1/2/13 g[3001-3007,3014-3017,3027],z3001
gpu-a100            up   infinite          8/0/0/8 g[3080-3085],z[3010-3011]
gpu-a40             up   infinite        32/0/0/32 g[3040-3047,3050-3057,3060-3067,3070-3077]
gpu-l40             up   infinite        15/0/0/15 g[3090-3099,3115-3119]
gpu-l40s            up   infinite        17/1/0/18 g[3100-3114,3120-3122]
gpu-p100            up   infinite          2/0/0/2 z[3005-3006]
gpu-rtx6k           up   infinite        19/0/0/19 g[3010-3013,3020-3026,3030-3037]
gpu-titan           up   infinite          1/0/0/1 z3002
compute-hugemem     up   infinite        27/2/1/30 n[3000-3007,3016-3023,3065,3067,3240-3243,3300-3303,3364-3367]
compute-ultramem    up   infinite          3/0/0/3 z[3007-3009]

# Below replace the word account with an account name you belong to 
# Use hyakalloc to see your accounts and partitions
salloc -A account -p ckpt-all -N 1 -c 4 --mem=10G --time=2:30:00

[netID@klone1 ~]$ salloc -N 2 -p compute -A stf --time=5 --mem=5G
salloc: Pending job allocation 2620960
salloc: job 2620960 queued and waiting for resources
salloc: job 2620960 has been allocated resources
salloc: Granted job allocation 2620960
salloc: Waiting for resource configuration
salloc: Nodes n[3148-3149] are ready for job
[netID@n3148 ~]$ srun hostname
n3148
n3149
[netID@n3148 ~]$ scontrol show hostnames
n3148
n3149

salloc -p <partition_name>-int -A <group_name> --time=<time> --mem=<size>G

#!/bin/bash

#SBATCH --job-name=<name>
#SBATCH --mail-type=<status>
#SBATCH --mail-user=<email>

#SBATCH --account=<lab>
#SBATCH --partition=<node_type>
#SBATCH --nodes=<num_nodes>
#SBATCH --ntasks-per-node=<cores_per_node>
#SBATCH --mem=<size[unit]>
#SBATCH --gpus=<type:quantity> 
#SBATCH --time=<time> # Max runtime in DD-HH:MM:SS format.

#SBATCH --chdir=<working directory>
#SBATCH --export=all
#SBATCH --output=<file> # where STDOUT goes
#SBATCH --error=<file> # where STDERR goes

# Modules to use (optional).
<e.g., module load apptainer>

# Your programs to run.
<my_programs>

SBATCH --nodes=4

SBATCH --ntasks-per-node=40

Arguments	Command Flags	Notes
Account	`-A` or `--account`	What lab are you part of? If you run the `groups` command you can see what groups (usually labs) you're a member of, these are associated with resource limits on the cluster. See the accounts section for additional information.
Partition	`-p` or `--partition`	What resource partition are you interested in using? This could be anything you see when you run `sinfo -s` as each partition corresponds to a class of nodes (e.g., high memory, GPU). See the partitions section for additional information.
Nodes	`-N` or `--nodes`	How many nodes are these resources spread across? In the overwhelming number of cases this is 1 (for a single node) but more sophisticated multi-node jobs could be run if your code supports it.
Cores	`-c` or `--cpus-per-task`	How many compute cores do you need? Not all codes can make use of multiple cores and if they do, the performance of the code is not always linear with the resources requested. If in doubt consider contacting the research computing team to assist in this optimization.
Memory	`--mem`	How much memory do you need for this job? This is in the format `size[units]` were size is a number and units are either `M`, `G`, or `T` for megabyte, gigabyte, and terabyte respectively. Megabyte is the default unit if none is provided.
Time	`-t` or `--time`	What's the maximum runtime for this job? Common acceptable time formats include `hours:minutes:seconds`, `days-hours`, and `minutes`.

Scheduling Jobs

Compute Resources

Job Types

Slurm Arguments

Interactive Jobs (Single Node)

Interactive Jobs (Multi Node)

Interactive Node Partitions

Slurm Environment Variables

Batch Jobs

Single Node Batch Jobs

Multiple Node Batch Jobs

Common Slurm Error Messages

Utility Commands

Man Pages

References

Compute Resources​

Job Types​

Slurm Arguments​

Interactive Jobs (Single Node)​

Interactive Jobs (Multi Node)​

Interactive Node Partitions​

Slurm Environment Variables​

Batch Jobs​

Single Node Batch Jobs​

Multiple Node Batch Jobs​

Common Slurm Error Messages​

Utility Commands​

Man Pages​

References​