Start Here
GPUs offer significant performance enhancements for computationally intensive tasks. GPU cores are designed for parallel computating, making them a useful tool for training machine learning models, molecular dynamics simulations, and data mining. Unlike CPUs which excel at sequential tasks, GPUs can handle large numbers of simultaneous operations.
GPU Jobs
You can view the available GPUs on Hyak with the sinfo -s
command. To view which GPUs are available on the ckpt
partition, use:
sinfo -p ckpt-all -O nodehost,cpusstate,freemem,gres,gresused -S nodehost | grep -v null
GPU Jobs on Checkpoint
A GPU job can be requested from ckpt
by specifing the type and number of GPUs to allocate with the tag --gpus-per-node
:
salloc --partition=ckpt-all --gpus-per-node=2080ti:1 --mem=10G --time=2:00:00
GPU Jobs on a Specific GPU Partition
If you have a GPU partition, you can start an interactive session on a GPU node by using the following command:
salloc --account=account --partition=gpu-rtx6k --gpus=1 --mem=10G --time=2:00:00
# Replace the account and partition flags to match your account and partitions.
If you are unsure if your accounts have GPU partitions, use the hyakalloc
command to see all of your available resources. A detailed walkthrough for requesting a GPU job can be found HERE.
You now know how to view all GPUs supported on Hyak with the sinfo -s
command. Additional information about each GPU is listed below:
L40 and L40s: 48GB of GDDR6 memory per GPU card
A40: 48GB of GDDR6 memory per GPU card
2080 Ti: 11GB of GDDR6 memory per GPU card
Titan: 24GB of GDDR6 memory per GPU card
RTX6k: 48GB of GDDR6 memory per GPU card
A100: 40GB of HBM2 memory per GPU card
P100: 16GB of HBM2 memory per GPU card
The next section is aimed to provide additional context for GPUs and NVIDIA NGC containers used to train LLMs.