April 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The HYAK team was able to use the interruption to implement the following changes:

  • Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
  • The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities#

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

AI Research Needs Survey

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

The Research Working Group of the UW AI Task Force would like faculty and research staff input on the needs and challenges of using AI in research at UW across a broad spectrum of disciplines. Please help by responding to a short survey at: https://forms.gle/mZrV3aCgJYNNBV6j8. Responses by April 25 would be most helpful, but the survey will remain open until April 30.

Thank you in advance for your time,

HYAK Team

Disk Storage Management with Conda

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

It has come to our attention that the default configuration of Miniconda and conda environments in the user's home directory leads to hitting storage limitations and the dreaded error Disk quota exceeded. We thought we would take some time to guide users in configuring their conda environment directories and package caches to avoid this error and proceed with their research computing.

Conda's config#

Software is usually accompanied by a configuration file (aka "config file") or a text file used to store configuration data for software applications. It typically contains parameters and settings that dictate how the software behaves and interacts it's environment. Familiarity with config files allows for efficient troubleshooting, optimization, and adaptation of software to specific environments, like HYAK's shared HPC environment, enhancing overall usability and performance. Conda's config file .condarc, is customizable and lets you determine where packages and environments are stored by conda.

Understanding your Conda#

First let's take a look at your conda settings. The conda info command provides information about the current conda installation and its configuration.

note

The following assumes you have already installed Miniconda in your home directory or elsewhere such that conda is in your $PATH. Install Miniconda instructions here.

$ conda info
active environment : None
shell level : 0
user config file : /mmfs1/home/UWNetID/.condarc
populated config files : /mmfs1/home/UWNetID/.condarc
conda version : 4.14.0
conda-build version : not installed
python version : 3.9.5.final.0
virtual packages : __linux=4.18.0=0
__glibc=2.28=0
__unix=0=0
__archspec=1=x86_64
base environment : /mmfs1/home/UWNetID/miniconda3 (writable)
conda av data dir : /mmfs1/home/UWNetID/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
. . .
package cache : /mmfs1/home/UWNetID/conda_pkgs
envs directories : /mmfs1/home/UWNetID/miniconda3/envs
platform : linux-64
user-agent : conda/4.14.0 requests/2.26.0 CPython/3.9.5 Linux/4.18.0-513.18.1.el8_9.x86_64 rocky/8.9 glibc/2.28
UID:GID : 1209843:226269
netrc file : None
offline mode : False

The paths shown above will show your username in place of UWNetID. Notice the highlighted lines above showing the absolute path to your config file in your home directory (e.g., /mmfs1/home/UWNetID/.condarc), the directory designated for your package cache (e.g., /mmfs1/home/UWNetID/conda_pkgs), and the directory/ies designated for your environments (e.g., /mmfs1/home/UWNetID/miniconda3/envs). Conda designates directories for your package cache and your environments by default, but under HYAK, your home directory has a 10G storage limit, which can quickly be maxed out by package tarballs and their contents. We can change the location for your package cache and your environments to avoid this.

tip

when you ls your home directory ls /mmfs1/home/UWNetID/ you might not see .condarc listed. It is there! To list all hidden files (files beginning with .) use ls -a /mmfs1/home/UWNetID/.

Configuring your package cache and envs directories#

Edit the highlighted lines in .condarc to designate directories with higher storage quotas for our envs_dirs and pkgs_dirs. Use a hyak preloaded editor like nano or vim to edit .condarc in place. More about nano. More about vim. Your .condarc will look like this:

$ nano ~/.condarc
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: true
envs_dirs:
- /mmfs1/home/UWNetID/miniconda3/envs
pkgs_dirs:
- /mmfs1/home/UWNetID/conda_pkgs

In this exercise, we will assign our envs_dirs and pkgs_dirs directories to directories in /gscratch/scrubbed/ where we have more storage, although remember scrubbed storage is temporary. Alternatively, your lab/research group might have another directory in /gscratch/ that can be used.

important

Remember to replace the word UWNetID in the paths below with YOUR username/UWNetID.

Here is what your edited .condarc should look like.

$ cat /mmfs1/home/UWNetID/.condarc
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: true
envs_dirs:
- /gscratch/scrubbed/UWNetID/envs
pkgs_dirs:
- /gscratch/scrubbed/UWNetID/conda_pkgs
warning

If you don't have a directory under your UWNetID in /gscratch/scrubbed/or whereever you intend to designate these directories you will need to create them now for this to work. Use the mkdir command, for example mkdir /gscratch/scrubbed/UWNetID and replace UWNetID with your username. Then create directories for your package cache and envs directory, for example, mkdir /gscratch/scrubbed/UWNetID/conda_pkgs and mkdir /gscratch/scrubbed/UWNetID/envs.

After .condarc is edited, we can use conda info to see if our changes have been incorporated.

$ conda info |grep cache
/gscratch/scrubbed/UWNetID/conda_pkgs
$ conda info |grep envs
/gscratch/scrubbed/UWNetID/envs

Cleaning up disk storage#

After you have reset the package cache and environment directories with your conda config file, you can delete the previous directories to free up storage. Before doing that, you can monitor how much storage was being occupied by each item in your home directory with the command du -h --max-depth=1. Remove directories previously used as cache and envs_dir recursively with rm -r. The following is an example of monitoring storage and removing directories.

warning

rm -r is permanent. We cannot your recover directory. You were warned.

$ du -h --max-depth=1 /mmfs1/home/UWNetID/
6.7G ./miniconda3/envs
4.0G ./conda_pkgs
. . .
$ rm -r /mmfs1/home/UWNetID/envs
$ du -h --max-depth=1 /mmfs1/home/UWNetID/
2.6G ./miniconda3/
4.0G ./conda_pkgs
. . .
note

The hyakstorage command is not simultaneously updated. Although you have cleaned up your home directory, hyakstorage might not yet show new storages estimates. du -sh will give you the most up to date information.

Storage can also be managed by cleaning up package cache periodically. Get rid of the large-storage tar archives after your conda packages have been installed with conda clean --all.

Lastly, regular maintenance of conda environments is crucial for keeping disk usage in check. Review you list of conda environments with conda env list and remove unused environments using the conda remove --name ENV_NAME --all command. Consider creating lightweight environments by installing only necessary packages to conserve disk space. For example, create an environment for each project (project1_env) rather than an environment for all projects combined (myenv).

Disk quota STILL exceeded#

Be aware that many software packages are configured similarly to conda. Explore the documentation of your software to locate the configuration file and anticipate where storage limitations might become an issue. In some cases, you may need to edit or create a config file for the software to use. pip and R are two other common offenders ballooning the disk storage in your home directory.

Configuring PIP#

If you are installing with pip, you might have a pip cache in ~/.cache/pip. Let's locate your the pip config file location under variant "global." You might have to activate a previously built conda environment to do this. For this exercise we will use an environment called project1_env.

$ conda activate project1_env
(project1_env) $ pip config list -v
. . .
For variant 'user', will try loading '/mmfs1/home/UWNetID/.pip/pip.conf'
. . .

The message "will try loading" rather than listing the config file pip.conf means that a pip config file has not been created. We will create our config file and set our pip cache. Create a directory in your home directory (e.g.,/mmfs1/home/UWNetID/.pip) to hold your pip config file and create a file called pip.conf with the touch command. Remember to also create the new directory for your new pip cache if you haven't yet.

$ mkdir /mmfs1/home/UWNetID/.pip/
$ touch /mmfs1/home/UWNetID/.pip/pip.conf
$ mkdir /gscratch/scrubbed/UWNetID/pip_cache

Open pip.conf with nano or vim and add the following lines to designate the location of your pip cache.

[global]
cache-dir=/gscratch/scrubbed/UWNetID/pip_cache

Check that your pip cache has been designated.

(project1_env) $ pip config list
/mmfs1/home/UWNetID/.pip/pip.conf
(project1_env) $ pip cache dir
/gscratch/scrubbed/UWNetID/pip_cache

Configuring R#

We previously covered this in our documentation.. Edit or create a config file called .Renviron in your home directory. Use nano oe vim to designate the location of your R package libraries. The contents of the file should be something like the following example.

$ cat ~/.Renviron
R_LIBS="/gscratch/scrubbed/UWNetID/R/"

The directory designated by R_LIBS will be where R installs your package libraries.

I'm still stuck#

Please reach out to us by emailing help@uw.edu with "hyak" in the subject line to open a help ticket.

March 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node#

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension Remote-SSH#

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

New Documentation

The steps discussed here for responsible use of VS Code have been added to our documentation. Please review the solutions for connecting VS Code to HYAK.

  1. Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.

    $ ps aux | grep UWNetID
  2. If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.

  3. Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities#

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?#

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Upcoming HPC Training Opportunities

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community!

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center (SDSC). If you are interested in picking up some additional skills and experience in HPC, please check them out.

  • SDSC Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute: The project is focused on teaching researchers and students the best practices for effectively running machine learning (ML) and data science applications on advanced cyberinfrastructure (CI) and high-performance computing (HPC) systems. Applications due 12 April 2024. https://www.sdsc.edu/education_and_training/ciml_summer_institute.html

  • SDSC HPC and Data Science Summer Institute: The program is aimed at researchers in academia and industry, especially in domains not traditionally engaged in supercomputing, who have problems that cannot typically be solved using local computing resources. Applications due 26 April 2024. https://www.sdsc.edu/education_and_training/summer_institute.html

  • SDSC Virtual Workshop; COMPLECS: Batch Computing: Getting Started with Batch Job Scheduling - Slurm Edition: Learn how to use Slurm, Hyak's batch job scheduler. In "our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster." Event held virtually on Thursday, March 21, 2024 11:00 AM - 12:30 PM PDT Link to Registration

Keep an eye on our blog for more opportunities and HYAK updates.

If you have any questions, please reach out to the team by emailing help@uw.edu and we sure to mention Hyak in the subject line. Thanks!

February 2024 Maintenance Details

Nam Pho

Nam Pho

Director for Research Computing

Hello HYAK community! We have a few notable announcements regarding this month’s maintenance. If the hyak-users mailing list e-mail didn’t fully satisfy your curiosity, hopefully this expanded version will answer any lingering questions.

GPUs#

  • Software: The GPU driver was upgraded to the latest stable version (545.29.06). The latest CUDA 12.3.2 is also now provided as a module. You are also encouraged to explore the use of container (i.e., Apptainer) based workflows, which bundle various versions of CUDA with your software of interest (e.g., PyTorch) over at NGC. NOTE: Be sure to pass the --nv flag to Apptainer when working with GPUs.

  • Hardware: The HYAK team has also begun the early deployments of our first Genoa-Ada GPU nodes. These are cutting-edge NVIDIA L40-based GPUs (code named “Ada”) running on the latest AMD processors (code named “Genoa”) with 64 GPUs released to their groups two weeks ago and an additional 16 GPUs to be released later this week. These new resources are not currently part of the checkpoint partition but we will be releasing guidance on making use of idle resources here over the coming weeks directly to the HYAK user documentation as we receive feedback from these initial researchers.

Storage#

  • Performance Upgrade: In recent weeks, AI/ML workloads have been increasingly stressing the primary storage on KLONE (i.e., "gscratch"). Part of this was attributed to the run up to the International Conference for Machine Learning (ICML) 2024 full paper deadline on Friday, February 2. However, it also reflects a broader trend in the increasing demands of data-intensive research. The IO profile was so heavy at times that our systems automation throttled the checkpoint capacity to near 0 in order to keep storage performance up and prioritize general cluster navigation and contributed resources. We have an internal tool called iopsaver that automatically reduces IOPS by intelligently requeuing checkpoint jobs generating the highest IOPS while concurrently limiting the number of total active checkpoint jobs until the overall storage is within its operating capacity. At times over the past few weeks you may have noticed that iopsaver had reduced the checkpoint job capacity to near 0 to maintain overall storage usability.

    During today’s maintenance, we have upgraded the memory on existing storage servers so that we could enable Local Read-Only Cache (LROC) although we don’t anticipate it will be live until tomorrow. Once enabled, LROC allows the storage cluster to make use of a previously idle SSD capacity to cache frequently accessed files on this more performant storage tier medium. We expect LROC to make a big difference as during this period of the last several weeks, the majority of the recent IO bottlenecking was attributed to a high volume of read operations. As always, we will continue to monitor developments and adjust our policies and solutions accordingly to benefit the most researchers and users of HYAK.

  • Scrubbed Policy: In the recent past this space has filled up. As a reminder, this is a free-for-all space and a communal resource for when you have data you only need to temporarily burst out into past your usual allocations from your other group affiliations. To ensure greater equity among its use, we have instituted a 10TB and 10M files limit for each user in scrubbed. This impacts <1% of users as only a handful of users were using an amount of quota from scrubbed >10TB.

Questions?#

Hopefully you found these extra details informative. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

Dataset policy & guidelines

Michael Wanek

Michael Wanek

HPC Engineer

Some context on /gscratch/data#

The Klone Data Commons is our cluster-wide, shared dataset storage located at /gscratch/data.

Historically, we've addressed requests to add datasets to the Commons on a case-by-case basis. We've seen a growing number of these types of requests over the past few weeks, so we thought we should make the guidelines clear. That's the purpose of this blog post today, as well as the new Data Commons documentation section here.

Requirements#

In order for a dataset to be approved, the following criteria must be met:

  1. The requester must create a new page of documentation, and submit a pull request, describing the dataset:

    • A full description of the dataset, publication date, licenses, etc.
    • Instructions for using the dataset, i.e. any required modules, the structure of the data, etc.
    • Contact information for dataset maintainers (typically, the group/user submitting the request) and the intended audience or discipline of the data.
  2. The requester must name a minimum of 3 separate groups/labs & 3 specific users who will be using the data.

  3. The requester emails help@uw.edu with:

    • A link to the documentation PR.
    • The following people CC'd: the lab/group owners & all initial users. This will be at least 6 people.
  4. Every person included in the request (again, at least 6), must individually attest that the dataset has been vetted: that, to the best of their knowledge, the dataset contains no material where its download/storage/use violates any State or Federal law and/or the rules/policies of UW, including intellectual property laws.

Questions?#

Hopefully this clears up our expectations going forward. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

The State of Enterprise Linux

Michael Wanek

Michael Wanek

HPC Engineer

Rocky Linux#

Hyak uses Rocky Linux on our compute nodes, our login nodes, and our backend nodes. We switched from CentOS to Rocky in early 2022, after Red Hat permanently ended CentOS development, and we wrote a blog post about our transition here.

Operating system migrations are difficult, and we were hoping to use Rocky for as long as possible. Now—less than two years after our change—Red Hat has thrown another curveball at the open-source enterprise Linux community.

The Latest from Red Hat#

You can read the update about Red Hat source code here: Furthering the evolution of CentOS Stream.

Our team doesn't have any position on these changes, but we understand the implication: this may make downstream, bug-for-bug compatible Linuxes—like Rocky—more difficult to maintain.

You can read Rocky Linux's official response here: Rocky Linux Expresses Confidence Despite Red Hat's Announcement.

The Rocky Linux team's confidence belies the significance of Red Hat's change. Red Hat's blog post—a mere 318 words—sparked collosal action. Multiple corporations, including tech giants SUSE and Oracle, joined forces to establish a collaborative trade assocation: OpenELA, the Open Enterprise Linux Association.

You can read the OpenELA annoucement here: CIQ, Oracle and SUSE Create Open Enterprise Linux Association for a Collaborative and Open Future.

What this means for Hyak#

It's too early to tell how this will impact Hyak. Rocky Linux may continue to be the de facto standard for stable, OSS Enterprise Linux. It's possible some flavor of SUSE takes the lead, like openSUSE Leap. We also need to see what will come out of OpenELA.

Our plan is for Hyak to remain on Rocky Linux: we will let you know if & when anything changes.

Hyak Huskies at ICML 2023

Michael Wanek

Michael Wanek

HPC Engineer
International Conference on Machine Learning

We were delighted to see so many Huskies in attendance at the Fortieth International Conference on Machine Learning, which took place at the end of July. The researchers using Hyak are doing incredible work, and we wanted to say congratulations to those who had their papers accepted:

Also, special congratulations for those with accepted shadow papers:

Can I put this on my CV?

August maintenance completed

Michael Wanek

Michael Wanek

HPC Engineer

August's scheduled maintenance is complete and the Hyak clusters have resumed normal operations: logins have been reenabled & jobs are already running.

This month's maintenance actions were our standard fare: node image and firmware updates. We keep our maintenance all-clear emails as brief as possible, but here's the rundown:

Node image updates#

Our compute nodes are stateless: their operating system is loaded into memory over the network, so we keep the node images as small as possible. This means that when we update the images, we're actually rebuilding them from scratch. All the operating system packages we include in our template are installed as their latest versions.

Any software on the node image beyond system packages is managed separately, which brings me to the only major update this month:

We upgraded Apptainer from 1.1.8 to 1.2.2. The update from 1.1 to 1.2 implements quite a few new features, modifications to default behavior, and other changes. You can read about them in the Apptainer 1.2.0 Patch Notes.

Node firmware updates#

Since firmware updates shouldn't impact cluster users, we normally don't even mention them. That said, this was the main part of our work today. We updated the firmware (including BIOS & BMCs) for our backend nodes, login nodes, and all 400+ compute nodes.