March 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node#

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension Remote-SSH#

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

  1. Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.

    $ ps aux | grep UWNetID
  2. If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.

  3. Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities#

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?#

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Upcoming HPC Training Opportunities

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community!

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center (SDSC). If you are interested in picking up some additional skills and experience in HPC, please check them out.

  • SDSC Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute: The project is focused on teaching researchers and students the best practices for effectively running machine learning (ML) and data science applications on advanced cyberinfrastructure (CI) and high-performance computing (HPC) systems. Applications due 12 April 2024. https://www.sdsc.edu/education_and_training/ciml_summer_institute.html

  • SDSC HPC and Data Science Summer Institute: The program is aimed at researchers in academia and industry, especially in domains not traditionally engaged in supercomputing, who have problems that cannot typically be solved using local computing resources. Applications due 26 April 2024. https://www.sdsc.edu/education_and_training/summer_institute.html

  • SDSC Virtual Workshop; COMPLECS: Batch Computing: Getting Started with Batch Job Scheduling - Slurm Edition: Learn how to use Slurm, Hyak's batch job scheduler. In "our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster." Event held virtually on Thursday, March 21, 2024 11:00 AM - 12:30 PM PDT Link to Registration

Keep an eye on our blog for more opportunities and HYAK updates.

If you have any questions, please reach out to the team by emailing help@uw.edu and we sure to mention Hyak in the subject line. Thanks!

February 2024 Maintenance Details

Nam Pho

Nam Pho

Director for Research Computing

Hello HYAK community! We have a few notable announcements regarding this month’s maintenance. If the hyak-users mailing list e-mail didn’t fully satisfy your curiosity, hopefully this expanded version will answer any lingering questions.

GPUs#

  • Software: The GPU driver was upgraded to the latest stable version (545.29.06). The latest CUDA 12.3.2 is also now provided as a module. You are also encouraged to explore the use of container (i.e., Apptainer) based workflows, which bundle various versions of CUDA with your software of interest (e.g., PyTorch) over at NGC. NOTE: Be sure to pass the --nv flag to Apptainer when working with GPUs.

  • Hardware: The HYAK team has also begun the early deployments of our first Genoa-Ada GPU nodes. These are cutting-edge NVIDIA L40-based GPUs (code named “Ada”) running on the latest AMD processors (code named “Genoa”) with 64 GPUs released to their groups two weeks ago and an additional 16 GPUs to be released later this week. These new resources are not currently part of the checkpoint partition but we will be releasing guidance on making use of idle resources here over the coming weeks directly to the HYAK user documentation as we receive feedback from these initial researchers.

Storage#

  • Performance Upgrade: In recent weeks, AI/ML workloads have been increasingly stressing the primary storage on KLONE (i.e., "gscratch"). Part of this was attributed to the run up to the International Conference for Machine Learning (ICML) 2024 full paper deadline on Friday, February 2. However, it also reflects a broader trend in the increasing demands of data-intensive research. The IO profile was so heavy at times that our systems automation throttled the checkpoint capacity to near 0 in order to keep storage performance up and prioritize general cluster navigation and contributed resources. We have an internal tool called iopsaver that automatically reduces IOPS by intelligently requeuing checkpoint jobs generating the highest IOPS while concurrently limiting the number of total active checkpoint jobs until the overall storage is within its operating capacity. At times over the past few weeks you may have noticed that iopsaver had reduced the checkpoint job capacity to near 0 to maintain overall storage usability.

    During today’s maintenance, we have upgraded the memory on existing storage servers so that we could enable Local Read-Only Cache (LROC) although we don’t anticipate it will be live until tomorrow. Once enabled, LROC allows the storage cluster to make use of a previously idle SSD capacity to cache frequently accessed files on this more performant storage tier medium. We expect LROC to make a big difference as during this period of the last several weeks, the majority of the recent IO bottlenecking was attributed to a high volume of read operations. As always, we will continue to monitor developments and adjust our policies and solutions accordingly to benefit the most researchers and users of HYAK.

  • Scrubbed Policy: In the recent past this space has filled up. As a reminder, this is a free-for-all space and a communal resource for when you have data you only need to temporarily burst out into past your usual allocations from your other group affiliations. To ensure greater equity among its use, we have instituted a 10TB and 10M files limit for each user in scrubbed. This impacts <1% of users as only a handful of users were using an amount of quota from scrubbed >10TB.

Questions?#

Hopefully you found these extra details informative. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

Dataset policy & guidelines

Michael Wanek

Michael Wanek

HPC Engineer

Some context on /gscratch/data#

The Klone Data Commons is our cluster-wide, shared dataset storage located at /gscratch/data.

Historically, we've addressed requests to add datasets to the Commons on a case-by-case basis. We've seen a growing number of these types of requests over the past few weeks, so we thought we should make the guidelines clear. That's the purpose of this blog post today, as well as the new Data Commons documentation section here.

Requirements#

In order for a dataset to be approved, the following criteria must be met:

  1. The requester must create a new page of documentation, and submit a pull request, describing the dataset:

    • A full description of the dataset, publication date, licenses, etc.
    • Instructions for using the dataset, i.e. any required modules, the structure of the data, etc.
    • Contact information for dataset maintainers (typically, the group/user submitting the request) and the intended audience or discipline of the data.
  2. The requester must name a minimum of 3 separate groups/labs & 3 specific users who will be using the data.

  3. The requester emails help@uw.edu with:

    • A link to the documentation PR.
    • The following people CC'd: the lab/group owners & all initial users. This will be at least 6 people.
  4. Every person included in the request (again, at least 6), must individually attest that the dataset has been vetted: that, to the best of their knowledge, the dataset contains no material where its download/storage/use violates any State or Federal law and/or the rules/policies of UW, including intellectual property laws.

Questions?#

Hopefully this clears up our expectations going forward. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

The State of Enterprise Linux

Michael Wanek

Michael Wanek

HPC Engineer

Rocky Linux#

Hyak uses Rocky Linux on our compute nodes, our login nodes, and our backend nodes. We switched from CentOS to Rocky in early 2022, after Red Hat permanently ended CentOS development, and we wrote a blog post about our transition here.

Operating system migrations are difficult, and we were hoping to use Rocky for as long as possible. Now—less than two years after our change—Red Hat has thrown another curveball at the open-source enterprise Linux community.

The Latest from Red Hat#

You can read the update about Red Hat source code here: Furthering the evolution of CentOS Stream.

Our team doesn't have any position on these changes, but we understand the implication: this may make downstream, bug-for-bug compatible Linuxes—like Rocky—more difficult to maintain.

You can read Rocky Linux's official response here: Rocky Linux Expresses Confidence Despite Red Hat's Announcement.

The Rocky Linux team's confidence belies the significance of Red Hat's change. Red Hat's blog post—a mere 318 words—sparked collosal action. Multiple corporations, including tech giants SUSE and Oracle, joined forces to establish a collaborative trade assocation: OpenELA, the Open Enterprise Linux Association.

You can read the OpenELA annoucement here: CIQ, Oracle and SUSE Create Open Enterprise Linux Association for a Collaborative and Open Future.

What this means for Hyak#

It's too early to tell how this will impact Hyak. Rocky Linux may continue to be the de facto standard for stable, OSS Enterprise Linux. It's possible some flavor of SUSE takes the lead, like openSUSE Leap. We also need to see what will come out of OpenELA.

Our plan is for Hyak to remain on Rocky Linux: we will let you know if & when anything changes.

Hyak Huskies at ICML 2023

Michael Wanek

Michael Wanek

HPC Engineer
International Conference on Machine Learning

We were delighted to see so many Huskies in attendance at the Fortieth International Conference on Machine Learning, which took place at the end of July. The researchers using Hyak are doing incredible work, and we wanted to say congratulations to those who had their papers accepted:

Also, special congratulations for those with accepted shadow papers:

Can I put this on my CV?

August maintenance completed

Michael Wanek

Michael Wanek

HPC Engineer

August's scheduled maintenance is complete and the Hyak clusters have resumed normal operations: logins have been reenabled & jobs are already running.

This month's maintenance actions were our standard fare: node image and firmware updates. We keep our maintenance all-clear emails as brief as possible, but here's the rundown:

Node image updates#

Our compute nodes are stateless: their operating system is loaded into memory over the network, so we keep the node images as small as possible. This means that when we update the images, we're actually rebuilding them from scratch. All the operating system packages we include in our template are installed as their latest versions.

Any software on the node image beyond system packages is managed separately, which brings me to the only major update this month:

We upgraded Apptainer from 1.1.8 to 1.2.2. The update from 1.1 to 1.2 implements quite a few new features, modifications to default behavior, and other changes. You can read about them in the Apptainer 1.2.0 Patch Notes.

Node firmware updates#

Since firmware updates shouldn't impact cluster users, we normally don't even mention them. That said, this was the main part of our work today. We updated the firmware (including BIOS & BMCs) for our backend nodes, login nodes, and all 400+ compute nodes.

Update on the hyakstorage command

Nam Pho

Nam Pho

Director for Research Computing

We’ve made an update to our storage accounting tool, hyakstorage, and with this update we are also phasing out usage_report.txt. That text file contained minimally-parsed internal metrics of the storage cluster, and we found it caused as many questions as it answered. Moving forward, the hyakstorage tool will display only the four relevant pieces of information for each fileset you query: storage space used vs. the storage space limit, and current amount of files (inodes) vs. maximum number of files.

The default operation–running hyakstorage with no arguments–will show your home directory & the gscratch directories you have access to, and it will only show the fileset totals & your contributions.

You can also specify which filesets you want to view, in a few different ways: you can use the flag --home to show your home directory, --gscratch to show your gscratch directories, and --contrib to show your group’s contrib directories. You can also specify an exact gscratch directory with the group name (e.g. hyakstorage stf), contrib directory (e.g. hyakstorage stf-src), or full path to a fileset (e.g. hyakstorage /mmfs1/gscratch/stf).

If you want more detailed metrics, you can use the flags --show-user or --show-group to break down the fileset totals by individual users or groups. Those detailed metrics can be sorted by space with --by-disk (the default) or by files with --by-files.

See also:

Terminology Reset

Nam Pho

Nam Pho

Director for Research Computing
note

There is no operational change, this is an administrative clarification of HYAK specific terminology.

The HYAK community has grown substantially over the past year, including the administrative teams that work with us to support the service. Some terms (e.g., nodes, servers) have been loosely used in communication, but have specific meanings for different backend teams. Beginning today, we are harmonizing all the terms to alleviate any confusion between the different teams supporting HYAK and our end users. This is only a clarification of language: there is no change to how HYAK operates.

At the physical layer we have nodes or servers: the smallest individual physical units of the HPC cluster. These are what we, the HYAK engineering team, purchase from our vendor partners. Historically, a physical node or server was what a lab would purchase to join the cluster. However, since HYAK’s inception, resource density has increased to such an extent–servers with hundreds of CPU cores, hundreds of gigabytes of RAM, multiple graphics cards, etc.–it no longer made sense to require labs to purchase an entire physical node.

Once we crossed a certain threshold, we began to offer labs an amount of computing resources–a specific number of CPU cores, amount of RAM, number of GPUs, etc.–rather than discrete servers, but we kept the node nomenclature. For a while, when labs purchased a node, it no longer meant they were purchasing a server, even though those words are identical in many computing contexts. From today forward, we are restoring those terms to their original meanings for HYAK: one node is one physical server.

When labs join HYAK, they will not purchase a physical node, they will purchase a slice. A slice represents an amount of on-demand compute capacity–CPUs, RAM, GPUs, etc. Again, this is only a terminology clarification: HYAK has operated this way for a while. One of the benefits of this model is that slices–representing resources, not specific, physical pieces of hardware–make resource scheduling considerably easier for our cluster's scheduling software, Slurm. This efficiency is returned to the entire community both as depth of the checkpoint partition’s resources, and as faster scheduling for non-checkpoint jobs.

We've seen some users refer to this as "virtualization", and that is a misnomer. We want to emphasize that there is no hardware virtualization taking place here: your job will run on the bare-metal, physical resources you have requested from Slurm.

While this may seem like a minor change in language, it will greatly ease the coordination among many groups working behind the scenes to support the HYAK service. As always, we appreciate your understanding and patience as we continue to refine and improve the support provided.

See also:

HYAK Team Storage Optimizations

Nam Pho

Nam Pho

Director for Research Computing
note

The HYAK team has taken six concrete steps to stabilize and optimize storage on KLONE over the past few weeks.

While the storage on KLONE (i.e., mmfs1 or gscratch) may appear to be a monolithic device, it is an extremely complex cluster in its own right. This storage cluster is mounted on every KLONE node: so despite appearing as "on the node", gscratch physically resides on specialized storage hardware separated from the compute resources of KLONE. The storage is accessed across a high-speed, ultra low-latency HDR Infiniband network, and is designed to be scalable independent of KLONE’s compute resources.

As mentioned in an earlier blog post today, our incoming hardware expansion will drastically increase the amount of demand the storage cluster can handle. In the meantime, the Hyak team has taken measures to help maintain a usable level of storage performance for users and jobs:

1. Improved internal storage metrics gathering and visibility.#

KLONE slurm metrics

The HYAK team improved storage-cluster metric gathering and visibility, allowing us to correlate those metrics to reports of poor user experience, and to make data-driven tuning and storage policy decisions.

In the figure above we have visibility into if an abnormally high number of jobs have errors that might suggest underlying storage or other user experience issues.

2. Created custom filesystem migration policies to optimize the use of the NVMe layer.#

The bulk of the storage capacity on KLONE is stored on rotary hard disk drives totalling approximately 1.7 Petabytes (PB) of raw storage. In addition to the hard disk storage, there is a much smaller, extremely fast–and expensive–pool of NVMe "flash" storage that functions both as a write buffer for new files written to the filesystem, and also as a read-cache-like layer where files can be read without causing load on the rotary disks.

The HYAK team has also optimized the file placement policy: files most likely to generate heavy load reside in the limited space of the NVMe layer, ensuring that no storage load is generated on the hard disk layer when those files are repeatedly accessed.

KLONE storage policy

In the figure above you can see that the flash tier (green line) is allowed to fill up to 80% capacity due to job writes then the migration policy begins until the flash tier is down to 65% full. For the majority of the past few several weeks we can see things worked as expected. However, there were a few events recently where jobs were producing so much data that the flash tier was able to get to 100% full faster than the storage system could move data off the flash tier. Giving the migration process too high of a priority results in "slowness" in the user experience. We have since been tuning the aggressiveness of this migration process to reduce the likelihood of it occuring again.

3. Added QoS policies to improve worst-case filesystem responsiveness.#

The KLONE filesystem has a coarse Quality-of-Service (QoS) tuning facility that allows the filesystem to cap the rate of storage operations for various types of storage input-output (IO). The HYAK team has used this facility in two different ways:

  1. First, to limit the storage load impact when the NVMe layer, described above, needs to free up space by moving files to the hard drive layer.

  2. Secondly, to moderate the amount of storage load that can be generated by any single compute node in the cluster. This way, outlier jobs in terms of storage load generation are less likely to have an outsized performance impact on the storage.

4. Manually identifying jobs causing a disproportionate impact on storage performance.#

KLONE storage metrics

Utilizing metrics and old-fashioned sleuthing, we have been manually tracking down individual jobs that appear to be having a disproportionate and/or unnecessary impact on storage performance, and working with users to address the storage performance impact of these jobs.

In the above figure we can see job IO follows a power law dynamic, a small handful of jobs are often responsible for the majority of load. In this case a single job on a single node is responsible. When users report storage "slowness" this disrepancy can be even more pronounced but we are able to quickly narrow down which specific nodes are responsible and address these corner cases.

5. Dynamically reducing the number of running checkpoint partition jobs.#

As of April 19th, 2022, we have implemented data-driven automation to moderate storage load by dynamically managing the number of running checkpoint (ckpt) partition jobs. When the number of running ckpt jobs is being limited, pending jobs will show AssocGrpJobsLimit as the REASON for not starting.

Please note that non-ckpt jobs (i.e., jobs submitted to nodes your lab contributed to the cluster) are not limited in any way. The social contract when joining the HYAK community is that you get access to the nodes your lab contributes on-demand, and–if and when they are idle–access to other labs’ resources on the cluster. However, access to other labs’ resources isn’t and hasn’t ever been guaranteed: it’s just that there’s often a steady state idle capacity for users to "burst" into by submitting ckpt jobs.

In aggregate, 'Storage Load' is a consumable resource just like CPU cores or memory, albeit one that impacts the whole cluster when it is over-consumed. The SLURM cluster scheduler cannot directly consider storage load availability when evaluating resources for starting ckpt jobs, hence our need to automate. Our new tooling limits the storage performance impact from ckpt jobs in order to improve storage stability for everyone.

KLONE storage load

The red and blue lines represent two storage servers that we have most closely tied to the user experience and 50% load being the threshold we aim to remain at or under by dynamically reducing the number of running ckpt jobs when it exceeds that limit.

So far, this appears to be very effective at moderating the overall storage load, preventing the storage cluster from becoming unusably slow and avoiding other storage-performance issues. We will continue to tune it in search of the best balance between idle resource utilization via ckpt and storage performance.

6. Expanding the team#

Acknowledging that the storage sub-system is a complicated machine in its own right, it needs much more care and attention and the current HYAK team is stretched incredibly thin as is. We have started the process of hiring a dedicated research data storage systems engineer to focus on optimizing storage going forward.

See also: