18 posts tagged with "hpc"

July 2024 Maintenance Details

July 9, 2024 · 1 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations.

The next maintenance will be Tuesday August 13, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

June 2024 Maintenance Details

June 11, 2024 · 5 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thanks again for your patience with our monthly scheduled maintenance. This month, we deployed new node resources that were purchased by various UW Researchers from across campus. These nodes are a little different, so we wanted to bring your attention to them and provide guidance on their use when they are idle with the checkpoint partition.

New G2 Nodes#

A new class of nodes have been deployed on klone which we are calling g2 because they are the second generation of nodes, and we will retroactively refer to the first generation nodes as g1. g2 CPU nodes feature AMD EPYC 9000-series 'Genoa' processors, and new GPU nodes featuring either NVIDIA L40 or L40S GPUs (with H100 GPUs possibly becoming available in the future). These nodes will join our community resources that can be used when idle (ckpt) under the new partitions:

ckpt-g2 for scheduling jobs on g2 nodes only.
ckpt-all for scheduling jobs on either g1 or g2 nodes.
ckpt will now schedule jobs on g1 nodes only.
Please review our documentation HERE for specific instructions for accessing these resources. Additionally, please see the blog post HERE where we discuss additional considerations for their usage.

To accompany the new g2 node deployments, we are providing a new Open MPI module (ompi/4.1.6-2), which is now the default module when module load ompi is executed. Previous OpenMPI modules will cause errors if used with the AMD processors on the new g2 nodes due to how the software was compiled. ompi/4.1.6-2 (and any openmpi module versions we provide in the future) are compiled to support both Intel and AMD processors. If your MPI jobs are submitted to a partition that includes g2 nodes, you should use module load ompi to use the new module by default, or explicitly load ompi/4.1.6-2 (or a newer version in the future) via module load ompi/4.1.6-2.

If you have compiled software on g1 nodes, you should test them on g2 nodes before bulk submitting jobs to partitions with g2 nodes (i.e., ckpt-g2 and ckpt-all), as they may or may not function properly depending on exactly how they were compiled.

Student Opportunities#

In addition, we have two student opportunities to bring to your attention.

Job Opportunity: The Research Computing (RC) team at the University of Washington (UW) is looking for a student intern to spearhead projects that could involve: (1) the development of new tools and software, (2) research computing documentation and user tutorials, or (3) improvements to public-facing service catalog descriptions and service requests. HYAK is an ecosystem of high-performance computing (HPC) resources and supporting infrastructure available to UW researchers, students, and associated members of the UW community. Our team administers and maintains HYAK as well as provides support to HYAK users. Our intern will be given the choice of projects that fit their interest and experience while filling a need for the UW RC community. This role will provide students with valuable hands-on experiences that enhance academic and professional growth.

The position pays $19.97-21.50 per hour depending on experience with a maximum of 20 hours per week (Fall, Winter, and Spring) and a maximum of 40 hours allowed during summer quarter. How to apply: Please apply by emailing: 1) a current resume and 2) a cover letter detailing your qualifications, experience, and interest in the position to me, Kristen Finch (UWNetID: finchkn). Due to the volume of applications, we regret that we are unable to respond to every applicant or provide feedback.

Minimum Qualifications:

Student interns must hold at least a 2nd year standing if an undergraduate student.
Student interns must be able to access the internet.
Student interns must be able to demonstrate an ability to work independently on their selected project/s and expect to handle challenges by consulting software manuals and publicly available resources.

We encourage applicants that:

meet the minimum qualifications.
have an interest in website accessibility and curation.
have experience in research computing and HYAK specifically. This could include experience in any of the following: 1) command-line interface in a Linux environment, 2) SLURM job scheduler, 3) python, 4) shell scripting.
have an interest in computing systems administration.
have an interest in developing accessible computing-focused tutorials for the HYAK user community.

Conference Opportunity: The 2024 NSF Cybersecurity Summit program committee is now accepting applications to the Student Program. This year’s summit will be held October 7th-10th at Carnegie Mellon University in Pittsburgh, PA. Both undergraduate and graduate students may apply. No specific major or course of study is required, as long as the student is interested in learning and applying cybersecurity innovations to scientific endeavors. Selected applicants will receive invitations from the Program Committee to attend the Summit in-person. Attendance includes your participation in a poster session. The deadline for applications is Friday June 28th at 12 am CDT, with notification of acceptance to be sent by Monday July 29th. Click Here to Apply

Our next scheduled maintenance will be Tuesday July, 9, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line. Student intern applications sent to help@uw.edu will not be considered. Email applications to finchkn at uw.edu

May 2024 Maintenance Details

May 14, 2024 · 2 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thanks again for your patience with our monthly scheduled maintenance, there are some notable improvements we implemented today.

KLONE node image: Over the past few weeks, you may have noticed some KLONE instability. This was a result of some behind the scenes storage upgrades that inadvertently introduced wider impacts to the existing cluster automation. At the time, we introduced a temporary fix to get the cluster back online but with today’s maintenance we implemented a more comprehensive fix.

Infiniband firmware: The KLONE cluster is built on the infiniband HPC interconnect for node-to-node communication. While KLONE originally launched with the HDR generation of infiniband, we have since upgraded mid-KLONE to have a HDR-NDR hybrid interconnect. NDR infiniband is required to support the latest compute slices we offer. We updated the firmware on our NDR switches following vendor recommendations for increased stability.

Apptainer on MOX: Apptainer (formerly Singularity) is the root-less containerization solution we provide on both HYAK clusters. Apptainer version 1.3.1 was deployed on both KLONE and MOX. As a reminder, on KLONE Apptainer is accessed through a module and is only available on compute nodes after module load apptainer. On MOX, Apptainer is default software and can be accessed with Apptainer commands directly after starting an interactive job for example, apptainer --version.

Training Opportunities: COMPLECS (San Diego Supercomputer) is hosting an Intermediate Linux Shell Scripting online workshop on Thursday May, 16 at 11:00 am Pacific Time. Register here.

Our next scheduled maintenance will be Tuesday June, 11, 2024. Stay informed by joining our mailing list. Sign up here.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

April 2024 Maintenance Details

April 22, 2024 · 2 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The HYAK team was able to use the interruption to implement the following changes:

Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities#

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

AI Research Needs Survey

April 11, 2024 · 1 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

The Research Working Group of the UW AI Task Force would like faculty and research staff input on the needs and challenges of using AI in research at UW across a broad spectrum of disciplines. Please help by responding to a short survey at: https://forms.gle/mZrV3aCgJYNNBV6j8. Responses by April 25 would be most helpful, but the survey will remain open until April 30.

Thank you in advance for your time,

HYAK Team

Disk Storage Management with Conda

April 4, 2024 · 7 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

It has come to our attention that the default configuration of Miniconda and conda environments in the user's home directory leads to hitting storage limitations and the dreaded error Disk quota exceeded. We thought we would take some time to guide users in configuring their conda environment directories and package caches to avoid this error and proceed with their research computing.

Error Message

warning post under contruction

We have been made aware that the solutions for disk storage presented here result in additional problems with conda environments, specifically with hardlinks to the install directory for Miniconda3 when envs_dirs and pkgs_dirs are configured to a different storage location. Please see this Issue for detailed information. we hope to have a better solution soon.

Conda's config#

Software is usually accompanied by a configuration file (aka "config file") or a text file used to store configuration data for software applications. It typically contains parameters and settings that dictate how the software behaves and interacts it's environment. Familiarity with config files allows for efficient troubleshooting, optimization, and adaptation of software to specific environments, like HYAK's shared HPC environment, enhancing overall usability and performance. Conda's config file .condarc, is customizable and lets you determine where packages and environments are stored by conda.

Understanding your Conda#

First let's take a look at your conda settings. The conda info command provides information about the current conda installation and its configuration.

note

The following assumes you have already installed Miniconda in your home directory or elsewhere such that conda is in your $PATH. Install Miniconda instructions here.

$ conda info

     active environment : None
            shell level : 0
       user config file : /mmfs1/home/UWNetID/.condarc
 populated config files : /mmfs1/home/UWNetID/.condarc

          conda version : 4.14.0
    conda-build version : not installed
         python version : 3.9.5.final.0
       virtual packages : __linux=4.18.0=0
                          __glibc=2.28=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /mmfs1/home/UWNetID/miniconda3  (writable)
      conda av data dir : /mmfs1/home/UWNetID/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          . . .
          package cache : /mmfs1/home/UWNetID/conda_pkgs
       envs directories : /mmfs1/home/UWNetID/miniconda3/envs
               platform : linux-64
             user-agent : conda/4.14.0 requests/2.26.0 CPython/3.9.5 Linux/4.18.0-513.18.1.el8_9.x86_64 rocky/8.9 glibc/2.28
                UID:GID : 1209843:226269
             netrc file : None
           offline mode : False

The paths shown above will show your username in place of UWNetID. Notice the highlighted lines above showing the absolute path to your config file in your home directory (e.g., /mmfs1/home/UWNetID/.condarc), the directory designated for your package cache (e.g., /mmfs1/home/UWNetID/conda_pkgs), and the directory/ies designated for your environments (e.g., /mmfs1/home/UWNetID/miniconda3/envs). Conda designates directories for your package cache and your environments by default, but under HYAK, your home directory has a 10G storage limit, which can quickly be maxed out by package tarballs and their contents. We can change the location for your package cache and your environments to avoid this.

tip

when you ls your home directory ls /mmfs1/home/UWNetID/ you might not see .condarc listed. It is there! To list all hidden files (files beginning with .) use ls -a /mmfs1/home/UWNetID/.

Configuring your package cache and envs directories#

Edit the highlighted lines in .condarc to designate directories with higher storage quotas for our envs_dirs and pkgs_dirs. Use a hyak preloaded editor like nano or vim to edit .condarc in place. More about nano. More about vim. Your .condarc will look like this:

$ nano ~/.condarc

channels:
  - conda-forge
  - bioconda
  - defaults
auto_activate_base: true
envs_dirs:
  - /mmfs1/home/UWNetID/miniconda3/envs
pkgs_dirs:
  - /mmfs1/home/UWNetID/conda_pkgs

In this exercise, we will assign our envs_dirs and pkgs_dirs directories to directories in /gscratch/scrubbed/ where we have more storage, although remember scrubbed storage is temporary. Alternatively, your lab/research group might have another directory in /gscratch/ that can be used.

important

Remember to replace the word UWNetID in the paths below with YOUR username/UWNetID.

Here is what your edited .condarc should look like.

$ cat /mmfs1/home/UWNetID/.condarc

channels: 
  - conda-forge
  - bioconda
  - defaults
auto_activate_base: true
envs_dirs:
  - /gscratch/scrubbed/UWNetID/envs
pkgs_dirs:
  - /gscratch/scrubbed/UWNetID/conda_pkgs

warning

If you don't have a directory under your UWNetID in /gscratch/scrubbed/or whereever you intend to designate these directories you will need to create them now for this to work. Use the mkdir command, for example mkdir /gscratch/scrubbed/UWNetID and replace UWNetID with your username. Then create directories for your package cache and envs directory, for example, mkdir /gscratch/scrubbed/UWNetID/conda_pkgs and mkdir /gscratch/scrubbed/UWNetID/envs.

After .condarc is edited, we can use conda info to see if our changes have been incorporated.

$ conda info |grep cache 
/gscratch/scrubbed/UWNetID/conda_pkgs
$ conda info |grep envs
/gscratch/scrubbed/UWNetID/envs

Cleaning up disk storage#

After you have reset the package cache and environment directories with your conda config file, you can delete the previous directories to free up storage. Before doing that, you can monitor how much storage was being occupied by each item in your home directory with the command du -h --max-depth=1. Remove directories previously used as cache and envs_dir recursively with rm -r. The following is an example of monitoring storage and removing directories.

warning

rm -r is permanent. We cannot your recover directory. You were warned.

$ du -h --max-depth=1 /mmfs1/home/UWNetID/
6.7G    ./miniconda3/envs
4.0G    ./conda_pkgs
. . .
$ rm -r /mmfs1/home/UWNetID/envs
$ du -h --max-depth=1 /mmfs1/home/UWNetID/
2.6G    ./miniconda3/
4.0G    ./conda_pkgs
. . .

note

The hyakstorage command is not simultaneously updated. Although you have cleaned up your home directory, hyakstorage might not yet show new storages estimates. du -sh will give you the most up to date information.

Storage can also be managed by cleaning up package cache periodically. Get rid of the large-storage tar archives after your conda packages have been installed with conda clean --all.

Lastly, regular maintenance of conda environments is crucial for keeping disk usage in check. Review you list of conda environments with conda env list and remove unused environments using the conda remove --name ENV_NAME --all command. Consider creating lightweight environments by installing only necessary packages to conserve disk space. For example, create an environment for each project (project1_env) rather than an environment for all projects combined (myenv).

Disk quota STILL exceeded#

Be aware that many software packages are configured similarly to conda. Explore the documentation of your software to locate the configuration file and anticipate where storage limitations might become an issue. In some cases, you may need to edit or create a config file for the software to use. pip and R are two other common offenders ballooning the disk storage in your home directory.

Configuring PIP#

If you are installing with pip, you might have a pip cache in ~/.cache/pip. Let's locate your the pip config file location under variant "global." You might have to activate a previously built conda environment to do this. For this exercise we will use an environment called project1_env.

$ conda activate project1_env
(project1_env) $ pip config list -v
. . .
For variant 'user', will try loading '/mmfs1/home/UWNetID/.pip/pip.conf'
. . .

The message "will try loading" rather than listing the config file pip.conf means that a pip config file has not been created. We will create our config file and set our pip cache. Create a directory in your home directory (e.g.,/mmfs1/home/UWNetID/.pip) to hold your pip config file and create a file called pip.conf with the touch command. Remember to also create the new directory for your new pip cache if you haven't yet.

$ mkdir /mmfs1/home/UWNetID/.pip/
$ touch /mmfs1/home/UWNetID/.pip/pip.conf
$ mkdir /gscratch/scrubbed/UWNetID/pip_cache

Open pip.conf with nano or vim and add the following lines to designate the location of your pip cache.

[global]
cache-dir=/gscratch/scrubbed/UWNetID/pip_cache

Check that your pip cache has been designated.

(project1_env) $ pip config list
/mmfs1/home/UWNetID/.pip/pip.conf
(project1_env) $ pip cache dir
/gscratch/scrubbed/UWNetID/pip_cache

Configuring R#

We previously covered this in our documentation. Edit or create a config file called .Renviron in your home directory. Use nano or vim to designate the location of your R package libraries. The contents of the file should be something like the following example.

$ cat ~/.Renviron 
R_LIBS="/gscratch/scrubbed/UWNetID/R/"

The directory designated by R_LIBS will be where R installs your package libraries.

I'm still stuck#

Please reach out to us by emailing help@uw.edu with "hyak" in the subject line to open a help ticket.

March 2024 Maintenance Details

March 12, 2024 · 3 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node#

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension `Remote-SSH`#

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

New Documentation

The steps discussed here for responsible use of VS Code have been added to our documentation. Please review the solutions for connecting VS Code to HYAK.

Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.
$ ps aux | grep UWNetID
If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.
Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities#

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?#

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Upcoming HPC Training Opportunities

March 8, 2024 · 2 min read

Kristen Finch

HPC Staff Scientist

Hello HYAK Community!

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center (SDSC). If you are interested in picking up some additional skills and experience in HPC, please check them out.

SDSC Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute: The project is focused on teaching researchers and students the best practices for effectively running machine learning (ML) and data science applications on advanced cyberinfrastructure (CI) and high-performance computing (HPC) systems. Applications due 12 April 2024. https://www.sdsc.edu/education_and_training/ciml_summer_institute.html
SDSC HPC and Data Science Summer Institute: The program is aimed at researchers in academia and industry, especially in domains not traditionally engaged in supercomputing, who have problems that cannot typically be solved using local computing resources. Applications due 26 April 2024. https://www.sdsc.edu/education_and_training/summer_institute.html
SDSC Virtual Workshop; COMPLECS: Batch Computing: Getting Started with Batch Job Scheduling - Slurm Edition: Learn how to use Slurm, Hyak's batch job scheduler. In "our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster." Event held virtually on Thursday, March 21, 2024 11:00 AM - 12:30 PM PDT Link to Registration

Keep an eye on our blog for more opportunities and HYAK updates.

If you have any questions, please reach out to the team by emailing help@uw.edu and we sure to mention Hyak in the subject line. Thanks!

February 2024 Maintenance Details

February 13, 2024 · 3 min read

Nam Pho

Director for Research Computing

Hello HYAK community! We have a few notable announcements regarding this month’s maintenance. If the hyak-users mailing list e-mail didn’t fully satisfy your curiosity, hopefully this expanded version will answer any lingering questions.

GPUs#

Software: The GPU driver was upgraded to the latest stable version (545.29.06). The latest CUDA 12.3.2 is also now provided as a module. You are also encouraged to explore the use of container (i.e., Apptainer) based workflows, which bundle various versions of CUDA with your software of interest (e.g., PyTorch) over at NGC. NOTE: Be sure to pass the --nv flag to Apptainer when working with GPUs.
Hardware: The HYAK team has also begun the early deployments of our first Genoa-Ada GPU nodes. These are cutting-edge NVIDIA L40-based GPUs (code named “Ada”) running on the latest AMD processors (code named “Genoa”) with 64 GPUs released to their groups two weeks ago and an additional 16 GPUs to be released later this week. These new resources are not currently part of the checkpoint partition but we will be releasing guidance on making use of idle resources here over the coming weeks directly to the HYAK user documentation as we receive feedback from these initial researchers.

Storage#

Performance Upgrade: In recent weeks, AI/ML workloads have been increasingly stressing the primary storage on KLONE (i.e., "gscratch"). Part of this was attributed to the run up to the International Conference for Machine Learning (ICML) 2024 full paper deadline on Friday, February 2. However, it also reflects a broader trend in the increasing demands of data-intensive research. The IO profile was so heavy at times that our systems automation throttled the checkpoint capacity to near 0 in order to keep storage performance up and prioritize general cluster navigation and contributed resources. We have an internal tool called iopsaver that automatically reduces IOPS by intelligently requeuing checkpoint jobs generating the highest IOPS while concurrently limiting the number of total active checkpoint jobs until the overall storage is within its operating capacity. At times over the past few weeks you may have noticed that iopsaver had reduced the checkpoint job capacity to near 0 to maintain overall storage usability.
During today’s maintenance, we have upgraded the memory on existing storage servers so that we could enable Local Read-Only Cache (LROC) although we don’t anticipate it will be live until tomorrow. Once enabled, LROC allows the storage cluster to make use of a previously idle SSD capacity to cache frequently accessed files on this more performant storage tier medium. We expect LROC to make a big difference as during this period of the last several weeks, the majority of the recent IO bottlenecking was attributed to a high volume of read operations. As always, we will continue to monitor developments and adjust our policies and solutions accordingly to benefit the most researchers and users of HYAK.
Scrubbed Policy: In the recent past this space has filled up. As a reminder, this is a free-for-all space and a communal resource for when you have data you only need to temporarily burst out into past your usual allocations from your other group affiliations. To ensure greater equity among its use, we have instituted a 10TB and 10M files limit for each user in scrubbed. This impacts <1% of users as only a handful of users were using an amount of quota from scrubbed >10TB.

Questions?#

Hopefully you found these extra details informative. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

Update on the hyakstorage command

August 9, 2022 · 2 min read

Nam Pho

Director for Research Computing

We’ve made an update to our storage accounting tool, hyakstorage, and with this update we are also phasing out usage_report.txt. That text file contained minimally-parsed internal metrics of the storage cluster, and we found it caused as many questions as it answered. Moving forward, the hyakstorage tool will display only the four relevant pieces of information for each fileset you query: storage space used vs. the storage space limit, and current amount of files (inodes) vs. maximum number of files.

The default operation–running hyakstorage with no arguments–will show your home directory & the gscratch directories you have access to, and it will only show the fileset totals & your contributions.

You can also specify which filesets you want to view, in a few different ways: you can use the flag --home to show your home directory, --gscratch to show your gscratch directories, and --contrib to show your group’s contrib directories. You can also specify an exact gscratch directory with the group name (e.g. hyakstorage stf), contrib directory (e.g. hyakstorage stf-src), or full path to a fileset (e.g. hyakstorage /mmfs1/gscratch/stf).

If you want more detailed metrics, you can use the flags --show-user or --show-group to break down the fileset totals by individual users or groups. Those detailed metrics can be sorted by space with --by-disk (the default) or by files with --by-files.

See also:

Terminology Reset

August 9, 2022 · 3 min read

Nam Pho

Director for Research Computing

note

There is no operational change, this is an administrative clarification of HYAK specific terminology.

The HYAK community has grown substantially over the past year, including the administrative teams that work with us to support the service. Some terms (e.g., nodes, servers) have been loosely used in communication, but have specific meanings for different backend teams. Beginning today, we are harmonizing all the terms to alleviate any confusion between the different teams supporting HYAK and our end users. This is only a clarification of language: there is no change to how HYAK operates.

At the physical layer we have nodes or servers: the smallest individual physical units of the HPC cluster. These are what we, the HYAK engineering team, purchase from our vendor partners. Historically, a physical node or server was what a lab would purchase to join the cluster. However, since HYAK’s inception, resource density has increased to such an extent–servers with hundreds of CPU cores, hundreds of gigabytes of RAM, multiple graphics cards, etc.–it no longer made sense to require labs to purchase an entire physical node.

Once we crossed a certain threshold, we began to offer labs an amount of computing resources–a specific number of CPU cores, amount of RAM, number of GPUs, etc.–rather than discrete servers, but we kept the node nomenclature. For a while, when labs purchased a node, it no longer meant they were purchasing a server, even though those words are identical in many computing contexts. From today forward, we are restoring those terms to their original meanings for HYAK: one node is one physical server.

When labs join HYAK, they will not purchase a physical node, they will purchase a slice. A slice represents an amount of on-demand compute capacity–CPUs, RAM, GPUs, etc. Again, this is only a terminology clarification: HYAK has operated this way for a while. One of the benefits of this model is that slices–representing resources, not specific, physical pieces of hardware–make resource scheduling considerably easier for our cluster's scheduling software, Slurm. This efficiency is returned to the entire community both as depth of the checkpoint partition’s resources, and as faster scheduling for non-checkpoint jobs.

We've seen some users refer to this as "virtualization", and that is a misnomer. We want to emphasize that there is no hardware virtualization taking place here: your job will run on the bare-metal, physical resources you have requested from Slurm.

While this may seem like a minor change in language, it will greatly ease the coordination among many groups working behind the scenes to support the HYAK service. As always, we appreciate your understanding and patience as we continue to refine and improve the support provided.

See also:

Glossary of HYAK specific terms.

HYAK Team Storage Optimizations

April 21, 2022 · 6 min read

Nam Pho

Director for Research Computing

note

The HYAK team has taken six concrete steps to stabilize and optimize storage on KLONE over the past few weeks.

While the storage on KLONE (i.e., mmfs1 or gscratch) may appear to be a monolithic device, it is an extremely complex cluster in its own right. This storage cluster is mounted on every KLONE node: so despite appearing as "on the node", gscratch physically resides on specialized storage hardware separated from the compute resources of KLONE. The storage is accessed across a high-speed, ultra low-latency HDR Infiniband network, and is designed to be scalable independent of KLONE’s compute resources.

As mentioned in an earlier blog post today, our incoming hardware expansion will drastically increase the amount of demand the storage cluster can handle. In the meantime, the Hyak team has taken measures to help maintain a usable level of storage performance for users and jobs:

1. Improved internal storage metrics gathering and visibility.#

The HYAK team improved storage-cluster metric gathering and visibility, allowing us to correlate those metrics to reports of poor user experience, and to make data-driven tuning and storage policy decisions.

In the figure above we have visibility into if an abnormally high number of jobs have errors that might suggest underlying storage or other user experience issues.

2. Created custom filesystem migration policies to optimize the use of the NVMe layer.#

The bulk of the storage capacity on KLONE is stored on rotary hard disk drives totalling approximately 1.7 Petabytes (PB) of raw storage. In addition to the hard disk storage, there is a much smaller, extremely fast–and expensive–pool of NVMe "flash" storage that functions both as a write buffer for new files written to the filesystem, and also as a read-cache-like layer where files can be read without causing load on the rotary disks.

The HYAK team has also optimized the file placement policy: files most likely to generate heavy load reside in the limited space of the NVMe layer, ensuring that no storage load is generated on the hard disk layer when those files are repeatedly accessed.

In the figure above you can see that the flash tier (green line) is allowed to fill up to 80% capacity due to job writes then the migration policy begins until the flash tier is down to 65% full. For the majority of the past few several weeks we can see things worked as expected. However, there were a few events recently where jobs were producing so much data that the flash tier was able to get to 100% full faster than the storage system could move data off the flash tier. Giving the migration process too high of a priority results in "slowness" in the user experience. We have since been tuning the aggressiveness of this migration process to reduce the likelihood of it occuring again.

3. Added QoS policies to improve worst-case filesystem responsiveness.#

The KLONE filesystem has a coarse Quality-of-Service (QoS) tuning facility that allows the filesystem to cap the rate of storage operations for various types of storage input-output (IO). The HYAK team has used this facility in two different ways:

First, to limit the storage load impact when the NVMe layer, described above, needs to free up space by moving files to the hard drive layer.
Secondly, to moderate the amount of storage load that can be generated by any single compute node in the cluster. This way, outlier jobs in terms of storage load generation are less likely to have an outsized performance impact on the storage.

4. Manually identifying jobs causing a disproportionate impact on storage performance.#

Utilizing metrics and old-fashioned sleuthing, we have been manually tracking down individual jobs that appear to be having a disproportionate and/or unnecessary impact on storage performance, and working with users to address the storage performance impact of these jobs.

In the above figure we can see job IO follows a power law dynamic, a small handful of jobs are often responsible for the majority of load. In this case a single job on a single node is responsible. When users report storage "slowness" this disrepancy can be even more pronounced but we are able to quickly narrow down which specific nodes are responsible and address these corner cases.

5. Dynamically reducing the number of running checkpoint partition jobs.#

As of April 19th, 2022, we have implemented data-driven automation to moderate storage load by dynamically managing the number of running checkpoint (ckpt) partition jobs. When the number of running ckpt jobs is being limited, pending jobs will show AssocGrpJobsLimit as the REASON for not starting.

Please note that non-ckpt jobs (i.e., jobs submitted to nodes your lab contributed to the cluster) are not limited in any way. The social contract when joining the HYAK community is that you get access to the nodes your lab contributes on-demand, and–if and when they are idle–access to other labs’ resources on the cluster. However, access to other labs’ resources isn’t and hasn’t ever been guaranteed: it’s just that there’s often a steady state idle capacity for users to "burst" into by submitting ckpt jobs.

In aggregate, 'Storage Load' is a consumable resource just like CPU cores or memory, albeit one that impacts the whole cluster when it is over-consumed. The SLURM cluster scheduler cannot directly consider storage load availability when evaluating resources for starting ckpt jobs, hence our need to automate. Our new tooling limits the storage performance impact from ckpt jobs in order to improve storage stability for everyone.

The red and blue lines represent two storage servers that we have most closely tied to the user experience and 50% load being the threshold we aim to remain at or under by dynamically reducing the number of running ckpt jobs when it exceeds that limit.

So far, this appears to be very effective at moderating the overall storage load, preventing the storage cluster from becoming unusably slow and avoiding other storage-performance issues. We will continue to tune it in search of the best balance between idle resource utilization via ckpt and storage performance.

6. Expanding the team#

Acknowledging that the storage sub-system is a complicated machine in its own right, it needs much more care and attention and the current HYAK team is stretched incredibly thin as is. We have started the process of hiring a dedicated research data storage systems engineer to focus on optimizing storage going forward.

See also:

KLONE Users Storage Optimizations

April 21, 2022 · 5 min read

Nam Pho

Director for Research Computing

note

There are steps you, as a researcher using KLONE, can do to limit the impact of whatever else is happening on the cluster on your individual workflows.

While some of what precipitated this conversation is the current state of the storage (i.e., mmfs1 or gscratch), there are several things you can do as a researcher to both reduce the load on gscratch as well as help insulate your jobs from cluster-wide storage slowdowns.

1. Use local node SSDs.#

Each node on the cluster has a local SSD drive with 350+ GB of space available for use by user jobs. This space is available only to jobs running on that node and all contents are purged when the users’ last job running on the node completes. It is mounted as /scr and /tmp (both paths go to the same place) on all the compute nodes.

If input data, Apptainer (Singularity) images, or other files used by your job will fit, copying those files to the SSD (via cp, rsync, etc.) once at the beginning of your job and reading them from there during the remainder of the job run results in less load on the central storage, helps insulate your job from any instances of central storage slowness, and can often result in better overall job performance.

SLURM has a command called sbcast [www] that is useful for efficiently copying files to all nodes used in a multi-node job as part of an sbatch script.

For files being written that need to be kept after the job run, it is generally best to write these directly to the central storage. Because new files are written directly to the very fast NVMe layer, such writes are less likely to impact overall storage performance. That said, it is still beneficial to write intermediate job files to the local SSD whenever possible.

2. Code for efficient file IO.#

While this can be a very complicated topic, a great deal of overall job performance can be gained by thoughtful and judicious use of file input-output (IO). Some general tips:

Keep in mind that file access is orders of magnitude slower than memory access, and processes often have to completely "stop and wait" for disk IO operations to complete. Minimizing file IO operations, especially inside "inner loops" of programs can greatly speed up job completion, and helps to reduce load on the cluster central storage.
Fewer, larger file IO operations are generally more efficient than multiple smaller file operations accessing the same data.
When possible, store data in an efficient format such as HDF5 instead of many small files.
"Open/read once, access many times" if job memory permits.

3. Containerize your environment.#

As mentioned above, minimizing the number of files you need to access can help reduce the number of input / output operations per second (IOPS) happening on the cluster. For example, a Python miniconda environment can create hundreds or even thousands of small files when you install different library dependencies. While Python is a common compute environment, this can be generalized to most other programs you may need. When you containerize your environment, this gets reduced to a single file. A brief introduction to Singularity (now called Apptainer) can be found here. As a side benefit, containerizing your environment–making it a single file–makes it much easier to move it around (see #1 above).

4. Stay under quota.#

Constantly hitting your inode (e.g., file) or block (e.g., number of GBs or TBs) quotas can cause extra storage slowness. If you need a bump on either please reach out to discuss your options. As a reminder you can us the hyakstorage command on KLONE to display current quota usage for all of your filesets as well as your home directory. Please note that this output is updated once an hour so it will take time to reflect any overages.

5. Report issues.#

While the HYAK team has an extensive monitoring and alerting framework in place to help us to proactively determine when things may be going wrong, not all causes of slow user experience are currently correlated to metrics. Furthermore, our team generally interfaces with the cluster in different ways than our users, so we may not be as equally exposed to any pains until it is reported to us. If you’ve run into a performance issue, please submit a ticket by emailing help@uw.edu. Please provide any symptoms you are observing, along with the date, timeframe, job IDs (if applicable), commands you are running with their full output, etc. If you don’t need or want a reply from us it is still helpful for us to hear from you, feel free to say "no response needed" or something along these lines so we know how to respond.

See also:

An update on KLONE storage

April 20, 2022 · 5 min read

Nam Pho

Director for Research Computing

note

KLONE has experienced exponential growth over the first year of its launch, necessitating long-standing storage ugprades to occur. The current estimate is between June and July 2022 for deployment of this hardware.

The 3rd generation HYAK cluster, KLONE, launched in spring 2021 with 144 HPC nodes and 192 GPUs. In just a single year, we’ve grown to over 384 HPC nodes (a 166% increase) and 448 GPUs (a 133% increase). KLONE has more than doubled in size, and while some of this growth comes from long-standing HYAK members migrating to the new cluster, much of our increased capacity comes from hundreds of new researchers joining the HYAK community. We’ve seen existing sponsors such as the College of Engineering increase their already substantial footprints by 60%, we’ve welcomed new sponsors such as UW Bothell, UW Tacoma, and the Puget Sound Institute, and seen over 1000% growth–seriously–in our new self-sponsored tier for investigators and faculty without an existing HYAK sponsor affiliation. As with any large project, during KLONE’s initial planning stages we made assumptions about our growth rate & the types of research we would be supporting: assumptions that have been shattered by our growth over the past year. It was never a question of if we would need to upgrade our support infrastructure–like storage–but when, and our rapid growth significantly accelerated our upgrade timeline.

Monitoring – and developing more monitoring for – the HYAK clusters is a central responsibility of our team. The status quo at the beginning of 2022 was to track down errant jobs or workflows when storage issues came up. In almost every instance, we were able to pinpoint the problematic job and work with the researcher to shape their code into a normal IO profile. Pausing jobs and providing best practices was sufficient to keep the storage performance solid for everyone. However, starting around the last week of March 2022, we started having trouble finding an obvious job, or even a set of jobs, impacting storage performance.

The truth is that our baseline load had shifted. Due to our tremendous growth, things researchers had previously been doing without issue were now causing problems. We also noticed an evolution of the types of research happening on KLONE. The HYAK community diversified from traditional HPC workflows (e.g., simulations) into more data-intensive areas like data science (e.g., R jobs), deep learning, and artificial intelligence research. We accelerated our discussions with storage vendors: in a few short months, an expansion went from an eventuality to an immediate and pressing need. Still, we tried several last-minute optimizations to see if we could prevent spending all that money. We are serious about our fiduciary duty, as stewards of this research platform, to provide the most value for the HYAK community with the dollars we are entrusted with. We knew a storage upgrade for KLONE would cost hundreds of thousands of dollars and we needed absolute certainty that we couldn’t engineer a way around that expense.

The storage on KLONE (i.e., mmfs1 or gscratch) might pretend to be a mere folder or directory, but in truth it’s an abstraction of a highly complex system. To provide cost-effective, high-performance storage, a small high-speed NVMe "flash" layer acts both as a write buffer for the slower spinning disks–which make up the vast majority of cluster’s capacity–and as a high-speed "cache" for recently & frequently accessed small files. While presented as a single folder to the researcher, behind the scenes the storage cluster moves data between these tiers to balance performance. As seen in the figure above, when the flash layer reaches 80% capacity, a process begins to drain it by moving less frequently used files to the spinning-disk layer until the flash layer reaches 65% capacity. You might also notice that despite our precautions and monitoring, as of April 9, 2022, we were no longer able to migrate data from flash to spinning disks faster than our users were writing. This was the final deciding factor for us, and we initiated our long-standing plan to upgrade the storage for KLONE.

This necessary investment to upgrade storage will double both the maximum input-output operations-per-second (IOPS) and throughput (storage bandwidth), providing much needed overhead for current workflows as well as accommodating future growth. We are excited for this upgrade – and are doing everything we can to expedite its deployment – but due to the sheer amount of hardware we’re purchasing, we’ve been swept up in the pandemic-induced global supply chain crunch. Our vendors have predicted that the end of July is the worst-case scenario, but that a June delivery is also possible. We will update the HYAK community as we know more. As always, we welcome any questions: if you want to speak with us about something, send an email to the HYAK team via help@uw.edu and we’ll follow up with you.

See also:

OS upgrade for KLONE

February 8, 2022 · 3 min read

Nam Pho

Director for Research Computing

note

KLONE has a new OS, we upgraded to Rocky Linux from CentOS 8.

Background#

In late 2020, while building the current-generation cluster, KLONE, our previous-generation cluster, MOX, was running CentOS 7 – which was nearing end-of-life support. We used the transition to KLONE as an opportunity to deploy CentOS 8, the world’s most popular OS in academic research computing environments. Unfortunately, around the time we were wrapping up KLONE’s software stack, the CentOS project announced [1, 2] a transition of their own: Red Hat unilaterally terminated the development of CentOS as an open-source version of Red Hat Enterprise Linux (RHEL). CentOS would become an upstream version of RHEL – in other words, more experimental and ultimately less stable.

As the dust from this announcement settled, a consensus emerged: Rocky Linux, led by the initial founder of the CentOS project, Greg Kurtzer, would become the CentOS successor.

The Transition#

Fast-forward to late 2021: after our summer ‘21 launch of KLONE, and our fall ‘21 cluster capacity expansion, we were finally able to turn our attention to the CentOS to Rocky migration. And just in time, too, because CentOS 8–the operating system we deployed just months earlier–would be officially unsupported after December 21, 2021.

Our goal was to make this OS transition as smooth and unnoticeable to our users as possible. After all, this is our mission: we take care of the tech so that you can take care of the science. Rocky, like CentOS, is intended to be a bug-for-bug, open-source version of RHEL, and with its talented, globe-spanning team of developers, we were confident that the impact of this transition would be minimal.

We began the transition with our backend during the December ‘21 maintenance: the KLONE head node, our SLURM scheduler, was successfully migrated to Rocky 8. So far so good! During our next maintenance, January ‘22, we migrated all the compute node images to Rocky. A handful of users reported code-compiling issues, which we were able to resolve, but otherwise it was uneventful. We took extra care on the final piece of the Rocky migration–the login nodes–due to their accessibility from the wider internet. And, as of today’s maintenance, we are excited and relieved to report that KLONE is now a 100% Rocky cluster! 🥳

Summary#

The HYAK team was forced to revisit a major OS migration, mere months after the initial launch of KLONE. This is highly unusual–and no small feat–but we have prevailed. We deployed a widely-supported, open-source OS with enterprise-level stability, while remaining cost-effective to the research community at the University of Washington. With this work behind us, we’ve arrived at a sustainable platform for the life of the KLONE cluster. We’re excited for the future of KLONE, and excited to redirect our time back to feature development.

We want to give a huge thank-you to our users for their patience during this migration period. Spoiler alert: Rocky won and it’s a good thing!

Fairshare improvements on KLONE

October 12, 2021 · 4 min read

Nam Pho

Director for Research Computing

note

We have adjusted legacy fairshare-related settings to account for GPUs and large memory contributions and usage in order to help more fairly allocate checkpoint resources.

History#

In fall 2019 (almost two years ago to the day) the HYAK team received our first Turing generation GPU node. HYAK has had a modest GPU footprint in the past as far back as a decade ago with the first generation cluster (called "IKT") and its pre-Pascal generation cards. In 2015 we acquired a smaller test bed of Pascal generation GPUs for the second generation cluster (called "MOX"). There were never more than a dozen GPUs in either the IKT or MOX clusters, but the introduction of Turing GPUs marked a resurgence of interest in these accelerators among the UW research community. In the last two years, we've substantially expanded our capabilities to over 300 GPUs.

Background#

HYAK clusters work on a "condo" model: labs are able to utilize their contributed hardware on-demand as well as take advantage of idle capacity from other groups' hardware via the checkpoint (ckpt) partition. Your checkpoint priority — or "fairshare" in SLURM scheduler parlance — is weighted such that your fairshare is directly proportional to your lab’s contribution to the cluster. In the MOX days, GPU users tended to stay within their contributed hardware partitions and rarely made use of checkpoint. We attributed this to a mental shift: students were used to using a single resource, like a desktop computer, rather than a shared cluster of computing resources. However, with the migration to the third generation HYAK cluster (called "KLONE") and its new QoS scheduling system and the increasing comfort of students using a shared platform, GPU utilization in the checkpoint partition has increased as well. This is a good thing: we want groups to benefit from their HYAK membership in the cluster and take advantage of idle cluster resources beyond their initial hardware contributions. This is a primary tenet of our social contract with the HYAK community: as a node contributor to the cluster, you have access to idle resources of the whole cluster.

Problem#

Fairshare was simpler to calculate in the pre-GPU days because our infrastructure was homogenous: one node contributed to the cluster equaled one fairshare unit. During the last two years of exponential GPU adoption on HYAK, the fairshare calculation has not evolved: 1 HPC node was the same as 1 GPU node at 1 fairshare unit. This didn’t hold because a GPU node can cost between 4 to 8 times (or more) than a traditional HPC node. The result was that labs with GPU or other speciality (e.g., high-memory) nodes tended to have smaller fairshares compared to groups with the same dollar investment but only in traditional CPU nodes. In practice, this meant these GPU users often directly competed for resources with non-GPU jobs in the checkpoint partition on a non-level playing field.

Solution#

Taking into consideration all of this information, as well as the fact that you can request as little as 1 GPU or 1 CPU from the scheduler, we have adjusted the fairshare calculations as follows:

Financially: 1 GPU card is roughly equivalent to 40 CPU cores (on a dollar basis), therefore the cost normalization is 40:1 in favor of GPUs.
Scarcity: 1 server typically holds 8 GPU cards or 40 CPU cores, therefore the scarcity normalization is 5:1 in favor of GPUs.
Combining the financial and scarcity considerations in the points above, the final weighting is 200:1 in favor of GPUs. In other words, 1 GPU card is worth 200 times more than a single CPU core in the eyes of the scheduler and factored into your checkpoint fairshare. Please note that this example only applies to the higher GPU memory cards (i.e., gpu-rtx6k) while less expensive GPUs have commensurately less weight.

Summary#

With the October monthly maintenance today we have introduced a new fairshare weighting system on the KLONE cluster's checkpoint (ckpt) partition that commensurately acknowledges GPU labs for their contributions to the HYAK community. This has no impact on jobs submitted to non-ckpt partitions.

Migrating from MOX to KLONE

May 1, 2021 · 4 min read

Nam Pho

Director for Research Computing

If you were previously a proficient MOX user and now find yourself on KLONE, what's new / different? This is a high-level summary, please consult the documentation [link] for more details.

note

Updated August 10, 2021 to include additional information specific for GPU users.

Login#

Logging in was previously to mox.hyak.uw.edu now it's klone.hyak.uw.edu.
As a reminder login nodes are only to connect to the cluster, navigate the cluster file system, and submit jobs. This applies to both KLONE and MOX. Do not compile codes on the login node or run any programs that require significant compute (get a session with SLURM).

Data Transfer#

Only use the login node to transfer data on KLONE. On MOX you'd have used a build node or could have used the login node if it wasn't very computationally heavy.

Storage#

The path to lab storage is still /gscratch/mylab on both KLONE and MOX. You'll need to copy over the data from MOX to KLONE you want to continue using.
Home directories are still 10GB per user, same on both clusters.
Scrubbed exists on KLONE just as it did on MOX at /gscratch/scrubbed this is a free-for-all space on both clusters where files are automatically deleted after 21 days.
Some new benefits of the KLONE storage compared to MOX:
- There are snapshots for gscratch! Look inside the /gscratch/mylab/.snapshots folder for a copy of your lab folder once an hour, every hour, for 24 hours. This is not a backup copy nor a replacement for version management (e.g., git) but useful for retrieving recent versions or something accidentally deleted. This is currently disabled.
- More storage! Previously you received 500GB or 0.5TB of gscratch quota per node (or pair of GPUs) contributed to MOX. Now on KLONE we've doubled your associated storage quota! For example, 2 nodes on MOX would mean 1TB of gscratch but 2 nodes on KLONE now means 2TB of gscratch. If you had an 8 x GPU node on MOX you would have received 2TB of gscratch but an 8 x GPU node on KLONE now means 4TB of gscratch.
- It's faster! We've had reports of performance that's averaging a 30% speed up all else being equal, nothing you need to do aside from use KLONE instead of MOX.
- It's faster than fast! While KLONE storage is faster than MOX storage overall, gscratch on KLONE is further turbo charged with a NVMe flash based tier. NVMe flash is among the fastest storage mediums you can get and further differentiating benefit if you use gscratch vs scrubbed on KLONE.

Compute#

When submitting a SLURM job, whether interactive (i.e., salloc) or batch (i.e., sbatch) you'll want to first decide which account to use. This is the group you're part of. You can run the command groups to see your affiliated accounts and run hyakalloc to see all the resources (e.g., compute cores, memory, GPUs) used and available associated with each affiliated account.
Then decide if you want to run this job to count under your resource allocation by submitting to the compute partition (i.e., -p compute) or if you want this job to use idle resources from other groups across the cluster using the checkpoint partition (i.e., -p ckpt).

Non-standard partitions. Run sinfo to see the list of all possible partitions, this is only if your group contributed non-standard nodes (e.g., high memory, GPUs) and need to idenitify the appropriate partition names to get immediate use. Otherwise, you'd only be able to get them in a checkpoint capacity. For GPU users this is currently either the gpu-2080ti or the gpu-rtx6k partitions for 11GB and 24GB of GPU memory cards, respectively.
There is no build node on KLONE. Get an interactive session (e.g., salloc) under an existing account and partition combination you have access to.
All nodes have internet now on KLONE. Do all data transfers to and from KLONE on the KLONE login nodes, the login nodes on KLONE have dual 40 Gbps uplinks to the internet. While the compute nodes on KLONE have internet routing now, they are bottlenecked at 1 Gbps so not suitable for big data transfers.

Software#

Singularity containers work the same on both clusters, we encourage this when possible. Refer to our container documentation [link].
Modules is updated to the latest versions of the most core parts that the HYAK team maintains (e.g., gcc, Intel, Matlab). Refresh yourself about modules [link].
If neither Singularity nor existing modules works for you, you may have to re-compile your codes on KLONE. "contrib" modules works different now on KLONE vs MOX, please check out the details [link].

Klone Soft Launch

February 25, 2021 · 4 min read

Nam Pho

Director for Research Computing

February 25, 2021#

The UW research computing team celebrates the soft launch of project KLONE, the 3rd generation HYAK supercomputer. Welcome to those researchers invited to participate in the early access program 🥳 🎉

caution

There will be weekly maintenance days on Tuesday during the soft launch period after which we will move back to our regular cadence of monthly maintenance windows.

The user documentation [link] has been updated to reflect the changes and new features of KLONE but this will be an ongoing process.

Compute#

Soft launch with 1,920 compute cores over 48 nodes:
- 28 x mem1 nodes (192GB of memory each) in the compute partition,
- 4 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
- 16 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.
build nodes no longer exist on klone as they did on mox. All instances have the potential to be interactive and all have internet routing by default (even non-interactive jobs).

Storage#

gscratch on klone is 1.4PB total capacity with a new 500TB NVMe flash tier. Data tiering happens automagically, if you use a file frequently it will be moved to the faster storage.
Storage quota is still charged back at the same rate ($10 / TB / month). Researchers receive 1TB per node purchased and contributed to klone.

Data#

gscratch is not backed up that is the responsibility of the researcher (e.g., LOLO, the cloud, external hard drive). Feel free to email us if you have any questions.
While all nodes have internet access now, transfer data using the login nodes. Login nodes have full 2 x 40 Gbps bandwidth. If you transfer using a compute node interactive session you are limited to 1 x 1 Gbps connection.

Software#

modules works the same as it did on mox. This is an improved implementation called LMOD on klone compared to environment modules on mox.
We provide the basic compilers (e.g., GNU, Intel) as modules.
The HYAK team is encouraging a container first world (i.e., use Singularity).

March 3, 2021#

The updated total is 3,840 cores and 96 nodes on klone.

Compute#

Compute has doubled by adding another rack to klone, an additional 1,920 compute cores over 48 nodes:
- 44 x mem1 nodes (192GB of memory each) in the compute partition,
- 2 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
- 2 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.

Software#

We created a module for cmake.

March 5, 2021#

Storage#

Implemented usage_report.txt files in the base folder of /gscratch/yourlab/ that is updated once an hour to reflect both your block quota and inode capacity usage. This is similar to the gscratch experience on the MOX cluster.

Website#

We migrated our site from https://UWrc.github.io to its new home at https://hyak.uw.edu.

March 9, 2021#

Storage#

Snapshots are here! We are piloting once an hour for 24 hours for every lab storage folder under /gscratch/. Check out the updated documentation here on how to access past snapshots.

Software#

We created more LMOD software modules:
- Matlab R2020b [docs]
- OpenMPI-4.1.0

March 12, 2021#

LMOD software modules:
- Intel has bundled their software suite (e.g., compiler, MPI) as oneCLI and we created this module (i.e., module load intel/oneCLI).
- There is now a "contrib" framework for groups to store their shared codes separately from their /gscratch/labname/ data. You can get 100GB of storage to compile codes at /sw/contrib/labname-src/ and then put your LMOD module file in /sw/contrib/modulefiles/labname/. Your module would appear when anyone runs module avail. This is created upon request so if you'd like to opt-in your group please let us know.

April 13, 2021#

Things have been going steady the past week and changes are coming less frequently. We are now increasing time between maintenance periods on klone from weekly on Tuesdays to monthly and aligning it with the mox maintenance as the 2nd Tuesday of every month.

That wraps up our klone soft launch blog updates here, other updates will appear on our HYAK users mailing list. Don't forget to subscribe, instructions on this page at the bottom.

Recent posts

18 posts tagged with "hpc"

New G2 Nodes#

Student Opportunities#

Training Opportunities#

warning post under contruction

Conda's config#

Understanding your Conda#

note

tip

Configuring your package cache and envs directories#

important

warning

Cleaning up disk storage#

warning

note

Disk quota STILL exceeded#

Configuring PIP#

Configuring R#

I'm still stuck#

Login Node#

Responsible Usage of VS Code Extension Remote-SSH#

New Documentation

Training Opportunities#

Questions?#

GPUs#

Storage#

Questions?#

note

note

1. Improved internal storage metrics gathering and visibility.#

2. Created custom filesystem migration policies to optimize the use of the NVMe layer.#

3. Added QoS policies to improve worst-case filesystem responsiveness.#

4. Manually identifying jobs causing a disproportionate impact on storage performance.#

5. Dynamically reducing the number of running checkpoint partition jobs.#

6. Expanding the team#

note

1. Use local node SSDs.#

2. Code for efficient file IO.#

3. Containerize your environment.#

4. Stay under quota.#

5. Report issues.#

note

note

Background#

The Transition#

Summary#

note

History#

Background#

Problem#

Solution#

Summary#

note

Login#

Data Transfer#

Storage#

Compute#

Software#

February 25, 2021#

caution

Compute#

Storage#

Data#

Software#

March 3, 2021#

Compute#

Software#

March 5, 2021#

Storage#

Website#

March 9, 2021#

Storage#

Software#

March 12, 2021#

April 13, 2021#

Responsible Usage of VS Code Extension `Remote-SSH`#