2 posts tagged with "apptainer"

View All Tags

May 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance, there are some notable improvements we implemented today.

klone node image: Over the past few weeks, you may have noticed some klone instability. This was a result of some behind the scenes storage upgrades that inadvertently introduced wider impacts to the existing cluster automation. At the time, we introduced a temporary fix to get the cluster back online but with today’s maintenance we implemented a more comprehensive fix.

Infiniband firmware: The klone cluster is built on the infiniband HPC interconnect for node-to-node communication. While klone originally launched with the HDR generation of infiniband, we have since upgraded mid-klone to have a HDR-NDR hybrid interconnect. NDR infiniband is required to support the latest compute slices we offer. We updated the firmware on our NDR switches following vendor recommendations for increased stability.

Apptainer on MOX: Apptainer (formerly Singularity) is the root-less containerization solution we provide on both Hyak clusters. Apptainer version 1.3.1 was deployed on both klone and MOX. As a reminder, on klone Apptainer is accessed through a module and is only available on compute nodes after module load apptainer. On MOX, Apptainer is default software and can be accessed with Apptainer commands directly after starting an interactive job for example, apptainer --version.

Training Opportunities: COMPLECS (San Diego Supercomputer) is hosting an Intermediate Linux Shell Scripting online workshop on Thursday May, 16 at 11:00 am Pacific Time. Register here.

Our next scheduled maintenance will be Tuesday June, 11, 2024. Stay informed by joining our mailing list. Sign up here.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

August maintenance completed

Michael Wanek

Michael Wanek

HPC Engineer

August's scheduled maintenance is complete and the Hyak clusters have resumed normal operations: logins have been reenabled & jobs are already running.

This month's maintenance actions were our standard fare: node image and firmware updates. We keep our maintenance all-clear emails as brief as possible, but here's the rundown:

Node image updates#

Our compute nodes are stateless: their operating system is loaded into memory over the network, so we keep the node images as small as possible. This means that when we update the images, we're actually rebuilding them from scratch. All the operating system packages we include in our template are installed as their latest versions.

Any software on the node image beyond system packages is managed separately, which brings me to the only major update this month:

We upgraded Apptainer from 1.1.8 to 1.2.2. The update from 1.1 to 1.2 implements quite a few new features, modifications to default behavior, and other changes. You can read about them in the Apptainer 1.2.0 Patch Notes.

Node firmware updates#

Since firmware updates shouldn't impact cluster users, we normally don't even mention them. That said, this was the main part of our work today. We updated the firmware (including BIOS & BMCs) for our backend nodes, login nodes, and all 400+ compute nodes.