Skip to main content

One post tagged with "enhancements"

View All Tags

June Maintenance Complete - Infrastructure Improvements to Hyak Klone

ยท 6 min read
Kristen Finch
Director of Research Computing Solutions

Our June maintenance is now complete for both clusters.

  • Tillicum maintenance was completed on June 9 and consisted primarily of routine operating system updates, security patching, and platform maintenance.
  • Klone maintenance was completed on June 17 and included significant backend infrastructure modernization work. While most of these changes are intentionally transparent to users, they improve system reliability, simplify future development, reduce technical debt, and provide a foundation for new capabilities planned for Hyak services.

What Changed on Klone?โ€‹

Operating System and Filesystem Updatesโ€‹

Klone received a kernel upgrade and an update to the GPFS client software.

These updates improve overall system stability and address several issues we have been tracking, including login node stability concerns. Keeping core system software current also ensures continued vendor support and access to future platform improvements.

Intel OneAPI Module Usageโ€‹

The default Intel oneAPI module has be updated from intel/oneAPI/2021.1.1 to intel/oneAPI/2026.0.0. This change was announced during the May maintenance and is being made because we have observed MPI-related segmentation fault errors with intel/oneAPI/2021.1.1 and intel/oneAPI/2023.2.1 following the Hyak Klone kernel upgrade.

After this change, module load intel/oneAPI will load intel/oneAPI/2026.0.0 by default. If your workflow still requires the previous version, please explicitly load it with module load oneAPI/2021.1.1.

Please note that intel/oneAPI/2026.0.0 has removed the legacy Intel compilers, including icc, icpc, and ifort. Users are encouraged to migrate to the newer active compilers such as icx, icpx, and ifx when possible.

For workflows that must continue using the legacy Intel compilers but also require MPI, you may use the legacy Intel compilers together with the MPI stack provided by ompi/4.1.6-intel, which is OpenMPI compiled with the legacy Intel compilers.

Please reach out to us if you have any questions about this change. Get Research Computing support.

Slurm Account Structure Modernizationโ€‹

One of the largest changes during this maintenance was a redesign of Klone's internal Slurm accounting hierarchy.

Historically, projects were represented by a single Slurm account. Going forward, each group resource allocation is now represented by one Slurm account per device type. Users may notice that they are in more Slurm accounts than before this change.

For users, job submission remains exactly the same. Existing submission workflows, batch scripts, and allocation usage practices do not need to change.

The primary differences are internal:

  • Simpler partition and resource management
  • Improved usage tracking and reporting accuracy
  • Removal of dependencies on deprecated software components
  • Improved flexibility for future scheduler enhancements
  • Better alignment between Klone and Tillicum administration

These changes make the platform easier to operate and maintain while positioning us to introduce future capabilities with less disruption.

Preserving Existing User Workflowsโ€‹

Following the Slurm modernization, the account query parameters of commands such as squeue and sacct would normally have changed.

To preserve continuity, we implemented compatibility wrapper scripts that maintain the familiar query formats most users expect.

As a result, the majority of users should see little or no difference in their day-to-day workflows.

Advanced users should note that invoking these commands through their full system paths bypasses the compatibility wrappers and --account queries will expect the new Slurm subaccount naming cheme.

Improvements to Group and Access Managementโ€‹

Over the last week, we completed a significant restructuring of Hyak account and permissions management for both Klone and Tillicum clusters.

Historically, some UW Groups served multiple purposes simultaneously, including:

  • Granting cluster login access
  • Authorizing Slurm account usage
  • Providing access to shared filesets and storage resources

Over time, this created complexity and made it difficult to clearly distinguish between computing access and storage permissions.

The new model separates these responsibilities into distinct functions. This separation makes access management more predictable, more secure, and easier to understand.

Login and Compute Accessโ€‹

Users who need to access a cluster must be members of the appropriate cluster login group.

Cluster login groups follow the group stem pattern u_hyak_<CLUSTER>_<GROUP>.

Storage and File Permissionsโ€‹

In most cases, Klone and Tillicum allocations are accompanied by storage allocations on that cluster, and enrollment in the cluster login will automatically grant permissions to filesets and storage resources.

However, there are limited special cases:

  • Subgroups that have restricted file access to individual group members
  • Shared compute and storage allocations (e.g., shared department resources)
  • Storage-only allocations

Separate permissions groups now control access to these special-case storage resources. Membership in a permissions-only group grants access to the associated Linux group and storage resources but does not automatically provide cluster login access.

What Does This Mean for Group Member Managers?โ€‹

Most users will not notice any changes. We have thoroughly reviewed this change, and it has already been in place for over 1 week with zero user-reported impact.

Group member managers should be aware that login access and storage-only permissions are now managed separately. However, most member managers will only add or remove members from the cluster lgin group, u_hyak_<CLUSTER>_<GROUP>.

Only member managers for the special use cases regarding permissions-only groups as described above will edit u_hyak_<GROUP> groups.

UW Groups display names and descriptions have been updated to guide group member managers on the type of access that is granted with group membership.

Research Computing staff will continue creating and maintaining the underlying UW Group structure. Researchers and project administrators will continue managing membership in the groups assigned to their projects.

These changes also establish a consistent framework that will support future improvements to project onboarding, account management, and access administration.

Looking Aheadโ€‹

Although much of this maintenance was focused on backend infrastructure, it represents one of the largest administrative and scheduling modernization efforts performed on Klone in recent years. These changes also establish a consistent framework that will support future improvements to project onboarding, account management, and access administration.

Most users should experience no disruption beyond the maintenance window itself, which is exactly the outcome we were aiming for.

Our next maintenance window is scheduled for Tuesday July 14, 2026.

As always, if you encounter unexpected behavior or have questions about any of these changes, please contact the Research Computing team.

Happy Computing,

Hyak Team