November 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Our November maintenance is complete. Thank you for your patience while we make package updates to node images, ensuring the security and behavior you expect from Hyak klone.

The next maintenance will be Tuesday December 10, 2024.

Noteable Updates#

  • We updated the GPU drivers to version 565.57.01 to patch security vulnerabilities CVE-2024-0117 through CVE-2024-0121 and CVE-2024-0126.
  • We migrated user Home directories to Solid State Drives (SSDs). Users may notice increased speed of software stored in Home directories.

Upcoming Training#

Hyak: Scheduling jobs with Slurm workshop is on Thursday, November 14th, from 10 a.m. to 12 p.m. in the WRF Data Science Studio (UW Physics/Astronomy Tower, 6th Floor; 3910 15th Ave NE, Seattle, WA 98195), and will cover Hyak鈥檚 job scheduler Slurm, interactive jobs, batch jobs, and array jobs. REGISTER HERE!

Hyak: Introduction to Deep Learning workshop is on Tuesday, December 3rd, from 10 to 11:50 a.m. in CSE2 (Gates Center) Room 371. Participants will start with a computer vision example, training a model on a sample dataset, and then learn how to execute the training process on Hyak. REGISTER HERE!

Office Hours#

Zoom office hours will be held on Wednesdays at 2pm. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email. Click here to Register for Zoom Office Hours

In-person office hours will be held on Thursdays at 2pm at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).

The Research Computing Club will be holding office hours fall term. In-person office hours will be held at the eScience Institute, WRF Data Science Studio.

OfficerDateTime
Sam Shin19 Nov2pm
Teerth Mehta3 Dec2pm

If you would like to request 1 on 1 help, please send a ticket to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting.

October 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Our October maintenance is complete. Thank you for your patience while we make package updates to node images, ensuring the security and behavior you expect from Hyak klone.

The next maintenance will be Tuesday November 12, 2024.

New Tools Documentation#

Our research computing interns have been hard at work adding documentation for new user tools that might help optimize your research computing. Click the links below to review the docs for these tools.

Squash Fuse - SquashFS packages multiple small files into a single read-only, compressed filesystem, reducing metadata calls and improving performance. This minimizes server load, enhancing throughput and efficiency in handling storage requests.

Use case on Hyak: In HPC, datasets often consist of numerous small files, which can lead to performance bottlenecks due to excessive metadata operations. By utilizing SquashFS, HPC applications can significantly reduce metadata overhead, improving data access speeds and enhancing overall system performance, particularly in large-scale distributed storage systems.

Checkpointing with DMTCP - DMTCP is a tool to transparently checkpoint and restart jobs, saving it to disk to be resumed at a later time. It requires no changes to application code, allowing easy use. Using DMTCP with your code allows checkpointing at regular intervals so if your job is pre-empted or reaches the time limit, it will resume from its last checkpoint.

Use case on Hyak: DMTCP offers a solution for folks who would like to use Hyak's ckpt partitions, but have jobs that exceed the ckpt time limits of 5 hours for CPU-noly jobs and 8 hours for GPU-only jobs. Checkpointing with DMTCP facilitates efficient use of ckpt resources, allowing higher throughput for your jobs.

Tools for Kopah Storage Users - We have installed Command Line Interface tools like s3cmd and s5cmd on klone and provide insctructions for using Python library boto3 for Kopah interaction and retreival to build Kopah S3 storage usage into your research computing applciations on Hyak.

If you have any issue using these tools, please open a ticket by emailing help@uw.edu with "Hyak" in the subject line. We appreciate any feedback about how to improve ease of use for tools presented in our documentation.

Upcoming Training#

We have planned 2 Hyak-specific trainings for this Fall (more to come, stay tuned). These trainings will be held in person and will not be recorded since recorded materials are already publicly accessible. Capacity is limited, follow the links below to register today to guarantee your spot.

Hyak: Containers are your friend - Monday October 28 10am - 12pm.

Hyak: Scheduling Jobs with Slurm - Thursday November 14 10am - 12pm

In the first hour and a half, we will go over content. The last 30 minutes will be reserved for questions.

Location: eScience Classroom; WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor; 3910 15th Ave NE, Seattle, WA 98195

Keep an eye on your inbox for updates about additional trainings this fall.

Fall Office Hours#

Hyak HPC Staff Scientist and Facilitator, Kristen Finch, will be holding office hours fall term.

Zoom office hours will be held on Wednesdays at 2pm. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email.

Click here to Register for Zoom Office Hours

In-person office hours will be held on Thursdays at 2pm at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195). Click here to RSVP for in-person Office Hours.

Click here to visit the eScience Office Hours page to see additional eScience office hours including AI/ML, R, Earth Data, and Python (not available to help with Homework).

The Research Computing Club will be holding office hours fall term. In-person office hours will be held at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).

OfficerDateTime
Brenden Pelkie16 Oct1pm
Nels Schimek23 Octipm
Nels Schimek6 Nov1pm
Sam Shin19 Nov2pm
Teerth Mehta3 Dec2pm

If you would like to request 1 on 1 help, please send a ticket to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting with Kristen.

Please don't hesitate to reach out to the Hyak team with issues and feedback by opening a tickey by emailing help@uw.edu with "Hyak" in the subject.

Have a great October!

Happy computing,

Hyak Team

September 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations. Of note, the Nvidia GPU driver on Klone has been updated to the latest production datacenter release, version 560.35.03.

The next maintenance will be Tuesday October 8, 2024.

AMD Math libraries#

In June we announced the addition of AMD Nodes and Slices to klone which make up our generation 2 or g2 collection of resources. Click here to read more about the difference between our g1 and g2 resources. During August, we installed a new AMD compiler suite, AOCC, along with specialized math libraries like AOCL, OpenBLAS, and ScaLAPACK as modules to make the most of this upgade. The new modules are useful on all partitions. The AOCC and AOCL modules are particularly relevant for partitions cpu-g2, cpu-g2-mem2x, and ckpt-g2. These tools are designed to optimize performance on AMD processors, speeding up complex mathematical computations. Whether you're working on simulations, data analysis, or any number-crunching tasks, these libraries may help ensure you get faster, more efficient results. If you're looking to boost your workflow, it's worth exploring how these libraries can benefit your projects. Here are the names of the new modules:

aocc/4.2.0
aocl/4.2.0
openblas/0.3.28
scalapack/2.2.0

In our benchmarking tests, performance of these libraries was similar on g1 and g2 CPUs for each math library, regardless of architecture. The best library performer on AMD CPUs is AOCC+AOCL, and for Intel CPUs it鈥檚 OpenBLAS+ScaLAPACK:

Image displays graph indicating that the best library performer on AMD CPUs is AOCC plus AOCL, and for Intel CPUs it鈥檚 OpenBLAS plus ScaLAPACK

Fall Office Hours#

Hyak HPC Staff Scientist and Facilitator, Kristen Finch, will be holding office hours fall term. Zoom office hours will be held on Wednesdays at 2pm. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email.

Click here to Register

In-person office hours will be held on Thursdays at 2pm at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195). Click here to RSVP for in-person Office Hours.

Click here to visit the eScience Office Hours page to see additional eScience office hours including AI/ML, R, Earth Data, and Python (not available to help with Homework).

If you would like to request 1 on 1 help, please send a ticket to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting with Kristen.

August 2024 Training Videos#

In case you missed it, we recorded the August 2024 Wednesday training sessions and posted them on the UW-IT YouTube channel under the playlist, "Hyak Training." Here are the links:

Keep an eye on your indox for updates about our Fall training schedule; training sessions are currently TBA. Trainings will be announced via the Hyak mailing list, click here to join the mailing list.

August 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations.

The next maintenance will be Tuesday September 10, 2024.

New self-hosted S3 storage option: KOPAH#

We are happy to announce the preview launch of our self-hosted S3 storage called KOPAH. S3 storage is a solution for securely storing and managing large amounts of data, whether for personal use or research computing. It works like an online storage locker where you store can files of any size, accessible from anywhere with an internet connection. For researchers and those involved in data-driven studies, it provides a reliable and scalable platform to store, access, and analyze large datasets, supporting high-performance computing tasks and complex workflows.

S3 uses buckets as containers to store data, where each bucket can hold 100,000,000 objects, which are the actual files or data you store. Each object within a bucket is identified by a unique key, making it easy to organize and retrieve your data efficiently. Public links can be generated for KOPAH objects so that users can share buckets and objects with collaborators.

Click here to learn more about KOPAH S3.

Who should use KOPAH?#

KOPAH is a storage solution for anyone. Just like other storage options out there, you can upload, download, and view your storage bucket with specialized tools and share your data via the internet. For Hyak users, KOPAH provides another storage option for research computing. It is more affordable than /gscratch storage and can be used for active research computing with a few added steps for retreiving stored data prior to a job.

Test Users Wanted#

Prior to September, we are inviting test users to try KOPAH and provide feedback about their experience. If you are interested in becoming a KOPAH test user, please each help@uw.edu with Hyak or KOPAH in the subject line.

Requirements:

  1. While we will not charge for the service until September 1, to sign up as a test user, we require a budget number and worktag. If the service doesn't work for you, you can cancel before September.
  2. We will ask for a name for the account. If your groups has an existing account on Hyak, klone /gscratch, it makes sense of the names to match across services.
  3. Please be ready to respond with your feedback about the service.

Opportunities#

PhD students should check out this opportunity for funding from NVIDIA: Graduate Research Fellowship Program

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

July 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations.

The next maintenance will be Tuesday August 13, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

June 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance. This month, we deployed new node resources that were purchased by various UW Researchers from across campus. These nodes are a little different, so we wanted to bring your attention to them and provide guidance on their use when they are idle with the checkpoint partition.

New G2 Nodes#

A new class of nodes have been deployed on klone which we are calling g2 because they are the second generation of nodes, and we will retroactively refer to the first generation nodes as g1. g2 CPU nodes feature AMD EPYC 9000-series 'Genoa' processors, and new GPU nodes featuring either NVIDIA L40 or L40S GPUs (with H100 GPUs possibly becoming available in the future). These nodes will join our community resources that can be used when idle (ckpt) under the new partitions:

  • ckpt-g2 for scheduling jobs on g2 nodes only.

  • ckpt-all for scheduling jobs on either g1 or g2 nodes.

  • ckpt will now schedule jobs on g1 nodes only.

    Please review our documentation HERE for specific instructions for accessing these resources. Additionally, please see the blog post HERE where we discuss additional considerations for their usage.

To accompany the new g2 node deployments, we are providing a new Open MPI module (ompi/4.1.6-2), which is now the default module when module load ompi is executed. Previous OpenMPI modules will cause errors if used with the AMD processors on the new g2 nodes due to how the software was compiled. ompi/4.1.6-2 (and any openmpi module versions we provide in the future) are compiled to support both Intel and AMD processors. If your MPI jobs are submitted to a partition that includes g2 nodes, you should use module load ompi to use the new module by default, or explicitly load ompi/4.1.6-2 (or a newer version in the future) via module load ompi/4.1.6-2.

If you have compiled software on g1 nodes, you should test them on g2 nodes before bulk submitting jobs to partitions with g2 nodes (i.e., ckpt-g2 and ckpt-all), as they may or may not function properly depending on exactly how they were compiled.

Student Opportunities#

In addition, we have two student opportunities to bring to your attention.

Job Opportunity: The Research Computing (RC) team at the University of Washington (UW) is looking for a student intern to spearhead projects that could involve: (1) the development of new tools and software, (2) research computing documentation and user tutorials, or (3) improvements to public-facing service catalog descriptions and service requests. Hyak is an ecosystem of high-performance computing (HPC) resources and supporting infrastructure available to UW researchers, students, and associated members of the UW community. Our team administers and maintains Hyak as well as provides support to Hyak users. Our intern will be given the choice of projects that fit their interest and experience while filling a need for the UW RC community. This role will provide students with valuable hands-on experiences that enhance academic and professional growth.

The position pays $19.97-21.50 per hour depending on experience with a maximum of 20 hours per week (Fall, Winter, and Spring) and a maximum of 40 hours allowed during summer quarter. How to apply: Please apply by emailing: 1) a current resume and 2) a cover letter detailing your qualifications, experience, and interest in the position to me, Kristen Finch (UWNetID: finchkn). Due to the volume of applications, we regret that we are unable to respond to every applicant or provide feedback.

Minimum Qualifications:

  • Student interns must hold at least a 2nd year standing if an undergraduate student.
  • Student interns must be able to access the internet.
  • Student interns must be able to demonstrate an ability to work independently on their selected project/s and expect to handle challenges by consulting software manuals and publicly available resources.

We encourage applicants that:

  • meet the minimum qualifications.
  • have an interest in website accessibility and curation.
  • have experience in research computing and Hyak specifically. This could include experience in any of the following: 1) command-line interface in a Linux environment, 2) Slurm job scheduler, 3) python, 4) shell scripting.
  • have an interest in computing systems administration.
  • have an interest in developing accessible computing-focused tutorials for the Hyak user community.

Conference Opportunity: The 2024 NSF Cybersecurity Summit program committee is now accepting applications to the Student Program. This year鈥檚 summit will be held October 7th-10th at Carnegie Mellon University in Pittsburgh, PA. Both undergraduate and graduate students may apply. No specific major or course of study is required, as long as the student is interested in learning and applying cybersecurity innovations to scientific endeavors. Selected applicants will receive invitations from the Program Committee to attend the Summit in-person. Attendance includes your participation in a poster session. The deadline for applications is Friday June 28th at 12 am CDT, with notification of acceptance to be sent by Monday July 29th. Click Here to Apply

Our next scheduled maintenance will be Tuesday July, 9, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line. Student intern applications sent to help@uw.edu will not be considered. Email applications to finchkn at uw.edu

G1 vs G2 Nodes

Nam Pho

Nam Pho

Director for Research Computing

Hello Hyak community, you may have noticed we've been relatively quiet infrastructure-wise over the past few months. Part of this has been due to data center constraints that have limited our ability to grow the cluster, which have since been addressed (for now) in April 2024. The good news is that we expect to begin ramping up deliveries of previously purchased slices over the coming weeks and months and, as a result, expanding the overall size of the cluster and checkpoint partitions.

G1 vs G2#

Large clusters are preferably homogenous to help with resource scheduling as any part of the cluster should be interchangeable with another. However, this is typical for fully built systems at the time of launch and not gradual build outs of condo-like systems such as Hyak. The primary driver behind this change from g1 (generation 1) to g2 are the rapid advances in technology providing a performance gain that make it untenable to maintain an older processor for the sake of homogeneity.

Intel vs AMD and Nodes vs Slices#

The first half of klone from its initial launch until June 11, 2024 when the first g2 nodes were introduced was a fully Intel-based cluster. Specifically, all g1 nodes are based on the Intel XEON 6230 Gold or "Cascade Lake" generation CPUs. The g2 nodes are based on the AMD EPYC 9000 series or "Genoa" generation CPUs.

Nodes refer to a physical unit we procure and install, such as a server. You don't need to worry about this. Slices are what researchers actually buy and contribute to the cluster in their "condo". Click here to learn more about Hyak's condo model.. A slice is a resource limit for your Hyak group backed by a commensurate contribution to the cluster. Sometimes (as in the g1 era) 1 node was 1 slice. However, in the g2 era, 1 node can consist of up to 6 slices to maintain a consistent price point for performance across generations and to provide flexibility to the Hyak team to select more cost-effective configurations.

CPU and Architecture Optimizations#

On a core-per-core basis, a g2 CPU core are faster than a g1 CPU core. If you are using an existing module that was compiled under a g1 (or Intel) slice there's a strong chance it will continue to work on a g2 (or AMD) slice. Intel and AMD are both x86 based CPUs and use an overlapping instruction set. Unless you compile code that is highly optimized for Intel or AMD, your code should run cross-platform. In our spot check of a few current modules (many contributed by our users) there seems to be no issues running existing Intel compiled code on the newer g2 slices. However, if you choose to recompile your code on a g2 (or retroactively for a g1) slice, you may see a performance improvement on that specific architecture at the expense of generalizibility and limiting your resource availability.

Storage and Memory#

There are no special considerations for taking advantage of the local node SSDs on either node type. This is accessible as /src across both g1 and g2 nodes and you can copy your data to and from there during a job. Note that the /src directory is wiped on job exit and not persistent beyond the life of your job.

There are no special considerations for taking advantage of the faster memory on the g2 nodes.

Ada and Hopper GPUs#

There are no special considerations for any pre-Ada (i.e., L40(S)) or pre-Hopper (i.e., H100) GPU code. All NVIDIA-based GPU codes are fully compatible across generations. As these are a newer GPU generation, there are performance improvements by using them alone. However, they are attached to the newer g2 architecture so benefit from the supporting case of improved CPU and memory performance of the surrounding system. If your support (i.e., non-GPU) code relies on any architecture optimizations, see the caveats above.

Resource Requests#

If you purchased any of these slices and contributed them to the cluster you will have received the specific partition names to use once they are deployed.

If you are interested in using these new resources when they are idle via the checkpoint partition there are now new considerations. You can read about it here. The new checkpoint partitions are:

  • ckpt-all if you want access to the entire cluster across g1 and g2 resources (and every possible GPU). One possible concern if you run a multi-node job that spans g1 and g2 nodes, is that you will probably see a performance hit. Multi-node jobs often rely on gather operations and will be as slow as the slowest worker, so your g2 nodes will be held back waiting for computation done on the slower g1 nodes to complete.
  • ckpt if you want to optimize for Intel specific processors (or only want to use the older GPUs). This is the status quo and you shouldn't need to change your job scripts if you don't want to use the newer resources.
  • ckpt-g2 if you want to optimize for AMD specific processors (or use only the newer GPUs).

As always, if you have any questions please reach us at help@uw.edu.

May 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance, there are some notable improvements we implemented today.

klone node image: Over the past few weeks, you may have noticed some klone instability. This was a result of some behind the scenes storage upgrades that inadvertently introduced wider impacts to the existing cluster automation. At the time, we introduced a temporary fix to get the cluster back online but with today鈥檚 maintenance we implemented a more comprehensive fix.

Infiniband firmware: The klone cluster is built on the infiniband HPC interconnect for node-to-node communication. While klone originally launched with the HDR generation of infiniband, we have since upgraded mid-klone to have a HDR-NDR hybrid interconnect. NDR infiniband is required to support the latest compute slices we offer. We updated the firmware on our NDR switches following vendor recommendations for increased stability.

Apptainer on MOX: Apptainer (formerly Singularity) is the root-less containerization solution we provide on both Hyak clusters. Apptainer version 1.3.1 was deployed on both klone and MOX. As a reminder, on klone Apptainer is accessed through a module and is only available on compute nodes after module load apptainer. On MOX, Apptainer is default software and can be accessed with Apptainer commands directly after starting an interactive job for example, apptainer --version.

Training Opportunities: COMPLECS (San Diego Supercomputer) is hosting an Intermediate Linux Shell Scripting online workshop on Thursday May, 16 at 11:00 am Pacific Time. Register here.

Our next scheduled maintenance will be Tuesday June, 11, 2024. Stay informed by joining our mailing list. Sign up here.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

April 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The Hyak team was able to use the interruption to implement the following changes:

  • Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
  • The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities#

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

AI Research Needs Survey

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

The Research Working Group of the UW AI Task Force would like faculty and research staff input on the needs and challenges of using AI in research at UW across a broad spectrum of disciplines. Please help by responding to a short survey at: https://forms.gle/mZrV3aCgJYNNBV6j8. Responses by April 25 would be most helpful, but the survey will remain open until April 30.

Thank you in advance for your time,

Hyak Team