4 posts tagged with "nvidia"

May 2025 Maintenance Update

May 13, 2025 · 7 min read

Director of Research Computing Solutions

During May's maintenance, we've refreshed the operating system images for both login and compute nodes with the latest Linux security updates and patches, and enhanced the node image to include NVIDIA Fabric Manager and software upgrades for next-generation GPU switching fabrics. There’s still time to register for GPU in EDU on Thursday, May 15th, 10:00 a.m.–4:00 p.m., where we team up with NVIDIA and Cambridge Computer for a day of demos, teaching guidance, and research highlights. Stay informed by subscribing to our mailing list and the UWIT Research Computing Events Calendar. The next maintenance is scheduled for Tuesday June 10, 2025 (AKA the 2nd Tuesday of the month).

Notable Updates

Operating system - The images for both the login and compute nodes have been refreshed to incorporate the latest Linux OS security updates and system patches.
Node image enhancements - This version of the node image includes NVIDIA Fabric Manager and software upgrades necessary to support next-generation GPU switching fabrics.

Upcoming Events

Subscribe to event updates and bookmark our UWIT Research Computing Events Calendar.

There's still time to sign up for GPU in EDU Thursday May 15th 10:00 a.m. - 4:00 p.m.
- Join us for a full day of learning about GPUs with experts from NVIDIA and Cambridge Computer. The event will feature recommendations for building GPU workflows, guidance for using GPUs for teaching, highlights of GPU-powered research at UW, and more. Don’t miss it — Lunch is provided! Registration is still open!

Spring Office Hours

Wednesdays at 2pm on Zoom. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email. Click here to Register for Zoom Office Hours.
Thursdays at 2pm in person in eScience. (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195). Click here to visit the eScience Office Hours page to see additional eScience office hours including AI/ML, R, Earth Data, and Python (not available to help with Homework).
The Research Computing Club officers will be hosting in person office hours in eScience.

If you would like to request 1 on 1 help, please send an email to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting.

Training Resources

New documentation added for launching Rstudio on Hyak with Open OnDemand - No more port forwarding!
If you are a new user, check out our Tutorials section.
New AWS AI training video uploaded this month.
Check out our Research Computing Training Playlist on UWIT's YouTube channel.

Opportunities

Computing Training from eScience and more

Intro Programming Workshop - eScience is holding a Software Carpentry workshop on May 27th–30th (9:00 a.m. – noon each day). The workshop will teach software tools that can make researchers more effective, automate research tasks, and track research over time. Specifically, the Unix Shell, Git, and Python will be taught with a focus on reproducible research. Register here.
Teach your own LLM - On Friday, May 23rd from 10:30 - noon in the Open Scholarship Commons, Jose Cols will lead the workshop “Teach your own LLM: Fine-tuning Models on Custom Datasets” covering how LLMs work and how to fine-tune a Llama 3 model for tasks like sentiment analysis and summarization.
- In-person details here.
- Online details here.

External Training Opportunities

COMPLECS: Using Regular Expressions with Linux Tools - Thursday, May 29, 2025 - 11:00 a.m. – 12:30 p.m. (Pacific Time) Regular expressions (regexes) provide a way to identify strings that match a specified pattern. They are extremely useful for preprocessing text and extracting results from high-performance computing and data science workloads. Primarily in the context of the Linux grep utility, we incrementally introduce the main features of regexes: string literals, specifying multiple characters, quantifiers, wildcards, anchors, character classes, grouping, and alternation. We also explore more advanced topics such as word boundaries, lazy and greedy matching, regex flavors (basic, extended, and Perl compatible), regexes with awk and sed, searching compressed files, and using large language models (LLMs) to create regexes. Register here!
8:00 - 11:30 am (Pacific time), Wednesday, May 28, 2025 - LANL is hosting an Accelerated Python Tutorial, presented by Scot Halverson from Nvidia. This will include measuring, understanding, and improving performance of their python applications, including ML workflows using PyTorch and Tensorflow. This event is open to NERSC users. Learn more and register.
Building GPU-Accelerated Differentiable Simulations with NVIDIA Warp Python 1- 4 pm (Pacific time), Wednesday, May 28, 2025 - LANL is hosting a Building GPU-Accelerated Differentiable Simulations with NVIDIA Warp Python training, presented by Eric Shi from Nvidia. This approach lets developers harness GPU performance while maintaining the simplicity and flexibility of Python. This event is open to NERSC users. Learn more and register.
NERSC GPU Hackathon - NERSC, in conjunction with NVIDIA and the OpenACC organization, will be hosting an Open Hackathon from July 16th-18th with an opening day on July 9th as part of the annual Open Hackathon Series. Please note the deadline to submit a proposal is 11:59 PM Pacific, May 28, 2025. So apply now!.
A Deep Dive into the HPC SDK - 1:00 - 4:00 pm (Pacific time), Thursday, May 29, 2025 - LANL is hosting a Deep Dive into the Nvidia HPC SDK Training, presented by Scot Halverson from Nvidia. This talk will cover the broad set of compilers, tools, and libraries that make up the NVIDIA HPC SDK. This event is open to NERSC users. Learn more and register.
High Throughput Workflow Tools and Strategies - 8:30 am - 12:00 pm (Pacific time), Friday, May 30, 2025 - NERSC is hosting an online webinar presented by William Arndt of NERSC and Geoffrey Lentner from Advanced Computing, Purdue University. This training session will discuss and demonstrate multiple software tools for managing high throughput workloads: GNU Parallel, Snakemake, and Hypershell. The seminar is open to the general public. Learn more and register.
Solving Data Management Challenges with Globus - June 6, 2025, 9 a.m. – 12 p.m. (Pacific Time) Participants will engage in hands-on exercises to explore how Globus can streamline data movement across cloud and high-performance computing systems. Whether managing large datasets, enabling secure collaboration, or automating workflows, this session will equip participants with the knowledge and skills to maximize the benefits of Globus. Enroll here.
COMPLECS: Code Migration - Thursday, June 12, 2025 - 11:00 a.m. – 12:30 p.m. (Pacific Time) We will cover typical approaches to moving your computations to HPC resources: using applications/software packages already available on the system through Linux environment modules; compiling code from source with information on compilers, libraries, and optimization flags to use; setting up Python and R environments; using conda-based environments; managing workflows; and using containerized solutions via Singularity. Register here!
Automating Research with Globus: The Modern Research IT Platform - Aug. 18, 2025, 9 a.m. – 12 p.m. (Pacific Time) This workshop introduces Globus Flows and its role in automating research workflows. Participants will explore data portals, science gateways, and commons, enabling seamless data discovery and access. Enroll here.

If you have any questions about using Hyak, please start a help request by emailing help@uw.edu with "Hyak" in the subject line.

Happy Computing,

Hyak Team

June 2024 Maintenance Details

June 11, 2024 · 5 min read

Kristen Finch

HPC Staff Scientist

Thanks again for your patience with our monthly scheduled maintenance. This month, we deployed new node resources that were purchased by various UW Researchers from across campus. These nodes are a little different, so we wanted to bring your attention to them and provide guidance on their use when they are idle with the checkpoint partition.

New G2 Nodes

A new class of nodes have been deployed on klone which we are calling g2 because they are the second generation of nodes, and we will retroactively refer to the first generation nodes as g1. g2 CPU nodes feature AMD EPYC 9000-series 'Genoa' processors, and new GPU nodes featuring either NVIDIA L40 or L40S GPUs (with H100 GPUs possibly becoming available in the future). These nodes will join our community resources that can be used when idle (ckpt) under the new partitions:

ckpt-g2 for scheduling jobs on g2 nodes only.
ckpt-all for scheduling jobs on either g1 or g2 nodes.
ckpt will now schedule jobs on g1 nodes only.

Please review our documentation HERE for specific instructions for accessing these resources. Additionally, please see the blog post HERE where we discuss additional considerations for their usage.

To accompany the new g2 node deployments, we are providing a new Open MPI module (ompi/4.1.6-2), which is now the default module when module load ompi is executed. Previous OpenMPI modules will cause errors if used with the AMD processors on the new g2 nodes due to how the software was compiled. ompi/4.1.6-2 (and any openmpi module versions we provide in the future) are compiled to support both Intel and AMD processors. If your MPI jobs are submitted to a partition that includes g2 nodes, you should use module load ompi to use the new module by default, or explicitly load ompi/4.1.6-2 (or a newer version in the future) via module load ompi/4.1.6-2.

If you have compiled software on g1 nodes, you should test them on g2 nodes before bulk submitting jobs to partitions with g2 nodes (i.e., ckpt-g2 and ckpt-all), as they may or may not function properly depending on exactly how they were compiled.

Student Opportunities

In addition, we have two student opportunities to bring to your attention.

Job Opportunity: The Research Computing (RC) team at the University of Washington (UW) is looking for a student intern to spearhead projects that could involve: (1) the development of new tools and software, (2) research computing documentation and user tutorials, or (3) improvements to public-facing service catalog descriptions and service requests. Hyak is an ecosystem of high-performance computing (HPC) resources and supporting infrastructure available to UW researchers, students, and associated members of the UW community. Our team administers and maintains Hyak as well as provides support to Hyak users. Our intern will be given the choice of projects that fit their interest and experience while filling a need for the UW RC community. This role will provide students with valuable hands-on experiences that enhance academic and professional growth.

The position pays $19.97-21.50 per hour depending on experience with a maximum of 20 hours per week (Fall, Winter, and Spring) and a maximum of 40 hours allowed during summer quarter. How to apply: Please apply by emailing: 1) a current resume and 2) a cover letter detailing your qualifications, experience, and interest in the position to me, Kristen Finch (UWNetID: finchkn). Due to the volume of applications, we regret that we are unable to respond to every applicant or provide feedback.

Minimum Qualifications:

Student interns must hold at least a 2nd year standing if an undergraduate student.
Student interns must be able to access the internet.
Student interns must be able to demonstrate an ability to work independently on their selected project/s and expect to handle challenges by consulting software manuals and publicly available resources.

We encourage applicants that:

meet the minimum qualifications.
have an interest in website accessibility and curation.
have experience in research computing and Hyak specifically. This could include experience in any of the following: 1) command-line interface in a Linux environment, 2) Slurm job scheduler, 3) python, 4) shell scripting.
have an interest in computing systems administration.
have an interest in developing accessible computing-focused tutorials for the Hyak user community.

Conference Opportunity: The 2024 NSF Cybersecurity Summit program committee is now accepting applications to the Student Program. This year’s summit will be held October 7th-10th at Carnegie Mellon University in Pittsburgh, PA. Both undergraduate and graduate students may apply. No specific major or course of study is required, as long as the student is interested in learning and applying cybersecurity innovations to scientific endeavors. Selected applicants will receive invitations from the Program Committee to attend the Summit in-person. Attendance includes your participation in a poster session. The deadline for applications is Friday June 28th at 12 am CDT, with notification of acceptance to be sent by Monday July 29th. Click Here to Apply

Our next scheduled maintenance will be Tuesday July, 9, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line. Student intern applications sent to help@uw.edu will not be considered. Email applications to finchkn at uw.edu

G1 vs G2 Nodes

June 10, 2024 · 5 min read

Nam Pho

Director for Research Computing

Hello Hyak community, you may have noticed we've been relatively quiet infrastructure-wise over the past few months. Part of this has been due to data center constraints that have limited our ability to grow the cluster, which have since been addressed (for now) in April 2024. The good news is that we expect to begin ramping up deliveries of previously purchased slices over the coming weeks and months and, as a result, expanding the overall size of the cluster and checkpoint partitions.

G1 vs G2

Large clusters are preferably homogeneous to help with resource scheduling as any part of the cluster should be interchangeable with another. However, this is typical for fully built systems at the time of launch and not gradual build outs of condo-like systems such as Hyak. The primary driver behind this change from g1 (generation 1) to g2 are the rapid advances in technology providing a performance gain that make it untenable to maintain an older processor for the sake of homogeneity.

Intel vs AMD and Nodes vs Slices

The first half of klone from its initial launch until June 11, 2024 when the first g2 nodes were introduced was a fully Intel-based cluster. Specifically, all g1 nodes are based on the Intel XEON 6230 Gold or "Cascade Lake" generation CPUs. The g2 nodes are based on the AMD EPYC 9000 series or "Genoa" generation CPUs.

Nodes refer to a physical unit we procure and install, such as a server. You don't need to worry about this. Slices are what researchers actually buy and contribute to the cluster in their "condo". Click here to learn more about Hyak's condo model.. A slice is a resource limit for your Hyak group backed by a commensurate contribution to the cluster. Sometimes (as in the g1 era) 1 node was 1 slice. However, in the g2 era, 1 node can consist of up to 6 slices to maintain a consistent price point for performance across generations and to provide flexibility to the Hyak team to select more cost-effective configurations.

CPU and Architecture Optimizations

On a core-per-core basis, a g2 CPU core are faster than a g1 CPU core. If you are using an existing module that was compiled under a g1 (or Intel) slice there's a strong chance it will continue to work on a g2 (or AMD) slice. Intel and AMD are both x86 based CPUs and use an overlapping instruction set. Unless you compile code that is highly optimized for Intel or AMD, your code should run cross-platform. In our spot check of a few current modules (many contributed by our users) there seems to be no issues running existing Intel compiled code on the newer g2 slices. However, if you choose to recompile your code on a g2 (or retroactively for a g1) slice, you may see a performance improvement on that specific architecture at the expense of generalizability and limiting your resource availability.

Storage and Memory

There are no special considerations for taking advantage of the local node SSDs on either node type. This is accessible as /src across both g1 and g2 nodes and you can copy your data to and from there during a job. Note that the /src directory is wiped on job exit and not persistent beyond the life of your job.

There are no special considerations for taking advantage of the faster memory on the g2 nodes.

Ada and Hopper GPUs

There are no special considerations for any pre-Ada (i.e., L40(S)) or pre-Hopper (i.e., H100) GPU code. All NVIDIA-based GPU codes are fully compatible across generations. As these are a newer GPU generation, there are performance improvements by using them alone. However, they are attached to the newer g2 architecture so benefit from the supporting case of improved CPU and memory performance of the surrounding system. If your support (i.e., non-GPU) code relies on any architecture optimizations, see the caveats above.

Resource Requests

If you purchased any of these slices and contributed them to the cluster you will have received the specific partition names to use once they are deployed.

If you are interested in using these new resources when they are idle via the checkpoint partition there are now new considerations. You can read about it here. The new checkpoint partitions are:

ckpt-all if you want access to the entire cluster across g1 and g2 resources (and every possible GPU). One possible concern if you run a multi-node job that spans g1 and g2 nodes, is that you will probably see a performance hit. Multi-node jobs often rely on gather operations and will be as slow as the slowest worker, so your g2 nodes will be held back waiting for computation done on the slower g1 nodes to complete.
ckpt if you want to optimize for Intel specific processors (or only want to use the older GPUs). This is the status quo and you shouldn't need to change your job scripts if you don't want to use the newer resources.
ckpt-g2 if you want to optimize for AMD specific processors (or use only the newer GPUs).

As always, if you have any questions please reach us at help@uw.edu.

April 2024 Maintenance Details

April 22, 2024 · 2 min read

Kristen Finch

HPC Staff Scientist

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The Hyak team was able to use the interruption to implement the following changes:

Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Notable Updates​

Upcoming Events​

Spring Office Hours​

Training Resources​

Opportunities​

Computing Training from eScience and more​

External Training Opportunities​

New G2 Nodes​

Student Opportunities​

G1 vs G2​

Intel vs AMD and Nodes vs Slices​

CPU and Architecture Optimizations​

Storage and Memory​

Ada and Hopper GPUs​

Resource Requests​

Training Opportunities​