2 posts tagged with "gpus"

View All Tags

June 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thanks again for your patience with our monthly scheduled maintenance. This month, we deployed new node resources that were purchased by various UW Researchers from across campus. These nodes are a little different, so we wanted to bring your attention to them and provide guidance on their use when they are idle with the checkpoint partition.

New G2 Nodes#

A new class of nodes have been deployed on klone which we are calling g2 because they are the second generation of nodes, and we will retroactively refer to the first generation nodes as g1. g2 CPU nodes feature AMD EPYC 9000-series 'Genoa' processors, and new GPU nodes featuring either NVIDIA L40 or L40S GPUs (with H100 GPUs possibly becoming available in the future). These nodes will join our community resources that can be used when idle (ckpt) under the new partitions:

  • ckpt-g2 for scheduling jobs on g2 nodes only.

  • ckpt-all for scheduling jobs on either g1 or g2 nodes.

  • ckpt will now schedule jobs on g1 nodes only.

    Please review our documentation HERE for specific instructions for accessing these resources. Additionally, please see the blog post HERE where we discuss additional considerations for their usage.

To accompany the new g2 node deployments, we are providing a new Open MPI module (ompi/4.1.6-2), which is now the default module when module load ompi is executed. Previous OpenMPI modules will cause errors if used with the AMD processors on the new g2 nodes due to how the software was compiled. ompi/4.1.6-2 (and any openmpi module versions we provide in the future) are compiled to support both Intel and AMD processors. If your MPI jobs are submitted to a partition that includes g2 nodes, you should use module load ompi to use the new module by default, or explicitly load ompi/4.1.6-2 (or a newer version in the future) via module load ompi/4.1.6-2.

If you have compiled software on g1 nodes, you should test them on g2 nodes before bulk submitting jobs to partitions with g2 nodes (i.e., ckpt-g2 and ckpt-all), as they may or may not function properly depending on exactly how they were compiled.

Student Opportunities#

In addition, we have two student opportunities to bring to your attention.

Job Opportunity: The Research Computing (RC) team at the University of Washington (UW) is looking for a student intern to spearhead projects that could involve: (1) the development of new tools and software, (2) research computing documentation and user tutorials, or (3) improvements to public-facing service catalog descriptions and service requests. HYAK is an ecosystem of high-performance computing (HPC) resources and supporting infrastructure available to UW researchers, students, and associated members of the UW community. Our team administers and maintains HYAK as well as provides support to HYAK users. Our intern will be given the choice of projects that fit their interest and experience while filling a need for the UW RC community. This role will provide students with valuable hands-on experiences that enhance academic and professional growth.

The position pays $19.97-21.50 per hour depending on experience with a maximum of 20 hours per week (Fall, Winter, and Spring) and a maximum of 40 hours allowed during summer quarter. How to apply: Please apply by emailing: 1) a current resume and 2) a cover letter detailing your qualifications, experience, and interest in the position to me, Kristen Finch (UWNetID: finchkn). Due to the volume of applications, we regret that we are unable to respond to every applicant or provide feedback.

Minimum Qualifications:

  • Student interns must hold at least a 2nd year standing if an undergraduate student.
  • Student interns must be able to access the internet.
  • Student interns must be able to demonstrate an ability to work independently on their selected project/s and expect to handle challenges by consulting software manuals and publicly available resources.

We encourage applicants that:

  • meet the minimum qualifications.
  • have an interest in website accessibility and curation.
  • have experience in research computing and HYAK specifically. This could include experience in any of the following: 1) command-line interface in a Linux environment, 2) SLURM job scheduler, 3) python, 4) shell scripting.
  • have an interest in computing systems administration.
  • have an interest in developing accessible computing-focused tutorials for the HYAK user community.

Conference Opportunity: The 2024 NSF Cybersecurity Summit program committee is now accepting applications to the Student Program. This year’s summit will be held October 7th-10th at Carnegie Mellon University in Pittsburgh, PA. Both undergraduate and graduate students may apply. No specific major or course of study is required, as long as the student is interested in learning and applying cybersecurity innovations to scientific endeavors. Selected applicants will receive invitations from the Program Committee to attend the Summit in-person. Attendance includes your participation in a poster session. The deadline for applications is Friday June 28th at 12 am CDT, with notification of acceptance to be sent by Monday July 29th. Click Here to Apply

Our next scheduled maintenance will be Tuesday July, 9, 2024.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line. Student intern applications sent to help@uw.edu will not be considered. Email applications to finchkn at uw.edu

Fairshare improvements on KLONE

Nam Pho

Nam Pho

Director for Research Computing
note

We have adjusted legacy fairshare-related settings to account for GPUs and large memory contributions and usage in order to help more fairly allocate checkpoint resources.

History#

In fall 2019 (almost two years ago to the day) the HYAK team received our first Turing generation GPU node. HYAK has had a modest GPU footprint in the past as far back as a decade ago with the first generation cluster (called "IKT") and its pre-Pascal generation cards. In 2015 we acquired a smaller test bed of Pascal generation GPUs for the second generation cluster (called "MOX"). There were never more than a dozen GPUs in either the IKT or MOX clusters, but the introduction of Turing GPUs marked a resurgence of interest in these accelerators among the UW research community. In the last two years, we've substantially expanded our capabilities to over 300 GPUs.

Background#

HYAK clusters work on a "condo" model: labs are able to utilize their contributed hardware on-demand as well as take advantage of idle capacity from other groups' hardware via the checkpoint (ckpt) partition. Your checkpoint priority — or "fairshare" in SLURM scheduler parlance — is weighted such that your fairshare is directly proportional to your lab’s contribution to the cluster. In the MOX days, GPU users tended to stay within their contributed hardware partitions and rarely made use of checkpoint. We attributed this to a mental shift: students were used to using a single resource, like a desktop computer, rather than a shared cluster of computing resources. However, with the migration to the third generation HYAK cluster (called "KLONE") and its new QoS scheduling system and the increasing comfort of students using a shared platform, GPU utilization in the checkpoint partition has increased as well. This is a good thing: we want groups to benefit from their HYAK membership in the cluster and take advantage of idle cluster resources beyond their initial hardware contributions. This is a primary tenet of our social contract with the HYAK community: as a node contributor to the cluster, you have access to idle resources of the whole cluster.

Problem#

Fairshare was simpler to calculate in the pre-GPU days because our infrastructure was homogenous: one node contributed to the cluster equaled one fairshare unit. During the last two years of exponential GPU adoption on HYAK, the fairshare calculation has not evolved: 1 HPC node was the same as 1 GPU node at 1 fairshare unit. This didn’t hold because a GPU node can cost between 4 to 8 times (or more) than a traditional HPC node. The result was that labs with GPU or other speciality (e.g., high-memory) nodes tended to have smaller fairshares compared to groups with the same dollar investment but only in traditional CPU nodes. In practice, this meant these GPU users often directly competed for resources with non-GPU jobs in the checkpoint partition on a non-level playing field.

Solution#

Taking into consideration all of this information, as well as the fact that you can request as little as 1 GPU or 1 CPU from the scheduler, we have adjusted the fairshare calculations as follows:

  • Financially: 1 GPU card is roughly equivalent to 40 CPU cores (on a dollar basis), therefore the cost normalization is 40:1 in favor of GPUs.
  • Scarcity: 1 server typically holds 8 GPU cards or 40 CPU cores, therefore the scarcity normalization is 5:1 in favor of GPUs.
  • Combining the financial and scarcity considerations in the points above, the final weighting is 200:1 in favor of GPUs. In other words, 1 GPU card is worth 200 times more than a single CPU core in the eyes of the scheduler and factored into your checkpoint fairshare. Please note that this example only applies to the higher GPU memory cards (i.e., gpu-rtx6k) while less expensive GPUs have commensurately less weight.

Summary#

With the October monthly maintenance today we have introduced a new fairshare weighting system on the KLONE cluster's checkpoint (ckpt) partition that commensurately acknowledges GPU labs for their contributions to the HYAK community. This has no impact on jobs submitted to non-ckpt partitions.