4 posts tagged with "training"

View All Tags

May 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thanks again for your patience with our monthly scheduled maintenance, there are some notable improvements we implemented today.

KLONE node image: Over the past few weeks, you may have noticed some KLONE instability. This was a result of some behind the scenes storage upgrades that inadvertently introduced wider impacts to the existing cluster automation. At the time, we introduced a temporary fix to get the cluster back online but with today’s maintenance we implemented a more comprehensive fix.

Infiniband firmware: The KLONE cluster is built on the infiniband HPC interconnect for node-to-node communication. While KLONE originally launched with the HDR generation of infiniband, we have since upgraded mid-KLONE to have a HDR-NDR hybrid interconnect. NDR infiniband is required to support the latest compute slices we offer. We updated the firmware on our NDR switches following vendor recommendations for increased stability.

Apptainer on MOX: Apptainer (formerly Singularity) is the root-less containerization solution we provide on both HYAK clusters. Apptainer version 1.3.1 was deployed on both KLONE and MOX. As a reminder, on KLONE Apptainer is accessed through a module and is only available on compute nodes after module load apptainer. On MOX, Apptainer is default software and can be accessed with Apptainer commands directly after starting an interactive job for example, apptainer --version.

Training Opportunities: COMPLECS (San Diego Supercomputer) is hosting an Intermediate Linux Shell Scripting online workshop on Thursday May, 16 at 11:00 am Pacific Time. Register here.

Our next scheduled maintenance will be Tuesday June, 11, 2024. Stay informed by joining our mailing list. Sign up here.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

April 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community,

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The HYAK team was able to use the interruption to implement the following changes:

  • Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
  • The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities#

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

March 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node#

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension Remote-SSH#

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

New Documentation

The steps discussed here for responsible use of VS Code have been added to our documentation. Please review the solutions for connecting VS Code to HYAK.

  1. Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.

    $ ps aux | grep UWNetID
  2. If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.

  3. Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities#

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?#

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Upcoming HPC Training Opportunities

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Community!

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center (SDSC). If you are interested in picking up some additional skills and experience in HPC, please check them out.

  • SDSC Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute: The project is focused on teaching researchers and students the best practices for effectively running machine learning (ML) and data science applications on advanced cyberinfrastructure (CI) and high-performance computing (HPC) systems. Applications due 12 April 2024. https://www.sdsc.edu/education_and_training/ciml_summer_institute.html

  • SDSC HPC and Data Science Summer Institute: The program is aimed at researchers in academia and industry, especially in domains not traditionally engaged in supercomputing, who have problems that cannot typically be solved using local computing resources. Applications due 26 April 2024. https://www.sdsc.edu/education_and_training/summer_institute.html

  • SDSC Virtual Workshop; COMPLECS: Batch Computing: Getting Started with Batch Job Scheduling - Slurm Edition: Learn how to use Slurm, Hyak's batch job scheduler. In "our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster." Event held virtually on Thursday, March 21, 2024 11:00 AM - 12:30 PM PDT Link to Registration

Keep an eye on our blog for more opportunities and HYAK updates.

If you have any questions, please reach out to the team by emailing help@uw.edu and we sure to mention Hyak in the subject line. Thanks!