6 posts tagged with "training"

View All Tags

October 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Our October maintenance is complete. Thank you for your patience while we make package updates to node images, ensuring the security and behavior you expect from Hyak klone.

The next maintenance will be Tuesday November 12, 2024.

New Tools Documentation#

Our research computing interns have been hard at work adding documentation for new user tools that might help optimize your research computing. Click the links below to review the docs for these tools.

Squash Fuse - SquashFS packages multiple small files into a single read-only, compressed filesystem, reducing metadata calls and improving performance. This minimizes server load, enhancing throughput and efficiency in handling storage requests.

Use case on Hyak: In HPC, datasets often consist of numerous small files, which can lead to performance bottlenecks due to excessive metadata operations. By utilizing SquashFS, HPC applications can significantly reduce metadata overhead, improving data access speeds and enhancing overall system performance, particularly in large-scale distributed storage systems.

Checkpointing with DMTCP - DMTCP is a tool to transparently checkpoint and restart jobs, saving it to disk to be resumed at a later time. It requires no changes to application code, allowing easy use. Using DMTCP with your code allows checkpointing at regular intervals so if your job is pre-empted or reaches the time limit, it will resume from its last checkpoint.

Use case on Hyak: DMTCP offers a solution for folks who would like to use Hyak's ckpt partitions, but have jobs that exceed the ckpt time limits of 5 hours for CPU-noly jobs and 8 hours for GPU-only jobs. Checkpointing with DMTCP facilitates efficient use of ckpt resources, allowing higher throughput for your jobs.

Tools for Kopah Storage Users - We have installed Command Line Interface tools like s3cmd and s5cmd on klone and provide insctructions for using Python library boto3 for Kopah interaction and retreival to build Kopah S3 storage usage into your research computing applciations on Hyak.

If you have any issue using these tools, please open a ticket by emailing help@uw.edu with "Hyak" in the subject line. We appreciate any feedback about how to improve ease of use for tools presented in our documentation.

Upcoming Training#

We have planned 2 Hyak-specific trainings for this Fall (more to come, stay tuned). These trainings will be held in person and will not be recorded since recorded materials are already publicly accessible. Capacity is limited, follow the links below to register today to guarantee your spot.

Hyak: Containers are your friend - Monday October 28 10am - 12pm.

Hyak: Scheduling Jobs with Slurm - Thursday November 14 10am - 12pm

In the first hour and a half, we will go over content. The last 30 minutes will be reserved for questions.

Location: eScience Classroom; WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor; 3910 15th Ave NE, Seattle, WA 98195

Keep an eye on your inbox for updates about additional trainings this fall.

Fall Office Hours#

Hyak HPC Staff Scientist and Facilitator, Kristen Finch, will be holding office hours fall term.

Zoom office hours will be held on Wednesdays at 2pm. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email.

Click here to Register for Zoom Office Hours

In-person office hours will be held on Thursdays at 2pm at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195). Click here to RSVP for in-person Office Hours.

Click here to visit the eScience Office Hours page to see additional eScience office hours including AI/ML, R, Earth Data, and Python (not available to help with Homework).

The Research Computing Club will be holding office hours fall term. In-person office hours will be held at the eScience Institute (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).

OfficerDateTime
Brenden Pelkie16 Oct1pm
Nels Schimek23 Octipm
Nels Schimek6 Nov1pm
Sam Shin19 Nov2pm
Teerth Mehta3 Dec2pm

If you would like to request 1 on 1 help, please send a ticket to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting with Kristen.

Please don't hesitate to reach out to the Hyak team with issues and feedback by opening a tickey by emailing help@uw.edu with "Hyak" in the subject.

Have a great October!

Happy computing,

Hyak Team

August 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations.

The next maintenance will be Tuesday September 10, 2024.

New self-hosted S3 storage option: KOPAH#

We are happy to announce the preview launch of our self-hosted S3 storage called KOPAH. S3 storage is a solution for securely storing and managing large amounts of data, whether for personal use or research computing. It works like an online storage locker where you store can files of any size, accessible from anywhere with an internet connection. For researchers and those involved in data-driven studies, it provides a reliable and scalable platform to store, access, and analyze large datasets, supporting high-performance computing tasks and complex workflows.

S3 uses buckets as containers to store data, where each bucket can hold 100,000,000 objects, which are the actual files or data you store. Each object within a bucket is identified by a unique key, making it easy to organize and retrieve your data efficiently. Public links can be generated for KOPAH objects so that users can share buckets and objects with collaborators.

Click here to learn more about KOPAH S3.

Who should use KOPAH?#

KOPAH is a storage solution for anyone. Just like other storage options out there, you can upload, download, and view your storage bucket with specialized tools and share your data via the internet. For Hyak users, KOPAH provides another storage option for research computing. It is more affordable than /gscratch storage and can be used for active research computing with a few added steps for retreiving stored data prior to a job.

Test Users Wanted#

Prior to September, we are inviting test users to try KOPAH and provide feedback about their experience. If you are interested in becoming a KOPAH test user, please each help@uw.edu with Hyak or KOPAH in the subject line.

Requirements:

  1. While we will not charge for the service until September 1, to sign up as a test user, we require a budget number and worktag. If the service doesn't work for you, you can cancel before September.
  2. We will ask for a name for the account. If your groups has an existing account on Hyak, klone /gscratch, it makes sense of the names to match across services.
  3. Please be ready to respond with your feedback about the service.

Opportunities#

PhD students should check out this opportunity for funding from NVIDIA: Graduate Research Fellowship Program

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

May 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thanks again for your patience with our monthly scheduled maintenance, there are some notable improvements we implemented today.

klone node image: Over the past few weeks, you may have noticed some klone instability. This was a result of some behind the scenes storage upgrades that inadvertently introduced wider impacts to the existing cluster automation. At the time, we introduced a temporary fix to get the cluster back online but with today’s maintenance we implemented a more comprehensive fix.

Infiniband firmware: The klone cluster is built on the infiniband HPC interconnect for node-to-node communication. While klone originally launched with the HDR generation of infiniband, we have since upgraded mid-klone to have a HDR-NDR hybrid interconnect. NDR infiniband is required to support the latest compute slices we offer. We updated the firmware on our NDR switches following vendor recommendations for increased stability.

Apptainer on MOX: Apptainer (formerly Singularity) is the root-less containerization solution we provide on both Hyak clusters. Apptainer version 1.3.1 was deployed on both klone and MOX. As a reminder, on klone Apptainer is accessed through a module and is only available on compute nodes after module load apptainer. On MOX, Apptainer is default software and can be accessed with Apptainer commands directly after starting an interactive job for example, apptainer --version.

Training Opportunities: COMPLECS (San Diego Supercomputer) is hosting an Intermediate Linux Shell Scripting online workshop on Thursday May, 16 at 11:00 am Pacific Time. Register here.

Our next scheduled maintenance will be Tuesday June, 11, 2024. Stay informed by joining our mailing list. Sign up here.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

April 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community,

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The Hyak team was able to use the interruption to implement the following changes:

  • Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
  • The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities#

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

March 2024 Maintenance Details

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Users,

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node#

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension Remote-SSH#

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

New Documentation

The steps discussed here for responsible use of VS Code have been added to our documentation. Please review the solutions for connecting VS Code to Hyak.

  1. Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.

    $ ps aux | grep UWNetID
  2. If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.

  3. Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities#

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?#

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Upcoming HPC Training Opportunities

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello Hyak Community!

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center (SDSC). If you are interested in picking up some additional skills and experience in HPC, please check them out.

  • SDSC Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute: The project is focused on teaching researchers and students the best practices for effectively running machine learning (ML) and data science applications on advanced cyberinfrastructure (CI) and high-performance computing (HPC) systems. Applications due 12 April 2024. https://www.sdsc.edu/education_and_training/ciml_summer_institute.html

  • SDSC HPC and Data Science Summer Institute: The program is aimed at researchers in academia and industry, especially in domains not traditionally engaged in supercomputing, who have problems that cannot typically be solved using local computing resources. Applications due 26 April 2024. https://www.sdsc.edu/education_and_training/summer_institute.html

  • SDSC Virtual Workshop; COMPLECS: Batch Computing: Getting Started with Batch Job Scheduling - Slurm Edition: Learn how to use Slurm, Hyak's batch job scheduler. In "our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster." Event held virtually on Thursday, March 21, 2024 11:00 AM - 12:30 PM PDT Link to Registration

Keep an eye on our blog for more opportunities and Hyak updates.

If you have any questions, please reach out to the team by emailing help@uw.edu and we sure to mention Hyak in the subject line. Thanks!