Skip to main content

13 posts tagged with "storage"

View All Tags

August 2025 Maintenance Update

· 6 min read
Kristen Finch
Director of Research Computing Solutions

During August’s maintenance, we refreshed the operating system images for both login and compute nodes, upgraded Slurm to version 25.5.2, and upgraded Klone's filesystem (GPFS) for increased stability. We also introduced a new Globus OneDrive connector, making it easier than ever to transfer files between OneDrive and Hyak Klone or Kopah Storage.

Stay informed by subscribing to our mailing list and the UW-IT Research Computing Events Calendar. The next maintenance is scheduled for Tuesday, September 9, 2025 (the second Tuesday of the month).

Notable Updates

  • Node image updates – Routine updates plus installation of new Slurm utilities that we will test for job efficiency monitoring.
  • Slurm upgrade to 25.5.2 – Resolves a bugs the --gres flag allowed resources to be allocated to more than one job in version 25.05.0 and fixes X11 forwarding. Read more about this version.
  • GPFS upgrade to 5.1.9.11 – Improves stability and includes several bug fixes. Read more about this version.

New Features

Globus OneDrive Connector – UW-IT Research Computing has added OneDrive as a connector to Globus, making transfers between OneDrive and Hyak Klone or OneDrive and Kopah Storage easier than ever before!

Search Uw OneDrive to utilize the connector.

Things to note

  • Did you know that the UW Community are eligible for 5TB of storage on OneDrive as part of the Office 365 Suite? Click here to learn more.
  • While OneDrive is HIPAA and FERPA compatible, encryption is not enforced for Globus transfers on any of our connectors (OneDrive, Kopah, Klone). As a reminder, Klone and Kopah are NOT aligned with HIPPA, please keep this in mind now that OneDrive can transfer to either cluster.
  • Sharing with external partners is not enabled for our OneDrive or Klone connectors via Globus. Sharing is permitted for Kopah.
  • Read more

Office Hours

  • Wednesdays at 2pm on Zoom. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email. Click here to Register for Zoom Office Hours.
  • Thursdays at 2pm in person in eScience. (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).
  • See our office hours schedule, subscribe to event updates, and bookmark our UW-IT Research Computing Events Calendar.

If you would like to request 1 on 1 help, please send an email to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting.

UW Community Opportunities

  • The Data Science and AI Accelerator pairs eScience Institute data scientists with researchers from any field of study to work on focused, collaborative projects. Collaborations may center on analysis of an existing dataset to answer a specific research question, an implementation of software for processing or analyzing data, data visualization tools, or tools for data interpretation. Accelerator Projects may be submitted at any time. Projects for Fall 2025 must be received by August 14th, 2025. LEARN MORE HERE.
  • Applications for the CSDE Data Science and Demography Training program are due Friday, August 22nd by 5pm. An information session will take place Wednesday, August 13th at 10:00 a.m. DETAILS HERE.
  • Cloud Clinic August 14 10-11am - guest presenter Niris Okram from AWS presenting on “The Utility of Capacity Blocks: Optimizing computing horsepower per budget dollar.” This will be followed by a short presentation on building small-scale (“Littlest”) JupyterHubs. LEARN MORE HERE.
  • DubHacks - October 18 - October 19, 2025 - DubHacks 2025 takes you back to where it all began—the childhood bedroom. A space for imagination, curiosity, and bold ideas. Now, with code instead of crayons, you get to build what makes your younger self proud. No limits, just pure creativity. What will you create when you let yourself play?

External Training Opportunities

  • Automating Research with Globus: The Modern Research IT Platform - Aug. 18, 2025, 9 a.m. – 12 p.m. (Pacific Time) This workshop introduces Globus Flows and its role in automating research workflows. Participants will explore data portals, science gateways, and commons, enabling seamless data discovery and access. Enroll here.
  • CU-RMACC Webinars: Should I be Scared of AI? Aug. 18, 2025 - 3:00 PM - 4:00 PM EDT Throughout history, new technologies have sparked both excitement and fear—AI is no different. In this talk, Dr. Shelley Knuth, Assistant Vice Chancellor for Research Computing at the University of Colorado explores the common fears surrounding artificial intelligence, why we feel them, and how we can shift our perspective to focus on positive outcomes. We’ll look at practical ways to address risks, embrace innovation, and move forward with AI as a powerful tool rather than something to fear. Learn more and register.
  • COMPLECS: Batch Computing (Part II): Getting Started with Batch Job Scheduling 08/21/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • NUG Community Call: A Birds-Eye View of Using Cuda with C/C++ on Perlmutter (Part 2) August 27, 2025, 11 a.m. - 12:30 p.m. PDT - In this two-part training series, users will be introduced to the basics of using CUDA on Perlmutter at NERSC. The training will focus on the basics of the Perlmutter architecture and NVIDIA GPUs, programming concepts with CUDA using C/C++. Event 2 focuses on advance kernel and custom cuda kernels in C/C++. Learn more and register.
  • COMPLECS: Linux Tools for Text Processing 09/04/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • Python for HPC 09/09/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • COMPLECS: Data Transfer 09/18/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • COMPLECS: Interactive Computing 10/09/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • COMPLECS: Linux Shell Scripting 10/23/25 - 2:00 PM - 3:30 PM EDT Learn more and register.
  • COMPLECS: Using Regular Expressions with Linux Tools 11/06/25 - 2:00 PM - 3:30 PM EST Learn more and register.
  • COMPLECS: Batch Computing (Part III) High-Throughput and Many-Task Computing - Slurm Edition 12/04/25 - 2:00 PM - 3:30 PM EST Learn more and register.
  • R for HPC 12/04/25 - 2:00 PM - 3:30 PM EST Learn more and register.

Positions

  • Two PhD positions in Artificial Intelligence - in collaboration with German Aerospace Center and TU Dresden, Germany. Deadline to apply: 27 August 2025. Apply Now!

Questions about Hyak Klone, Tillicum, or any other UW-IT Research Computing Service? Fill out our Research Computing Consulting intake form. We are here to help!

Happy Computing,

Hyak Team

July 2025 Maintenance Update

· 6 min read
Kristen Finch
Director of Research Computing Solutions

During June's maintenance, we've refreshed the operating system images for both login and compute nodes including the newest version of Slurm, and we have implemented some changes critical to provisioning our new GPU system, Tillicum (launching in Fall 2025). Stay informed by subscribing to our mailing list and the UW-IT Research Computing Events Calendar. The next maintenance is scheduled for Tuesday August 12, 2025 (AKA the 2nd Tuesday of the month).

Notable Updates

  • Routine package updates – images for both the login and compute nodes have been refreshed to incorporate the latest Linux OS security updates and system patches.
  • Slurm Upgrade to version 25.05 – Slurm 25.05 introduces encrypted job communication, improved support for complex network topologies, and new features like optional TLS, job start events in Kafka, and better license request handling. While you won’t notice major changes in your day-to-day workflow, this upgrade improves security, enables more flexible job scheduling, and lays the groundwork for new features in the future. Learn more from Slurm's release notes.
  • SSHD changes – We’ve updated some behind-the-scenes SSH settings to improve login handling. These changes help ensure account access stays consistent across Klone and Tillicum, but you won’t need to do anything differently when connecting.

New Training Videos

This month we uploaded several training videos to our YouTube Playlist that may be of interest:

Summer Office Hours

  • Wednesdays at 2pm on Zoom. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email. Click here to Register for Zoom Office Hours.
  • Thursdays at 2pm in person in eScience. (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).
  • See our office hours schedule, subscribe to event updates, and bookmark our UW-IT Research Computing Events Calendar.

If you would like to request 1 on 1 help, please send an email to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting.

External Training Opportunities

  • NVIDIA Workshop: Building Transformer-Based Natural Language Processing Applications - 07/09/25 - 10:00 AM - 6:00 PM EDT Learn how to apply and fine-tune a Transformer-based Deep Learning model to Natural Language Processing (NLP) tasks. In this course, you'll: · Construct a Transformer neural network in PyTorch · Build a named-entity recognition (NER) application with BERT · Deploy the NER application with ONNX and TensorRT to a Triton inference server Upon completion, you’ll be proficient in task-agnostic applications of Transformer-based models. Learn More and Register.

  • COMPLECS: Intermediate Linux - 07/10/25 - 2:00 PM - 3:30 PM EDT Knowledge of Linux is indispensable for using advanced CI. While GUIs are becoming more prevalent, being able to work at the command line interface (CLI) provides the greatest power and flexibility. In this session, we assume that participants are already comfortable with basic Linux operations such as creating, deleting and renaming files, and navigating between directories. Topics covered include the filesystem hierarchy, file permissions, symbolic and hard links, wildcards and file globbing, finding commands and files, environment variables and modules, configuration files, aliases, and history. Learn More and Register.

  • Codee for Beginners: Automatic Code Optimization with Codee - July 29, 2025, 9 - 10:30 a.m. PDT This is an introductory webinar showing how Codee’s AutoFix feature can automatically accelerate computational kernels, representing performance hotspots, on both CPUs and GPUs. With AutoFix, developers can simply instruct Codee to insert OpenMP, OpenACC, and compiler-specific directives, as well as language-specific constructs (e.g., Fortran’s “do concurrent”) to vectorize, parallelize, and offload compute-intensive loops. AutoFix can even combine optimization techniques, such as multithreading and vectorization for nested loops, or OpenACC alongside OpenMP to maximize compatibility, allowing even novice programmers to write expert-level parallel code.

  • NUG Community Call: A Birds-Eye View of Using Cuda with C/C++ on Perlmutter (Part 1) - July 30, 2025, 11 a.m. - 12:30 p.m. PDT NERSC will be hosting a 2-part event series that focuses on using learning introductory GPU Programming concepts with CUDA on Perlmutter at NERSC. In this two-part training series, users will be introduced to the basics of using CUDA on Perlmutter at NERSC. The training will focus on the basics of the Perlmutter architecture and NVIDIA GPUs, programming concepts with CUDA using C/C++. This training is also open to non-NERSC users. Learn More and Register.

  • Accelerating and Scaling Python for HPC - August 8, 2025, 9 a.m. - 5 p.m. PDT In this interactive tutorial you’ll learn how to write, debug, profile, and optimize high-performance, multi-node GPU applications in Python. You'll learn and master: CuPy for drop-in GPU acceleration of NumPy workflows; Nvmath-python for high level API for integrating Python with NVIDIA math libraries; Numba for writing custom kernels that match the performance of C++ and Fortran; and mpi4py for scaling across thousands of nodes. Along the way we’ll learn how to profile our code, debug tricky kernels, and leverage foundational and domain-specific accelerated libraries. Learn More and Register.

  • Automating Research with Globus: The Modern Research IT Platform - Aug. 18, 2025, 9 a.m. – 12 p.m. (Pacific Time) This workshop introduces Globus Flows and its role in automating research workflows. Participants will explore data portals, science gateways, and commons, enabling seamless data discovery and access. Enroll here.

Questions about Hyak Klone, Tillicum, or any other UW-IT Research Computing Service? Fill out our Research Computing Consulting intake form. We are here to help!

Happy Computing,

Hyak Team

June 2025 Maintenance Update

· 6 min read
Kristen Finch
Director of Research Computing Solutions

During June's maintenance, we've refreshed the operating system images for both login and compute nodes, and we've responded to user feedback with a solution to make Cron jobs persistent. Good news: we are holding office hours all summer to support your research grind. Stay informed by subscribing to our mailing list and the UW-IT Research Computing Events Calendar. The next maintenance is scheduled for Tuesday July 8, 2025 (AKA the 2nd Tuesday of the month).

Notable Updates

  • Routine package updates - images for both the login and compute nodes have been refreshed to incorporate the latest Linux OS security updates and system patches.
  • Slurm database lock timeout settings adjusted to match documentation best practices.
  • Cron job system improvements – our users provided feedback that their Cron jobs were being lost after monthly maintenance. We resolve this:
    • User crontabs moved to GPFS (/gscratch) for persistence across maintenance
    • Only one login node will now run user cron jobs (preventing duplication)
    • Users will need to re-create their crontabs one more time after this maintenance
    • This is intended to be a permanent fix—no more resets in future maintenance
    • FYI - Cron jobs are recurring scheduled tasks run by the system using each user's crontab.
    • We recommend scrontab for routine operations. Learn more from Slurm. Learn more from NERSC.

Action Required: Research Computing Club (stf account) Members

To keep your access to RCC-supported Hyak accounts, please fill out the following form by Friday, June 13, 2025:

2025 RCC Usage Check-In Form

This short form is required for Student Technology Fee reporting and ensures the RCC can continue offering free computing resources to UW students. It only takes a few minutes, just tell us how you’ve used RCC resources this past year. Thanks for helping us keep RCC resources funded and accessible!

Spotlight: Kopah Object Storage

Our Kopah S3-compatible storage service is available to all campus researchers and staff. It’s a flexible, scalable storage solution to complement your research computing portfolio.

If you missed our recent Data Storage Day on May 5, we’ve published the full set of demonstration videos on our YouTube Playlist. Topics covered and relevant links:

Whether you're just getting started or looking to expand your use of campus storage resources, this is a great place to learn more.

Summer Office Hours

  • Wednesdays at 2pm on Zoom. Attendees need only register once and can attend any of the occurrences with the Zoom link that will arrive via email. Click here to Register for Zoom Office Hours.
  • Thursdays at 2pm in person in eScience. (address: WRF Data Science Studio, UW Physics/Astronomy Tower, 6th Floor, 3910 15th Ave NE, Seattle, WA 98195).
  • See our office hours schedule, subscribe to event updates, and bookmark our UW-IT Research Computing Events Calendar.

If you would like to request 1 on 1 help, please send an email to help@uw.edu with "Hyak Office Hour" in the subject line to coordinate a meeting.

Opportunities

Computing Training from eScience and more

eScience logo.

Introduction to Text Mining - Friday, June 27, 2025, 2 – 3 p.m. Open Scholarship Commons Presentation Space, Discover the power of text mining in this interactive workshop, where you will learn techniques for collecting, cleaning, and analyzing textual data through hands-on exercises using Python and Jupyter Notebooks. In this session, you will: Learn methods for scraping and extracting text from web sources.

  • Gain skills in preprocessing textual data, such as removing HTML tags, tokenization, and handling stop words.
  • Explore techniques for visualizing and analyzing word frequencies to uncover hidden themes and trends.
  • Use deep learning models to identify relevant text by analyzing semantic similarity.
  • Register here

External Training Opportunities

  • COMPLECS: Code Migration - Thursday, June 12, 2025 - 11:00 a.m. – 12:30 p.m. (Pacific Time) We will cover typical approaches to moving your computations to HPC resources: using applications/software packages already available on the system through Linux environment modules; compiling code from source with information on compilers, libraries, and optimization flags to use; setting up Python and R environments; using conda-based environments; managing workflows; and using containerized solutions via Singularity. Register here!
  • Automating Research with Globus: The Modern Research IT Platform - Aug. 18, 2025, 9 a.m. – 12 p.m. (Pacific Time) This workshop introduces Globus Flows and its role in automating research workflows. Participants will explore data portals, science gateways, and commons, enabling seamless data discovery and access. Enroll here.
  • HPC Fundamentals: June 11, 9 am – 4 pm PDT & June 12, 9 am – 12 pm PDT - This 1.5-day hybrid training, provided in collaboration with HPC Carpentries, is for novice HPC users to learn the basic skills they will need to start using an HPC resource. Capacity is limited to 40 learners; application and registration are required. Register here.
  • OLCF Julia for Science: June 19, 10 am – 1 pm PDT; also June 26, 10 am – 1 pm PDT - The Oak Ridge Leadership Computing Facility (OLCF), in conjunction with the Oak Ridge National Laboratory Computer Science and Mathematics Division (CSMD), will host Julia for Science, a 3-hour tutorial focused on introductory aspects of the Julia programming language, and ecosystem for computation and data analysis. This training provides a hands-on way to learn more about using Julia and parallel code in scientific computing. Register here.
  • Crash Course in Supercomputing: June 23, 9 am – 4 pm PDT - In this course, students will learn to write parallel programs that can be run on a supercomputer. We begin by discussing the concepts of parallelization before introducing MPI and OpenMP, the two leading parallel programming libraries. Finally, the students will put together all the concepts from the class by programming, compiling, and running a parallel code on one of the NERSC supercomputers. Training accounts will be provided for students who have not yet set up a NERSC account. This hybrid training, as part of the 2025 Berkeley Lab Computational Sciences Summer Student Program, is also open to NERSC, ALCF, LANL, OLCF, and TACC users. This training is geared towards novice parallel programmers. Register here.

If you have any questions about using Hyak, please start a help request by emailing help@uw.edu with "Hyak" in the subject line.

Happy Computing,

Hyak Team

JuiceFS or using Kopah on Klone

· 7 min read
Nam Pho
Research Computing

If you haven't heard, we recently launched an on-campus S3-compatible object storage service called Kopah docs that is available to the research community at the University of Washington. Kopah is built on top of Ceph and is designed to be a low-cost, high-performance storage solution for data-intensive research.

warning

From our testing we have observed significant performance challenges for JuiceFS in "single mode" which is demonstrated by this blog post. We do not recommend JuiceFS as a solution for demanding workflows.

While the deployment of Kopah was welcome news to those who are comfortable working with S3-compatible cloud solutions, we recognize some folks may be hesitant to give up their familiarity with POSIX file systems. If that sounds like you, we explored the use of JuiceFS, a distributed file system that provides a POSIX interface on top of object storage, as a potential solution.

info

Simplistically, object storage often presents using two API keys and data is accessed using a command line tool that wraps API calls, whereas POSIX is what you typically get presented with from the storage when interacting with a cluster via command-line.

Installation

JuiceFS isn't installed by default so you will need to compile it yourself or download the pre-compiled binary from their release page.

As of January 2025 the latest version is 1.2.3 and you want the amd64 version if using from Klone. The command below will download and extract the binary to your current working directory.

wget https://github.com/juicedata/juicefs/releases/download/v1.2.3/juicefs-1.2.3-linux-amd64.tar.gz -O - | tar xzvf -

I have to move it to a folder in my $PATH so I can run it from anywhere by just calling the binary. Your personal environment varies here.

mv -v juicefs ~/bin/

Verify you can run JuiceFS.

npho@klone-login03:~ $ juicefs --version
juicefs version 1.2.3+2025-01-22.4f2aba8
npho@klone-login03:~ $

Cool, now we can start using JuiceFS!

Standalone Mode

There are two ways to run JuiceFS, standalone or distributed mode. This blog post explores the former. Standalone mode is meant to only present Kopah via POSIX on Klone. The key points being:

  1. There is an active juicefs process required to run while you want to access it.
  2. It is intended for you to run it only on the node you are running the process from.

If you wanted to run JuiceFS on multiple nodes or with multiple users then we will have another proof-of-concept with distributed mode in the future.

Create Filesystem

JuiceFS separates the data (placed into S3 object storage) and the metadata, which is kept locally in a database. The command below will create the myjfs filesystem and store the metadata in a SQLite database called myjfs.db in the directory where the command is run. It puts the data itself into a Kopah bucket called npho-project.

juicefs format \
--storage s3 \
--bucket https://s3.kopah.uw.edu/npho-project \
--access-key REDACTED \
--secret-key REDACTED \
sqlite3://myjfs.db myjfs

You can rename the metadata file and the filesystem name to whatever you want (they don't have to match). The same goes for the bucket name on Kopah. However, I would strongly recommend having unique metadata file names that match the file system names for ease of tracking alongside the bucket name itself.

npho@klone-login03:~ $ juicefs format \
> --storage s3 \
> --bucket https://s3.kopah.uw.edu/npho-project \
> --access-key REDACTED \
> --secret-key REDACTED \
> sqlite3://myjfs.db myjfs
2025/01/31 11:52:47.940709 juicefs[1668088] <INFO>: Meta address: sqlite3://myjfs.db [interface.go:504]
2025/01/31 11:52:47.944930 juicefs[1668088] <INFO>: Data use s3://npho-project/myjfs/ [format.go:484]
2025/01/31 11:52:48.666657 juicefs[1668088] <INFO>: Volume is formatted as {
"Name": "myjfs",
"UUID": "eb47ec30-c1f7-4a92-9b17-23c4beae7f76",
"Storage": "s3",
"Bucket": "https://s3.kopah.uw.edu/npho-project",
"AccessKey": "removed",
"SecretKey": "removed",
"BlockSize": 4096,
"Compression": "none",
"EncryptAlgo": "aes256gcm-rsa",
"KeyEncrypted": true,
"TrashDays": 1,
"MetaVersion": 1,
"MinClientVersion": "1.1.0-A",
"DirStats": true,
"EnableACL": false
} [format.go:521]
npho@klone-login03:~ $

You can verify there is now a myjfs.db file in your current working directory. It's a SQLite database file that will store your file system meta data.

We can also verify the npho-project bucket was created on Kopah to store the data itself.

npho@klone-login03:~ $ s3cmd -c ~/.s3cfg-default ls                                      
2025-01-31 19:48 s3://npho-project
npho@klone-login03:~ $

You should run juicefs format --help to view the full range of options and customize the parameters of your file system to your unique needs but just briefly:

  • Encryption: When you create the file system and format it you can see it has encryption by default using AES256. You can over ride this using the --encrypt-algo flag if you prefer chacha20-rsa or you can use key file based encryption and provide your private key using the --encrypt-rsa-key flag.
  • Compression: This is not enabled by default and there is a computational penalty for doing so if you want to access your files since it needs to be de or re encrypted on the fly.
  • Quota: By default there is no block (set with --capacity in GiB units) or inode (set with --inodes files) quota enforced at the file system level. If you do not explicitly set this, it will be matched to whatever you get from Kopah. This is still useful for setting explicitly if you wanted to have multiple projects or file systems in JuiceFS that use the same Kopah account and have some level of separation.
  • Trash: By default, files are not deleted immediately but moved to a trash folder similar to most desktop systems. This is set with the --trash-days flag and you can set it to 0 if you want files to be deleted immediately. The default here is 1 day after which the file is permanently deleted.

Mount Filesystem

Running the command below will mount your newly created file system to the myproject folder in your home directory. It does not need to previously exist.

juicefs mount sqlite3://myjfs.db ~/myproject --background
warning

The SQLite database file is critical, do not lose it. You can move its location around afterwards but it contains all the meta data about your files.

This process occurs in the background.

warning

Where you mount your file system the first time is where it will be expected to be mounted going forward.

npho@klone-login03:~ $ juicefs mount sqlite3://myjfs.db ~/myproject --background
2025/01/31 11:57:01.652279 juicefs[1690855] <INFO>: Meta address: sqlite3://myjfs.db [interface.go:504]
2025/01/31 11:57:01.654920 juicefs[1690855] <INFO>: Data use s3://npho-project/myjfs/ [mount.go:629]
2025/01/31 11:57:02.156898 juicefs[1690855] <INFO>: OK, myjfs is ready at /mmfs1/home/npho/myproject [mount_unix.go:200]
npho@klone-login03:~ $

Use Filesystem

Now with the file system mounted (at ~/myproject) you can use it like any other POSIX file system.

npho@klone-login03:~ $ cp -v LICENSE myproject 
'LICENSE' -> 'myproject/LICENSE'
npho@klone-login03:~ $ ls myproject
LICENSE
npho@klone-login03:~ $

Remember, you won't be able to see it in the bucket because it is encrypted before being stored there.

Recover Deleted Files

If you enabled the trash can option then you can recover files up until the permanent delete date.

First delete a file on the file system.

npho@klone-login03:~ $ cd myproject 
npho@klone-login03:myproject $ rm -v LICENSE
removed 'LICENSE'
npho@klone-login03:myproject $

Verify the file is deleted. Go to recover it from the trash bin.

npho@klone-login03:myproject $ ls          
npho@klone-login03:myproject $ ls -alh
total 23K
drwxrwxrwx 2 root root 4.0K Jan 31 12:54 .
drwx------ 48 npho all 8.0K Jan 31 13:08 ..
-r-------- 1 npho all 0 Jan 31 11:57 .accesslog
-r-------- 1 npho all 2.6K Jan 31 11:57 .config
-r--r--r-- 1 npho all 0 Jan 31 11:57 .stats
dr-xr-xr-x 2 root root 0 Jan 31 11:57 .trash
npho@klone-login03:myproject $ ls .trash
2025-01-31-20
npho@klone-login03:myproject $ ls .trash/2025-01-31-20
1-2-LICENSE
npho@klone-login03:myproject $ cp -v .trash/2025-01-31-20/1-2-LICENSE LICENSE
'.trash/2025-01-31-20/1-2-LICENSE' -> 'LICENSE'
npho@klone-login03:myproject $ ls
LICENSE
npho@klone-login03:myproject $

As you can see, we can recover files that are tracked by their delete date. You would need to copy the file back out to recover it.

Unmount Filesystem

When you are done using the file system you can unmount it with the command below.

npho@klone-login03:~ $ juicefs umount myproject
npho@klone-login03:~ $

Remember, the file system is only accessible in standalone mode so long as a juicefs process is running. Since we ran it in the background you will need to explicitly unmount it.

Questions?

Hopefully you found this proof-of-concept useful. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

August 2024 Maintenance Details

· 3 min read
Kristen Finch
HPC Staff Scientist

Thanks again for your patience with our monthly scheduled maintenance. During this maintenance session, we were able to provide package updates to node images to ensure compliance with the latest operating system level security fixes and performance optimizations.

The next maintenance will be Tuesday September 10, 2024.

New self-hosted S3 storage option: KOPAH

We are happy to announce the preview launch of our self-hosted S3 storage called KOPAH. S3 storage is a solution for securely storing and managing large amounts of data, whether for personal use or research computing. It works like an online storage locker where you store can files of any size, accessible from anywhere with an internet connection. For researchers and those involved in data-driven studies, it provides a reliable and scalable platform to store, access, and analyze large datasets, supporting high-performance computing tasks and complex workflows.

S3 uses buckets as containers to store data, where each bucket can hold 100,000,000 objects, which are the actual files or data you store. Each object within a bucket is identified by a unique key, making it easy to organize and retrieve your data efficiently. Public links can be generated for KOPAH objects so that users can share buckets and objects with collaborators.

Click here to learn more about KOPAH S3.

Who should use KOPAH?

KOPAH is a storage solution for anyone. Just like other storage options out there, you can upload, download, and view your storage bucket with specialized tools and share your data via the internet. For Hyak users, KOPAH provides another storage option for research computing. It is more affordable than /gscratch storage and can be used for active research computing with a few added steps for retrieving stored data prior to a job.

Test Users Wanted

Prior to September, we are inviting test users to try KOPAH and provide feedback about their experience. If you are interested in becoming a KOPAH test user, please each help@uw.edu with Hyak or KOPAH in the subject line.

Requirements:

  1. While we will not charge for the service until September 1, to sign up as a test user, we require a budget number and worktag. If the service doesn't work for you, you can cancel before September.
  2. We will ask for a name for the account. If your groups has an existing account on Hyak, klone /gscratch, it makes sense of the names to match across services.
  3. Please be ready to respond with your feedback about the service.

Opportunities

PhD students should check out this opportunity for funding from NVIDIA: Graduate Research Fellowship Program

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

April 2024 Maintenance Details

· 2 min read
Kristen Finch
HPC Staff Scientist

Thank you for your patience this month while there was more scheduled downtime than usual to allow for electrical reconfiguration work in the UW Tower data center. We appreciate how disruptive this work has been in recent weeks. Please keep in mind that this work by the data center team has been critical in allowing the facility to increase available power to the cluster to provide future growth capacity, which was limiting deployment of new equipment in recent months.

The Hyak team was able to use the interruption to implement the following changes:

  • Increase in checkpoint (--partition=ckpt) runtime for GPU jobs from 4-5 hours to 8-9 hours (pre-emption for requeuing will still occur subject to cluster utilization). Please see the updated documentation page for information about using idle resources.
  • The NVIDIA driver has been updated for all GPUs.

Our next scheduled maintenance will be Tuesday May 14, 2024.

Training Opportunities

Follow NSF ACCESS Training and Events posting HERE to find online webinars about containers, parallel computing, using GPUs, and more from HPC providers around the USA.

Questions? If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

Disk Storage Management with Conda

· 8 min read
Kristen Finch
HPC Staff Scientist

It has come to our attention that the default configuration of Miniconda and conda environments in the user's home directory leads to hitting storage limitations and the dreaded error Disk quota exceeded. We thought we would take some time to guide users in configuring their conda environment directories and package caches to avoid this error and proceed with their research computing.

Error Message

Conda's config

Software is usually accompanied by a configuration file (aka "config file") or a text file used to store configuration data for software applications. It typically contains parameters and settings that dictate how the software behaves and interacts it's environment. Familiarity with config files allows for efficient troubleshooting, optimization, and adaptation of software to specific environments, like Hyak's shared HPC environment, enhancing overall usability and performance. Conda's config file .condarc, is customizable and lets you determine where packages and environments are stored by conda.

Understanding your Conda

First let's take a look at your conda settings. The conda info command provides information about the current conda installation and its configuration.

note

The following assumes you have already installed Miniconda in your home directory or elsewhere such that conda is in your $PATH. Install Miniconda instructions here.

conda info

The output should look something like this if you have installed Miniconda3.

     active environment : None
shell level : 0
user config file : /mmfs1/home/UWNetID/.condarc
populated config files : /mmfs1/home/UWNetID/.condarc

conda version : 4.14.0
conda-build version : not installed
python version : 3.9.5.final.0
virtual packages : __linux=4.18.0=0
__glibc=2.28=0
__unix=0=0
__archspec=1=x86_64
base environment : /mmfs1/home/UWNetID/miniconda3 (writable)
conda av data dir : /mmfs1/home/UWNetID/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
. . .
package cache : /mmfs1/home/UWNetID/conda_pkgs
envs directories : /mmfs1/home/UWNetID/miniconda3/envs
platform : linux-64
user-agent : conda/4.14.0 requests/2.26.0 CPython/3.9.5 Linux/4.18.0-513.18.1.el8_9.x86_64 rocky/8.9 glibc/2.28
UID:GID : 1209843:226269
netrc file : None
offline mode : False

The paths shown above will show your username in place of UWNetID. Notice the highlighted lines above showing the absolute path to your config file in your home directory (e.g., /mmfs1/home/UWNetID/.condarc), the directory designated for your package cache (e.g., /mmfs1/home/UWNetID/conda_pkgs), and the directory/directories designated for your environments (e.g., /mmfs1/home/UWNetID/miniconda3/envs). Conda designates directories for your package cache and your environments by default, but under Hyak, your home directory has a 10G storage limit, which can quickly be maxed out by package tarballs and their contents. We can change the location for your package cache and your environments to avoid this.

tip

when you ls your home directory (i.e., ls /mmfs1/home/UWNetID/ or ls ~)you might not see .condarc listed. It might not be there and you might have to create it in the next step, but you already have one, you much use the following command

ls -a

to list all hidden files (files beginning with .).

Configuring your package cache and envs directories

If you don't have a .condarc in your home directory, you can create and edit it with a hyak preloaded editor like nano or vim. Here we will used nano.

nano ~/.condarc

Edit OR ADD the highlighted lines to your .condarc to designate directories with higher storage quotas for our envs_dirs and pkgs_dirs. In this exercise, we will assign our envs_dirs and pkgs_dirs directories to directories in /gscratch/scrubbed/ where we have more storage, although remember scrubbed storage is temporary and files are deleted automatically after 21 days if the timestamps are not updated. Alternatively, your lab/research group might have another directory in /gscratch/ that can be used.

important

Remember to replace the word UWNetID in the paths below with YOUR username/UWNetID.

Here is what your edited .condarc should look like.

~/.condarc
envs_dirs:
- /gscratch/scrubbed/UWNetID/envs
pkgs_dirs:
- /gscratch/scrubbed/UWNetID/conda_pkgs
always_copy: true

In addition to designating the directories, please include always_copy: true, which is required on the Hyak filesystem for configuring your conda in this way.

After .condarc is edited, we can use conda info with grep to see if our changes have been incorporated.

conda info |grep cache 

The result should be something like

/gscratch/scrubbed/UWNetID/conda_pkgs

And for the environments directory

conda info |grep envs

Result

/gscratch/scrubbed/UWNetID/envs
warning

If you don't have the directories you intend to use under your UWNetID in /gscratch/scrubbed/or wherever you intend to designate these directories you will need to create them now for this to work. Use the mkdir command, for example mkdir /gscratch/scrubbed/UWNetID and replace UWNetID with your username. Then create directories for your package cache and envs directory, for example, mkdir /gscratch/scrubbed/UWNetID/conda_pkgs and mkdir /gscratch/scrubbed/UWNetID/envs.

Cleaning up disk storage

After you have reset the package cache and environment directories with your conda config file, you can delete the previous directories to free up storage. Before doing that, you can monitor how much storage was being occupied by each item in your home directory with the command du -h --max-depth=1. Remove directories previously used as cache and envs_dir recursively with rm -r. The following is an example of monitoring storage and removing directories.

warning

rm -r is permanent. We cannot your recover directory. You were warned.

Below is an example output from the du -h --max-depth 1 command

du -h --max-depth=1 /mmfs1/home/UWNetID/
6.7G ./miniconda3/
4.0G ./conda_pkgs
. . .
rm -r /mmfs1/home/UWNetID/miniconda3/envs
du -h --max-depth=1 /mmfs1/home/UWNetID/
2.6G ./miniconda3/
4.0G ./conda_pkgs
. . .
note

The hyakstorage command is not simultaneously updated. Although you have cleaned up your home directory, hyakstorage might not yet show new storages estimates. du -sh will give you the most up to date information.

Storage can also be managed by cleaning up package cache periodically. Get rid of the large-storage tar archives after your conda packages have been installed with conda clean --all.

Lastly, regular maintenance of conda environments is crucial for keeping disk usage in check. Review you list of conda environments with conda env list and remove unused environments using the conda remove --name ENV_NAME --all command. Consider creating lightweight environments by installing only necessary packages to conserve disk space. For example, create an environment for each project (project1_env) rather than an environment for all projects combined (myenv).

Disk quota STILL exceeded

Be aware that many software packages are configured similarly to conda. Explore the documentation of your software to locate the configuration file and anticipate where storage limitations might become an issue. In some cases, you may need to edit or create a config file for the software to use. pip and R are two other common offenders ballooning the disk storage in your home directory.

Configuring PIP

If you are installing with pip, you might have a pip cache in ~/.cache/pip. Let's locate your the pip config file location under variant "global." You might have to activate a previously built conda environment to do this. For this exercise we will use an environment called project1_env.

conda activate project1_env
(project1_env) $ pip config list -v
. . .
For variant 'user', will try loading '/mmfs1/home/UWNetID/.pip/pip.conf'
. . .

The message "will try loading" rather than listing the config file pip.conf means that a pip config file has not been created. We will create our config file and set our pip cache. Create a directory in your home directory (e.g.,/mmfs1/home/UWNetID/.pip) to hold your pip config file and create a file called pip.conf with the touch command. Remember to also create the new directory for your new pip cache if you haven't yet.

mkdir /mmfs1/home/UWNetID/.pip/
touch /mmfs1/home/UWNetID/.pip/pip.conf
mkdir /gscratch/scrubbed/UWNetID/pip_cache

Open pip.conf with nano or vim and add the following lines to designate the location of your pip cache.

[global]
cache-dir=/gscratch/scrubbed/UWNetID/pip_cache

Check that your pip cache has been designated.

(project1_env) $ pip config list
/mmfs1/home/UWNetID/.pip/pip.conf
(project1_env) $ pip cache dir
/gscratch/scrubbed/UWNetID/pip_cache

Configuring R

We previously covered this in our documentation. Edit or create a config file called .Renviron in your home directory. Use nano or vim to designate the location of your R package libraries. The contents of the file should be something like the following example.

~/.Renviron
R_LIBS="/gscratch/scrubbed/UWNetID/R/"

The directory designated by R_LIBS will be where R installs your package libraries.

I'm still stuck

Please reach out to us by emailing help@uw.edu with "hyak" in the subject line to open a help ticket.

Acknowledgements

Several users noticed some idiosyncrasies when configuring conda to better use storage on Hyak. In short, by default miniconda3 uses softlinks to help preserve storage, storing one copy of essential packages (e.g., encodings) and using softlinks to make the single copy available to all conda environments. On Hyak, which utilizes a mounted filesystem server, these softlinks were broken, leading to broken environments after their first usage. We appreciate the help of the Miniconda team who helped us find a solution. More details about this can be found by following this link to the closed issue on Github.

March 2024 Maintenance Details

· 3 min read
Kristen Finch
HPC Staff Scientist

For our March maintenance we had some notable changes we wanted to share with the community.

Login Node

Over the last several months the login node has been crashing on occasion. We have been monitoring and dissecting the kernel dumps from each crash and this behavior seems to be highly correlated with VS Code Remote-SSH extension activity. To prevent node instability, we have upgraded the storage drivers to the latest version. If you are a VS Code user and connect to klone via Remote-SSH, we have some recommendations to help limit the possibility that your work would cause system instability on the login node.

Responsible Usage of VS Code Extension Remote-SSH

While developing your code with connectivity to the server is a great usage of our services, connecting directly to the login node via the Remote-SSH extension will result in VS Code server processes running silently in the background and leading to node instability. As a reminder, we prohibit users running processes on the login node.

New Documentation

The steps discussed here for responsible use of VS Code have been added to our documentation. Please review the solutions for connecting VS Code to Hyak.

  1. Check which processes are running on the login node, especially if you have been receiving klone usage violations when you are not aware of jobs running. Look for vscode-server among the listed processes.

    $ ps aux | grep UWNetID
  2. If you need to develop your code with connectivity to VS Code, use a ProxyJump to open a connection directly to a compute node. Step 1 documentation. and then use the Remote-SSH extension to connect to that node through VS Code on your local machine, preserving the login node for the rest of the community. Step 2 documentation.

  3. Lastly, VS Code’s high usage is due to it silently installing its built in features into the user's home directory ~/.vscode on klone enabling intelligent autocomplete features. This is a well known issue, and there is a solution that involves disabling the @builtin TypeScript plugin from the VS Code on your local machine. Here is a link to a blog post about the issue and the super-easy solution. Disabling @builtin TypeScript will reduce your usage of the shared resources and avoid problems.

In addition to the upgrade of the storage driver, we performed updates to security packages.

Training Opportunities

We wanted to make you aware of two training opportunities with the San Diego Supercomputer Center. If you are interested in picking up some additional skills and experience in HPC, check this blog post.

Questions?

If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak in the subject line.

February 2024 Maintenance Details

· 3 min read
Nam Pho
Director for Research Computing

Hello Hyak community! We have a few notable announcements regarding this month’s maintenance. If the hyak-users mailing list e-mail didn’t fully satisfy your curiosity, hopefully this expanded version will answer any lingering questions.

GPUs

  • Software: The GPU driver was upgraded to the latest stable version (545.29.06). The latest CUDA 12.3.2 is also now provided as a module. You are also encouraged to explore the use of container (i.e., Apptainer) based workflows, which bundle various versions of CUDA with your software of interest (e.g., PyTorch) over at NGC. NOTE: Be sure to pass the --nv flag to Apptainer when working with GPUs.

  • Hardware: The Hyak team has also begun the early deployments of our first Genoa-Ada GPU nodes. These are cutting-edge NVIDIA L40-based GPUs (code named “Ada”) running on the latest AMD processors (code named “Genoa”) with 64 GPUs released to their groups two weeks ago and an additional 16 GPUs to be released later this week. These new resources are not currently part of the checkpoint partition but we will be releasing guidance on making use of idle resources here over the coming weeks directly to the Hyak user documentation as we receive feedback from these initial researchers.

Storage

  • Performance Upgrade: In recent weeks, AI/ML workloads have been increasingly stressing the primary storage on klone (i.e., "gscratch"). Part of this was attributed to the run up to the International Conference for Machine Learning (ICML) 2024 full paper deadline on Friday, February 2. However, it also reflects a broader trend in the increasing demands of data-intensive research. The IO profile was so heavy at times that our systems automation throttled the checkpoint capacity to near 0 in order to keep storage performance up and prioritize general cluster navigation and contributed resources. We have an internal tool called iopsaver that automatically reduces IOPS by intelligently requeuing checkpoint jobs generating the highest IOPS while concurrently limiting the number of total active checkpoint jobs until the overall storage is within its operating capacity. At times over the past few weeks you may have noticed that iopsaver had reduced the checkpoint job capacity to near 0 to maintain overall storage usability.

    During today’s maintenance, we have upgraded the memory on existing storage servers so that we could enable Local Read-Only Cache (LROC) although we don’t anticipate it will be live until tomorrow. Once enabled, LROC allows the storage cluster to make use of a previously idle SSD capacity to cache frequently accessed files on this more performant storage tier medium. We expect LROC to make a big difference as during this period of the last several weeks, the majority of the recent IO bottlenecking was attributed to a high volume of read operations. As always, we will continue to monitor developments and adjust our policies and solutions accordingly to benefit the most researchers and users of Hyak.

  • Scrubbed Policy: In the recent past this space has filled up. As a reminder, this is a free-for-all space and a communal resource for when you have data you only need to temporarily burst out into past your usual allocations from your other group affiliations. To ensure greater equity among its use, we have instituted a 10TB and 10M files limit for each user in scrubbed. This impacts <1% of users as only a handful of users were using an amount of quota from scrubbed >10TB.

Questions?

Hopefully you found these extra details informative. If you have any questions for us, please reach out to the team by emailing help@uw.edu with Hyak somewhere in the subject or body. Thanks!

Update on the hyakstorage command

· 2 min read
Nam Pho
Director for Research Computing

We’ve made an update to our storage accounting tool, hyakstorage, and with this update we are also phasing out usage_report.txt. That text file contained minimally-parsed internal metrics of the storage cluster, and we found it caused as many questions as it answered. Moving forward, the hyakstorage tool will display only the four relevant pieces of information for each fileset you query: storage space used vs. the storage space limit, and current amount of files (inodes) vs. maximum number of files.

The default operation–running hyakstorage with no arguments–will show your home directory & the gscratch directories you have access to, and it will only show the fileset totals & your contributions.

You can also specify which filesets you want to view, in a few different ways: you can use the flag --home to show your home directory, --gscratch to show your gscratch directories, and --contrib to show your group’s contrib directories. You can also specify an exact gscratch directory with the group name (e.g. hyakstorage stf), contrib directory (e.g. hyakstorage stf-src), or full path to a fileset (e.g. hyakstorage /mmfs1/gscratch/stf).

If you want more detailed metrics, you can use the flags --show-user or --show-group to break down the fileset totals by individual users or groups. Those detailed metrics can be sorted by space with --by-disk (the default) or by files with --by-files.

See also: