4 posts tagged with "features"

View All Tags

OS upgrade for klone

Nam Pho

Nam Pho

Director for Research Computing
note

klone has a new OS, we upgraded to Rocky Linux from CentOS 8.

Background#

In late 2020, while building the current-generation cluster, klone, our previous-generation cluster, MOX, was running CentOS 7 ā€“ which was nearing end-of-life support. We used the transition to klone as an opportunity to deploy CentOS 8, the worldā€™s most popular OS in academic research computing environments. Unfortunately, around the time we were wrapping up KLONEā€™s software stack, the CentOS project announced [1, 2] a transition of their own: Red Hat unilaterally terminated the development of CentOS as an open-source version of Red Hat Enterprise Linux (RHEL). CentOS would become an upstream version of RHEL ā€“ in other words, more experimental and ultimately less stable.

Rocky Linux

As the dust from this announcement settled, a consensus emerged: Rocky Linux, led by the initial founder of the CentOS project, Greg Kurtzer, would become the CentOS successor.

The Transition#

Fast-forward to late 2021: after our summer ā€˜21 launch of klone, and our fall ā€˜21 cluster capacity expansion, we were finally able to turn our attention to the CentOS to Rocky migration. And just in time, too, because CentOS 8ā€“the operating system we deployed just months earlierā€“would be officially unsupported after December 21, 2021.

Drake on CentOS and Rocky

Our goal was to make this OS transition as smooth and unnoticeable to our users as possible. After all, this is our mission: we take care of the tech so that you can take care of the science. Rocky, like CentOS, is intended to be a bug-for-bug, open-source version of RHEL, and with its talented, globe-spanning team of developers, we were confident that the impact of this transition would be minimal.

We began the transition with our backend during the December ā€˜21 maintenance: the klone head node, our Slurm scheduler, was successfully migrated to Rocky 8. So far so good! During our next maintenance, January ā€˜22, we migrated all the compute node images to Rocky. A handful of users reported code-compiling issues, which we were able to resolve, but otherwise it was uneventful. We took extra care on the final piece of the Rocky migrationā€“the login nodesā€“due to their accessibility from the wider internet. And, as of todayā€™s maintenance, we are excited and relieved to report that klone is now a 100% Rocky cluster! šŸ„³

Summary#

The Hyak team was forced to revisit a major OS migration, mere months after the initial launch of klone. This is highly unusualā€“and no small featā€“but we have prevailed. We deployed a widely-supported, open-source OS with enterprise-level stability, while remaining cost-effective to the research community at the University of Washington. With this work behind us, weā€™ve arrived at a sustainable platform for the life of the klone cluster. Weā€™re excited for the future of klone, and excited to redirect our time back to feature development.

We want to give a huge thank-you to our users for their patience during this migration period. Spoiler alert: Rocky won and itā€™s a good thing!

Rocky Wins!

Fairshare improvements on klone

Nam Pho

Nam Pho

Director for Research Computing
note

We have adjusted legacy fairshare-related settings to account for GPUs and large memory contributions and usage in order to help more fairly allocate checkpoint resources.

History#

In fall 2019 (almost two years ago to the day) the Hyak team received our first Turing generation GPU node. Hyak has had a modest GPU footprint in the past as far back as a decade ago with the first generation cluster (called "IKT") and its pre-Pascal generation cards. In 2015 we acquired a smaller test bed of Pascal generation GPUs for the second generation cluster (called "MOX"). There were never more than a dozen GPUs in either the IKT or MOX clusters, but the introduction of Turing GPUs marked a resurgence of interest in these accelerators among the UW research community. In the last two years, we've substantially expanded our capabilities to over 300 GPUs.

Background#

Hyak clusters work on a "condo" model: labs are able to utilize their contributed hardware on-demand as well as take advantage of idle capacity from other groups' hardware via the checkpoint (ckpt) partition. Your checkpoint priority ā€” or "fairshare" in Slurm scheduler parlance ā€” is weighted such that your fairshare is directly proportional to your labā€™s contribution to the cluster. In the MOX days, GPU users tended to stay within their contributed hardware partitions and rarely made use of checkpoint. We attributed this to a mental shift: students were used to using a single resource, like a desktop computer, rather than a shared cluster of computing resources. However, with the migration to the third generation Hyak cluster (called "klone") and its new QoS scheduling system and the increasing comfort of students using a shared platform, GPU utilization in the checkpoint partition has increased as well. This is a good thing: we want groups to benefit from their Hyak membership in the cluster and take advantage of idle cluster resources beyond their initial hardware contributions. This is a primary tenet of our social contract with the Hyak community: as a node contributor to the cluster, you have access to idle resources of the whole cluster.

Problem#

Fairshare was simpler to calculate in the pre-GPU days because our infrastructure was homogenous: one node contributed to the cluster equaled one fairshare unit. During the last two years of exponential GPU adoption on Hyak, the fairshare calculation has not evolved: 1 HPC node was the same as 1 GPU node at 1 fairshare unit. This didnā€™t hold because a GPU node can cost between 4 to 8 times (or more) than a traditional HPC node. The result was that labs with GPU or other speciality (e.g., high-memory) nodes tended to have smaller fairshares compared to groups with the same dollar investment but only in traditional CPU nodes. In practice, this meant these GPU users often directly competed for resources with non-GPU jobs in the checkpoint partition on a non-level playing field.

Solution#

Taking into consideration all of this information, as well as the fact that you can request as little as 1 GPU or 1 CPU from the scheduler, we have adjusted the fairshare calculations as follows:

  • Financially: 1 GPU card is roughly equivalent to 40 CPU cores (on a dollar basis), therefore the cost normalization is 40:1 in favor of GPUs.
  • Scarcity: 1 server typically holds 8 GPU cards or 40 CPU cores, therefore the scarcity normalization is 5:1 in favor of GPUs.
  • Combining the financial and scarcity considerations in the points above, the final weighting is 200:1 in favor of GPUs. In other words, 1 GPU card is worth 200 times more than a single CPU core in the eyes of the scheduler and factored into your checkpoint fairshare. Please note that this example only applies to the higher GPU memory cards (i.e., gpu-rtx6k) while less expensive GPUs have commensurately less weight.

Summary#

With the October monthly maintenance today we have introduced a new fairshare weighting system on the klone cluster's checkpoint (ckpt) partition that commensurately acknowledges GPU labs for their contributions to the Hyak community. This has no impact on jobs submitted to non-ckpt partitions.

Migrating from mox to klone

Nam Pho

Nam Pho

Director for Research Computing

If you were previously a proficient mox user and now find yourself on klone, what's new / different? This is a high-level summary, please consult the documentation [link] for more details.

note

Updated August 10, 2021 to include additional information specific for GPU users.

Login#

  • Logging in was previously to mox.hyak.uw.edu now it's klone.hyak.uw.edu.
  • As a reminder login nodes are only to connect to the cluster, navigate the cluster file system, and submit jobs. This applies to both klone and mox. Do not compile codes on the login node or run any programs that require significant compute (get a session with Slurm).

Data Transfer#

  • Only use the login node to transfer data on klone. On mox you'd have used a build node or could have used the login node if it wasn't very computationally heavy.

Storage#

  • The path to lab storage is still /gscratch/mylab on both klone and mox. You'll need to copy over the data from mox to klone you want to continue using.
  • Home directories are still 10GB per user, same on both clusters.
  • Scrubbed exists on klone just as it did on mox at /gscratch/scrubbed this is a free-for-all space on both clusters where files are automatically deleted after 21 days.
  • Some new benefits of the klone storage compared to mox:
    • There are snapshots for gscratch! Look inside the /gscratch/mylab/.snapshots folder for a copy of your lab folder once an hour, every hour, for 24 hours. This is not a backup copy nor a replacement for version management (e.g., git) but useful for retrieving recent versions or something accidentally deleted. This is currently disabled.
    • More storage! Previously you received 500GB or 0.5TB of gscratch quota per node (or pair of GPUs) contributed to mox. Now on klone we've doubled your associated storage quota! For example, 2 nodes on mox would mean 1TB of gscratch but 2 nodes on klone now means 2TB of gscratch. If you had an 8 x GPU node on mox you would have received 2TB of gscratch but an 8 x GPU node on klone now means 4TB of gscratch.
    • It's faster! We've had reports of performance that's averaging a 30% speed up all else being equal, nothing you need to do aside from use klone instead of mox.
    • It's faster than fast! While klone storage is faster than mox storage overall, gscratch on klone is further turbo charged with a NVMe flash based tier. NVMe flash is among the fastest storage mediums you can get and further differentiating benefit if you use gscratch vs scrubbed on klone.

Compute#

  1. When submitting a Slurm job, whether interactive (i.e., salloc) or batch (i.e., sbatch) you'll want to first decide which account to use. This is the group you're part of. You can run the command groups to see your affiliated accounts and run hyakalloc to see all the resources (e.g., compute cores, memory, GPUs) used and available associated with each affiliated account.
  2. Then decide if you want to run this job to count under your resource allocation by submitting to the compute partition (i.e., -p compute) or if you want this job to use idle resources from other groups across the cluster using the checkpoint partition (i.e., -p ckpt).
  • Non-standard partitions. Run sinfo to see the list of all possible partitions, this is only if your group contributed non-standard nodes (e.g., high memory, GPUs) and need to idenitify the appropriate partition names to get immediate use. Otherwise, you'd only be able to get them in a checkpoint capacity. For GPU users this is currently either the gpu-2080ti or the gpu-rtx6k partitions for 11GB and 24GB of GPU memory cards, respectively.
  • There is no build node on klone. Get an interactive session (e.g., salloc) under an existing account and partition combination you have access to.
  • All nodes have internet now on klone. Do all data transfers to and from klone on the klone login nodes, the login nodes on klone have dual 40 Gbps uplinks to the internet. While the compute nodes on klone have internet routing now, they are bottlenecked at 1 Gbps so not suitable for big data transfers.

Software#

  • Singularity containers work the same on both clusters, we encourage this when possible. Refer to our container documentation [link].
  • Modules is updated to the latest versions of the most core parts that the Hyak team maintains (e.g., gcc, Intel, Matlab). Refresh yourself about modules [link].
  • If neither Singularity nor existing modules works for you, you may have to re-compile your codes on klone. "contrib" modules works different now on klone vs mox, please check out the details [link].

klone Soft Launch

Nam Pho

Nam Pho

Director for Research Computing

February 25, 2021#

The UW research computing team celebrates the soft launch of project klone, the 3rd generation Hyak supercomputer. Welcome to those researchers invited to participate in the early access program šŸ„³ šŸŽ‰

caution

There will be weekly maintenance days on Tuesday during the soft launch period after which we will move back to our regular cadence of monthly maintenance windows.

The user documentation [link] has been updated to reflect the changes and new features of klone but this will be an ongoing process.

Compute#

  • Soft launch with 1,920 compute cores over 48 nodes:
    • 28 x mem1 nodes (192GB of memory each) in the compute partition,
    • 4 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
    • 16 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.
  • build nodes no longer exist on klone as they did on mox. All instances have the potential to be interactive and all have internet routing by default (even non-interactive jobs).

Storage#

  • gscratch on klone is 1.4PB total capacity with a new 500TB NVMe flash tier. Data tiering happens automagically, if you use a file frequently it will be moved to the faster storage.
  • Storage quota is still charged back at the same rate ($10 / TB / month). Researchers receive 1TB per node purchased and contributed to klone.

Data#

  • gscratch is not backed up that is the responsibility of the researcher (e.g., LOLO, the cloud, external hard drive). Feel free to email us if you have any questions.
  • While all nodes have internet access now, transfer data using the login nodes. Login nodes have full 2 x 40 Gbps bandwidth. If you transfer using a compute node interactive session you are limited to 1 x 1 Gbps connection.

Software#

  • modules works the same as it did on mox. This is an improved implementation called LMOD on klone compared to environment modules on mox.
  • We provide the basic compilers (e.g., GNU, Intel) as modules.
  • The Hyak team is encouraging a container first world (i.e., use Singularity).

March 3, 2021#

The updated total is 3,840 cores and 96 nodes on klone.

Compute#

  • Compute has doubled by adding another rack to klone, an additional 1,920 compute cores over 48 nodes:
    • 44 x mem1 nodes (192GB of memory each) in the compute partition,
    • 2 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
    • 2 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.

Software#

  • We created a module for cmake.

March 5, 2021#

Storage#

  • Implemented usage_report.txt files in the base folder of /gscratch/yourlab/ that is updated once an hour to reflect both your block quota and inode capacity usage. This is similar to the gscratch experience on the MOX cluster.

Website#

March 9, 2021#

Storage#

  • Snapshots are here! We are piloting once an hour for 24 hours for every lab storage folder under /gscratch/. Check out the updated documentation here on how to access past snapshots.

Software#

  • We created more LMOD software modules:
    • Matlab R2020b [docs]
    • OpenMPI-4.1.0

March 12, 2021#

  • LMOD software modules:
    • Intel has bundled their software suite (e.g., compiler, MPI) as oneCLI and we created this module (i.e., module load intel/oneCLI).
    • There is now a "contrib" framework for groups to store their shared codes separately from their /gscratch/labname/ data. You can get 100GB of storage to compile codes at /sw/contrib/labname-src/ and then put your LMOD module file in /sw/contrib/modulefiles/labname/. Your module would appear when anyone runs module avail. This is created upon request so if you'd like to opt-in your group please let us know.

April 13, 2021#

Things have been going steady the past week and changes are coming less frequently. We are now increasing time between maintenance periods on klone from weekly on Tuesdays to monthly and aligning it with the mox maintenance as the 2nd Tuesday of every month.

That wraps up our klone soft launch blog updates here, other updates will appear on our Hyak users mailing list. Don't forget to subscribe, instructions on this page at the bottom.