Condo Model

The HYAK clusters operate on a condo model. This means that the cluster itself consists of contributed resource slices from various groups across campus. The HYAK team, funded through the office of research and sponsoring entities, provides the core infrastructure (e.g., networking, storage, support staff). This is why faculty that are from sponsoring entities do not have any annual, ongoing costs associated with their slices beyond the initial cost of the hardware. The leadership of their sponsoring entities cover this. Faculty that are not affiliated with sponsoring entities have to shoulder this annual, ongoing cost associated with any slices they wish to contribute.

You get access to resources equivalent to the slices your account contributes to the cluster on-demand. A cluster account also provides you access to all the other contributed slices from other entities, subject to their availability (i.e., if the contributors or the resources aren't actively using them). This is referred to as the "checkpoint" partition due to the lack of job run-time guarantees. Once a checkpoint job starts it can be re-queued at any moment, but is not uncommon for a job to run for 4 to 5 hours before requeue. Longer checkpoint jobs will continue to run and be re-queued until they complete, which is why it is important that your job be able to checkpoint or save state and resume gracefully. Checkpoint access can provide access to substantial resources beyond what you contribute and is the benefit of joining a shared cluster like HYAK compared to buying the same hardware operating your own server.

The total cost considerations for compute nodes in HYAK can be broken down into the sum of the following two components.
  1. Slice Annual Costs
  2. Slice Hardware Costs

Slice Annual Costs

Self-Sponsored Slices (Annual)

$1,750 / 1 slice / 1 year

What's included?
  • Cluster membership evaluated annually.
  • Access to the checkpoint partition for additional resources and compute time beyond what you contribute to the cluster.
  • Grant application support.
  • Scientific consultation for workflows and researcher onboarding.
  • Access to workshops and other training as provided.
  • Next business day support for questions.
  • 24 / 7 / 365 monitoring of the cluster as a whole.
  • Regular (cyber)security patching and updates.
  • Historical uptimes better than 99% for the cluster not including previously scheduled maintenance days.

NOTE: Slices purchased separately (below).

Sponsored Slices (Annual)

$0 / year

What's included?
  • Everything that comes with self-sponsored slices.
  • Slice lifetime guaranteed for a minimum of 4 years.
  • No annual costs beyond the up front cost of the slices.

NOTE: Slices purchased separately (below).

If your lab has a faculty affiliation with a sponsoring entity (listed below), then you are only responsible for a one time, total up-front cost of the slices. You get 4 years of guaranteed and fully supported utilization per slice and beyond that subject to capacity and other conditions. You can skip down to the section below for specific slice configurations.

If your lab does not have a faculty affiliation with a sponsoring entity (listed below), then there is an annual cost of $1,750 per 1 slice per 1 year (Self-Sponsored Slices above).

  • UW Seattle
    • College of Arts & Sciences
    • College of Engineering
    • College of the Environment
    • Institute for Protein Design
    • School of Medicine (Pending)
  • UW Bothell
  • UW Tacoma

Slice Hardware Configurations

TypeHPC SlicesGPU Slices
Slice Count1 x HPC slice1 x GPU slice
Compute Cores32-cores
Memory (System)256GB512GB>512GB384GB
GPU TypeN/A2 x L402 x H100
Memory (GPU)N/A48GB per GPU80GB per GPU
Pricing ($)Email UsEmail Us

General FAQ:

  • All hardware is procured at cost (market value with substantial university negotiated bulk discounts) and no sales tax or university overhead applied.
  • We reserve the 2nd Tuesday of every month for cluster maintenance.
  • Slice Service Life:
    • Sponsored Slices: All sponsored slices are supported for a minimum guaranteed lifetime of 4 years. Beyond 4 years all slices are continued to be made available subject to hardware viability (i.e., it didn't break) and the sponsoring entity still having capacity. Historically, this has been 6 years on average. However, past performance is not a guarantee of future experiences.
    • Self-Sponsored Slices: Since self-sponsored slices have an on-going annual cost, this means slice life is reviewed on a yearly basis subject to the lab's willingness to continue, hardware viability, and overall cluster capacity.
  • Storage:
    • Local: Each full node has 1.5TB or more of local NVME SSD disk storage. This is non-persistent storage and is cleared after a job ends. Data must be copied to and from local SSD before and after each job to utilize this.
    • Group: Each slice purchase includes 1 TB of storage space and a 1 million file count limit of shared group storage (i.e., gscratch) accessible from every node. Additional storage quota increases can be purchased for $10 per month for 1TB of additional space and 1 million additional file count limit. Additional "scrubbed" shared storage is available for short-term use, but will be automatically deleted if not accessed for several weeks.
HPC Slices:
  • All slices are standardized on AMD EPYC 9654 CPUs ("Genoa").
  • A physical server (or node) has 192-cores and >1.5TB of memory packaged in a single box. This is in turn sub-divided into 6 equal "slices" that are resources of compute units that are sold to researchers.
  • They are identically configured with your choice of memory (or RAM).
  • Any jobs requiring multiple nodes should be prepared to be independent computations (i.e., "embarassingly parallel") or make use of message passing libraries (e.g., OpenMPI) to scale across multiple nodes simultaneously.
GPU Slices:
  • All slices are standardized on AMD EPYC 9534 CPUs ("Genoa"). We are on the NVIDIA "Ada" and "Hopper" generation of GPUs.
  • 4 x GPU slices constitute a single physical server (or node). It is a single box with 128-cores, 1.5TB of memory, and 8 x GPUs of the same type. They are sold in resource slices to make this a more tractable cost for labs with more modest GPU needs.
  • Any jobs requiring more than 8 x GPUs of the same type should be prepared to make use of message passing libraries (e.g., PyTorch Lightning) to scale across multiple servers. Any job up to the equivalent of 4 x GPU slices (i.e., 8 x GPU cards) can be run on the same physical machine and therefore scale easily without much further modification to the codebase.