Condo Model

The HYAK clusters operate on a condo model. This means that the cluster is itself a contribution of resource slices from various groups across campus. The HYAK team, funded through the office of research and sponsoring entities, provides the core infrastructure (e.g., networking, storage, support staff). This is why faculty that are from sponsoring entities do not have any annual, ongoing costs associated with their nodes beyond the initial cost of the hardware. The leadership of their sponsoring entities cover this. Faculty that are not affiliated with sponsoring entities have to shoulder this annual, ongoing cost associated with any slices they wish to contribute.

Whatever slices your group contributes to the cluster you get access to on-demand. A cluster account also provides you access to all the other contributed nodes from other labs, subject to their availability (i.e., if the contributing labs aren't actively using them). This is referred to as the "checkpoint" partition due to the lack of job run-time guarantees. Once a checkpoint job starts it can be re-queued at any moment, but historically this has been 5 hours of continuous segments on average. Longer checkpoint jobs will continue to run and be re-queued until it completes, which is why it is important that your job be able to checkpoint or save state to resume gracefully. Checkpoint access can provide access to substantial resources beyond what you contribute and is the benefit of joining a shared cluster like HYAK compared to buying the same hardware and setting up your own dedicated mini server.

Therefore, the total cost considerations for compute nodes in HYAK can be broken down into the sum of the following two components.
  1. Slice Annual Costs
  2. Slice Hardware Costs

Slice Annual Costs

Self-Sponsored Slices (Annual)

$1,750 / 1 node / 1 year

What's included?
  • Cluster membership evaluated annually.
  • Access to the checkpoint partition for additional resources and compute time beyond what you contribute to the cluster.
  • Grant application support.
  • Scientific consultation for workflows and researcher onboarding.
  • Access to workshops and other training as provided.
  • Next business day support for questions.
  • 24 / 7 / 365 monitoring of the cluster as a whole.
  • Regular (cyber)security patching and updates.
  • Historical uptimes better than 99% for the cluster not including previously scheduled maintenance days.

NOTE: Slices purchased separately (below).

Sponsored Slices (Annual)

$0 / year

What's included?
  • Everything that comes with self-sponsored slices.
  • Slice lifetime guaranteed for a minimum of 4 years.
  • No annual costs beyond the up front cost of the slices.

NOTE: Slices purchased separately (below).

If your lab has a faculty affiliation with a sponsoring entity (listed below), then you are only responsible for a one time, up front cost of the slice quantities you would like. You get 4 years of guaranteed and fully supported utilization per slice and beyond that subject to capacity and other conditions. You can skip down to the section below for specific slice configurations.

If your lab does not have a faculty affiliation with a sponsoring entity (listed below), then there is an annual cost of $1,750 per 1 slice per 1 year.

  • UW Seattle
    • College of Arts & Sciences
    • College of Engineering
    • College of the Environment
    • Institute for Protein Design
    • School of Medicine
  • UW Bothell
  • UW Tacoma

Slice Hardware Configurations

TypeHPC SlicesGPU Slices
Slice Count1 x HPC slice1 x GPU slice
Compute Cores40-cores13-cores
Memory (System)192GB384GB768GB1.5TB250GB
GPU TypeN/A2 x A402 x A100
Memory (GPU)N/A48GB per GPU80GB per GPU
Pricing ($)Email UsEmail Us

General FAQ:

  • All hardware is procured at cost (market value with substantial university negotiated bulk discounts) and no sales tax or university overhead applied.
  • We reserve the 2nd Tuesday of every month for cluster maintenance.
  • Slice Service Life:
    • Sponsored Slices: All sponsored slices are supported for a minimum guaranteed lifetime of 4 years. Beyond 4 years all slices are continued to be made available subject to hardware viability (i.e., it didn't break) and the sponsoring entity still having capacity. Historically, this has been 6 years on average. However, past performance is not a guarantee of future experiences.
    • Self-Sponsored Slices: Since self-sponsored slices have an on-going annual cost, this means slice life is reviewed on a yearly basis subject to the lab's willingness to continue, hardware viability, and overall cluster capacity.
  • Storage:
    • Local: Each slice comes with 480GB of local SSD. This is non-persistent storage and is cleared after a job ends. Data must be copied to and from local SSD before and after each job to utilize this.
    • Group: Each node purchase comes with 1TB per node of scalable, shared, group storage (i.e., gscratch). Additional storage can be purchased for $10 / 1 TB / 1 month or free options exist such as scrubbed.
HPC Slices:
  • All slices are standardized on Intel 6230 CPUs ("Cascade Lake").
  • Each slice is a physical server (or node).
  • They are identically configured with your choice of memory (or RAM).
  • Any jobs requiring multiple nodes should be prepared to be independent computations (i.e., "embarassingly parallel") or make use of message passing libraries (e.g., OpenMPI) to scale across multiple nodes simultaneously.
GPU Slices:
  • All slices are standardized on Intel 6230R CPUs ("Cascade Lake"). We are on the NVIDIA "Ampere" generation of GPUs.
  • 4 x GPU slices constitutes a single physical server (or node). It is a single box with 52-cores, 1TB of memory, and 8 x GPUs of the sampe type. They are sold in resource slices to make this a more tractible cost for labs with more modest GPU needs.
  • Any jobs requiring more than 8 x GPUs of the same type should be prepared to make use of message passing libraries (e.g., PyTorch Lightning) to scale across multiple servers. Any job up to the equivalent of 4 x GPU nodes (i.e., 8 x GPU cards) can be run on the same physical machine and therefore scale easily without much further modification to the codebase.