3 posts tagged with "launch"

View All Tags

Migrating from MOX to KLONE

Nam Pho

Nam Pho

Director for Research Computing

If you were previously a proficient MOX user and now find yourself on KLONE, what's new / different? This is a high-level summary, please consult the documentation [link] for more details.

note

Updated August 10, 2021 to include additional information specific for GPU users.

Login#

  • Logging in was previously to mox.hyak.uw.edu now it's klone.hyak.uw.edu.
  • As a reminder login nodes are only to connect to the cluster, navigate the cluster file system, and submit jobs. This applies to both KLONE and MOX. Do not compile codes on the login node or run any programs that require significant compute (get a session with SLURM).

Data Transfer#

  • Only use the login node to transfer data on KLONE. On MOX you'd have used a build node or could have used the login node if it wasn't very computationally heavy.

Storage#

  • The path to lab storage is still /gscratch/mylab on both KLONE and MOX. You'll need to copy over the data from MOX to KLONE you want to continue using.
  • Home directories are still 10GB per user, same on both clusters.
  • Scrubbed exists on KLONE just as it did on MOX at /gscratch/scrubbed this is a free-for-all space on both clusters where files are automatically deleted after 21 days.
  • Some new benefits of the KLONE storage compared to MOX:
    • There are snapshots for gscratch! Look inside the /gscratch/mylab/.snapshots folder for a copy of your lab folder once an hour, every hour, for 24 hours. This is not a backup copy nor a replacement for version management (e.g., git) but useful for retrieving recent versions or something accidentally deleted. This is currently disabled.
    • More storage! Previously you received 500GB or 0.5TB of gscratch quota per node (or pair of GPUs) contributed to MOX. Now on KLONE we've doubled your associated storage quota! For example, 2 nodes on MOX would mean 1TB of gscratch but 2 nodes on KLONE now means 2TB of gscratch. If you had an 8 x GPU node on MOX you would have received 2TB of gscratch but an 8 x GPU node on KLONE now means 4TB of gscratch.
    • It's faster! We've had reports of performance that's averaging a 30% speed up all else being equal, nothing you need to do aside from use KLONE instead of MOX.
    • It's faster than fast! While KLONE storage is faster than MOX storage overall, gscratch on KLONE is further turbo charged with a NVMe flash based tier. NVMe flash is among the fastest storage mediums you can get and further differentiating benefit if you use gscratch vs scrubbed on KLONE.

Compute#

  1. When submitting a SLURM job, whether interactive (i.e., salloc) or batch (i.e., sbatch) you'll want to first decide which account to use. This is the group you're part of. You can run the command groups to see your affiliated accounts and run hyakalloc to see all the resources (e.g., compute cores, memory, GPUs) used and available associated with each affiliated account.
  2. Then decide if you want to run this job to count under your resource allocation by submitting to the compute partition (i.e., -p compute) or if you want this job to use idle resources from other groups across the cluster using the checkpoint partition (i.e., -p ckpt).
  • Non-standard partitions. Run sinfo to see the list of all possible partitions, this is only if your group contributed non-standard nodes (e.g., high memory, GPUs) and need to idenitify the appropriate partition names to get immediate use. Otherwise, you'd only be able to get them in a checkpoint capacity. For GPU users this is currently either the gpu-2080ti or the gpu-rtx6k partitions for 11GB and 24GB of GPU memory cards, respectively.
  • There is no build node on KLONE. Get an interactive session (e.g., salloc) under an existing account and partition combination you have access to.
  • All nodes have internet now on KLONE. Do all data transfers to and from KLONE on the KLONE login nodes, the login nodes on KLONE have dual 40 Gbps uplinks to the internet. While the compute nodes on KLONE have internet routing now, they are bottlenecked at 1 Gbps so not suitable for big data transfers.

Software#

  • Singularity containers work the same on both clusters, we encourage this when possible. Refer to our container documentation [link].
  • Modules is updated to the latest versions of the most core parts that the HYAK team maintains (e.g., gcc, Intel, Matlab). Refresh yourself about modules [link].
  • If neither Singularity nor existing modules works for you, you may have to re-compile your codes on KLONE. "contrib" modules works different now on KLONE vs MOX, please check out the details [link].

Klone Soft Launch

Nam Pho

Nam Pho

Director for Research Computing

February 25, 2021#

The UW research computing team celebrates the soft launch of project KLONE, the 3rd generation HYAK supercomputer. Welcome to those researchers invited to participate in the early access program 🥳 🎉

caution

There will be weekly maintenance days on Tuesday during the soft launch period after which we will move back to our regular cadence of monthly maintenance windows.

The user documentation [link] has been updated to reflect the changes and new features of KLONE but this will be an ongoing process.

Compute#

  • Soft launch with 1,920 compute cores over 48 nodes:
    • 28 x mem1 nodes (192GB of memory each) in the compute partition,
    • 4 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
    • 16 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.
  • build nodes no longer exist on klone as they did on mox. All instances have the potential to be interactive and all have internet routing by default (even non-interactive jobs).

Storage#

  • gscratch on klone is 1.4PB total capacity with a new 500TB NVMe flash tier. Data tiering happens automagically, if you use a file frequently it will be moved to the faster storage.
  • Storage quota is still charged back at the same rate ($10 / TB / month). Researchers receive 1TB per node purchased and contributed to klone.

Data#

  • gscratch is not backed up that is the responsibility of the researcher (e.g., LOLO, the cloud, external hard drive). Feel free to email us if you have any questions.
  • While all nodes have internet access now, transfer data using the login nodes. Login nodes have full 2 x 40 Gbps bandwidth. If you transfer using a compute node interactive session you are limited to 1 x 1 Gbps connection.

Software#

  • modules works the same as it did on mox. This is an improved implementation called LMOD on klone compared to environment modules on mox.
  • We provide the basic compilers (e.g., GNU, Intel) as modules.
  • The HYAK team is encouraging a container first world (i.e., use Singularity).

March 3, 2021#

The updated total is 3,840 cores and 96 nodes on klone.

Compute#

  • Compute has doubled by adding another rack to klone, an additional 1,920 compute cores over 48 nodes:
    • 44 x mem1 nodes (192GB of memory each) in the compute partition,
    • 2 x mem2 nodes (384GB of memory each) in the compute-bigmem partition,
    • 2 x mem3 nodes (768GB of memory each) in the compute-hugemem partition.

Software#

  • We created a module for cmake.

March 5, 2021#

Storage#

  • Implemented usage_report.txt files in the base folder of /gscratch/yourlab/ that is updated once an hour to reflect both your block quota and inode capacity usage. This is similar to the gscratch experience on the MOX cluster.

Website#

March 9, 2021#

Storage#

  • Snapshots are here! We are piloting once an hour for 24 hours for every lab storage folder under /gscratch/. Check out the updated documentation here on how to access past snapshots.

Software#

  • We created more LMOD software modules:
    • Matlab R2020b [docs]
    • OpenMPI-4.1.0

March 12, 2021#

  • LMOD software modules:
    • Intel has bundled their software suite (e.g., compiler, MPI) as oneCLI and we created this module (i.e., module load intel/oneCLI).
    • There is now a "contrib" framework for groups to store their shared codes separately from their /gscratch/labname/ data. You can get 100GB of storage to compile codes at /sw/contrib/labname-src/ and then put your LMOD module file in /sw/contrib/modulefiles/labname/. Your module would appear when anyone runs module avail. This is created upon request so if you'd like to opt-in your group please let us know.

April 13, 2021#

Things have been going steady the past week and changes are coming less frequently. We are now increasing time between maintenance periods on klone from weekly on Tuesdays to monthly and aligning it with the mox maintenance as the 2nd Tuesday of every month.

That wraps up our klone soft launch blog updates here, other updates will appear on our HYAK users mailing list. Don't forget to subscribe, instructions on this page at the bottom.

Hello world!

Nam Pho

Nam Pho

Director for Research Computing

tl;dr (1) decomissioned a cluster, (2) got a bunch of GPUs for maching learning, (3) launched a cluster, and (4) new and improved documentation.

2020 has definitely been an eventful year but here on Team Hyak we've been trying to make the best of a bad situation (lemons out of lemonade and such). This year saw the decomissioning of the 1st generation Hyak cluster, ikt, and the soft launch of our 3rd generation Hyak cluster, klone. Our partnership with the Allen School and other departments across campus has enabled an explosion in on-campus GPU capacity for the current 2nd generation Hyak cluster, mox. This is all very exciting, machine learning is only going to get bigger. We realize whether you do your research on your laptop, Hyak, or the cloud that at the end of the day it's all just a computer and what matters is what you can actually do with it. Therefore, we are placing more emphasis on new and improved documentation (this website) and will be doing more regular research tutorials on Hyak throughout the coming year.

We hope you have weathered the adversity 2020 brought upon everyone. It has been a tough year for sure, but may your 2021 be brighter and have improvements in store. The Hyak Team has lots of efforts in the works to benefit supporting your research and they will hit full stride in the coming year. This is one improvement we can all look forward to in 2021.