2 posts tagged with "quota"

View All Tags

Disk Storage Management with Conda

Kristen Finch

Kristen Finch

HPC Staff Scientist

Hello HYAK Users,

It has come to our attention that the default configuration of Miniconda and conda environments in the user's home directory leads to hitting storage limitations and the dreaded error Disk quota exceeded. We thought we would take some time to guide users in configuring their conda environment directories and package caches to avoid this error and proceed with their research computing.

Conda's config#

Software is usually accompanied by a configuration file (aka "config file") or a text file used to store configuration data for software applications. It typically contains parameters and settings that dictate how the software behaves and interacts it's environment. Familiarity with config files allows for efficient troubleshooting, optimization, and adaptation of software to specific environments, like HYAK's shared HPC environment, enhancing overall usability and performance. Conda's config file .condarc, is customizable and lets you determine where packages and environments are stored by conda.

Understanding your Conda#

First let's take a look at your conda settings. The conda info command provides information about the current conda installation and its configuration.

note

The following assumes you have already installed Miniconda in your home directory or elsewhere such that conda is in your $PATH. Install Miniconda instructions here.

$ conda info
active environment : None
shell level : 0
user config file : /mmfs1/home/UWNetID/.condarc
populated config files : /mmfs1/home/UWNetID/.condarc
conda version : 4.14.0
conda-build version : not installed
python version : 3.9.5.final.0
virtual packages : __linux=4.18.0=0
__glibc=2.28=0
__unix=0=0
__archspec=1=x86_64
base environment : /mmfs1/home/UWNetID/miniconda3 (writable)
conda av data dir : /mmfs1/home/UWNetID/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
. . .
package cache : /mmfs1/home/UWNetID/conda_pkgs
envs directories : /mmfs1/home/UWNetID/miniconda3/envs
platform : linux-64
user-agent : conda/4.14.0 requests/2.26.0 CPython/3.9.5 Linux/4.18.0-513.18.1.el8_9.x86_64 rocky/8.9 glibc/2.28
UID:GID : 1209843:226269
netrc file : None
offline mode : False

The paths shown above will show your username in place of UWNetID. Notice the highlighted lines above showing the absolute path to your config file in your home directory (e.g., /mmfs1/home/UWNetID/.condarc), the directory designated for your package cache (e.g., /mmfs1/home/UWNetID/conda_pkgs), and the directory/ies designated for your environments (e.g., /mmfs1/home/UWNetID/miniconda3/envs). Conda designates directories for your package cache and your environments by default, but under HYAK, your home directory has a 10G storage limit, which can quickly be maxed out by package tarballs and their contents. We can change the location for your package cache and your environments to avoid this.

tip

when you ls your home directory ls /mmfs1/home/UWNetID/ you might not see .condarc listed. It is there! To list all hidden files (files beginning with .) use ls -a /mmfs1/home/UWNetID/.

Configuring your package cache and envs directories#

Edit the highlighted lines in .condarc to designate directories with higher storage quotas for our envs_dirs and pkgs_dirs. Use a hyak preloaded editor like nano or vim to edit .condarc in place. More about nano. More about vim. Your .condarc will look like this:

$ nano ~/.condarc
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: true
envs_dirs:
- /mmfs1/home/UWNetID/miniconda3/envs
pkgs_dirs:
- /mmfs1/home/UWNetID/conda_pkgs

In this exercise, we will assign our envs_dirs and pkgs_dirs directories to directories in /gscratch/scrubbed/ where we have more storage, although remember scrubbed storage is temporary. Alternatively, your lab/research group might have another directory in /gscratch/ that can be used.

important

Remember to replace the word UWNetID in the paths below with YOUR username/UWNetID.

Here is what your edited .condarc should look like.

$ cat /mmfs1/home/UWNetID/.condarc
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: true
envs_dirs:
- /gscratch/scrubbed/UWNetID/envs
pkgs_dirs:
- /gscratch/scrubbed/UWNetID/conda_pkgs
warning

If you don't have a directory under your UWNetID in /gscratch/scrubbed/or whereever you intend to designate these directories you will need to create them now for this to work. Use the mkdir command, for example mkdir /gscratch/scrubbed/UWNetID and replace UWNetID with your username. Then create directories for your package cache and envs directory, for example, mkdir /gscratch/scrubbed/UWNetID/conda_pkgs and mkdir /gscratch/scrubbed/UWNetID/envs.

After .condarc is edited, we can use conda info to see if our changes have been incorporated.

$ conda info |grep cache
/gscratch/scrubbed/UWNetID/conda_pkgs
$ conda info |grep envs
/gscratch/scrubbed/UWNetID/envs

Cleaning up disk storage#

After you have reset the package cache and environment directories with your conda config file, you can delete the previous directories to free up storage. Before doing that, you can monitor how much storage was being occupied by each item in your home directory with the command du -h --max-depth=1. Remove directories previously used as cache and envs_dir recursively with rm -r. The following is an example of monitoring storage and removing directories.

warning

rm -r is permanent. We cannot your recover directory. You were warned.

$ du -h --max-depth=1 /mmfs1/home/UWNetID/
6.7G ./miniconda3/envs
4.0G ./conda_pkgs
. . .
$ rm -r /mmfs1/home/UWNetID/envs
$ du -h --max-depth=1 /mmfs1/home/UWNetID/
2.6G ./miniconda3/
4.0G ./conda_pkgs
. . .
note

The hyakstorage command is not simultaneously updated. Although you have cleaned up your home directory, hyakstorage might not yet show new storages estimates. du -sh will give you the most up to date information.

Storage can also be managed by cleaning up package cache periodically. Get rid of the large-storage tar archives after your conda packages have been installed with conda clean --all.

Lastly, regular maintenance of conda environments is crucial for keeping disk usage in check. Review you list of conda environments with conda env list and remove unused environments using the conda remove --name ENV_NAME --all command. Consider creating lightweight environments by installing only necessary packages to conserve disk space. For example, create an environment for each project (project1_env) rather than an environment for all projects combined (myenv).

Disk quota STILL exceeded#

Be aware that many software packages are configured similarly to conda. Explore the documentation of your software to locate the configuration file and anticipate where storage limitations might become an issue. In some cases, you may need to edit or create a config file for the software to use. pip and R are two other common offenders ballooning the disk storage in your home directory.

Configuring PIP#

If you are installing with pip, you might have a pip cache in ~/.cache/pip. Let's locate your the pip config file location under variant "global." You might have to activate a previously built conda environment to do this. For this exercise we will use an environment called project1_env.

$ conda activate project1_env
(project1_env) $ pip config list -v
. . .
For variant 'user', will try loading '/mmfs1/home/UWNetID/.pip/pip.conf'
. . .

The message "will try loading" rather than listing the config file pip.conf means that a pip config file has not been created. We will create our config file and set our pip cache. Create a directory in your home directory (e.g.,/mmfs1/home/UWNetID/.pip) to hold your pip config file and create a file called pip.conf with the touch command. Remember to also create the new directory for your new pip cache if you haven't yet.

$ mkdir /mmfs1/home/UWNetID/.pip/
$ touch /mmfs1/home/UWNetID/.pip/pip.conf
$ mkdir /gscratch/scrubbed/UWNetID/pip_cache

Open pip.conf with nano or vim and add the following lines to designate the location of your pip cache.

[global]
cache-dir=/gscratch/scrubbed/UWNetID/pip_cache

Check that your pip cache has been designated.

(project1_env) $ pip config list
/mmfs1/home/UWNetID/.pip/pip.conf
(project1_env) $ pip cache dir
/gscratch/scrubbed/UWNetID/pip_cache

Configuring R#

We previously covered this in our documentation.. Edit or create a config file called .Renviron in your home directory. Use nano oe vim to designate the location of your R package libraries. The contents of the file should be something like the following example.

$ cat ~/.Renviron
R_LIBS="/gscratch/scrubbed/UWNetID/R/"

The directory designated by R_LIBS will be where R installs your package libraries.

I'm still stuck#

Please reach out to us by emailing help@uw.edu with "hyak" in the subject line to open a help ticket.

Update on the hyakstorage command

Nam Pho

Nam Pho

Director for Research Computing

We’ve made an update to our storage accounting tool, hyakstorage, and with this update we are also phasing out usage_report.txt. That text file contained minimally-parsed internal metrics of the storage cluster, and we found it caused as many questions as it answered. Moving forward, the hyakstorage tool will display only the four relevant pieces of information for each fileset you query: storage space used vs. the storage space limit, and current amount of files (inodes) vs. maximum number of files.

The default operation–running hyakstorage with no arguments–will show your home directory & the gscratch directories you have access to, and it will only show the fileset totals & your contributions.

You can also specify which filesets you want to view, in a few different ways: you can use the flag --home to show your home directory, --gscratch to show your gscratch directories, and --contrib to show your group’s contrib directories. You can also specify an exact gscratch directory with the group name (e.g. hyakstorage stf), contrib directory (e.g. hyakstorage stf-src), or full path to a fileset (e.g. hyakstorage /mmfs1/gscratch/stf).

If you want more detailed metrics, you can use the flags --show-user or --show-group to break down the fileset totals by individual users or groups. Those detailed metrics can be sorted by space with --by-disk (the default) or by files with --by-files.

See also: