Interactive and Batch Jobs
caution
This documentation is under construction.
#
Interactive JobsAn interactive session on the cluster allows users to access a computing node in real time for tasks that require direct interaction, exploration, or debugging. Request an interactive job with the salloc
command. If you have a quick job or you are preparing software to use later, an interactive session is the best choice. Let's start an interactive job on the ckpt
partition. We will specify that we want a single CPU with the flag --cpus-per-task=1
, 10G of RAM with --mem=10G
, and a maximum time of 2 hours with --time=2:00:00
. The job will automatically end after 2 hours if we don't end it before 2 hours has elapsed.
The output will look something like this:
Finally, your shell prompt will show that you are no longer on the login node, or look something like this:
Except that the word UWNetID
will be replaced with your Net ID and n3424
will be replaced with the node Slurm assigned to your interactive job. Finally, the ~
will be replaced with the name of your current directory (your location on the filesystem).
#
Using Locator in interactive modeNow that we have a job open on a compute node, we can work interactively in the container and test out our code. If the container allows it (most do), you can open a shell within the container and access the software that is installed there, run software-specific commands, and test and debug your code before submitting jobs to run in the background. This can also be a recommended method to run a shorter job that doesn't need to be submitted to complete in the background.
Before we do that, we will need a directory where our locator results will be stored. I'm going to call my locator results directory, locator_out
.
Copy the container to your current directory if you haven't already.
Next open a shell inside the locator container, locator.sif
with the following command.
Let's break this command down into its parts to understand it:
apptainer shell
- Apptainer is the container program on Hyak and withshell
we are asking apptainer to open a shell within the container.--cleanenv
- Containers have their own environment variables that must be set for the software they container to work properly. However, sometimes the environment variables from the host are too similar to those of the container, which can cause conflicts. The--cleanenv
flag instructs the container to ignore environment variables from the host.--bind /gscratch/
- The--bind
flag mounts a file system to the container. The locator container and many containers do not include your data. Mounting the filesystem/gscratch
means that the container can access datafiles that only exist outside of the container.locator.sif
- The last part of the full command is to pass the name of the locator container to apptainer.
You will know that you are inside of the container when your shell prompt starts looks like the following:
Let's explore within the container by listing the root directory /
Notice that we have all the directories we saw when we listed the root directory of klone
, but now we have a directory /locator/
, which contains the files associated with the Locator GitHub Repository.
Specifically the /locator/scripts/
subdirectory contains a file called locator.py
, which is the python script used to run locator nueral network.
Additionally, we have a version of python within the container and we can activate python as follows:
Use exit()
or hold the Ctrl
key and press the d
key to exit python.
Next, we can run locator with the Populus trichocarpa dataset. Copy the data to your current directory if you haven't already.
First let's take a look at the data.
10% of the tree origins in sample data were randomly replaced with NA. These trees will serve as the test set. Locator will train the neural network based on the genotypes of 90% of the trees of known origin, validate the neural network on 10% of the trees of known origin, and then predict the origins of the trees in the test set, providing a set of longitudes and latitudes that can be compared with the true origins of the test set trees.
Let's test the code by running locator on one test set data/potr_m_pred1.txt
Let's break this command down into its parts to understand it:
python /locator/scripts/locator.py
- starts python and executes thelocator.py
python script--matrix data/potr_genotypes.txt
---matrix
is the arguement that indicates the provided filedata/potr_genotypes.txt
is the genotype matrix.--sample_data data/potr_m_pred1.txt
---sample_data
is the arguement that indicates the provided filedata/potr_m_pred1.txt
is the sample data.--out locator_out/potr_predictions1
---out
is the arguement that indicates that results should be saved into thelocator_out/
directory and that the files should have the prefixpotr_predictions1
.
You'll know it is working when it starts providing some messages. The first messages are errors that can be ignored, unless we plan to use a GPU. There will be a few more errors because tensorflow could use a GPU. We won't use a GPU, so we can ignore the errors. The following indicated a successful start of a locator run:
Congratulations, you just trained a neural network based on genotypes of Populus trichocarpa trees sampled across and you have predicted origins for a test set of Populus trichocarpa trees based on their DNA alone. Let's look at your results.
See the Locator publication (Battey et al. 2020) and Locator GitHub Repository for full explanation of the output files.
#
Batch JobsNext we are going to execute the EXACT same code, but as a batch jobs and with the second test set potr_m_pred2.txt
. Batch jobs are ideal for operations that take a longer time to run. These jobs are submitted to the job scheduler Slurm to execute and run in the background until completed.
We made a Slurm batch script for this tutorial. You can use this script as a template for submitting a single job to Slurm and replace the main command with your command/s.
First copy the template to your current directory if you haven't already.
If you have been following along, the following script should work without error (except errors that have to do with GPU usage and can be ignored). However, you will want to read the comments in the script carefully to edit the script to fit your needs for a different task.
Use cat
to view the script.
And use the text editor nano
to edit it as needed.
The lines in the script beginning with #SBATCH
are sbatch directives, or flags passed to sbatch which give instructions about the job we are requesting. This script requests a single node, single task job with 10G of RAM for a maximum time of 1 hour. See Slurm sbatch documentation for the full list of options. Remember to use hyakalloc
to find which accounts and partitions are available to you. If you have a compute
parition, replace --parition=ckpt
with --partition=compute
and your job will be scheduled faster because you will be requesting a job on resources you can use with priority access.
Once you have edited the script to fit your needs, you can submit it with sbatch
.
Pro Tip
Monitor the job with squeue
and your UWNetID
like the following example:
Slurm will save a file called locator_job_12345678.out
where the number is replaced with the JobID Slurm assigned to your job. The output that would normally be printed to the screen while locator is running (which we save when we ran locator interactively) will be saved to this file. View this file with cat
Or follow the messages in real time with the tail
command and the flag --follow
Congratulations, you just trained a neural network based on genotypes of Populus trichocarpa trees sampled across and you have predicted origins for a second test set of Populus trichocarpa trees based on their DNA alone. But this time you did it with a batch up. Let's look at your results.
That Slurm job completed completely in the background, meaning that we could have submitted the job, ended our connection to klone
by logging out, and returned later to view the progress or results. You can instruct Slurm to send messages about jobs completing by adding the following sbatch directives to your Slurm script and replacing the work UWNetID
with your UW Net ID:
In the nest section, we will use a Slurm batch script to submit multiple jobs as an array to be executed in the background in parallel.