Parallel Computing
caution
This documentation is under construction.
Background
In machine learning, there is some inherent randomness (e.g., random starting point when beginning to for the network) and across iterations of training a neural network, error estimates can fluctuate. Additionally, in the case of training a neural network on a biological system, like our Populus trichocarpa trees, the origin of some trees might be easier to predict of practical and biological reasons. For example, the DNA quality could be a practical reason that the tree origin is uncertain, or a biological reason for uncertainty could be that trees from a large region of the species distribution may be homogeneous genetically. Because of this uncertainty and randomness, we want to train the neural network on multiple test sets to get a better understanding of the distribution of origin prediction error.
For this worked example, we have 5 test sets of Populus trichocarpa trees, each with a different random draw of 10% of individuals where their true origin has been replaced with NA. We want to train the neural network with each test set, so that later we can combine results and calculate prediction error from a broader diversity of P. trichocarpa trees.
#
Array JobsThe method for solving this embarassingly parallel computing problem is very similar to what we have set up in the last section. We will use a Slurm batch script to submit an array of jobs to be executed in parallel by adding the sbatch
directive #SBATCH --array=
. In our case, #SBATCH --array=0-4
which will execute 5 jobs, one for each test_set. Let's take a look at the script before we test it out.
Use cat
to view the script.
And use the text editor nano
to edit it as needed. Remember to use hyakalloc
to find which accounts and partitions are available to you. If you have a compute
parition, replace --parition=ckpt
with --partition=compute
and your job will be scheduled faster because you will be requesting a job on resources you can use with priority access.
The work of transforming this batch job into an array job is done by attaching the Slurm environment variable SLURM_ARRAY_TASK_ID
to each test set. SLURM_ARRAY_TASK_ID
is an index (0-4) being attached to each file in the data/
directory. The file list is saved as a variable FILES
and then each file plus its SLURM_ARRAY_TASK_ID
index is saved as a variable FILE
which is passed as the input with the flag --sample_data
. We aslso use the SLURM_ARRAY_TASK_ID
index as a suffix for the results that will be saved in the locator_out/
directory.
This single script is scheduling an array of 5 jobs, one for each test set (${FILE}
). Each job in the array will run as one task on one node that has 10G of RAM. Each job in the array will produce an output file like locator_array_12345678_0.out using %x
as shorthand of the job-name, %A
as shorthand for the array-jobID that will be assigned by Slurm when the job is submitted, and %a
for the index of the job within the array the array-jobID will replace 12345678 in locator_array_12345678_0.out and there will be 5 output files, one for each job locator_array_12345678_0-4.out.
Once you have edited the script to fit your needs, you can submit it with sbatch
.
And use squeue
with watch
to monitor the progress of the jobs in real time.
The watch
command executes the squeue
command every 2 seconds, allowing you to watch the job in real time. List the currect directory to see the output files there.
Use tail
to compare two of the output files to each other.
As you can see, test set 0 (data/potr_m_pred0.txt
) took a slightly longer time to execute and the validation error mean and median differ between the runs. Results files locator_out/array_potr_predictions_0_predlocs.txt
and locator_out/array_potr_predictions_1_predlocs.txt
are distinct as well and contain predictions for the trees that whose origin were NA in the test set. The next step would be to combine the results and calculate the distnace between the true and predicted origin, but that data analysis is outside of the scope of this tutorial.
Congratulations, each job executed in parallel took around 4 minutes to complete, and if you would have executed these serially, it would have taken about 20 minutes. What will you do with all of your extra time? Go forth and parallelize your workflows.