High-performance computing - Better Code, Better Science

In previous chapters we have worked with examples where the code can be feasibly run on a standard personal computer, but at some point datasets get large enough that it’s simply not feasible to do the work on a single machine. For many scientists, the next move is to use a cluster, which involves a number of computers connected to one another. It is certainly possible to build one’s own cluster, and some researchers insist on running their own cluster; occasionally there are good reasons for running one’s own cluster (I did it for several years early in my career), but if centralized high-performance computing (HPC) resources are available at one’s institution (as they are at many universities) then it often makes sense to take advantage of these resources. First, they are generally administered by professional system administrators (known as sysadmins), who can ensure that they operate smoothly and securely. This also helps one avoid having to spend one’s personal time scrambling to get the cluster working after a problem occurs. Second, they are generally going to be much larger than any system that an individual researcher or research group can afford to create. While it might make sense to have dedicated hardware if a group can utilize it at full capacity most of the time, it’s more common for researchers to have usage patterns that are bursty such that they only occasionally need a large system, meaning that a dedicated system would be sitting idle most of the time. Third, centralized systems will generally have dedicated power and cooling resources, whereas it’s not uncommon for the “cluster in a closet” to overheat or overload the power system that wasn’t designed to handle such a load.

There are two different kinds of high-performance computing resources that one might take advantage of. The first is on-premise (often called “on-prem”) resources, which are physically located within the institutions. Most large research institutions have an on-prem system that is available to researchers, often for free. The other alternative is cloud systems, which are run by commercial cloud vendors and come with costs included machine usage and data ingress/egress costs. Because most scientists end up using on-prem HPC resources, I am going to focus on those in this chapter, but I will later discuss cloud computing options.

Anatomy of an HPC system¶

A high-performance computer (often called a supercomputer) is composed of a set of individual computers that are known as nodes. These are usually compact rack-mounted systems that can be stacked tightly into a data center. Most of the nodes in a large HPC system are known as compute nodes because their primary role is to perform computations; users generally cannot log into these nodes unless they are actively performing computations on them. A small number of nodes are known as login nodes, which are available for users to log into in order to use the system. While logging in through SSH is most common, many systems also provide web-based interfaces that allow use of Jupyter notebooks or other graphical interfaces. Login nodes also usually have less compute resources (cores/RAM) than compute nodes.

Most compute-intensive jobs on an HPC system are run in a batch mode, in which the jobs are submitted to a queue via a job scheduler and then run when sufficient resources are available. This is a very different model from the interactive mode when one is using one’s own computer, where you can simply run the job directly. The most important rule of working on shared HPC system is this: never run computing jobs on a login node! Because the login nodes are shared by all users, they can sometimes have hundreds of users logged in on a popular system. If even just a few of these ran computationally intensive jobs, the node would quickly grind to a halt. On most HPC systems, the sysadmins will quickly kill any large jobs that are run, and may revoke one’s access privileges if they are a serial abuser of the policy. I’ll discuss later how to obtain a login where it’s allowable to run computationally intensive jobs in an interactive manner, but this is not generally the mode that is used when working on an HPC system.

The nodes in an HPC system, which can number into the thousands on major systems, are connected to one another by a high-performance network, either an Infiniband network or a high-speed (10-100 Gb) Ethernet network. This is important because while both of these network fabrics have high bandwidth (meaning that they can push a large amount of data at once), Infiniband has much lower latency, meaning that the messages travel much more quickly (by at least an order of magnitude). Further, Infiniband offers a special ability called Remote Direct Memory Access (RDMA), which allows it to inject information directly into memory without going through the CPU. For distributed computations across many nodes this provide a massive performance boost by allowing the CPU to focus solely on computing rather than on memory management. I’ll return to this later when I discuss distributed computing on HPC systems.

HPC filesystems¶

Most HPC systems have several different filesystems with different purposes/use cases.

Home¶

The home filesystem is the place where user’s home directories are located. This system will generally have relatively small storage limits. They are generally mounted using the Network File System (NFS) protocol, which is a common and reliable way to share file systems but is not optimized for high-speed data transfer or parallel access. For this reason, it’s important to never use the home directory as an output location for jobs run on an HPC system, as it could both reduce performance of the job as well as degrading performance of the system for other users. The home filesystem is likely to be backed up regularly.

High-performance (“scratch”) storage¶

The high-performance storage on an HPC system is generally referred to as scratch storage. This storage is connected to the compute nodes by a high-performance network, and generally uses a parallel filesystem that can handle a high amount of parallel activity (such as Lustre or GPFS). This is the system that one should generally write to when running jobs that involve a substantial amount of output. However, scratch filesystems are not meant for long-term storage, and most have policies such that data will be automatically removed after some period of disuse (often 60-90 days). Thus, it’s best practice to move important outputs from scratch to a more persistent storage system immediately.

High-capacity storage¶

Most HPC systems have a storage system that is meant for longer-term storage of large data, often with many petabytes of capacity. This is where one would store data that are meant to persist for the duration of a project, which could last months or years. On most systems this storage is not backed up, but as I discussed in the chapter on Data Management, they generally stored on a system that has redundant storage, making data loss possible but unlikely. These systems are usually mounted on the compute nodes, so that it is possible to write to them directly from a compute job, but the performance of these systems will usually be lower than the scratch system. Jobs that need to write large amounts of data should thus usually write to scratch, and then transfer the necessary data to high-capacity storage for safe keeping.

One important thing to know about both scratch and high-capacity systems is that they generally have limits both on the amount of space that one has available and the number of files (often referred to as “inodes”). In addition, these systems usually have a metadata server that must be accessed for each file that is opened, and opening a large number of small files across many nodes can cause the system to hang for everyone. It can thus be problematic to generate and work with many small files on these systems. If your workflows involve many small files, it’s worth considering whether they can be refactored to either use storage formats that can combine many files into a single one (like HDF5) or using a database system (such as an SQLite database, which stores information in a local file).

Local node storage¶

The fastest storage available on a high-performance computing system is the local disk storage on each compute node. Because these are directly connected, they are much faster than the other filesystems that are connected via the network. They make a good place for storage of intermediate work products that don’t need to be saved. However, there are a couple of caveats to their use. First, they often have much smaller capacity than the network filesystems, and exceeding their capacity will cause crashes that can sometimes be hard to diagnose. In the worst case it can fill the root partition and cause a kernel panic, requiring extra work for the syadmin to recover it. Second, some systems actually use a RAM disk, so that any information that you write to “disk” is actually taking RAM away from your computation; check the documentation for your local system to find out more. Third, the data will generally be removed as soon as your job ends (if you used the temporary directory function correctly), making it impossible to retrieve any logs or other information that can help debug crashes. If you are going to use local storage, it’s important to:

Make sure that it’s a physical drive and not a RAM disk
Use the temporary directory environment variable provided by your job submission system (such as $TMPDIR)
Clean up any files that you create, in case the job submission systems fails to clean it up
Copy any important information to the global scratch system prior to the job completion

Job schedulers¶

When I was a postdoc, we had a couple of computers in the lab that were used to analyze brain imaging data (a Sun Microsystems workstation and a Linux box). Because there were several people in the lab analyzing data, it was a constant struggle to avoid overloading the machines by running too many jobs at once. This is exactly the role that a job scheduler plays: ensuring that the resources of the system are used as efficiently and consistently as possible while preventing overloads. The most common job scheduler in use today is the Simple Linux Utility for Resource Management (Slurm), which I will focus on in my discussion. However, it is important to know that Slurm installations can differ substantially between different HPC systems, so some of the suggestions below may not work on all systems.

Working on a shared HPC system generally involves submitting a batch job to the job scheduler. This is very different from “running a program” on a personal computer. Instead of executing the program immediately, the scheduler puts the job into a queue, in which the program waits until the resources required to run it are available. The job request includes details about the amount of compute (number of nodes/cores and amount of RAM) as well as the maximum amount of time requested. If the job requires few resources and requests a small amount of time then it may run in seconds; if it requests many resources and/or a long run time, it might have to wait a significant amount of time in the queue until the resources become available, depending on how busy the system is. HPC systems often become very busy in the weeks before major conference abstract deadlines, leading to very long wait times; this is simply the tradeoff that one must make to gain access to such large systems. On the large systems I have used at Stanford and the University of Texas, I have seen large jobs wait a day or two to run during busy times. On some systems it is possible for researchers to purchase their own nodes which are put into a private queue, allowing dedicated resources for the group while still allowing those resources to be used by others when they are idle.

Job schedulers generally use a Fairshare system to allocate resources to users. This system computes the resources used by each user according to a sliding window over time, and assigns priority in such a way that users with heavy recent usage have a lower priority score for job execution. One can see their priority score using the sshare command on systems using Slurm for scheduling:

$ sshare
Account        User  RawShares  NormShares  RawUsage  EffectvUsage  FairShare
---------- --------- ---------- ----------- --------- ------------- ----------
 russpold  russpold        100    0.033333      6533      0.000065   0.816451

The important number here is the “FairShare” value of 0.81. This is a value that ranges from 1 (lowest usage, highest priority) to zero (very high usage, lowest priority), with 0.5 representing having used one’s “fair share”, with neutral priority. In order to protect one’s FairShare score, it can be useful to spread jobs out over time. If you end up with a low score, it should improve after a few days with low usage.

Anatomy of a batch job¶

A batch job is defined by a batch file, which has a specific format depending on the scheduler; I will focus here on the Slurm format. Here is the first part of a Slurm file that runs a single Python program:

#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --mail-user=myemail@university.edu
#SBATCH --mail-type=END,FAIL
#SBATCH -J my_python_job
#SBATCH -D /scratch/workdir/pyjob
#SBATCH -o pyjob-%j.out
#SBATCH -e pyjob-%j.err

Note the Bash directive at the top of the file; a Slurm script is simply a Bash script that includes some directives (the lines starting with #SBATCH) for Slurm to use in creating the job request:

--partition=normal: the partition (or queue) that the job should be run in, in this case using the queue called normal
--nnodes=1: minimum number of nodes requested
--ntasks=4: number of distinct tasks that will be run
--cpus-per-task=4: number of cores required for each task (defaults to one core per task)
--mem=32G: total amount of memory requested per node
--time=02:00:00: maximum runtime for the job (after which it will be automatically cancelled)
--mail-user=myemail@university.edu: email address to which notifications will be sent
--mail-type=END,FAIL: events that trigger email notification (in this case successful ending or failure)
-J my_python_job: a name for the job, will will appear in the queue listing
-D /scratch/workdir/pyjob: working directory for the job
-o pyjob-%j.out: file containing standard output from the job, including the job ID (specified by %j)
-e pyjob-%j.err: file containing standard error from the job

After specifying these directives, then we specify the modules to be loaded (more about that below) and then the commands that we want to run:

# ----------------Modules------------------------
module load python/3.12.1

# ----------------Commands------------------------
# 1. PREVENT OVER-SUBSCRIPTION
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

# 2. RUN COMMANDS IN PARALLEL
srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000001 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000002 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000003 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000004 &

# 3. THE "WAIT" COMMAND
# Without this, the script finishes immediately, and Slurm kills your Python jobs.
wait

There are three sections in the commands portion of the file, the first of which sets some important environment variables (OMP_NUM_THREADS and MKL_NUM_THREADS) that are used by Numpy and other packages to determine how many cores are available for multithreading. If these variables are not set then Numpy will by default try to use all of the available cores, which can lead to excessive context-switching that can actually cause the code to run slower. The second section runs the commands using the srun command, with settings that specify the number of tasks, nodes, and cores for each script. Ending each line with & causes it to be run in the background, which allows the next job to start; otherwise the runner would wait for that line to complete before starting the next command. The final section includes a wait command, which basically tells Slurm to wait until the parallel jobs are complete before ending the job.

Running a batch job¶

To submit the batch job, we simply use the sbatch call:

$ sbatch fibnumber.sbatch
Submitted batch job 14653975

We can then look at the queue to see the status of the job, using the squeue command. Running that command would return a very long list, depending on the size of the system. For reference, here are the results from the Sherlock supercomputer at Stanford that I used for testing:

$ squeue | wc
   8473   68399  684460
$ squeue | grep PD | wc
   1933   16076  161026

At the time I ran my jobs there were more than 8,000 jobs in the queue, of which almost 2000 were pending. We can use grep to pull out the jobs that are owned by me (using my account name russpold) - otherwise it would be a long list of all jobs on the system):

$ squeue | grep russpold
          14653975    normal fibnumbe russpold PD       0:00      1 (None)

The status PD means that the job is pending. The parenthetical note at the end ((None)) specifies the reason that the job hasn’t started running it. It will initially say None until the scheduler has actually processed the job and determined its rank in the queue; this status should only last a few seconds. There are several other reasons that might also be listed:

Priority: there are other jobs in the queue that are ahead of yours
Resources: you have a high enough priority to run but the required resources are not available
Dependency: the job has dependencies that are not yet finished
JobLimit: you have reached your maximum number of allowable tasks
ReqNodeNotAvail: there is a hardware problem or upcoming maintenance window

If you find yourself stuck for a long time due to Resources, it likely suggests that you are requesting too much resources for your project.

Once your job starts running, the status turns to R:

$ squeue | grep russpold
          14654437    normal fibnumbe russpold  R       1:11      1 sh02-01n36

The information at the end (sh02-01n36) is the name of the compute node where the job is running. In general it’s only possible to SSH into a compute node while one has an active job running, as I did here in order to look at the processes running under my name:

$ ssh sh02-01n36

------------------------------------------
 Sherlock compute node
------------------------------------------


$ ps aux | grep russpold
russpold 20439  0.0  0.0 113504  1788 ?        S    10:26   0:00 /bin/bash /var/spool/slurmd/job14654437/slurm_script
russpold 20467  0.0  0.0 116332  6580 ?        S    10:26   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000001
russpold 20468  0.0  0.0 116332  6576 ?        S    10:26   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000002
russpold 20469  0.0  0.0 116332  6576 ?        S    10:26   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000003
russpold 20470  0.0  0.0 421832  9228 ?        Sl   10:26   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000004
russpold 20480  0.0  0.0 116332  2608 ?        S    10:26   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 python3 fibnumber.py -i 1000004
russpold 20522 12.3  0.0 150888 14224 ?        S    10:26   0:11 /share/software/user/open/python/3.12.1/bin/python3 fibnumber.py -i 1000004
russpold 20667  1.1  0.0 115048  3516 pts/0    Ss   10:27   0:00 -bash
russpold 20915  0.0  0.0 153564  1836 pts/0    R+   10:27   0:00 ps aux
russpold 20916  0.0  0.0 112828  1000 pts/0    S+   10:27   0:00 grep --color=auto russpold

This output is a bit odd: for one of the inputs (-i 1000004) there are two srun processes along with the actual Python process, while for the others there is only a single srun process, and those Python processes never actually got run before the time expired. This occurs because Slurm assumes that each srun command has full access to all of the resources that have been requested, and the first of the srun commands to hit the slurm controller (which in this case was the -i 1000004 one) will run while the others will be stuck in a holding pattern waiting for the required resources. In order to cause the multiple processes to run at the same time, we need to add the --exclusive flag, as well as specifying that the total. memory should be shared between the processes:

srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 $CODEDIR/fibnumber.py -i 1000001 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 $CODEDIR/fibnumber.py -i 1000002 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 $CODEDIR/fibnumber.py -i 1000003 &
srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 $CODEDIR/fibnumber.py -i 1000004 &

Logging into the compute node we now see that each Python process is running along with two srun processes for each input:

russpold 30556  0.0  0.0 113504  1788 ?        S    11:05   0:00 /bin/bash /var/spool/slurmd/job14658999/slurm_script
russpold 30611  0.4  0.0 421832  9224 ?        Sl   11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000001
russpold 30612  0.4  0.0 421832  9216 ?        Sl   11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000002
russpold 30613  0.4  0.0 421832  9228 ?        Sl   11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000003
russpold 30614  0.3  0.0 421832  9232 ?        Sl   11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000004
russpold 30634  0.0  0.0 116332  2608 ?        S    11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000004
russpold 30635  0.0  0.0 116332  2608 ?        S    11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000003
russpold 30648  0.0  0.0 116332  2604 ?        S    11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000001
russpold 30658  0.0  0.0 116332  2608 ?        S    11:05   0:00 srun --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=250M --exclusive python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000002
russpold 30731 44.8  0.0 150888 14228 ?        S    11:05   0:09 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000004
russpold 30746 45.1  0.0 150508 13716 ?        S    11:05   0:09 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000003
russpold 30751 57.1  0.0 150044 13388 ?        S    11:05   0:11 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000002
russpold 30754 56.9  0.0 149892 13296 ?        S    11:05   0:11 /share/software/user/open/python/3.12.1/bin/python3 /home/users/russpold/code/bettercode/src/bettercode/slurm/fibnumber.py -i 1000001

Why are there two srun processes? It turns out that srun first starts a lead process, whose job it is to communicate with the Slurm controller (in this case that’s PID 30614 for the -i 1000004 job). This is how the job can be cancelled if the user cancels it (using scancel) or when the allotted time expires. This process then starts a helper process (in this case PID 30634), which sets up the environment and actually runs the Python script, which is running in PID 30731. These processes are treated as part of a single group, which ensures that if the lead runner gets killed, the helper and actual Python script also get killed, preventing zombie processes from persisting on the compute node.

Parametric sweeps¶

A very common use case on HPC systems is a parametric sweep, where we want to run the same code with different combinations of parameter settings. We did a small version of that with the example above, but what if we wanted to run hundreds of jobs? The most efficient way to do this is to use a job array, which allows submission of a set of related jobs. This has several advantages over submitting a large number of independent jobs:

It only submits a single job to the scheduler, which makes it easier to monitor or cancel the job
It allows one to limit the number of jobs run at once
It is easy to create an index file that contains the relevant parameters and then read those into the sbatch file or the code directly

First let’s implement our previous analysis as a job array; we wouldn’t usually use a job array for such a small set, but it’s a useful example. Here is what the sbatch file would look like:

#!/bin/bash
#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=250M
#SBATCH --time=00:05:00
#SBATCH --mail-user=russpold@stanford.edu
#SBATCH --mail-type=END,FAIL
#SBATCH -J fibnumber

# 1. Use small indices (1, 2, 3, 4)
#SBATCH --array=1-4

# 2. use indices and job ID in the output file names
#SBATCH -o fibnumber-%A_%a.out
#SBATCH -e fibnumber-%A_%a.err
#SBATCH -D /scratch/users/russpold/bettercode/fibnumber

module load python/3.12.1
CODEDIR=/home/users/russpold/code/bettercode/src/bettercode/slurm

# 3. Perform the math to get your large number
# Starting at 1000000 and adding the array index (1, 2, 3, or 4)
ITER_VALUE=$((1000000 + SLURM_ARRAY_TASK_ID))

echo "Running task $SLURM_ARRAY_TASK_ID with iteration value $ITER_VALUE"

srun python3 $CODEDIR/fibnumber.py -i $ITER_VALUE

First note that this batch file specifies the resources needed for each individual job, rather than the combined job like it did before; this is because jobs in a job array are treated separately. We create array numbers that are used to index the jobs (from 1 to 4); these will end up in the file names for the error and output files, by including %a in the file name specifiers. This script is then called once for each value of the array index, and the environment variable $SLURM_ARRAY_TASK_ID contains the value of the array; we add 100000 to this to get the values that we want to pass along to the Python script. Once the job begins, we see this in the Slurm queue:

$ squeue | grep russpold
        14679499_1    normal fibnumbe russpold  R       2:46      1 sh02-01n49
        14679499_2    normal fibnumbe russpold  R       2:46      1 sh02-01n49
        14679499_3    normal fibnumbe russpold  R       2:46      1 sh02-01n41
        14679499_4    normal fibnumbe russpold  R       2:46      1 sh02-01n41

There is a single job number (14679499), but each individual job in the array is shown separately in the queue; also notice that two different nodes are being used for the different jobs.

Job arrays become very useful when one wants to run a large number of parametric variations. In these cases it can be useful to specify the parameters in a text file and then read that file into the slurm script, rather than using Bash to create the parameters. For example, let’s say that we have a prediction problem and we want to run a multiverse analysis to assess analytic variability across models and crossvalidation schemes, running each combination with 100 different random seeds. We could generate a text file called params.txt containing columns with each of these:

Lasso  ShuffleSplit  42
Ridge  ShuffleSplit  593
...

We could then pass the task ID to our Python script, which would read in the parameter file and select the row specified by the task ID to retrieve the parameters for that run:

#SBATCH --array=1-100%10  # Run 100 tasks, but only 10 at a time (%)

# The task ID tells the Python script which line to use from the parameter file
python3 runmodel.py $SLURM_ARRAY_TASK_ID

Job arrays work well up to about 1000 jobs, beyond which schedulers often get unhappy; it’s often useful to think about reorganizing the work so that there are fewer jobs but each job does more work. It’s very important to include a throttle on the job array (the 10 in --array=1-100%10) when the array is large, which causes it to run only a subset of the jobs at once, in order to prevent the filesystem from being overwhelmed if one is reading data. It’s also useful to create a file for each job that specifies that it completed successfully, which then allows rerunning the array in order to rerun any jobs that crashed, without rerunning those that were successful; alternatively one might consider using Snakemake, which interoperates very well with Slurm.

Job dependencies¶

Sometimes there are dependencies between jobs, such that one job must complete before another can start. These are the kinds of dependencies that would often be handled by a workflow engine, but they can also be specified directly in Slurm using the --dependency flag. There are several dependency flags that can be used:

afterok:jobid: only run after jobid has completed successfully
afternotok:jobid: only run after jobid has failed
afterany:jobid: run after jobid completes regardless of status
after:jobid: only run after jobid has started
singleton: only run one job with the current job’s name at a time

For example, if you had a job that you wanted to run only after job 999444 had completed, you could use:

sbatch --dependency=afterok:999444 part2.sbatch

You can also retrieve the job ID from an sbatch command more easily by using the --parseable flag, which allows easy chaining of jobs:

# Submit first job, capture job ID
JOB1=$(sbatch --parsable preprocess.sbatch)

# Second job waits for first
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 analyze.sbatch)

# Third job waits for second
sbatch --dependency=afterok:$JOB2 postprocess.sbatch

Resource estimation¶

Effective and efficient use of HPC resources requires a good understanding of the resource needs of the code that is being run. There are really three important resources to think about:

CPUs/job: How many cores do you need for each job?
Memory/job: How much RAM does each process need?
Jobs/run: how many jobs will be run in parallel?
Time/run: How long will the entire run take?

Time/run and jobs/run trade off against each other directly: More jobs take more time to complete, holding other resources constant. In addition, longer jobs generally spend a longer time in the queue before being executed. Most schedulers also have a maximum time limit (usually 24-72 hours), though longer runs can often be allowed upon special request (another reason to be kind to the sysadmin). In general, I aim to keep jobs to no more than 12 hours, as I have found that longer jobs will often sit in the queue for a very long time.

Estimating the CPUs required for each job requires an understanding of the parallelism that the job can take advantage of. For example, code that performs operations using Numpy without any explicit multiprocessing might seem like it only needs a single CPU, but remember that Numpy actually takes advantage of implicit parallelization when multiple cores are available. Thus, optimal performance might occur with more cores per job; keeping in mind the diminshing returns of increasing parallelism, something like 4 cores per job is probably a sweet spot for Numpy jobs, though it’s worth doing some benchmarking for large jobs where performance is a major concern.

In my experience, memory is the resource that is trickiest to estimate and causes the most problems, in part because it is often used in a bursty way across the computation. I have seen workflows that require less than 1 GB of RAM for most of their runtime, but then require over 10GB for a very short period. The memory allocation request must be sufficient to cover the maximum amount needed for the entire run. If a workflow has a single operation that takes a large amount of RAM, then it might make sense to isolate this operation as a separate job, since requesting large amounts of RAM for a long-running workflow can result in long waits in the queue.

When estimating resource needs, it’s generally a good idea to pad the estimated memory and runtime estimates by about 20%, since performance can vary across jobs. This will help prevent jobs from being killed due to excessive runtime or from crashing due to insufficient RAM.

Job profiling with Slurm¶

One essential tool for resource estimation is the seff tool provided by Slurm, which provides a report for completed jobs on their resource usage. Here is an example from one of the example jobs run earlier:

Job ID: 14654437
Cluster: sherlock
User/Group: russpold/russpold
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:01
CPU Efficiency: 0.07% of 00:22:28 core-walltime
Job Wall-clock time: 00:05:37
Memory Utilized: 20.83 MB
Memory Efficiency: 2.03% of 1.00 GB (1.00 GB/node)

This report tells us that I requested much more CPU time and more RAM than I utilized for the job. If this were a real job, I would first try to pack more computations into a single job, since it’s not very efficient to perform one second’s worth of computation in an HPC job; the overhead costs much more time. Instead I would want to perform thousands of operations within this job, so that the overhead cost is spread out over many more computations (known as amortizing the overhead). I would also greatly reduce the amount of memory requested, given that I only needed about 2% of what I asked for.

Checkpointing for HPC¶

I previously discussed the challenge of jobs with long-tailed completion time distributions, which are also a problem for HPC resource estimation: If we estimate the runtime based on the worst-case scenario, then we could end up sitting on a lot of resources while a few long-running jobs finish. If we request less runtime, then we will end up with some jobs being cancelled, wasting computation. For this reason, it’s good practice to use checkpoints to save intermediate results so that computations don’t have to be repeated, and to save information that specifies which jobs have been completed successfully. This allows a multi-stage process where we first run the entire set of jobs with a moderate runtime limit, and then rerun any jobs that failed with a longer time limit.

We already discussed checkpointing in the context of workflows, and their ability to easily implement checkpointing is a good reason to consider using a workflow manager in the context of HPC. However, it’s worth mentioning that this adds a layer of complexity that can make debugging of problems challenging.

Software configuration on HPC systems¶

New HPC users, particularly those with strong system administration skills, often struggle with the fact that they don’t have the same level of control over the HPC system as they would over their own machine. This is particularly true when it comes to the installation of software, which often requires administrator (“root”) access to the system, which HPC users generally don’t have. Fortunately there are several ways to gain access to the software that HPC users need.

Modules¶

Most HPC systems offer the ability to load software modules that provide access to many different software packages. The module function provides access to these. Here is a listing of the modules that are loaded by default when I log into my local HPC system, obtained using module list:

$ module list

Currently Loaded Modules:
  1) devel       (S)   5) ucx/1.12.1       (g)   9) openmpi/4.1.2 (g,m)
  2) math        (S)   6) libfabric/1.14.0      10) hdf5/1.12.2   (m)
  3) gcc/10.1.0        7) system
  4) cuda/11.5.0 (g)   8) munge/0.5.16

  Where:
   S:  Module is Sticky, requires --force to unload or purge
   g:  GPU support
   m:  MPI support

A listing of all available modules can be printed using module avail, but this will often be a very long list! A more useful tool is module spider, which searches the available modules for a specific term. For example, we could search for the available Python modules:

$ module spider python

----------------------------------------------------------------------------
  python:
----------------------------------------------------------------------------
    Description:
      Python is an interpreted, interactive, object-oriented programming
      language.

     Versions:
        python/2.7.13 (devel)
        python/3.6.1 (devel)
        python/3.9.0 (devel)
        python/3.12.1 (devel)
        python/3.14.2 (devel)
     Other possible modules matches:
        py-biopython  py-bx-python  py-cuda-python  py-cuquantum-python  ...

Let’s say that I want to load the latest version of Python. By default our system provides access to an older version of Python 3:

$ python3 --version
Python 3.6.8

By loading the python/3.14.2 module, we can gain access a newer version:

$ module load python/3.14.2
$ python3 --version
Python 3.14.2

We could also set this as a default using module save, and the next time we log in we would see that this module is loaded. If there is a package that you need that isn’t in the current list of packages, a friendly email to the help desk is usually sufficient to get it installed as a module.

Virtual environments¶

Throughout the book I have talked about the utility of virtual environments, and they are commonly used on HPC systems to can access to packages or package versions that are not available as modules on the system. There is, however, one issue that should be kept in mind when using virtual environments in the HPC context. When we install a virtual environment, the environment folder contains all of the dependencies that are installed in the environment. For some projects this can end up being quite large, to the degree that one can run into disk quota issues if they are stored in the home directory. For example, the full Anaconda installation is almost 10GB, which would largely fill the 15 GB quota for my home directory on the local HPC system; for this reason, I always recommend using miniconda which is a more minimal installation. uv does a better job of caching but its local cache directory can also get very large over many projects. For this reason, I we generally install Conda-based environments outside of the home directory, on a filesystem that has a larger quota. When using uv, we generally set the $UV_CACHE_DIR environment variable to a location with a larger quota as well.

Containers¶

Containers solve at least two important problems for HPC users. First, as I noted in the earlier discussion, they provide a strong platform for reproducible computing; given that the HPC user has no control over operating system upgrades, containers help ensure that software dependencies will remain consistent over time. Second, they allow users to install software that may not be runnable on the HPC system; for example, the user may require a version of a package that is too old to run on the HPC’s current operating system. For these reasons, containers have become very popular on HPC systems.

However, there is a rub: Because Docker requires that the user have root access to the system, it can’t be run by users on HPC systems (which restrict root access to sysadmnins). Instead, most HPC systems support Apptainer (formerly called Singularity) as a platform for running containers. However, Apptainer can’t actually build the container; it can only use existing containers, defined through an image file or downloaded from an image registry such as Dockerhub. In many cases, there are existing containers on Dockerhub that have the appropriate software installed, which can be called directly from Apptainer. However, if no suitable pre-existing container image is available, the user will need to first generate a container using Docker on a system where they have administrative access (usually their own personal computer), and then convert the Docker image into an Apptainer image for use on the HPC system. This can lead to long debugging cycles if there are problems with the container that require rebuilding, but once it’s working it should continue working in the long term.

Interactive access to HPC systems¶

I have focused so far primarily on usage of HPC systems from the command line shell, which for many years was the only way to access these systems. However, HPC systems increasingly offer graphical interfaces to interact with their system. One common platform is OnDemand, which offers access to a large number of applications on an HPC system via a web interface. Other systems offer access to Jupyter notebooks via JupyterHub, and RStudio Server for R users. These interactive tools still usually require launching a Slurm job, which is done automatically, but this means that one can still end up waiting in a queue before being able to access the system.

Software development on an HPC system¶

Software development on HPC systems can be difficult compared to using a personal computer. While some IDEs like VSCode support the use of remote systems, the security policies of HPC centers often prevent the use of those systems for remote development. Thus, one is often stuck having to edit code by hand using a text-based editor in the UNIX shell. This is one reason why I think it is essential that computational scientists have decent skill using a text-based editor like vi or emacs. In general, my development workflow for HPC software is as follows; this won’t work for every situation but should be fairly generalizable.

First, I develop the code using a small version of the problem on my laptop. This provides me with access to all of the coding tools that I’m accustomed with, and avoids the need to wait for submitted jobs to run on an HPC system. I might start prototyping with Jupyter notebooks, but ultimately would move to a Python package, with my code implemented as Python modules and scripts managed by uv. This makes it easy to portably install the package on the HPC system when it’s ready. Depending on the nature of the code (e.g. whether it requires additional non-Python requirements) I would also consider implementing it within a Docker container.

Second, I would install the code on the HPC system and run it interactively within an interactive development shell, which is available using the srun command:

srun --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=16G --time=2:00:00 --pty bash

This provides an interactive shell on a compute node, so that one can run computationally intensive jobs. I would run my code in this shell in order to ensure that it executes successfully on the system, and also to profile its memory usage for the slurm batch job.

Third, once the code is running properly in the interactive shell I would implement a small Slurm batch job that checks to see whether the code runs properly in the batch environment. I would start with a very short runtime (e.g. 5 minutes) and a single core, simply to check if the code is able to run; often this will uncover environment issues, such as modules that need to be loaded or environment variables that need to be set in the Slurm batch script. Using a short runtime and a single core helps the test job make it through the queue more effectively.

Finally, I would create the Slurm script to run the full job at scale on the system. This workflow is probably overkill for small jobs, but for large jobs that require many hours across many CPUs it can save a lot of time, since those jobs may spend hours or even days in the queue before they are run.

Distributed computing using MPI¶

So far I have focused on using HPC resources for jobs that are embarrassingly parallel, such that we can run a large number of jobs without having to coordinate between them. This is an increasingly common use case for HPC, particularly with the advent of “big data”, but historically a major use case for HPC was the execution of massive computations across many nodes that require coordination between nodes to achieve parallelism. This is particularly the case for very large simulations, such as cosmological models, climate models, and dynamical models such as molecular or fluid dynamics. These applications commonly use a framework called message passing interface (MPI), which allows the computation to run simultaneously across many nodes and coordinates computations by sending messages between nodes, taking advantage of high-performance interconnects like Infiniband when available. I’m not going to go into detail about MPI here since it has become relatively niche (I personally have never used it in my 15+ years of using HCP systems), but it is important to be aware of if you are working on a problem that exceeds the memory of a single node and requires intensive communication between nodes.

Cloud computing¶

In addition to the HPC resources available at many research institutions, another option for large compute needs is the commercial cloud systems offered by companies such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure. These systems offer almost unlimited resources, compared to the hard limits imposed by on-prem HPC; instead, the limit becomes one’s budget for computing, since these systems are pay-to-play. There are a few cases where cloud computing makes a lot of sense, assuming that the budget allows:

There is no on-prem HPC center available
The compute/storage needs of the project exceed the locally available resources
One needs access to specialized hardware (such as specific GPUs or large-memory instances)
Computing needs to be reproducible by a broad group of individuals (who may not be able to access the on-prem system)

Benefits of commercial cloud computing are that the resources are generally available immediately, and that the user has complete control and root access to the resources. The downside of commercial cloud computing is that it can become very expensive very quickly. Every component costs money: computing time, data storage, and data egress, as well as other features (e.g. database hosting). There are ways to reduce costs, such as using pre-emptible resources (which may be interrupted), but these systems can still be very costly for large data/compute needs.

We have used cloud computing for various projects, and it can be very useful to obtain specific kinds of compute resources quickly. However, I generally feel that pay-to-play is a bad model for scientific computing, because it inhibits the kind of creativity and exploration that I think scientists need to be able to engage in. When every computing cycle costs money, researchers are less likely to play and possible discover something new. That said, they are an important tool in the computational toolbox, and can often help solve problems that would be intractable using on-prem HPC resources.