Using the CSP Computational Cluster

Running Jobs on the CSP Cluster with the Slurm Workload Manager

In Fall 2022, we refreshed the CSP cluster by adding new nodes, removing old low memory nodes, and switching to a new OS and job manager/scheduler. The new cluster uses the Slurm Workload Manager, which is what the GACRC also uses. There are a few things you need to know to get started using slurm. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info.

Get a cluster account

Cluster accounts are no longer linked to departmental accounts, which means that your cluster home directory is not the same as your department home directory.

Cluster Access

Only the login/compile node, csp1.csp.uga.edu, is accessible from outside the cluster. You can login via ssh to csp1.csp.uga.edu from on campus, or off campus if logged in to the UGA VPN. Transfering files to the cluster via rsync, scp, or sftp must also be done through this node. (Note: the terminal prompt will display the host name as node0, which is it's hostname on the internal cluster network).

LMOD Modules System

Modules are used to connect to different software packages to run your scripts. They handle all the required paths and environment variables associated with the software, and thus, different versions of the software can be switched quickly and easily using the module commands. The most common use case is switching from gnu compilers and OpenMPI to Intel compilers and IMPI.

Default Modules

A set of modules is loaded automatically on login to set up the default environment, which includes the gnu compiler suite (gcc, g++, gfortran, etc.) currently version 9.4.0, and the OpenMPI 4.1.1 runtime, which adds mpiCC, mpic++, mpicc, mpicxx, mpif77, mpif90, mpifort, and mpirun built with the gnu compilers.

The default modules loaded on login via the ohpc module

Available Modules

To see what modules are available, use the `module avail` command or the `ml av` shortcut.

The `module spider` command shows detailed info about software versions.

Details for the intel module

Changing Loaded Modules

If you want to use the intel compiler suite instead of the gnu compilers, you will need to unload the gnu modules and load the intel modules. There are a couple ways to do that.

First option: module swap

module swap gnu9 intel/2022.1.0

module swap openmpi4 impi/2021.6.0

Second Option: Unload all modules with module purge, then load the intel modules

module purge module load intel/19.0.5.281 module load impi/2019.5.281

These module commands can be included in your job scripts as well.

Module Commands

Commands	Syntax	Description
module avail	module avail	Shows all the available modules
module spider	module spider <software-name>	Retrieves the specific module
module load	module load <software-name>	Loads the specific module to your environment
module list	module list	Lists all the loaded modules
module swap	module swap <m1> <m2>	Unloads module m1 and loads module m2
module unload	module unload <software-name>	Gets rid of the specific module
module purge	module purge	Removes all the currently loaded modules
module help	module help	Prints the help information.

See the LMOD Documentation to learn more about the modules system.

Basic Slurm Commands

Commands	Syntax	Description
sbatch	sbatch <job-id>	Submit a batch script to Slurm for processing.
squeue	squeue -u	Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue.
srun	srun <resource-parameters>	Run jobs interactively on the cluster.
skill/scancel	scancel <job-id>	End or cancel a queued job.
sacct	sacct	Show information about current and previous jobs.
sinfo	sinfo	Get information about the resources on available nodes that make up the HPC cluster.

Example Job Script

#!/bin/bash

#SBATCH --job-name=mclz             # Job name

#SBATCH --partition=batch           # Partition (queue) name

#SBATCH --nodes=1                   # Number of nodes

#SBATCH --ntasks-per-node=1         # Number of tasks to call on each node

#SBATCH --mem-per-cpu=200mb         # Memory per processor

#SBATCH --time=24:00:00             # Time limit hrs:min:sec

#SBATCH --output=mclz.%j.out        # Standard output log

#SBATCH --error=mclz.%j.err         # Standard error log

JOBDIR=${SLURM_JOBID}

mkdir $JOBDIR

export PATH=./bin:$PATH

time srun ./bin/mclz.sh -n -I C -Z 6 -A He -L TA -N "TA"

cp c6+he* $JOBDIR

Submitting jobs

sbatch

The command I would use to submit the script above is:

sbatch example.sh

The above command works assuming I'm in the same directory as example.sh. sbatch can also take arguments, but those can also be placed in the script as SBATCH directives. In example.sh, the lines starting with '#SBATCH' are directives that can also be command-line arguments. If you don't want to specify the number of tasks per node in the script, do the following:

sbatch --ntasks-per-node=1 example.sh

In the screenshot above, the qsub command returned with a job number. This number is assigned to $JOB_ID in your submit script, which means that in this example:

mkdir SLURM_JOBID

runs as:

mkdir 463

squeue

squeue shows the current running or pending jobs. If you only want to see your jobs, run squeue -me

scancel

scancel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running:

scancel

4. Output

In addition to any file output that your program generates, Slurm captures the standard output and standard error and puts them into files. By default, slurm combines the two into a single file called slurm-$JOBID.out, where $JOBID is the ID Slurm assigned the job. You can specify a different name with the --output SBATCH directive. Use %j as a placeholder for the job id. For example: #SBATCH --output=example.%j.out To capture the standard error output to a different file, use the --error SBATCH directive, e.g. #SBATCH --error=example.%j.error The error file contains any errors generated while your script was running. This is useful for debugging your submit script or executable.

output

Click here for more sumbit script samples.

Slideshow

Using the CSP Computational Cluster

Running Jobs on the CSP Cluster with the Slurm Workload Manager

Get a cluster account

Cluster Access

LMOD Modules System

Default Modules

Available Modules

Changing Loaded Modules

First option: module swap

Second Option: Unload all modules with module purge, then load the intel modules

Module Commands

Basic Slurm Commands

Example Job Script

Submitting jobs

sbatch

squeue

scancel

4. Output

Support us

Login