Skip to main content
Skip to main menu Skip to spotlight region Skip to secondary region Skip to UGA region Skip to Tertiary region Skip to Quaternary region Skip to unit footer

Slideshow

Using the CSP Computational Cluster

Running Jobs on the CSP Cluster with the Slurm Workload Manager

In Fall 2022, we refreshed the CSP cluster by adding new nodes, removing old low memory nodes, and switching to a new OS and job manager/scheduler.  The new cluster uses the Slurm Workload Manager, which is what the GACRC also uses. There are a few things you need to know to get started using slurm. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info.

Get a cluster account

Cluster accounts are no longer linked to departmental accounts, which means that your cluster home directory is not the same as your department home directory.

Cluster Access

Only the login/compile node, csp1.csp.uga.edu, is accessible from outside the cluster. You can login via ssh to csp1.csp.uga.edu from on campus, or off campus if logged in to the UGA VPN. Transfering files to the cluster via rsync, scp, or sftp must also be done through this node. (Note: the terminal prompt will display the host name as node0, which is it's hostname on the internal cluster network).

LMOD Modules System

Modules are used to connect to different software packages to run your scripts. They handle all the required paths and environment variables associated with the software, and thus, different versions of the software can be switched quickly and easily using the module commands. The most common use case is switching from gnu compilers and OpenMPI to Intel compilers and IMPI.

Default Modules

A set of modules is loaded automatically on login to set up the default environment, which includes the gnu compiler suite (gcc, g++, gfortran, etc.) currently version 9.4.0, and the OpenMPI 4.1.1 runtime, which adds mpiCC, mpic++, mpicc, mpicxx, mpif77, mpif90, mpifort, and mpirun built with the gnu compilers.

image 1

The default modules loaded on login via the ohpc module

Available Modules

To see what modules are available, use the `module avail` command or the `ml av` shortcut.

Image 2

The `module spider` command shows detailed info about software versions.

image 3

Details for the intel module

Changing Loaded Modules

If you want to use the intel compiler suite instead of the gnu compilers, you will need to unload the gnu modules and load the intel modules. There are a couple ways to do that.

First option: module swap

module swap gnu9 intel/2022.1.0

module swap openmpi4 impi/2021.6.0

Second Option: Unload all modules with module purge, then load the intel modules

module purge module load intel/19.0.5.281 module load impi/2019.5.281

These module commands can be included in your job scripts as well.

Module Commands

Commands Syntax Description

module avail

module avail

Shows all the available modules

module spider

module spider <software-name>

Retrieves the specific module

module load

module load <software-name>

Loads the specific module to your environment

module list

module list

Lists all the loaded modules

module swap

module swap <m1> <m2>

Unloads module m1 and loads module m2

module unload

module unload <software-name>

Gets rid of the specific module

module purge

module purge

Removes all the currently loaded modules

module help

module help

Prints the help information.

See the LMOD Documentation to learn more about the modules system.

Basic Slurm Commands

Commands Syntax Description

sbatch

sbatch <job-id>

Submit a batch script to Slurm for processing.

squeue

squeue -u

Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue.

srun

srun <resource-parameters>

Run jobs interactively on the cluster.

skill/scancel

scancel <job-id>

End or cancel a queued job.

sacct

sacct

Show information about current and previous jobs.

sinfo

sinfo

Get information about the resources on available nodes that make up the HPC cluster.

Example Job Script



#!/bin/bash

#SBATCH --job-name=mclz             # Job name

#SBATCH --partition=batch           # Partition (queue) name

#SBATCH --nodes=1                   # Number of nodes

#SBATCH --ntasks-per-node=1         # Number of tasks to call on each node

#SBATCH --mem-per-cpu=200mb         # Memory per processor

#SBATCH --time=24:00:00             # Time limit hrs:min:sec

#SBATCH --output=mclz.%j.out        # Standard output log

#SBATCH --error=mclz.%j.err         # Standard error log
JOBDIR=${SLURM_JOBID}

mkdir $JOBDIR
export PATH=./bin:$PATH

time srun ./bin/mclz.sh -n -I C -Z 6 -A He -L TA -N "TA"
cp c6+he* $JOBDIR
 

Submitting jobs

sbatch

The command I would use to submit the script above is:

sbatch example.sh

The above command works assuming I'm in the same directory as example.sh. sbatch can also take arguments, but those can also be placed in the script as SBATCH directives. In example.sh, the lines starting with '#SBATCH' are directives that can also be command-line arguments. If you don't want to specify the number of tasks per node in the script, do the following:

sbatch --ntasks-per-node=1 example.sh

image 6

In the screenshot above, the qsub command returned with a job number. This number is assigned to $JOB_ID in your submit script, which means that in this example:

mkdir SLURM_JOBID

runs as:

mkdir 463

squeue

squeue shows the current running or pending jobs. If you only want to see your jobs, run squeue -me

image 7

scancel

scancel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running:

scancel

4. Output

In addition to any file output that your program generates, Slurm captures the standard output and standard error and puts them into files. By default, slurm combines the two into a single file called slurm-$JOBID.out, where $JOBID is the ID Slurm assigned the job. You can specify a different name with the --output SBATCH directive. Use %j as a placeholder for the job id. For example: #SBATCH --output=example.%j.out To capture the standard error output to a different file, use the --error SBATCH directive, e.g. #SBATCH --error=example.%j.error The error file contains any errors generated while your script was running. This is useful for debugging your submit script or executable.

output

output

Click here for more sumbit script samples.

Support us

We appreciate your financial support. Your gift is important to us and helps support critical opportunities for students and faculty alike, including lectures, travel support, and any number of educational events that augment the classroom experience. Click here to learn more about how to help us grow.

Every dollar given has a direct impact upon our students and faculty.