Running Jobs on the CSP Cluster with the Slurm Workload Manager In Fall 2022, we refreshed the CSP cluster by adding new nodes, removing old low memory nodes, and switching to a new OS and job manager/scheduler. The new cluster uses the Slurm Workload Manager, which is what the GACRC also uses. There are a few things you need to know to get started using slurm. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info. Get a cluster account Cluster accounts are no longer linked to departmental accounts, which means that your cluster home directory is not the same as your department home directory. Cluster Access Only the login/compile node, csp1.csp.uga.edu, is accessible from outside the cluster. You can login via ssh to csp1.csp.uga.edu from on campus, or off campus if logged in to the UGA VPN. Transfering files to the cluster via rsync, scp, or sftp must also be done through this node. (Note: the terminal prompt will display the host name as node0, which is it's hostname on the internal cluster network). LMOD Modules System Modules are used to connect to different software packages to run your scripts. They handle all the required paths and environment variables associated with the software, and thus, different versions of the software can be switched quickly and easily using the module commands. The most common use case is switching from gnu compilers and OpenMPI to Intel compilers and IMPI. Default Modules A set of modules is loaded automatically on login to set up the default environment, which includes the gnu compiler suite (gcc, g++, gfortran, etc.) currently version 9.4.0, and the OpenMPI 4.1.1 runtime, which adds mpiCC, mpic++, mpicc, mpicxx, mpif77, mpif90, mpifort, and mpirun built with the gnu compilers. The default modules loaded on login via the ohpc module Available Modules To see what modules are available, use the `module avail` command or the `ml av` shortcut. The `module spider` command shows detailed info about software versions. Details for the intel module Changing Loaded Modules If you want to use the intel compiler suite instead of the gnu compilers, you will need to unload the gnu modules and load the intel modules. There are a couple ways to do that. First option: module swap module swap gnu9 intel/2022.1.0 module swap openmpi4 impi/2021.6.0 Second Option: Unload all modules with module purge, then load the intel modules module purge module load intel/19.0.5.281 module load impi/2019.5.281 These module commands can be included in your job scripts as well. Module Commands Commands Syntax Description module avail module avail Shows all the available modules module spider module spider <software-name> Retrieves the specific module module load module load <software-name> Loads the specific module to your environment module list module list Lists all the loaded modules module swap module swap <m1> <m2> Unloads module m1 and loads module m2 module unload module unload <software-name> Gets rid of the specific module module purge module purge Removes all the currently loaded modules module help module help Prints the help information. See the LMOD Documentation to learn more about the modules system. Basic Slurm Commands Commands Syntax Description sbatch sbatch <job-id> Submit a batch script to Slurm for processing. squeue squeue -u Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue. srun srun <resource-parameters> Run jobs interactively on the cluster. skill/scancel scancel <job-id> End or cancel a queued job. sacct sacct Show information about current and previous jobs. sinfo sinfo Get information about the resources on available nodes that make up the HPC cluster. Example Job Script #!/bin/bash #SBATCH --job-name=mclz # Job name #SBATCH --partition=batch # Partition (queue) name #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks-per-node=1 # Number of tasks to call on each node #SBATCH --mem-per-cpu=200mb # Memory per processor #SBATCH --time=24:00:00 # Time limit hrs:min:sec #SBATCH --output=mclz.%j.out # Standard output log #SBATCH --error=mclz.%j.err # Standard error log JOBDIR=${SLURM_JOBID} mkdir $JOBDIR export PATH=./bin:$PATH time srun ./bin/mclz.sh -n -I C -Z 6 -A He -L TA -N "TA" cp c6+he* $JOBDIR Submitting jobs sbatch The command I would use to submit the script above is: sbatch example.sh The above command works assuming I'm in the same directory as example.sh. sbatch can also take arguments, but those can also be placed in the script as SBATCH directives. In example.sh, the lines starting with '#SBATCH' are directives that can also be command-line arguments. If you don't want to specify the number of tasks per node in the script, do the following: sbatch --ntasks-per-node=1 example.sh In the screenshot above, the qsub command returned with a job number. This number is assigned to $JOB_ID in your submit script, which means that in this example: mkdir SLURM_JOBID runs as: mkdir 463 squeue squeue shows the current running or pending jobs. If you only want to see your jobs, run squeue -me scancel scancel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running: 4. Output In addition to any file output that your program generates, Slurm captures the standard output and standard error and puts them into files. By default, slurm combines the two into a single file called slurm-$JOBID.out, where $JOBID is the ID Slurm assigned the job. You can specify a different name with the --output SBATCH directive. Use %j as a placeholder for the job id. For example: #SBATCH --output=example.%j.out To capture the standard error output to a different file, use the --error SBATCH directive, e.g. #SBATCH --error=example.%j.error The error file contains any errors generated while your script was running. This is useful for debugging your submit script or executable. Click here for more sumbit script samples.