Running Jobs on the CSP Cluster with the Slurm Workload Manager
In Fall 2022, we refreshed the CSP cluster by adding new nodes, removing old low memory nodes, and switching to a new OS and job manager/scheduler. The new cluster uses the Slurm Workload Manager, which is what the GACRC also uses. There are a few things you need to know to get started using slurm. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info.
Get a cluster account
Cluster accounts are no longer linked to departmental accounts, which means that your cluster home directory is not the same as your department home directory.
Cluster Access
Only the login/compile node, csp1.csp.uga.edu, is accessible from outside the cluster. You can login via ssh to csp1.csp.uga.edu from on campus, or off campus if logged in to the UGA VPN. Transfering files to the cluster via rsync, scp, or sftp must also be done through this node. (Note: the terminal prompt will display the host name as node0, which is it's hostname on the internal cluster network).
LMOD Modules System
Modules are used to connect to different software packages to run your scripts. They handle all the required paths and environment variables associated with the software, and thus, different versions of the software can be switched quickly and easily using the module commands. The most common use case is switching from gnu compilers and OpenMPI to Intel compilers and IMPI.
Default Modules
A set of modules is loaded automatically on login to set up the default environment, which includes the gnu compiler suite (gcc, g++, gfortran, etc.) currently version 9.4.0, and the OpenMPI 4.1.1 runtime, which adds mpiCC, mpic++, mpicc, mpicxx, mpif77, mpif90, mpifort, and mpirun built with the gnu compilers.
The default modules loaded on login via the ohpc module
Available Modules
To see what modules are available, use the `module avail` command or the `ml av` shortcut.
The `module spider` command shows detailed info about software versions.
Details for the intel module
Changing Loaded Modules
If you want to use the intel compiler suite instead of the gnu compilers, you will need to unload the gnu modules and load the intel modules. There are a couple ways to do that.
Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue.
srun
srun <resource-parameters>
Run jobs interactively on the cluster.
skill/scancel
scancel <job-id>
End or cancel a queued job.
sacct
sacct
Show information about current and previous jobs.
sinfo
sinfo
Get information about the resources on available nodes that make up the HPC cluster.
Example Job Script
#!/bin/bash
#SBATCH --job-name=mclz # Job name
#SBATCH --partition=batch # Partition (queue) name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Number of tasks to call on each node
#SBATCH --mem-per-cpu=200mb # Memory per processor
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH --output=mclz.%j.out # Standard output log
#SBATCH --error=mclz.%j.err # Standard error log
JOBDIR=${SLURM_JOBID}
mkdir $JOBDIR
export PATH=./bin:$PATH
time srun ./bin/mclz.sh -n -I C -Z 6 -A He -L TA -N "TA"
cp c6+he* $JOBDIR
Submitting jobs
sbatch
The command I would use to submit the script above is:
sbatch example.sh
The above command works assuming I'm in the same directory as example.sh. sbatch can also take arguments, but those can also be placed in the script as SBATCH directives. In example.sh, the lines starting with '#SBATCH' are directives that can also be command-line arguments. If you don't want to specify the number of tasks per node in the script, do the following:
sbatch --ntasks-per-node=1 example.sh
In the screenshot above, the qsub command returned with a job number. This number is assigned to $JOB_ID in your submit script, which means that in this example:
mkdir SLURM_JOBID
runs as:
mkdir 463
squeue
squeue shows the current running or pending jobs. If you only want to see your jobs, run squeue -me
The
scancel
scancel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running:
4. Output
In addition to any file output that your program generates, Slurm captures the standard output and standard error and puts them into files. By default, slurm combines the two into a single file called slurm-$JOBID.out, where $JOBID is the ID Slurm assigned the job. You can specify a different name with the --output SBATCH directive. Use %j as a placeholder for the job id. For example: #SBATCH --output=example.%j.out To capture the standard error output to a different file, use the --error SBATCH directive, e.g. #SBATCH --error=example.%j.error The error file contains any errors generated while your script was running. This is useful for debugging your submit script or executable.
Columbia University Professor Andrew Millis presents his talk titled "Meeting Dirac’s Challenge: Quantum Many-body Physics in the 21st Century" In-person and Online.