Using the CSP Computational Clusters
Using the serial cluster queues for single-threaded code
We use the Sun Grid Engine (SGE) for job control. There are a few things you need to know to get started using SGE. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info.
1. Compile your Code
If you are using fortran, we provide the Intel fortran compiler for linux as well as gfortran and Intel's C/C++ compiler plus gcc/g++ for non-fortran code. The command for compiling with the Intel fortran compiler is ifort. For example, to compile a file called hello.f to make an executable called hello-x86 do the following:
ifort -o hello-x86 hello.f
Note: You should login to one of the cluster nodes to compile your code.
2. Write a script for your code.
SGE will not allow you to submit a compiled executable to it. Instead, you must submit a script that points to the executable as well as provides other parameters. Below is an example called 'simple.sh' using the code compiled above.
Special Note on running serial jobs from your home directory:
DON'T DO IT!
All CSP home directories are stored on a single file server and are network mounted on all of the cluster machines as well as users desktops in the center. This creates an enormous amount of network traffic, since all reads and writes to people's home directories happen over the network. When users run multiple serial jobs from their home directories on our 100+ cluster cpus, the network file system can become overloaded, which can cause the desktop machines to hang. Additionally, more users are running parallel jobs using MPI, which have to run in their home directories. The easiest way to mitigate unnecessary network load is to run jobs from local scratch directories on the cluster machines.
On each cluster machine, a scratch directory has been created for you at /scratch/username. In your submit script, you should:
- copy your executable and data files to a directory under your scratch directory,
- run your code, having it write it's output to the that directory,
- copy the results back to your home directory,
- clean up scratch directory.
Since most of the cluster machines have multiple processors, you may be running multiple jobs on the same machine. To keep the jobs from trying to write to the same output file in your scratch directory, you should create unique sub-directories in your scratch directory. Since each submitted job is given a unique Job ID, you can use that number in the sub-directory name. As you will see in the example below, the Job ID is available in the shell variable $JOB_ID. The creation of this sub-directory should be done in your submit script.
### Tell queue system to use the directory of the submit script ###
### as the current working directory ###
### Tell queue system to submit this job to a specific queue ###
###(currently xeon ) ###
#$ -q xeon
### make results directory in your home on hal ###
### make directory on local machine in scratch ###
### Tell queue system to write it's output and error files to the scratch directory ###
### PLEASE NOTE since these output files are created before this script is run, ###
### they must be in /scratch/username since /scratch/username/$JOB_ID doesn't exist yet
#$ -o /scratch/jeff/simple.sh.o$JOB_ID
#$ -e /scratch/jeff/simple.sh.e$JOB_ID
### copy source file to job directory ###
cp hello.f /scratch/jeff/$JOB_ID
### change current working directory to job directory ###
### compile code ###
ifort -o hello-x86 hello.f
### execute code ###
###copy files back to home directory located on hal ###
cp * $resultdir
###clean up scratch directory since space is limited on local machines ###
rm -rf /scratch/jeff/$JOB_ID
3. SGE Commands
The command I would use to submit the script above to sge is:
The above command works assuming I'm in the same directory as simple.sh. qsub can also take arguments, but those can also be placed in the script. In simple.sh, the lines starting with '#$' are command-line arguments. If you don't want to specify a queue in the script (#$ -q opteron), do the following:
qsub -q opteron simple.sh.
In the screenshot above, the qsub command returned with a job number. This number is assigned to $JOB_ID in your submit script, which means that in this example:
qstat shows the status of the queues. With qstat, you can see what jobs are running in the queue, how many jobs are waiting to run, and the states of the queues themselves. To see if all of the machines are full, use qstat -f. Here's a screenshot with my job waiting to run:
The 'qw' means that my job is queued and in the wait state. As you can see above, jobs that are currently running in the queue are shown with an 'r.'
qdel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running:
In addition to any file output that your program generates, SGE captures the standard output and standard error and puts them into files. As stated above, these files start with the name of your submit script and append .e or .o where is the job number SGE gave your job and .e and .o are standard error and standard output, respectively. The .e file is useful for it contains any errors generated while your script ran. This is useful for debugging your submit script.
For example, I didn't have the executable where it was supposed to be. This was evident from the error file:
Up next: Running parallel/mpi jobs on the cluster.