Using the CSP Queuing System


In the Center for Simulational Physics, we use the Sun Grid Engine (SGE) for job control. There are a few things you need to know to get started using SGE. Below, we will show you the basic steps of running your code in the queue system as well as where to go for more info.


1. Compile your Code

Right now, our production cluster consists of 9 machines running a total of 28 opteron cpus. If you are using fortran, we provide the Intel fortran compiler for linux. The command for compiling with the Intel fortran compiler is ifort. For example, to compile a file called hello.f to make an executable called hello-x86 do the following:

ifort -o hello-x86 hello.f

2. Write a script for your code.

SGE will not allow you to submit a compiled executable to it. Instead, you must submit a script that points to the executable as well as provides other parameters. Below is an example called 'simple.sh' using the code compiled above.

simple.sh

#$ -cwd
#$ -q opteron
~/hello-x86
sleep 20

simple.sh explanation

Line 1: This command tells SGE that it should use the current working directory for its output.

Line 2: This command tells SGE to submit this job to the opteron queue.

Line 3: This line is tells SGE to run the executable hello-x86 from my home directory. If you submit your job from the same directory that your executable is located and you have used the -cwd flag, you do not need to include the path to your executable. However, it is a good habit to do so.

Line 4: This line tells the script to sleep for 20 seconds. This is only because my executable runs incredibly fast. By having my code sleep for 20 seconds, I can see its status in the queueing system. You will not need this line.

This is the simplest of submit scripts. You can also use the script to clean up your scratch data, move output to other locations, etc.

3. SGE Commands

qsub

The command I would use to submit the script above to sge is:

qsub simple.sh

The above command works assuming I'm in the same directory as simple.sh. qsub can also take arguments, but those can also be placed in the script. In simple.sh, the two lines starting with '#$' are command-line arguments. Were they not in the script I would do the following:

qsub -cwd -q opteron simple.sh.

In the screenshot above, the qsub command returned with a job number. In this case, SGE will create two files in its current working directory called simple.sh.o808 and simple.sh.e808. The o and e in the file names stand for output and error, respectively. The standard output from the script is in the .o808 file and the standard error from the script is in the .e808 file.

qstat

qstat shows the status of the queues. With qstat, you can see what jobs are running in the queue, how many jobs are waiting to run, and the states of the queues themselves. To see if all of the machines are full, use qstat -f. Here's a screenshot with my job waiting to run:

The 'qw' means that my job is queued and in the wait state. As you can see above, jobs that are currently running in the queue are shown with an 'r.'

qdel

qdel removes a job from the queue. This can be useful if you need to make changes before the job is run, or if it needs to be stopped while it's running:

4. Output

In addition to any file output that your program generates, SGE captures the standard output and standard error and puts them into files. As stated above, these files start with the name of your submit script and append .e or .o where is the job number SGE gave your job and .e and .o are standard error and standard output, respectively. The .e file is useful for it contains any errors generated while your script ran. This is useful for debugging your submit script.

For example, I didn't have the executable where it was supposed to be. This was evident from the error file:

Problem Corrected:

More information.

The Sun Grid Engine website has more information. Here's a link to the basic usage document: Basic Usage