Batch Cluster Usage

When you log into the cluster, you are connecting to the cluster’s head node. Users can only access the compute node by using the batch scheduling system.

The batch scheduling system allows users to submit job requests using the qsub command. The jobs will run when resources become available. All output is captured to file by the batch job system, and users can request e-mail notifications when jobs start or end.

To submit a batch job, a user must create a short text file called a batch job script. The batch job script contains instructions for running the batch job and the specific commands the batch job should execute. Batch jobs run as a new user session, so the batch job script should include any commands needed to set up the user session and navigate to the location of data files, etc.

Below is a sample batch job script for a job that will use a single CPU core.

#!/bin/bash -l
#PBS -N some_name
#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
#PBS -q batch
#PBS -m abe
#Comment - batch job setup complete
cd mrbayes
module load mrbayes
mb batch. nex

The big memory nodes can be requested by adding a feature request to the third line above:

#PBS -l nodes=1:ppn=1:BIGMEM

and by replacing the fifth line with:

#PBS -q big_mem_queue

Similarly, to request a GPU node the line needs to read:

#PBS -l nodes=1:ppn=1:GPU

To submit a batch job script, execute:

qsub script.job

This will output a unique job identifier in the form of <job-number>. <hostname>.

The batch job script is a Linux shell script, so the first line specifies the shell interpreter to use when running the script. 

Note that for bash shell, the -l option must be used.

Lines starting with #PBS are instructions to the batch scheduling system. Note that the #PBS options can be overridden on the qsub command line. For example, qsub -N bar foo.job will override any #PBS -N directives in the foo.job job script and name the job bar.

Common batch job options
-N Name of the batch job:
#PBS -N 
The job name is used to name output files and is also displayed when using qstat to query the job status
-j Join output and error output into one file:
#PBS -j
Use this option to collect all job output in a single file rather than having separate files for standard output and error.
-m Mail options:
#PBS -m abe
Send e-mail to the user when the job aborts (a), begins (b) or ends (e). Any combination of these three is allowed. Without this option, an email will only be sent when a job aborts.
#PBS -m 
No e-mail will be sent.
-M E-mail user list.
#PBS -M user1@example. com, user2@example. com

List of additional e-mail addresses for messages. Note that e-mail is always sent to your address, so it does not need to be specified.
-l (dash lower case L) Resource list. There are two main types of resources, CPUs and time. Multiple #PBS -l lines can be used to request these separately.
#PBS -l walltime=100:00:00
This requests 100 hours of run time for the job.
#PBS -l nodes=2:ppn=8
This requests 2 physical nodes and all 8 processors on each node (ppn=processors per node).
Note that if you do not specify the number of processors, it will default to one processor core. The default wait time is 1 hour.
-q Destination - which batch queue to use.
#PBS -q batch
This sends the job to the default batch queue on the Redhawk cluster.
-V Inherit environment settings.
This will cause all environment variables in the Linux session that the job is submitted from to be inherited by the batch job.
-I (dash upper case i) Interactive.
use "qsub -I" to request an interactive batch job.

Checking Job Status

There are several ways to check the status of a job. The qstat command will show the status of all of the jobs currently running or queued. The displayed information includes the status of the job.

  • Q = queued
  • R = running
  • E = exiting
  • C = complete

The qstat command also shows the length of time the job has been running in the format hhh:mm:ss.

Details on a specific job can be seen using the qstat -f <job-number> where <job-number> is the numeric portion of the name returned by the qsub command.

A different view of the job queues can be seen using the showq command. This shows separate blocks for active (i. e. running), eligible, and blocked jobs. Waiting jobs are divided into eligible and blocked jobs based on the queue parameters. For example, the queues on Redhawk limit the number of running jobs per user, so queued jobs for users with the maximum number of running jobs will be listed as blocked.

HPC Cluster Architecture

Schematic shows EGB lab PC, which is an NXNomachine/WinSCP, feeding a Cluster Head Node, which is a qsub. The Cluster Head Node in turn feeds 32 additional machines. The Cluster Head Node and the 32 machines together form the Redhawk Cluster.