Directions for using Torque on the NMSU CS Bigdat cluster
written by Jonathan Cook, May 29, 2014, joncook@nmsu.edu

Torque is the job submission and execution system that our cluster uses. Think of it as a batch operating system. You submit a job (a program to run), and it queues it up, selects nodes/resources to allocate to it when they are available, runs the job, and cleans up after it.

Shortcut demo:

  1. Copy files in /nmsu/examples/hello-mpi/
  2. Look at Makefile, make sure modules are loaded
  3. Build by running make
  4. Do "qsub sub-hello.sh" to submit and run the job
  5. Do "qstat" until the job no longer appears as active
  6. Look at the output files ending with o### and e###

**DO NOT** login to individual compute nodes and directly run programs on them! Yes, you CAN login (using ssh) to the compute nodes from the head node, but use this only to check on your jobs as they are running or for other inspection tasks (e.g., you can run "nvidia-smi" on a GPU compute node to see info about the NVidia card). IF YOU NEED to interactively run some compute-intensive programs, use a Torque interactive session to do this (see below).

**DO NOT** run "mpirun" on the head node (bigdat.nmsu.edu) with a hostlist file that lists compute nodes! If you need to do small testing runs while developing, you can use "mpirun" on the head node and just run locally -- the head node has 16 cores (32 hyperthreads) and so you can run a few MPI processes locally, but only do this for short test runs, not for actual computation runs.


The Torque commands all begin with "q", and options in submission scripts and environment variables are identified with "PBS" (because Torque is a derivation of the original PBS job system).

The most common commands you will use are:
CommandDescription
qsub <job-script>submit a job that is embodied in the named script
qstatshow the status of your queued and running jobs
qstat -f <jobid>show detailed information on an existing job id #
qdel <jobid>delete a queued job and/or kill a running job
qpeek <jobid>look at the latest stdout output of a running job
qsub -I -l <res>start an interactive job with resources listed


QSUB (noninteractive)

The "qsub" command is the basic job execution command. You need to provide a job script file that does the work of running your job. This is a real shell script -- it begins with "#!/bin/bash" or any other shell you prefer to use. Special comment lines beginning with "#PBS" are used to set options for Torque, most commonly to specify the resources needed. See the example in /nmsu/examples/hello-mpi/

Resource Specification

For both the batch and interactive usage of the "-l" (dash-ell) resource specification option, the format of the options is:

      nodes=#:ppn=#:resname

where the first # is the number of cluster nodes you want to reserve (up to 10), the second # is the number of cores/threads you want per node (up to 48), and the "resname" is the resource name you want to allocate the nodes from. We have the following resources:
ResourceDescription
nodesall cluster nodes
gpu-allall the nodes with GPUs on them
gpu-himemthe nodes with GPUs that are memory-expandable
gputhe nodes with GPUs that are not memory expandable
fpga-allall the nodes with FPGAs on them (but not installed yet)
fpga-himemthe nodes with FPGAs that are memory-expandable
fpgathe nodes with FPGAs that are not memory expandable

The IMPORTANT thing to know is that Torque keeps track of resource allocation per core/thread, so if you do not request ppn=48, then Torque could allocate another job running on the same node as yours (if it also does not ask for ppn=48). So if you want to make sure that no other jobs share your nodes (e.g., for consistent performance evaluation data for a paper), make sure you use ppn=48. BUT SEE the next paragraph for MPI.

If you are running an MPI program, mpirun will automatically create one MPI process for each core/thread on all nodes (e.g., nodes x ppn). If you are using MPI+OpenMP (or some other local threading capability) and you want only one MPI process per node, then you'll need to do something special. This is because Torque automatically creates a host file for mpirun to use, and it repeats each node name the number of time for the ppn value. If you only want one MPI process per node but still need to allocate the entire node, you need to do something like in your job script:

uniq ${PBS_NODEFILE} > ${PBS_WORKDIR}/tmphostfile
mpirun -hostfile ${PBS_WORKDIR}/tmphostfile myMpiProgram ...

This will take the generated host file ${PBS_NODEFILE} and create a file with just one entry for each unique node name. You may also need to do a custom "-np #" option on mpirun, too.


QSUB Interactive

If you need to use an interactive session (e.g., you want to run your GPU program or other program by hand), you can use "qsub -I". This will create and allocate a job, and then open up a shell on the first node in your job allocation. You end the session/job by typing "exit" in the shell.

If you just need one node, use something like "qsub -I -l nodes=1:ppn=48:gpu-all" Using "ppn=48" will ensure that no one else will be allocated on your node. The "gpu-all" ensures that your node will be one of the nodes with a GPU on it.

You CAN ask for more nodes in an interactive session, but such a use is more complicated, and I would certainly recommend against running MPI programs this way! It might be useful for debugging and testing, though.

Be SURE to "exit" from the interactive session!