Torque is the job submission and execution system that our cluster uses. Think of it as a batch operating system. You submit a job (a program to run), and it queues it up, selects nodes/resources to allocate to it when they are available, runs the job, and cleans up after it.
Shortcut demo:
**DO NOT** login to individual compute nodes and directly run programs on them! Yes, you CAN login (using ssh) to the compute nodes from the head node, but use this only to check on your jobs as they are running or for other inspection tasks (e.g., you can run "nvidia-smi" on a GPU compute node to see info about the NVidia card). IF YOU NEED to interactively run some compute-intensive programs, use a Torque interactive session to do this (see below).
**DO NOT** run "mpirun" on the head node (bigdat.nmsu.edu) with a hostlist file that lists compute nodes! If you need to do small testing runs while developing, you can use "mpirun" on the head node and just run locally -- the head node has 16 cores (32 hyperthreads) and so you can run a few MPI processes locally, but only do this for short test runs, not for actual computation runs.
The Torque commands all begin with "q", and options in submission scripts and environment variables are identified with "PBS" (because Torque is a derivation of the original PBS job system).
The most common commands you will use are:
Command | Description |
---|---|
qsub <job-script> | submit a job that is embodied in the named script |
qstat | show the status of your queued and running jobs |
qstat -f <jobid> | show detailed information on an existing job id # |
qdel <jobid> | delete a queued job and/or kill a running job |
qpeek <jobid> | look at the latest stdout output of a running job |
qsub -I -l <res> | start an interactive job with resources listed |
For both the batch and interactive usage of the "-l" (dash-ell) resource specification option, the format of the options is:
nodes=#:ppn=#:resname
where the first # is the number of cluster nodes you want to reserve (up to 10), the second # is the number of cores/threads you want per node (up to 48), and the "resname" is the resource name you want to allocate the nodes from. We have the following resources:
Resource | Description |
---|---|
nodes | all cluster nodes |
gpu-all | all the nodes with GPUs on them |
gpu-himem | the nodes with GPUs that are memory-expandable |
gpu | the nodes with GPUs that are not memory expandable |
fpga-all | all the nodes with FPGAs on them (but not installed yet) |
fpga-himem | the nodes with FPGAs that are memory-expandable |
fpga | the nodes with FPGAs that are not memory expandable |
The IMPORTANT thing to know is that Torque keeps track of resource allocation per core/thread, so if you do not request ppn=48, then Torque could allocate another job running on the same node as yours (if it also does not ask for ppn=48). So if you want to make sure that no other jobs share your nodes (e.g., for consistent performance evaluation data for a paper), make sure you use ppn=48. BUT SEE the next paragraph for MPI.
If you are running an MPI program, mpirun will automatically create one MPI process for each core/thread on all nodes (e.g., nodes x ppn). If you are using MPI+OpenMP (or some other local threading capability) and you want only one MPI process per node, then you'll need to do something special. This is because Torque automatically creates a host file for mpirun to use, and it repeats each node name the number of time for the ppn value. If you only want one MPI process per node but still need to allocate the entire node, you need to do something like in your job script:
uniq ${PBS_NODEFILE} > ${PBS_WORKDIR}/tmphostfile mpirun -hostfile ${PBS_WORKDIR}/tmphostfile myMpiProgram ...
This will take the generated host file ${PBS_NODEFILE} and create a file with just one entry for each unique node name. You may also need to do a custom "-np #" option on mpirun, too.
If you need to use an interactive session (e.g., you want to run your GPU program or other program by hand), you can use "qsub -I". This will create and allocate a job, and then open up a shell on the first node in your job allocation. You end the session/job by typing "exit" in the shell.
If you just need one node, use something like "qsub -I -l nodes=1:ppn=48:gpu-all" Using "ppn=48" will ensure that no one else will be allocated on your node. The "gpu-all" ensures that your node will be one of the nodes with a GPU on it.
You CAN ask for more nodes in an interactive session, but such a use is more complicated, and I would certainly recommend against running MPI programs this way! It might be useful for debugging and testing, though.
Be SURE to "exit" from the interactive session!