Running batch jobs on matrix

Matrix

Matrix is a cluster of 8 dual CPU's dedicated to CDF. Six subnodes can be logged into by typing 'ssh node1', 'ssh node3' ... 'ssh node7' (node2 does not seem to work). The status of the subnodes can be checked by typing 'pbsnodes -a'.

The cluster has two 1.8TB disk arrays available under /data1 and /data2. The free capacity of the disks can be checked by typing 'df' (disk free).

Batch jobs with pbs

Jobs which require a long time to run (typically more than 1 CPU hour) are best run as batch jobs. Matrix uses pbs (portable batch system) for submitting batch jobs.

Prepare a shell script that executes your job, for example prime.sh (all examples can be found on matrix:~oldeman/pbs). Type 'qsub prime.sh' to submit the job. A line saying 'xxxxx.matrix.ph.liv.ac.uk' appears, where xxxxx is the job number. Type 'qstat' to check the status of the job. When the job is finished, two files are produced: prime.sh.oxxxx contains the output of the job, and prime.sh.exxxx contains the error messages of the job.

(this example script runs for about 30 seconds and calculates the 200th prime number).

Note:

Command line arguments cannot be passed on to the script.
By default, the job starts in your home directory.
Make the script executable by typing 'chmod +x prime.sh'
Make sure to specify the shell on the first line of the script, for example putting '#!/bin/bash' as the first line (in this example the bash shell was used).

Submitting multiple jobs

The main advantage of running batch jobs on the cluster is that you can run multiple jobs (up to 12) in parallel. Instead of submitting each job by hand, you can make a script that launches multiple jobs. Of the many ways to do that, I find the most straightforward method to write a launch_prime.sh script that uses sed to make modified copies of the original script and submits the jobs.

Running CDF software in multiple batch jobs

CDF software is always is bit more complicated to run than normal jobs. Things to take into account:

A script should start with setting up the cdfsoftware environment (using source ~cdfsoft/cdf2.shrc; setup cdfsoft2 5.3.3).
The script should start to a suitable subdirectory.
When running multiple MC production jobs the random numbers must be independent between jobs.
If (temporary) output files are generated with the same name in each job, each job must run in a separate subdirectory to avoid interference.

An example script that produces 1000 inclusive B decays (generator-level only) is Bincl.sh, and a launch script launch_Bincl.sh.

Last updated on 25/05/05, Rolf Oldeman.