Running batch jobs on matrix

Matrix

Matrix is a cluster of 8 dual CPU's dedicated to CDF. Six subnodes can be logged into by typing 'ssh node1', 'ssh node3' ... 'ssh node7' (node2 does not seem to work). The status of the subnodes can be checked by typing 'pbsnodes -a'.

The cluster has two 1.8TB disk arrays available under /data1 and /data2. The free capacity of the disks can be checked by typing 'df' (disk free).

Batch jobs with pbs

Jobs which require a long time to run (typically more than 1 CPU hour) are best run as batch jobs. Matrix uses pbs (portable batch system) for submitting batch jobs.

Prepare a shell script that executes your job, for example prime.sh (all examples can be found on matrix:~oldeman/pbs). Type 'qsub prime.sh' to submit the job. A line saying 'xxxxx.matrix.ph.liv.ac.uk' appears, where xxxxx is the job number. Type 'qstat' to check the status of the job. When the job is finished, two files are produced: prime.sh.oxxxx contains the output of the job, and prime.sh.exxxx contains the error messages of the job.

(this example script runs for about 30 seconds and calculates the 200th prime number).

Note:

Submitting multiple jobs

The main advantage of running batch jobs on the cluster is that you can run multiple jobs (up to 12) in parallel. Instead of submitting each job by hand, you can make a script that launches multiple jobs. Of the many ways to do that, I find the most straightforward method to write a launch_prime.sh script that uses sed to make modified copies of the original script and submits the jobs.

Running CDF software in multiple batch jobs

CDF software is always is bit more complicated to run than normal jobs. Things to take into account: An example script that produces 1000 inclusive B decays (generator-level only) is Bincl.sh, and a launch script launch_Bincl.sh.


Last updated on 25/05/05, Rolf Oldeman.