HEP Interactive and Batch Processing


There are many ways of accessing high performance compute systems for your simulations and analysis. Interactive work (code development, testing binaries, graphical work) should be performed on dedicated Linux desktops or on interactive nodes. The latter is better for high memory or high throughput work.

Batch processing is supported with Slurm for Centos 7.

Interactive Work

For code development and testing a number of central interactive servers are provided. These systems have fast network connections to the main cluster storage systems and local disk systems. These can be accessed from any system on or off campus via SSH. Please do not start lots of long running processes on these nodes, please use the batch queues instead for these processes.

64bit Linux desktops should also have the same range of resources available, although on a lower bandwidth network connection.

The available nodes for HEP users are
Hostname OS Server type
kappa Centos7 64bit 8-core, 9GB/core
2xTesla M2070 GPGPU
phi Centos7 64bit 20-core, 6GB/core
gamma Centos7 64bit, 32core, 12GB/core
hepcuda1 Centos7 64bit, 4core, 4GB/core
1xGTX980Ti GPGPU

A node reserved for LIV.DAT use is available
Hostname OS Server type
livdat1 Centos7 64bit 32-core, 4GB/core
There are also experiment-specific nodes (not in the list above), please do not use them unless you have permission.

Submitting Batch Jobs

Slurm

The Centos 7 desktops and nodes uses the Slurm batch system. Access isn't automatic but we try to add everyone when their accounts are registered. If you are denied access to submit jobs please email helpdesk to be added.

There is much documentation on the Slurm website.

To ease the transition from old-style PBS jobs Slurm provides wrapper scripts to give similar utilities to the old PBS batch system, eg qsub, qstat, pbsnodes. PBS-style job scripts can be used with these wrappers, usually with no modifications. There are some subtle differences in the way the commands work compared to the old system (eg the way stdout/stderr files are handled) so be sure to check your scripts first.

Slurm has native commands for submitting jobs and querying the system (most tools give instructions with the --help switch or man pages). The equivalent of the PBS 'queue' is the Slurm 'partition'.
Slurm CommandSorted ascending Action Notes PBS Equivalent
sacct Job accounting details
eg CPU, RAM consumption
-j to show specific job
-a to show all jobs
 
salloc/srun Start interactive jobs or jobsteps Usually called within a jobscript qsub
sbatch Submit a batch script job -p to specify partition (queue)
-t HH:MM:SS to request walltime
-c N to request N CPUs
--mem=N(KMG) to request N RAM
--gpus=(type:)N to request N GPUs
--tmp=N(KMG) to request N disk space
qsub
scancel Cancel job(s) Give a list of jobids qdel
sinfo List partitions or nodes -N for node format, -l for more info qstat -Q
smap Text view of jobs and nodes -i N to update every N seconds  
squeue List queued jobs and their state -l for long form output qstat
sshare List user usage and job priority -a to show all users
Values updated every 5mins
 
sview Graphical view of jobs and nodes    

There are currently three partitions ('queues') for general use but the partitions and their resources will vary over time as systems are added or retired:
Queue name Max RAM/CPU Total RAM/node CPUs/node Default/Max Walltime Total CPUs GPUs(type) Notes
desktop 1000MB 1000MB 2 24hrs/24hrs 32 0 Runs on desktops so keep IO (network/disk usage) light
short 1900MB 15800MB 8 4hrs/8hrs 32 0 Dedicated old batch nodes, fast network.
compute 5300MB 128,384,1024GB* 24,72* 24hrs/48hrs 408 5(rtx_2080) Default. Dedicated new batch nodes, fast network.
* Some nodes have more RAM or CPUs available than others, you can check with eg sinfo -N -o "%N %c %m" to show node, max CPUs and max RAM in megabytes.

There may be additional project-specific queues, please don't use these unless you have permission.

All queues can have jobs submitted from desktops, but advanced operations that require connecting directly to the compute nodes will only work from interactive nodes (most job scripts shouldn't need this).

Jobs can be submitted with requirements on the command line eg to submit a job that needs 2 rtx_2080 GPUs, 12 CPUs and 24GB of RAM
  • sbatch -p compute --gpus=rtx_2080:2 --mem=24G -c 12 my_jobscript.sh
A simple example jobscript to be submitted with sbatch (for a job that runs on 1 node with 4 CPUs, and needs 4hrs 30minutes to run on the 'compute' partition, and outputs stdout and stderr files for each job to your scratch area)
#!/bin/bash
#SBATCH -N 1
#SBATCH -c 4
#SBATCH -p compute
#SBATCH -o /scratch/username/slurm-%j.out
#SBATCH -e /scratch/username/slurm-%j.err
#SBATCH -J myfirstjob
#SBATCH -t 04:30:00

#run the application:
/path/to/my/binary

Or a job that needs 1GPU of any type and 2CPUs
#!/bin/bash
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH -c 2
#SBATCH -o /scratch/username/slurm-%j.out
#SBATCH -e /scratch/username/slurm-%j.err
#SBATCH -J gpujob
/path/to/my/binary

Job resources are constrained by Linux cgroups. This effectively means that job processes won't be allowed to exceed their CPU or RAM limits (default or requested) eg if your job requests 1 CPU but runs 4 processes they will only get 25% of a CPU each.

Temporary disk space

All nodes have some amount of local disk storage available. By default a temporary directory is created for each job with the location stored in the environment variable SLURM_TMPDIR. This directory is deleted after the job finishes.

The local disks aren't high performance but are often the best place to stream data output from a running job which can then be copied over to network storage once the job has completed. Single file copies are much more efficient than streaming data over the network.

If you're going to be storing many GBs of data locally you should request a suitable amount with the --tmp switch, this will allow the scheduler to limit the number of jobs on a single node such that the disk doesn't fill up 100%.

GPUs

A limited number of nodes have a GPU available. This will probably increase over time. At present there are only two types, the generic 'geforce' for older cards, and 'rtx_2080' for the 11GB RTX 2080Ti cards.

GPUs can be requested as a generic resource eg
  • sbatch -p compute --gpus=1
will request one available GPU on the compute queue. If you need a specific type of GPU this can also be requested, eg to run a job specifically on a RTX 2080 class of device use
  • sbatch -p compute --gpus=rtx_2080:1
GPU Devices requested through Slurm will always be enumerated starting at 0, regardless of which device they are running on, ie a job requesting one GPU will always see device GPU0, and another job requesting two GPUs will always see devices GPU0 and GPU1.

At present GPU resources are allocated exclusively, ie they cannot be shared by multiple jobs.

Good Practice

Jobs submitted to the batch system will by default use the compute queue and allocate 1 CPU, 5300MB of RAM and a walltime of 24hrs. This is more than adequate for a lot of jobs, but depending on the mix of jobs on the system it can tie up resources unnecessarily.

If you know that your job can use less resources, eg only 2GB of RAM or run for less than 24 hours then specifying this when submitting the job allows the Slurm scheduler to more efficiently allocate jobs, particularly when there is a mix of large and small jobs.

The scheduler employs a Fair Share system. Users who have been running lots of jobs recently will have a lower priority than those that haven't. This prevents any one user hogging the resources by queueing lots of jobs. Over time everyone should have a roughly equal share of the resources, but you may have to wait for running jobs to finish before yours start. Users should submit their jobs to the queues and allow the system to allocate the resources, do not attempt to schedule job submission yourself it is unlikely to be any better.

Interactive access can be granted through the salloc and srun commands, similar to sbatch. While this is useful for testing job steps or performing one-off tasks that need large compute resources we don't recommend it for most workflows, it can tie up resources in idle sessions. Interactive work should nearly always be performed on dedicated interactive nodes.

Torque/PBS

The old Torque batch queues on SL6 systems have been removed, all batch work should be submitted to the Slurm queues. SL6 compatibility on Slurm can be achieved using Singularity containers (see the HEPContainerGuide). These instructions are left for reference only.

Batch jobs can be submitted from SL6 system (desktop or interactive node). Here you can use the standard PBS tools to submit and control jobs. For example
  • qsub -q medium64 -e /scratch/user -o /scratch/user jobscript.sh
Please output your stdout and stderr to /scratch not your home area.

Your jobscript should be a shell script that sources any relevant profiles and configuration and runs the commands you wish to be run. By default it should be located in your home directory.

The grid UI software is configured by default on SL6 systems. This is (along with a valid grid certificate and proxy) required for submitting grid jobs and accessing grid storage.

Queued and running jobs can be viewed using
  • qstat
which will list job IDs, owner, queue, walltime etc. If you want to delete a job use
  • qdel JOBID
If this doesn't remove your job then please contact the admins (jobs can be lost on nodes that have crashed or been turned off).

To quickly remove all of your jobs (both queued and running)
  • qselect -u username | xargs qdel
Job scheduling is subject to 'fair shares'. Users who have recently run more jobs will have a lower priority than those who have run no or fewer jobs and so the latter's queued jobs will be started before the former's. This prevents single users monopolising the batch systems with lots of queued jobs but also allows users to make maximum usage of the resources when they are free. This essentially means you can submit as many jobs as you need but you can't hog the queues.

There are a number of different PBS queues available depending on the length of your job and the required resources.
Queue CPUs per node Node type RAM /CPU Max Job Slots Walltime (hrs)
short64 8 2.5GHz Xeon 5420 64bit 2GB Up to 16 1
desk64 2 2.66GHz Q8400 64bit 1GB Up to 30 8
all64 2 Any available node variable Variable 24
short64 uses spare 64bit interactive node capacity so the number of available slots will vary.

desk64 uses spare 64bit desktop PC capacity so the number of available slots will vary and may be limited to out of office hours (6pm-6am).

all64 uses any available 64bit node. Note that job limits should be set to the lowest spec available (ie desk64).

If you need more RAM per job than is available you can request a number of CPUs eg if you want 4GB per job and the nodes have 2GB per CPU you could use
  • qsub -l nodes=1:ppn=2 ...other options
If your jobs need more RAM than can be supported by the machines on a queue please use a different queue. If you have unusual requirements please enquire with helpdesk for advice.

Further Reading

Torque job submission documentation.
Topic revision: r29 - 31 Aug 2021, JohnBland
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback