HEP Interactive and Batch Processing

HEP Interactive and Batch Processing

There are many ways of accessing high performance compute systems for your simulations and analysis. Interactive work (code development, testing binaries, graphical work) should be performed on dedicated Linux desktops or on interactive nodes. The latter is better for high memory or high throughput work.

Batch processing is supported with Slurm for Centos 7.

Interactive Work

For code development and testing a number of central interactive servers are provided. These systems have fast network connections to the main cluster storage systems and local disk systems. These can be accessed from any system on or off campus via SSH. Please do not start lots of long running processes on these nodes, please use the batch queues instead for these processes.

64bit Linux desktops should also have the same range of resources available, although on a lower bandwidth network connection.

The available nodes for HEP users are

Hostname	OS	Server type
kappa	Centos7	64bit 8-core, 9GB/core
phi	Centos7	64bit 20-core, 6GB/core 2xRTX A4000 GPGPU
gamma	Centos7	64bit, 32core, 12GB/core
hepcuda1	Centos7	64bit, 4core, 4GB/core 1xGTX980Ti GPGPU

An AlmaLinux 9 node is also available

Hostname	OS	Server type
theta	AlmaLinux 9	64bit 32-core, 4GB/core

There are also experiment-specific nodes (not in the list above), please do not use them unless you have permission.

Submitting Batch Jobs

Slurm

The Centos 7 desktops and nodes uses the Slurm batch system. Access isn't automatic but we try to add everyone when their accounts are registered. If you are denied access to submit jobs please email helpdesk to be added.

There is much documentation on the Slurm website.

To ease the transition from old-style PBS jobs Slurm provides wrapper scripts to give similar utilities to the old PBS batch system, eg qsub, qstat, pbsnodes. PBS-style job scripts can be used with these wrappers, usually with no modifications. There are some subtle differences in the way the commands work compared to the old system (eg the way stdout/stderr files are handled) so be sure to check your scripts first.

Slurm has native commands for submitting jobs and querying the system (most tools give instructions with the --help switch or man pages). The equivalent of the PBS 'queue' is the Slurm 'partition'.

Slurm Command	Action	Notes	PBS Equivalent
salloc/srun	Start interactive jobs or jobsteps	Usually called within a jobscript	qsub
scancel	Cancel job(s)	Give a list of jobids	qdel
sbatch	Submit a batch script job	-p to specify partition (queue) -t HH:MM:SS to request walltime -c N to request N CPUs --mem=N(KMG) to request N RAM --gpus=(type:)N to request N GPUs --tmp=N(KMG) to request N disk space	qsub
sinfo	List partitions or nodes	-N for node format, -l for more info	qstat -Q
squeue	List queued jobs and their state	-l for long form output	qstat
sacct	Job accounting details eg CPU, RAM consumption	-j to show specific job -a to show all jobs
smap	Text view of jobs and nodes	-i N to update every N seconds
sshare	List user usage and job priority	-a to show all users Values updated every 5mins
sview	Graphical view of jobs and nodes

There are currently two partitions ('queues') for general use but the partitions and their resources will vary over time as systems are added or retired:

Queue name	Max RAM/CPU	Total RAM/node	CPUs/node	Default/Max Walltime	Total CPUs	GPUs(type)	Notes
short	1900MB	15800MB	8	4hrs/8hrs	32	0	Dedicated old batch nodes, fast network.
compute	5300MB	128,384,1024GB*	24,72*	24hrs/48hrs	408	5(rtx_2080)	Default. Dedicated new batch nodes, fast network.

* Some nodes have more RAM or CPUs available than others, you can check with eg sinfo -N -o "%N %c %m" to show node, max CPUs and max RAM in megabytes.

There may be additional project-specific queues, please don't use these unless you have permission.

All queues can have jobs submitted from desktops, but advanced operations that require connecting directly to the compute nodes will only work from interactive nodes (most job scripts shouldn't need this).

Jobs can be submitted with requirements on the command line eg to submit a job that needs 2 rtx_2080 GPUs, 12 CPUs and 24GB of RAM

sbatch -p compute --gpus=rtx_2080:2 --mem=24G -c 12 my_jobscript.sh

A simple example jobscript to be submitted with sbatch (for a job that runs on 1 node with 4 CPUs, and needs 4hrs 30minutes to run on the 'compute' partition, and outputs stdout and stderr files for each job to your scratch area)

#!/bin/bash
#SBATCH -N 1
#SBATCH -c 4
#SBATCH -p compute
#SBATCH -o /scratch/username/slurm-%j.out
#SBATCH -e /scratch/username/slurm-%j.err
#SBATCH -J myfirstjob
#SBATCH -t 04:30:00
#run the application:
/path/to/my/binary

Or a job that needs 1GPU of any type and 2CPUs

#!/bin/bash
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH -p gpu
#SBATCH -c 2
#SBATCH -o /scratch/username/slurm-%j.out
#SBATCH -e /scratch/username/slurm-%j.err
#SBATCH -J gpujob
/path/to/my/binary

Job resources are constrained by Linux cgroups. This effectively means that job processes won't be allowed to exceed their CPU or RAM limits (default or requested) eg if your job requests 1 CPU but runs 4 processes they will only get 25% of a CPU each.

Temporary disk space

All nodes have some amount of local disk storage available. By default a temporary directory is created for each job with the location stored in the environment variable SLURM_TMPDIR. This directory is deleted after the job finishes.

The local disks aren't high performance but are often the best place to stream data output from a running job which can then be copied over to network storage once the job has completed. Single file copies are much more efficient than streaming data over the network.

If you're going to be storing many GBs of data locally you should request a suitable amount with the --tmp switch, this will allow the scheduler to limit the number of jobs on a single node such that the disk doesn't fill up 100%.

GPUs

A limited number of nodes have a GPU available. This will probably increase over time. At present there are only two types, the generic 'geforce' for older cards, and 'rtx_2080' for the 11GB RTX 2080Ti cards.

GPUs can be requested as a generic resource eg

sbatch -p compute --gpus=1

will request one available GPU on the compute queue. If you need a specific type of GPU this can also be requested, eg to run a job specifically on a RTX 2080 class of device use

sbatch -p compute --gpus=rtx_2080:1

GPU Devices requested through Slurm will always be enumerated starting at 0, regardless of which device they are running on, ie a job requesting one GPU will always see device GPU0, and another job requesting two GPUs will always see devices GPU0 and GPU1.

At present GPU resources are allocated exclusively, ie they cannot be shared by multiple jobs.

Good Practice

Jobs submitted to the batch system will by default use the compute queue and allocate 1 CPU, 5300MB of RAM and a walltime of 24hrs. This is more than adequate for a lot of jobs, but depending on the mix of jobs on the system it can tie up resources unnecessarily.

If you know that your job can use less resources, eg only 2GB of RAM or run for less than 24 hours then specifying this when submitting the job allows the Slurm scheduler to more efficiently allocate jobs, particularly when there is a mix of large and small jobs.

The scheduler employs a Fair Share system. Users who have been running lots of jobs recently will have a lower priority than those that haven't. This prevents any one user hogging the resources by queueing lots of jobs. Over time everyone should have a roughly equal share of the resources, but you may have to wait for running jobs to finish before yours start. Users should submit their jobs to the queues and allow the system to allocate the resources, do not attempt to schedule job submission yourself it is unlikely to be any better.

Interactive access can be granted through the salloc and srun commands, similar to sbatch. While this is useful for testing job steps or performing one-off tasks that need large compute resources we don't recommend it for most workflows, it can tie up resources in idle sessions. Interactive work should nearly always be performed on dedicated interactive nodes.

Torque/PBS

The old Torque batch queues on SL6 systems have been removed, all batch work should be submitted to the Slurm queues. SL6 compatibility on Slurm can be achieved using Singularity containers (see the HEPContainerGuide). These instructions are left for reference only.

Batch jobs can be submitted from SL6 system (desktop or interactive node). Here you can use the standard PBS tools to submit and control jobs. For example

qsub -q medium64 -e /scratch/user -o /scratch/user jobscript.sh

Please output your stdout and stderr to /scratch not your home area.

Your jobscript should be a shell script that sources any relevant profiles and configuration and runs the commands you wish to be run. By default it should be located in your home directory.

The grid UI software is configured by default on SL6 systems. This is (along with a valid grid certificate and proxy) required for submitting grid jobs and accessing grid storage.

Queued and running jobs can be viewed using

qstat

which will list job IDs, owner, queue, walltime etc. If you want to delete a job use

qdel JOBID

If this doesn't remove your job then please contact the admins (jobs can be lost on nodes that have crashed or been turned off).

To quickly remove all of your jobs (both queued and running)

qselect -u username | xargs qdel

Job scheduling is subject to 'fair shares'. Users who have recently run more jobs will have a lower priority than those who have run no or fewer jobs and so the latter's queued jobs will be started before the former's. This prevents single users monopolising the batch systems with lots of queued jobs but also allows users to make maximum usage of the resources when they are free. This essentially means you can submit as many jobs as you need but you can't hog the queues.

There are a number of different PBS queues available depending on the length of your job and the required resources.

Queue	CPUs per node	Node type	RAM /CPU	Max Job Slots	Walltime (hrs)
short64	8	2.5GHz Xeon 5420 64bit	2GB	Up to 16	1
desk64	2	2.66GHz Q8400 64bit	1GB	Up to 30	8
all64	2	Any available node	variable	Variable	24

short64 uses spare 64bit interactive node capacity so the number of available slots will vary.

desk64 uses spare 64bit desktop PC capacity so the number of available slots will vary and may be limited to out of office hours (6pm-6am).

all64 uses any available 64bit node. Note that job limits should be set to the lowest spec available (ie desk64).

If you need more RAM per job than is available you can request a number of CPUs eg if you want 4GB per job and the nodes have 2GB per CPU you could use

qsub -l nodes=1:ppn=2 ...other options

If your jobs need more RAM than can be supported by the machines on a queue please use a different queue. If you have unusual requirements please enquire with helpdesk for advice.