Intro to SLURM - icaoberg

SLURM Logo

SLURM (Simple Linux Utility for Resource Management) is the workload manager used on Bridges-2 at the Pittsburgh Supercomputing Center (PSC). It schedules and manages jobs across the cluster, ensuring fair access to shared compute resources.

What is Bridges-2?

Bridges-2 is a high-performance computing system at PSC designed for a wide range of research workloads — from traditional parallel computing to AI and machine learning. It features CPU nodes with large memory configurations and GPU nodes equipped with NVIDIA V100 and H100 GPUs.

Partitions on Bridges-2

When submitting a job, you must select a partition (also called a queue). Each partition targets a different hardware tier and use case.

Partition	Description	Cores/Node	Memory/Node	GPUs	Max Time
RM	Regular Memory, full node	128	256 GB	—	72 hrs
RM-shared	Regular Memory, partial node	1–64	2 GB/core	—	72 hrs
RM-512	Regular Memory 512 GB, full node	128	512 GB	—	72 hrs
EM	Extreme Memory, full node	96	4 TB	—	120 hrs
GPU	GPU nodes, full node	all	512 GB–2 TB	8–16	48 hrs
GPU-shared	GPU nodes, partial node	—	—	1–4	48 hrs

GPU nodes include NVIDIA H100 (80 GB), L40S (48 GB), and V100 (32 GB/16 GB) models.

Submitting a Batch Job with `sbatch`

Batch jobs are the most common way to run work on Bridges-2. You write a script, submit it, and SLURM runs it when resources are available.

CPU Example (RM-shared)

#!/bin/bash
#SBATCH --job-name=cpu_example
#SBATCH --partition=RM-shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu_job_%j.out
#SBATCH --error=cpu_job_%j.err

# Load your modules
module load python/3.11

# Run your code
python my_script.py

Submit with:

sbatch cpu_job.sh

GPU Example (GPU-shared)

#!/bin/bash
#SBATCH --job-name=gpu_example
#SBATCH --partition=GPU-shared
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=02:00:00
#SBATCH --output=gpu_job_%j.out
#SBATCH --error=gpu_job_%j.err

# Load modules
module load cuda/12.2
module load python/3.11

# Run your GPU code
python train_model.py

GPU Example (Full GPU Node)

Use the GPU partition when you need all 8 GPUs on a node:

#!/bin/bash
#SBATCH --job-name=gpu_full_node
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --gpus=8
#SBATCH --time=04:00:00
#SBATCH --output=gpu_full_%j.out
#SBATCH --error=gpu_full_%j.err

module load cuda/12.2

python distributed_training.py

Running Interactive Jobs with `srun`

srun is useful for interactive debugging or quick tests. It allocates resources and runs a command directly in your terminal.

Interactive CPU Session

srun --partition=RM-shared --nodes=1 --ntasks-per-node=4 --time=01:00:00 --pty bash

Interactive GPU Session

srun --partition=GPU-shared --nodes=1 --gpus=1 --time=01:00:00 --pty bash

Once the shell launches, you can run commands interactively on the allocated node.

Useful SLURM Commands

Command	Description
`sbatch job.sh`	Submit a batch job
`squeue -u $USER`	View your queued and running jobs
`scancel <job_id>`	Cancel a job
`sinfo`	Show available partitions and nodes
`sacct -j <job_id>`	View accounting info for a past job

For full documentation on Bridges-2 partitions and resource limits, see the PSC Bridges-2 User Guide.

icaoberg / Intro to SLURM