icaoberg / Intro to SLURM

Created Tue, 10 Feb 2026 00:00:00 +0000 Modified Wed, 29 Apr 2026 23:57:12 -0400

SLURM Logo

SLURM (Simple Linux Utility for Resource Management) is the workload manager used on Bridges-2 at the Pittsburgh Supercomputing Center (PSC). It schedules and manages jobs across the cluster, ensuring fair access to shared compute resources.


What is Bridges-2?

Bridges-2 is a high-performance computing system at PSC designed for a wide range of research workloads — from traditional parallel computing to AI and machine learning. It features CPU nodes with large memory configurations and GPU nodes equipped with NVIDIA V100 and H100 GPUs.


Partitions on Bridges-2

When submitting a job, you must select a partition (also called a queue). Each partition targets a different hardware tier and use case.

Partition Description Cores/Node Memory/Node GPUs Max Time
RM Regular Memory, full node 128 256 GB 72 hrs
RM-shared Regular Memory, partial node 1–64 2 GB/core 72 hrs
RM-512 Regular Memory 512 GB, full node 128 512 GB 72 hrs
EM Extreme Memory, full node 96 4 TB 120 hrs
GPU GPU nodes, full node all 512 GB–2 TB 8–16 48 hrs
GPU-shared GPU nodes, partial node 1–4 48 hrs

GPU nodes include NVIDIA H100 (80 GB), L40S (48 GB), and V100 (32 GB/16 GB) models.


Submitting a Batch Job with sbatch

Batch jobs are the most common way to run work on Bridges-2. You write a script, submit it, and SLURM runs it when resources are available.

CPU Example (RM-shared)

#!/bin/bash
#SBATCH --job-name=cpu_example
#SBATCH --partition=RM-shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu_job_%j.out
#SBATCH --error=cpu_job_%j.err

# Load your modules
module load python/3.11

# Run your code
python my_script.py

Submit with:

sbatch cpu_job.sh

GPU Example (GPU-shared)

#!/bin/bash
#SBATCH --job-name=gpu_example
#SBATCH --partition=GPU-shared
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=02:00:00
#SBATCH --output=gpu_job_%j.out
#SBATCH --error=gpu_job_%j.err

# Load modules
module load cuda/12.2
module load python/3.11

# Run your GPU code
python train_model.py

GPU Example (Full GPU Node)

Use the GPU partition when you need all 8 GPUs on a node:

#!/bin/bash
#SBATCH --job-name=gpu_full_node
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --gpus=8
#SBATCH --time=04:00:00
#SBATCH --output=gpu_full_%j.out
#SBATCH --error=gpu_full_%j.err

module load cuda/12.2

python distributed_training.py

Running Interactive Jobs with srun

srun is useful for interactive debugging or quick tests. It allocates resources and runs a command directly in your terminal.

Interactive CPU Session

srun --partition=RM-shared --nodes=1 --ntasks-per-node=4 --time=01:00:00 --pty bash

Interactive GPU Session

srun --partition=GPU-shared --nodes=1 --gpus=1 --time=01:00:00 --pty bash

Once the shell launches, you can run commands interactively on the allocated node.


Useful SLURM Commands

Command Description
sbatch job.sh Submit a batch job
squeue -u $USER View your queued and running jobs
scancel <job_id> Cancel a job
sinfo Show available partitions and nodes
sacct -j <job_id> View accounting info for a past job

For full documentation on Bridges-2 partitions and resource limits, see the PSC Bridges-2 User Guide.