SLURM (Simple Linux Utility for Resource Management) is the workload manager used on Bridges-2 at the Pittsburgh Supercomputing Center (PSC). It schedules and manages jobs across the cluster, ensuring fair access to shared compute resources.
What is Bridges-2?
Bridges-2 is a high-performance computing system at PSC designed for a wide range of research workloads — from traditional parallel computing to AI and machine learning. It features CPU nodes with large memory configurations and GPU nodes equipped with NVIDIA V100 and H100 GPUs.
Partitions on Bridges-2
When submitting a job, you must select a partition (also called a queue). Each partition targets a different hardware tier and use case.
| Partition | Description | Cores/Node | Memory/Node | GPUs | Max Time |
|---|---|---|---|---|---|
| RM | Regular Memory, full node | 128 | 256 GB | — | 72 hrs |
| RM-shared | Regular Memory, partial node | 1–64 | 2 GB/core | — | 72 hrs |
| RM-512 | Regular Memory 512 GB, full node | 128 | 512 GB | — | 72 hrs |
| EM | Extreme Memory, full node | 96 | 4 TB | — | 120 hrs |
| GPU | GPU nodes, full node | all | 512 GB–2 TB | 8–16 | 48 hrs |
| GPU-shared | GPU nodes, partial node | — | — | 1–4 | 48 hrs |
GPU nodes include NVIDIA H100 (80 GB), L40S (48 GB), and V100 (32 GB/16 GB) models.
Submitting a Batch Job with sbatch
Batch jobs are the most common way to run work on Bridges-2. You write a script, submit it, and SLURM runs it when resources are available.
CPU Example (RM-shared)
#!/bin/bash
#SBATCH --job-name=cpu_example
#SBATCH --partition=RM-shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --output=cpu_job_%j.out
#SBATCH --error=cpu_job_%j.err
# Load your modules
module load python/3.11
# Run your code
python my_script.py
Submit with:
sbatch cpu_job.sh
GPU Example (GPU-shared)
#!/bin/bash
#SBATCH --job-name=gpu_example
#SBATCH --partition=GPU-shared
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=02:00:00
#SBATCH --output=gpu_job_%j.out
#SBATCH --error=gpu_job_%j.err
# Load modules
module load cuda/12.2
module load python/3.11
# Run your GPU code
python train_model.py
GPU Example (Full GPU Node)
Use the GPU partition when you need all 8 GPUs on a node:
#!/bin/bash
#SBATCH --job-name=gpu_full_node
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --gpus=8
#SBATCH --time=04:00:00
#SBATCH --output=gpu_full_%j.out
#SBATCH --error=gpu_full_%j.err
module load cuda/12.2
python distributed_training.py
Running Interactive Jobs with srun
srun is useful for interactive debugging or quick tests. It allocates resources and runs a command directly in your terminal.
Interactive CPU Session
srun --partition=RM-shared --nodes=1 --ntasks-per-node=4 --time=01:00:00 --pty bash
Interactive GPU Session
srun --partition=GPU-shared --nodes=1 --gpus=1 --time=01:00:00 --pty bash
Once the shell launches, you can run commands interactively on the allocated node.
Useful SLURM Commands
| Command | Description |
|---|---|
sbatch job.sh |
Submit a batch job |
squeue -u $USER |
View your queued and running jobs |
scancel <job_id> |
Cancel a job |
sinfo |
Show available partitions and nodes |
sacct -j <job_id> |
View accounting info for a past job |
For full documentation on Bridges-2 partitions and resource limits, see the PSC Bridges-2 User Guide.