GPU Accelerated SPH Fluid Pouring Simulation

Background

SPH is a particle-based method where fluid is represented as a set of discrete particles. Rather than a fixed grid, SPH estimates fluid properties by combining contributions from nearby particles using a weighted average. This makes it ideal for splashing and pouring flows.

At a high level, each simulation timestep consists of several stages. First, for every particle, the simulator finds nearby particles within a fixed radius. Using those neighbors, it computes local density and pressure. Then it computes forces on each particle, including pressure forces, viscosity, gravity, and collisions with container boundaries. Finally, it updates each particle’s velocity and position based on these forces, moving the simulation forward in time. This process repeats every timestep, and in a pouring scenario, it captures behavior like liquid forming a stream, hitting a surface, and spreading or splashing.

This workload has clear opportunities for parallelism. Most of the work is done per particle, so we can assign one particle to each thread. A key operation is neighbor search, which means finding which nearby particles are close enough to affect a given particle. To avoid checking every particle against every other particle, the simulation space is divided into a spatial hash, and each particle is placed into a grid cell based on its position. Then, when processing one particle, we only need to check particles in its own cell and nearby cells, which makes the computation much more efficient.

In our implementation, we plan to parallelize the main SPH stages on a GPU. Grid construction, density and pressure computation, force computation, and the final timestep integration. However, although SPH appears highly parallel, the workload is not uniform. In a pouring simulation, particles are unevenly distributed. Some regions are sparse (like in a falling stream), while others are very dense (like where the liquid lands). This means some threads do much more work than others, and memory access becomes irregular. These effects can lead to warp divergence, load imbalance, and poor memory performance, which are the key challenges we aim to analyze in this project.

Pseudocode

for each timestep:
    // Spatial Hashing
    for each particle i:
        cell_i = cell containing particle i
        add particle i to cell_i

    // Compute density and pressure for all particle
    for each particle i:
        neighbors = find_neighbors(i, spatial_grid)
        density_i = sum(neighbor_contributions)
        pressure_i = compute_pressure(density_i)

    // Compute forces using already-computed densities/pressures
    for each particle i:
        force_i = sum(pressure_viscosity_interactions)
        force_i += gravity and wall collisions

    // Move particles forward
    for each particle i:
        velocity_i += force_i * dt
        position_i += velocity_i * dt

The Challenge

The idea of a SPH fluid simulation for pouring specifically, introduces a unique set of challenges and constraints that sets the project apart from classic parallel SPH projects.

Warp Divergence & Load Imbalance

Pouring a thin stream of liquid from one cup into another creates a severe warp divergence problem, where threads operating on sparse parts of the grid cell, such as the thin stream in mid-air, will likely finish calculation far before threads operating on denser parts of the cell such as the point where the stream lands in the pool inside the recipient cup, as well as the dense bodies inside both cups. Also, with liquid being poured from cup to cup there will be significant splashing, which will present us with a challenge of balancing the tradeoffs of how frequently we rebuild the spatial hash to maintain load balance.

Neighbor search and spatial hash rebuilding

SPH relies heavily on finding nearby particles efficiently. As specified previously, particle distributions may change rapidly from dense fluid to a thin stream and then to splashing regions. This makes the spatial grid quickly outdated. Rebuilding it too often adds overhead, while rebuilding too infrequently makes neighbor searches slower and increases imbalance. Choosing when and how often to rebuild becomes a key challenge.

Memory Bandwidth & Cache Pressure

To push the memory hierarchy even further, we propose running a number of these simulations concurrently on the same GPU in batches (motivated by the need to evaluate multiple pour angles for MPC). While parallelizing independent trajectories is traditionally trivial, forcing them to share the cache and memory bandwidth while dynamically allocating spatial grids for SPH introduces severe cache thrashing and contention, which we will have to actively manage

Goals and Deliverables

Plan to Achieve

Full 3D SPH pipeline on the GPU (Spatial grid, neighbor search, forces, integration).
Real-time simulation speed for standard pouring scenarios.
Efficient CUDA kernels handling irregular particle distributions.

Hope to Achieve

Support for batched simulations to evaluate multiple pouring angles concurrently.
Deep-dive analysis on cache contention during batching.

Fallback Goals

If development takes longer than expected, we may reduce the scope to 2D, which preserves the core algorithmic structure while simplifying implementation and debugging

Performance Goals

At this point it is hard to set specific performace targets as the workload characteristics will only become clear after initial implementation. We will refine performance goals as the limitations of our program become more clear.

Demo plan

A live demo of a simulation of liquid being poured from one container to another

Analysis Goals

How do particle distribution patterns (e.g., sparse streams vs dense regions) affect performance?
How significant are memory bandwidth and cache effects in limiting performance?
How does grid resolution/cell size affect performance and load balancing?
How does varying the number of particles processed per thread affect GPU utilization, memory access patterns, and overall runtime?
What portion of runtime is spent in each stage (grid build, neighbor search, density, forces, integration), and which stage becomes the bottleneck as problem size increases?
How do different parallelization strategies (single simulation vs batched simulations) affect efficiency?

System Capabilities and Expected Outcomes

Effective use of GPU parallelism for particle based simulation
Measurable speedup over a baseline implementation
A detailed understanding of the performance bottlenecks in this type of workload.

Platform Choice

We will implement our system on the GHC lab machines, which provide NVIDIA GPUs with CUDA support. This platform is well suited for our workload because SPH fluid simulation is highly data parallel, with most computation occurring independently per particle. CUDA allows us to naturally assign particles to GPU threads and execute density, and force computations in parallel, at a large scale. Additionally, GPUs are designed to handle high throughput workloads with many concurrent threads, which matches the structure of SPH where thousands of particles must be processed each timestep.

Our project specifically targets challenges such as irregular particle distributions, memory bandwidth pressure, and warp divergence, all of which are important performance considerations on modern GPU architectures. Using this platform allows us to both exploit the available parallelism and study how well the workload maps to the GPU’s execution and memory model.

Schedule

Week 1 (Mar 26 - Apr 1) ✅
- Begin reading reference papers and exploring existing SPH repositories to understand standard implementations and design choices. Start implementing a basic sequential SPH simulation as a baseline, focusing on core components such as particle representation, neighbor search, and force computation. At this stage, the goal is correctness and simplicity, without integrating the full pouring environment.
Week 2 (Apr 2 - Apr 8) ✅
- Complete the sequential simulation for a small number of particles and validate its behavior. Extend the system to include the pouring environment, such as transferring fluid from one cup to another with adjustable tilt angles and fill levels. Begin introducing parallelization by designing and implementing initial CUDA kernels for key components, while ensuring consistency with the sequential version.
Week 3
- 4/9 - 4/13
  - Work out remaining fluid physics bugs | Athan and Yashoditya
  - Prepare initial sequential results and analysis for the milestone report | Athan and Yashoditya
- 4/13 - 4/16
  - Milestone report due April 14th
  - Possibly continue refining Surface Particle and Boundary Condition challenges with SPH | Yashoditya
  - Focus on getting a working parallel version of the simulation.
  - Implement core CUDA kernel for neighbor search (spatial hashing) | Athan
  - Implement core CUDA kernel for force computation | Yashoditya
  - Ensure correctness compared to the sequential baseline. | Athan and Yashoditya
  - Begin basic performance measurements and identify major bottlenecks such as warp divergence or memory access patterns. | Athan
Week 4
- 4/16 - 4/19
  - Optimize and refine the parallel implementation
  - Research and experiment with strategies to reduce warp divergence | Athan
  - Research and experiment with strategies for improving load imbalance | Yashoditya
  - Analyze cache behavior and memory bandwidth constraints, consider feasibility of multiple concurrent simulations | Yashoditya
- 4/19 - 4/23
  - Collect more detailed performance data and compare different design choices | Athan and Yashoditya
  - Implement best strategy found for reducing warp divergence | Athan
  - Implement best strategy found for improving load imbalance | Yashoditya
  - Possibly begin implementing multiple concurrent simulations | Athan and Yashoditya
Week 5
- 4/23 - 4/26
  - Wrap up final touches and debugging of code | Athan and Yashoditya
  - Finalize experiments and evaluate results across all configurations. | Athan
  - Focus on analyzing tradeoffs between performance, scalability, and simulation behavior. | Athan and Yashoditya
- 4/26 - 4/30
  - Final Report due April 30th
  - Generate plots and visualizations to clearly present findings. | Yashoditya
  - Complete the final report, summarizing key insights, challenges, and lessons learned from the project. | Athan and Yashoditya

GPU Accelerated SPH for Fluid Pouring Simulation

Summary