CMU 15-418/618 (Spring 2013) Final Project Report
Parallel Fluid Simulation using SPH
Harry Gifford (hgifford) and Luo Yi Tan (luoyit)
Main Project Page

We implemented a fluid simulator using Smoothed Particle Hydrodynamics(SPH) on the GPU. We achieved a real-time simulation which performance scales linearly to the number of particles.

Smooth Particle Hydrodynamics is a way to simulate fluid flow, and has been used in many applications such as oceanography and astrophysics. It works by dividing the fluid into particles that have a certain distance between them, which is smoothed over by a kernel function to compute its physical properties and to calculate the forces that are acting upon each particle. This is done by dividing the particles up into grids and doing the calculations based on how the neighboring grids and particles interact with each other, and then force on each particle will then be calculated depending on it's neighbors. We do this because we can assume that particles that are too distant from a particular particle would not have any effect on it.

The actual computation of the physics of the particles was done using the Navier-Stokes equations:

Where v is the flow velocity, ρ is the fluid density, p is the pressure, T represents the drag term and f represents the external forces acting upon the fluid, eg: gravity.

Our implementation
Our CPU implemention is done using C++ and OpenGL. We first divide the space in which the fluid particles reside into grids, and determine which particles reside in which grid, and then compute the forces that are acting upon the particle based on the neighboring particles that are in its grid using the Navier-Stokes equation. Finally, we render the particles to the screen using OpenGL.

Our GPU implementation was done using CUDA, using the CUDA code examples from the NVIDIA website as starter code. The main SPH algorithm we used is based on this paper by Goswami et al: Interactive SPH Simulation and Rendering on the GPU. It makes use of CUDA data structures and Z-indexing to speed up the computation time of the interactions between the particles. Z-indexing is a way to generate a compact representation of the particles which allow us to store the particle positions in a 32 byte number while preserving the locality of that data by interleaving the one-dimensional representation of the positions.

Z-indexing in 3 dimensions

After we compute the Z-indices of the particles, we then use a radix sort from the thrust library to sort these indices based on the indices of the particles. After this, we compute the blocks for each particles where we will perform the computations in.

After this, we generate CUDA blocks which store an index to the first particle to the Z-index array, along with the particle count in each block. We then filter these blocks to remove the empty blocks, and then split the blocks with more than a certain number of particles based on the threads per CUDA block.

For each non-empty block in B, a CUDA block is generated in B' and launched with N threads (N = 4 here)

This brings the complexity of the algorithm to O(n^2) to O(mn), where n is the total number of particles and m is the number of neighboring particles. The grid size our implementation also changes depending on the number of particles to optimize the use of CUDA threads.

We then perform the physics computations(density and force and surface tension) and then collisions that act upon the particles. We do about 10 iterations each frame to improve the accuracy of the simulation. Finally, we render the particles using the resulting positions, and use a pointer to these positions to render them using OpenGL. We had researched some fluid rendering algorithms like marching cubes or raycasting to see if we would be able to use them to render on the GPU, but we unfortunately did not have the time to.

After we implemented this algorithm, we also attempted to improve our simulator by implementing a Hybrid SPH algorithm based on this paper: Hybrid Smoothed Particle Hydrodynamics, as it supposedly improved the stability of the simulation as the timestep increases, but we did not see a significant improvement when we tried to implement it.

All of our tests were run on an Intel Xeon W3520 with an NVIDIA GTX 670.
Serial implementation

Timing for serial implementation
Number of Particles Densities and Forces Collisions Rendering Total
100008.01 ms17.74 ms2.83 ms 28.58 ms
1500013.550 ms32.636 ms  4.366 ms 50.552 ms
2000020.80 ms53.71 ms 10.81 ms 89.686 ms
2500030.520 ms82.582 ms 113.103 ms 249.987 ms
3000041.063113.681 ms 155.491 ms 310.235 ms

Parallel implementation

Timing for parallel implementation
Number of Particles Data transfer Z-indexing Sorting Densities/Forces Collisions Rendering Total
1000010.716 ms0.0279 ms 2.460 ms 2.520 ms 0.004 ms 0.547 ms16.678 ms
1500010.688 ms0.027 ms 3.334 ms 1.680 ms 0.004 ms 0.531 ms16.322 ms
2000010.148 ms0.028 ms 3.257 ms 1.830 ms 0.004 ms 0.547 ms15.878 ms
250008.299 ms0.028 ms 3.261 ms 1.770 ms 0.004 ms 0.509 ms13.930 ms
300007.028 ms0.029 ms 3.363 ms 1.740 ms 0.004 ms 0.585 ms12.807 ms
10000012.999 ms0.028 ms 6.017 ms 1.770 ms 0.004 ms 0.707 ms21.586 ms
20000020.474 ms0.028 ms 9.13 ms 1.830 ms 0.004 ms 1.693 ms33.222 ms
30000027.749 ms0.029 ms 10.133 ms 1.860 ms 0.004 ms 2.365 ms42.203 ms
40000033.031 ms0.029 ms 11.112 ms 2.070 ms 0.004 ms 2.000 ms48.316 ms
50000040.473 ms0.029 ms 24.956 ms 2.820 ms 0.004 ms 0.751 ms56.552 ms
100000048.011 ms0.032 ms 26.037 ms 1.740 ms 0.005 ms 0.987 ms76.905 ms

When there are a lower number of particles, the constants affect the simulation lot more, which is why from 10000 - 30000 particles there is some fluctuation in the data. Also, some of the variations in timings are due to the positions of the particles, because when there are a large number of particles in the same grid, there would be a lot more calculations that need to be done on that particular grid. Investigating a technique to adapt to this would be a nice optimization we could make in the future. The density/force computation and the collision computation stay relatively constant as the number of particles increases.

As mentioned before, the performance of our implemention when there are a smaller number of particles is affected a lot more by constants, which explains the unusual noise in the beginning of the graph. In general, the time is takes to compute a frame scales almost linearly to the number of particles.

The x-axis denotes the number of particles

A large percentage of time is spent on transferring data to the GPU and sorting the particles for Z-indexing. We found it a little strange that the time spent on the data transfer increased as the number of particles increased, as we are only sending relatively constant values over to the GPU except for when the direction of gravity is changing. We can easily get around this by just sending the gravity direction instead of the entire struct that we store the constants in. From this, we determined that the actual bottleneck in the implementation would be the sorting of the particles, at least for a relatively large number of them (100000).

Here are some images of our CUDA implementation:
Rendering 300000 particles

Rendering 1 million particles

And here is a video of it in action:

It is possible to achieve real-time fluid simulation on the GPU, but fine-tuning it to be as optimized as possible proved to be quite a challenge. From our implemention, we realized that the main bottleneck in our performance is during the sorting the particles. Ideally if we could find a better way to decrease the time needed to sort the number of particles we would be able to achieve faster rendering times with a larger number of particles. The paper that we used as reference did do some optimizations there by reusing the memory where the sorting of the particles take place, but we could not think of a good way to implement this in the time we had.

Interactive SPH Simulation and Rendering on the GPU
Hybrid Smoothed Particle Hydrodynamics
Simple OpenGL and Particles examples from

Work done by each student
Both of us did roughly equal amounts of work, Harry did most of the base implementation while Luo did the timing and debugging and wrote most of the report.