Wen Jay Tan(wenjayt)
We are going to use CUDA to compute an N-body simulation with collisions using different algorithms. Our goal is to provide analysis for the speedup and the scalability of various algorithms used to approximate the N-body simulation.
The N-body simulation is a force simulator that calculates force between every time step, resulting in O(N^2) calculations. Furthermore, our version of the problem is made more complicated by the existence of colliisions, which means that individual bodies may have to communicate with one another to recalculate force during a timestep, which may severely limit the amount of speedup one can get by parallizing naively over the individual bodies. There are ways to approximate the forces via algorithms like Barnes-Hut and the Fast Multipole Method to help scale the problem out. To calculate the force in between two bodies, we will either use the gravitational force, Gm1m2/r^2, or have no force in between the two particles, in which case our simulation becomes a collision simulation. We will also try to add some form of friction between the particles and the air and the particles and each other during collision in order to make this problem a stable systemIn general, processing time is improved significantly by making the time step in parallel, but the amount of parallelism is severely limited by the scaling of which algorithm we use to compute the time steps.
The challenge to this problem is minimizing the amount of communication between each thread, especially during those where the bodies are going to collide with one another. We are especially interested in modeling the interactions where there is going to be a few collisions each time step, because this is where the body needs to recalculate where it is going to go based on the aspects of the other ball. This means that a thread's ball may not update just its own values, but the values of other balls too. This generates potential for race conditions and a myriad of other issues when trying to parallelize this
We will be using various different GPU's in order to see if the problem scales properly with the number of available CUDA threads as well. The code base will most likely be made from scratch. In terms of literature, we do not have a specific reference to work off of at the moment, but the N-body problem is something studied in great detail in the CS world. We may consider using the latedays cluster to take advantage of the other GPU's on the system
As part of the demo, we at least want to have some numbers to show how fast it takes a single tick to finish processing. We hope to get a working GUI that can show the balls in real time and also display how long it takes a tick to finish processing, but this is a stretch goal as neither of us have extensive experience in making such a GUI. However, having a GUI would be cooler for a presentation. Since our project is primarily an analysis project, we hope to learn about scaling and communication issues when calculating forces in the N-Body problem. We don't know what kind of performance to expect, but expect far from perfect scaling due to all of the issues with communication in a dense system.
We are going to use C++. C is incredibly unfriendly to making GUI's, and is in fact generally unfriendly to any form of work. Since CUDA is primarily designed to work with C, C++, and Fortran, we think that C++ is the best option to code with since neither of us know Fortran either. We have the NVIDIA 940M on hand, which is helpful because this means we don't have to search for a GPU to use. We also can use the latedays cluster if we want to do analysis on other GPUs, but this may be difficult for GUI usage due to having to SSH in to the cluster.
So far, we have created the data structures for the N-Body simulation and are almost done with the sequential version of the simulator. We have written code for advancing time step and graphics rendering. We created separate arrays created for the x and y axis values for position, velocity, and force as an optimization aspect for data locality. Currently, we are figuring out the physics for collisions and creating the GUI for the demo that is expected to be done by this weekend. We are about half a week behind schedule due to other responsibilities during carnival weekend but we expect to be back on track and leave the next three weeks before due date for optimizing the parallel version of the simulator and for the Barnes-Hut implementation.
For the presentation, we plan on having a demo of the simulator and some graphs produced from our analysis of the scalability of this program. We do not currently have any major concerns or preliminary results for our project.
So far, we have the algorithms for the parallel and serial version of the naive algorithm that calculates every force between the two. We also have the serial version of Barnes Hut as well, but have not optimized this yet to make a valid comparison of speed versus the parallel and serial version. Currently, we have a GUI that we plan on using to show for our presentation that displays how long various steps in the process take and how long it takes in total to compuate a single tick. The GUI also can display the current state of the balls, but our final benchmark of how long code takes will not be running with the GUI because of potential issues with computing CUDA code and displaying a frame using OpenGL.
We have the following videos that demonstrate the GUI aspect of our code, ran on ~10k balls in simulation between serial and parallel code. We also plan on benchmarking our code compared to similar codes on the internet, which will be finished on Saturday
For our 418 final project, we implemented a serial and parallel version of the N-Body Simulation using the naive algorithm that calculates all components of the particles without approximation and the Barnes-Hut approximation algorithm. We used OpenGL to render the graphics for the simulation and used CUDA to parallelize our serial algorithms. For our deliverables, we will show videos of the simulation running on the different algorithms and graphs to show our analysis of speedup between the serial and parallel versions of the code as well as how different parameters for approximations affect our speedup. The machine we used to run our code consists of an i7 6th Gen CPU and a Nvidia GeForce 940M graphics card. We found that the CUDA parallelized version of our naive algorithm achieved a 6x speedup over our serial naive implementation. However, we found that CUDA provides too high of an overhead for parallelizing the Barnes Hut algorithm and our serial Barnes Hut and naive implementations actually performs better.
An N-Body simulation is a simulation of a particle system under the stress of forces such as gravity and attraction. These simulations typically contains over millions of particles and therefore generates a scalability issue as all particles have to update their force, position, and velocity parameters at each timestep.
In our naive implmentation, the particles are represented by integers from 0 to the total number of particles. There are separate arrays allocated for the x and y components of the position, velocity, and force of the particles in order to take advantage of cache locality. The key operations on these arrays are recalculating the force and updating the velocity and position based on the new force values at each time step. Since this application is a simulation, it does not take an input or produce an output. It opens a window where the particles are drawn by OpenGL and moves the particles based on the new positions at each time step. The most computationally expensive part of the algorithm is handling ball-ball collisions and the force calculations since these are done between all pairs of balls. In order to allow for parallelization of ball-ball collisions, we had to alter the standard ball collision handling implementation which is explained below. Wall collisions are also parallelizable since wall collisions for each particle is independent of each other. Since the force calculations for each particle is independent of the calculations for other particles, it is inherently parallelizable. It is possible to use SIMD for the force calculations for the above reason.
In our Barnes Hut implementation, the particles, positions, velocities, and force arrays are allovated the same way as the naive implementation. There is also a quad tree struct which contains a size 4 array of quadtrees to represent as children of the node as well as mass, center of mass, and width of node data for the node. The key operations are inserting a particle into the tree, calculating the center of mass for each internal node in the tree, and calculating the force for a particle. These operations are explained further in the approach section. The implementations for ball and wall collisions are the same as the naive implementation. Like the naive implementation, our altered implementation of ball collision handling, wall collisions, and the force calculations can be parallelized since they are independent of each other. Although it is possible to use SIMD for the force calculations, it is not easy due to the need for gather and scatter which may detriment the performance. The tree creation must also be done in serial due to the way insert collisions must be handled.
In the naive implementation, the force is recalculated between all pairs of particles in the system at each time step by computing the gravity by dividing the product of the 2 particles' mass by the distance squared, their angle theta with respect to the x axis by computing the arctan of the difference in positions, and finding the force in the x direction with gravity*cos(theta) and the force in the y direction with gravity*sin(theta). After updating the force, the velocity for each particle is updated by adding delta timestep times the accumulated force of that particle divided by the particle's mass and the position of each particle is updated by adding delta timestep times the velocity.
We then handled collisions between each particle. Since this is the most computationally expensive part of the simulation, we had to alter the standard ball collision implementation to support parallelism since it would create race conditions in a parallel setting. In order to make the collisions parallelizable, we analyzed the dependency tree between each ball and discovered that we need a total of 2 times the number of existing particles - 2 iterations to finish computing all collisions between all pairs of particles. We found the highest numbered particle that has finished collision handling by calculating the iteration number plus 1 divided by 2. As long as the current particle we are computing collisions for is less than the highest particle that has finished collision handling, we can run the collisions for the current particle in parallel with the other particles satisfying the conditional. To compute the velocity and position of the balls after collision, we simply solve the standard equations for an elastic collision using conservation of energy and momentum. To handle wall collisions, we simply test the position of the particles with the bounding box and negate the velocities if the particle is leaving the bounding box.
For the ball to ball collisions, we adopt the following collision algorithm and apply it to the example below: For every ball pair (i,j), where i < j, we calculate the distance to move each ball such the balls do not collide anymore. Furthermore, we change the velocity of each ball depending on how far and which direction we move each ball. We consider each ball pair in ascending order of i, then j. For example, our algorithm for the image below would consider the collision between (0,1), then (0,2), then (0,3)... then (1,2), (1,3),... until we get to (5,6). Because the ball positions move in every potential collisions, we may get a race condition if we parallelize these computations blindly. For example, if we consider (0,4) before we consider (0,1), we could move ball 0 farther into ball 1, which increases the velocity (and reverses the direction) ball 0 would leave at the following tick.
To parallelize these computations, we adopt the following algorithm that takes advantage of the dependency graph between collisions to minimize the span of computing these collisions. We do the computations of these ball collisions in runs. In the first run, we only calculate ball pair (0,1). In the following run, we can compute ball pair (0,2). However, on the third run of our algorithm, we run both (0,3) and (1,2). We are allowed to run the second ball pair in parallel in the third run because ball 1 has no more dependencies from balls before it (since 0,1 has been computed, and ball 2 is only depending on ball 1 for its calculations). If we follow this pattern, we add a new ball pair to compute per run after every odd numbered run, and the span of the algorithm changes from O(N^2) to about 2n-3 of these parallelized iterations.
Since wall collisions are independent for all particles, we can do all wall collision handling in parallel. In the CUDA implementation, each thread handles the calculations for one ball. The parallel CUDA implementation of the naive algorithm produced a 6x speedup over the sequential implementation.
In the Barnes Hut algorithm, rather than simply recalculating the force between all particles, the Barnes Hut algorithm creates a quadtree to represent the total area and what partition of the area each particle is in. Each node in the tree is either a NULL which represents that there is no particle in the quadrant, an external node which represents the particle in the quadrant, or an internal node which means that that quadrant was further partitioned into another 4 quadrant. Each quadrant is partitioned into another 4 quadrants until each quadrant only contains a single particle. A graphic explanation is shown below. The empty circles represent internal nodes, the circles with numbers represent external nodes, and the branches with nothing are NULL.
For each particle, we find the quadrant that it belongs in by testing the position of the particle against the walls of the 4 quadrant's bounding box. If that the node representing that quadrant is NULL, then we can simply add the particle into that node and make it an external node. If the node representing that quadrant is an internal node, then we split that quadrant into 4 again and recompute the quadrant that particle belongs to in the smaller partition and repeat until the node representing the correct quadrant is a NULL. If the node representing that quadrant is an external node, then the particle in the external node is removed and replaced with an internal node and both the removed particle and the particle we are trying to insert are inserted into the subtree rooted at the new internal node. In our code, the tree is represented by a struct containing a size 4 array, one for each 4 quadrant, the total mass, the center of mass, the particle number, and the width of the bounding box it represents. The total mass, center of mass, and width variables are only meaningful in internal nodes. If the particle number is -1, then the node is an internal node, else if it is a number between 0 and the number of existing particles, it is an external node. At the beginning of each timestep, we use a for loop to insert each particle into the tree and at the end of each time step, we free the entire tree.
Each internal node contains total mass, center of mass, and width field. The mass field is the sum of all particle mass of the particles in the subtree rooted at the internal node. The center of mass field is the center of mass between all particles contained in the subtree rooted at the internal node. The width field is the width of the bounding box for the quadrant that internal node represents. The mass field is computed during the tree creation by adding the mass of each particle inserted into the tree to the appropriate internal nodes. The width field is also set during the tree creation. To calculate the center of mass for the tree, we used a recursion function that sums the position of the particle times its mass. At the end, we divide this sum by the total mass in the tree to get the center of mass for the quadrant.
To compute the force for each particle, we have a constant theta that represents the degree of approximation the algorithm does. Given a particle, in our recursive function, if the node is an external node and doesn't equal to our given particle, we calculate the attractive force between the two particles as in our naive implementation. Otherwise, it is an internal node so we divide the internal node's width by the distance from our given node to the center of mass of the internal node, and if it is less than theta, then we calculate the force between our given node with the center of mass of the internal node. This is where the approximation happens since we grouped the particles together into one entity in the force calculations. If it is not less than theta, then we calculate the forces individually within the internal node as in the naive implementation. Therefore, if theta is equal to zero, then the Barnes Hut algorithm is the same as the naive algorithm.
After calculating the forces and updating the position and velocities, the ball collisions and wall collisions are done the same way as the naive algorithm.
In parallelizing the Barnes Hut algorithm with CUDA, each thread in CUDA handles the calculations for one ball. We realized that we can parallelize the force computation for every particle since each particle's force calculation does not alter the tree and is independent of each other. The tree creation in each timestep still has to be created in serial due to insertion collisions. The ball collisions can also be parallelized since it is the same implementation as the ball collisions in the naive implementation. Wall collisions can also be done in parallel since all wall collisions are independent of each other. The center of mass calculations also can't be parallelized since it is dependent on the center of mass of other particles. Our program also crashes at 3200 particles because we run out of memory on our GPU. Analyzing the performance between the serial and parallel implementations of Barnes Hut, we concluded that CUDA presented too much of an overhead to the parallel implementation, making the serial implementation faster than the parallel implementation.
We present the following graphs that show performance of the Serial versus Parallel Naive Algorithms, the Serial versus Naive Barnes Hut algorithm, and the cost of executing different pieces of code for each portion of our algorithm.
For our Naive code, we run up to 10-20 times faster when we have 5120 balls to compute, and up to a 6 times speedup for 2560 balls. This is most likely due to the parallelism of the force computation. For the Serial code, up to 2.5 seconds were spent on computing the force for 5120 balls. For the parallel code, this force computation takes only a few millseconds. Instead, most of the time is spent in computing ball to ball collisions, which is a fairly expenive operation. We also created another metric in which we calculated how many balls could be calculated at 10 FPS, which would make our GUI run in 'real-time'. For the parallel algorithm, we had up to 3000 balls at 10 fps, whereas the serial code could only manage about 600 balls.
Our parallelism speedup for the naive code was limited partly by the size of the datasets we worked with. In this application, we only consider data up to 5280 balls, which is about the limit of rendering frames at 1FPS in the parallelized code. However, we can see that as we increase the number of balls, the serial code's execution time outpaces that of the parallel code's execution time, resulting a far bigger speedup as we add more balls. I believe that if we add even more balls, we can continue to realize even larger speedups at the cost of rendering our simulation with a UI. Also, in most research application of N-Body simluations, only point masses are considered. This means that collisions are ignored, which results in a lot less computation being done. As we can see in the third graph, the serial code is mostly limited by the force calculations of the balls. When we move to the Force calculation in the parallel version, this is done almost immediately; we are instead limited by the ball-ball collision detections. If we only considered point masses, we would easily be able to realize a 100x speedup or more on smaller datasets.
For the Barnes-Hut code, the overhead of creating the tree far outpaces the amount of speedup we can get for the datasets we are interested in. Up to 3000 balls, it is simply more efficient to run the Barnes-Hut code on a single thread than deal with making the tree in CUDA and working on that in parallel. In addition, the tree grows very quickly compared to the number of balls, which results in a crash due to not enough space on the GPU, which occured as we tried to test even farther than our second graph shows.
Overall, our choice of target machine was sound. N-Body simulation remains to be a massively parallelizable application, which means that CUDA's ability to execute very many threads comes in handy to achieve a sizeable speedup for larger datasets. Videos of our UI can be found above in the Second Progress Report Section.
Both students did and equal amount of work on the entire project.