CMU 15-418/618 (Spring 2013) Final Project Report
Implementing a parallel CPU-based raytracer
Lingzhang Jiang
Main Project Page

Summary

I used an existing basic serial raytracer that I wrote myself(with the help of starter code) for Computer Graphics(15-462), and implemented a bounding volume hierarchy, subsequently modifying the code to run on multiple cores using OpenMP. All versions were ran on a 6-core gates3000 cluster machine(supporting 12 virtual cores).

Background

Ray tracing is a method used for rendering graphics.

The difference between ray-tracing and rasterization (the traditional rendering method) is that ray-tracing employs a per-pixel approach while rasterization employs a per-object approach. The advantage of ray-tracing is that some characteristics of a scene like the shadows and reflections can be computed more accurately. The disadvantage is that it is usually slower.

The basic idea(in pseudocode) is as such:

for each pixel do
  compute viewing ray
  find first object hit by ray and its surface normal n
  set pixel color to value computed from hit point, light, and n

My raytracer is a backward-raytracer, whereby each ray is traced from the eye and not the light source. It inputs a scene file and outputs a .png image that is rendered by raytracing. During raytracing itself, an "eye-ray" is shot from the camera through each pixel into the scene. Intersection tests of the ray with the objects in the scene have to be done in order to find out the color of that pixel. For my raytracer, each scene consists of 3 different geometry objects: triangles, spheres, and models. A model is a list of triangles based on a mesh of vertices. Because the models in my available scenes took up by far the majority of the scene, I implemented a BVH for these models.

The BVH is a bounding volume hierarchy which is essentially a binary tree, for which each node is an Axis Aligned Bounding Box(AABB), and each leaf contains a pointer to one MeshTriangle object. The BVH supports the operations buildBVH, which builds a BVH from a model, returning a pointer to the root, as well as searchBVH, which returns the closest intersect given a ray, using a tree traversal starting from the root.

Figure 0: Perf report


As can be seen from the profiling analysis using perf above ran on an intermediate version of the raytracer, the most computationally expensive part of the raytracer is the searchBVH operation and bounding box tests. Given that I am working with static scenes, the cost of building the BVH itself is trivial compared to the searches, as it only has to be built once. While it is difficult to parallelize a single search of the BVH, it is intuitive to divide the pixels that are ray-traced among different cores, which is a solid approach to speed up the entire program.

At a fundamental level, each of the rays can be traced entirely in parallel as they have no dependency on each other. However, using a BVH implies that some preprocessing needs to be done to construct the BVH in the first place before any ray-tracing can be done.

In addition, tree traversal for a BVH is inherently sequential as the decision of whether to go into a child bounding box depends on whether or not the ray intersects a parent bounding box.

A frequently used concept in efficient ray-tracing is coherent rays. This term refers to rays that traverse a similar path through the scene, hitting the same objects and so on. This will definitely benefit the performance of a BVH search if it is exploited as a similar path through the tree would be taken by coherent rays, thus it is possible to improve parallel performance through exploiting locality.

Approach

To start with, I had a serial raytracer that I coded for 15-462 Computer Graphics that did not have any acceleration data structures for intersection tests but instead used a naive for loop to check against all objects. The raytracer uses mainly C++ and opengl to render scenes. In my final program, I also make use of OpenMp to incorporate parallelism.

To get started, I have 5 simple scenes as well as 1 complex scene, as below:

Scene Triangles Spheres
Cornell Box 10 2
Spheres 0 11
Test 1 2
Cube 14 0
Stacks 62 1
Tetrahedron 4 4
Toy 55542 0

As some of them do not contain models while others are just have too few objects, I decided to use stacks.scene and toy.scene for my analysis. These scenes appear as such:

stacks.scene


toy.scene


Figure 1: BVH node


As seen in Figure 1, each of the bvhnodes contains information required to carry out a bounding box test as well as pointers to its parent, left child, and right child. Because of the "axis-aligned" property of the bounding boxes, I could use the 6 values in Figure 1 to create 3 sets of axis-aligned planes, of whose intersection is the bounding box. Using these planes, I implemented a simple bounding box test that tested whether a given ray lies within the intersection of the above-mentioned planes.

The next challenge was how to efficiently parallelize the code. As observed from Figure 0, very significant time was spent in searchBVH. As this function requires quite a fair amount of information to be fetched from memory, I realized that bandwidth may be a problem when more processors came into play. Eventually, I arrived at the final parallel implementation in 3 main stages.

Stage 0 - No BVH

Just to have an idea of performance, I recorded the results of running a parallel version of the non-BVH raytracer on a single core and 6 cores. The first thing I did was to insert a timer function at the start and end of the raytrace.

Figure 2: Stage 1 - From Serial

Figure 2 sums up Stage 1, or my first attempt to parallelize the raytracer(not all code shown). In the serial implementation, the outermost for loop advances a row counter by a set increment STEP_SIZE, which was set to 8, every iteration, while the inner for loops iterates through the rows and columns respectively. It was a simple matter to insert an OMP parallel for structure to divide the rows among the processors. However, this did not lead to a satisfactory result as will be detailed later.

Figure 3: Stage 2 - by Pixel

In Stage 2, I attempted to partition the work finely into pixels and divide that among the processors, leaving OpenMP to do the scheduling. Again, the result was not what I had hoped for.

Figure 4: Stage 3 - Dynamic Scheduling

In Stage 3, I implemented a simple dynamic scheduler with a shared variable. As can be seen in Figure 4, which is an example with 4 threads, I initially divided half of the to-be-raytraced image equally among the 4 threads, and subsequently rows will be given to threads on a by-request basis. This had the best performance in terms of throughput and speedup among the 3 implementations.

Results

Figure 5

Figure 5 details the basic results I obtained from running my initial code without a BVH. Note that the time taken for toy.scene was left out because it was much higher than every other scene. It turned out to be 2529.76s for a single core implementation and 658.859s for 6 cores. Of course, this performance left much to be desired.

Figure 6

Next, using the approach described in the previous section, I implemented a Bounding Volume Hierarchy in the AABB format.

Analyzing the results, it seems that there is a perfect linear speedup up to 4 cores. However, even though the machine had 6 cores, the speedup completely fell off beyond 4 cores. Upon re-examining the code, I realized that this was due to how the loop was structured. Because in each iteration, only 8 rows are being raytraced, any number of threads that is not a factor or multiple of 8 will have a less than optimal speedup.

Figure 7

The most direct way to correct this, I thought, was to simply break the work down into the smallest possible units and then distribute that dynamically among threads. Clearly, according to the actual speedup graph in Figure 7, this did not work well. While I did not run a profiler to confirm it, I think the reason is due to the random sequence of rays that each thread is accessing. This leads to more cache misses, subsequently causing more requests to go out to memory and thus the performance suffers from a bandwidth bottleneck.

Figure 8

My final implementation, as described previously, used an approach more similar to Stage 1 but ensures maximum locality and even workload distribution by assigning and initial large block to each thread, and then single rows once the first block is complete. As evidenced by the graphs in Figure 8, this implementation did better than the two in terms of both performance and speedup.

Eventually, I think that bandwidth limits are still going to come into effect as a limiter for speedup beyond a certain number of cores. The fact that there is non-trivial speedup beyond 6(and the falloff in speed up before 6) threads suggests that a fair amount of latency hiding is taking place.

The next stage would be to implement a packet-based raytracer that groups coherent rays into a single ray packet and trace the packet, instead of single rays. What this would do is that multiple rays would traverse a single path of the BVH at once, so more computation is being done per unit of data fetched from memory. This will theoretically improve both the throughput and the parallelism of the program. While I did manage to get started, I was unfortunately unable to complete this stage before the deadline.

CPU/GPU?

A GPU usually has more computing pwoer and bandwidth than a CPU. However, the overhead required to send the required information to the GPU for computation may outweigh the performance gain. An efficient algorithm for grouping coherent rays into packets, on the other hand, for a BVH-accelerated raytracer, would be well suited for a GPU, which favors convergent thread execution. If done correctly with the proper optimizations, I believe that a BVH-based parallel raytracer would have overall better performance on an average GPU as compared to an average CPU.

References

Quadgnim: A guide on how to build a simple BVH
Ingo Wald, Solomon Boulos and Peter Shirley: Raytracing Deformable Scenes using Dynamic Bounding Volume Hierarchies