Final Report:
CPU-Based Parallel Ray Tracer
Abhishek Chugh, Chenxi Liu

Project Main Page

Project Proposal

Checkpoint Report

Project Summary
We have implemented a CPU-based fast ray tracer for dynamic scenes. To achieve this, we build a spatial data structure, BVH in our case, in parallel; and packetize the ray traversal. All the experiments were run on the GHC 3000 machines which are 6-core hyper threaded. At the parallelism competition we demonstrated our ray tracer rendering animated scenes at basic interactive rate.
Ray tracing is a technique for rendering 3D scene onto the screen using ray casting and physics law. A ray tracer simulates a real-world camera in a backward way. Instead of accumulating lights from light sources, a ray tracer shoots rays to every pixel on the screen and checks whether the rays eventually return to light sources. When tracking a ray, the ray tracer compute the color of corresponding pixel based on objects encountered along the path.

Figure 1. Backward Ray Tracing

A main task involved in ray tracing is visibility test, that is, checking if the ray hits an object first. The visibility test is increasingly costly when the number of objects grows. To avoid expensive test on each step, data structures for objects are often introduced to reduce computations. Several examples are BVH, kd-tree, and octree.

A ray tracer with the spatial data structure has two working phases. Phase one is maintaining the data structure, which may require to build a whole data structure from scratch or adjust the structure for changed scene or view point. Phase two is doing the regular ray tracing with the data structure. This phase could be further broken into smaller steps, such as ray casting and color computing.
Building BVH in Parallel
The BVH are widely used to accelerate intersection computation for ray tracing and have been shown to perform better than other common spatial structures for many types of scenes. A BVH is a binary tree structure on a set of primitives, like spheres and triangles, in our scene. Each node of the tree contains a bounding box (axis-aligned in our implementation) of a primitive subset; the root of the tree contains the bounding box over all the objects in the scene. Each internal node splits its set of primitives into two disjoint sets which are inherited by its two children. There are multiple choices on how to partition the primitives into two child nodes. We partition the primitive using the surface area heuristic, which estimates the computational expense of performing ray intersections tests, including the time spent traversing the nodes of the tree and expected time spent on ray-primitive intersection tests for a particular partition.

Figure 2. Backward Ray Tracing

Building a BVH from scratch is a typical divide-and-conquer process. There are two kinds of work that could be potentially improved by parallelism. At the early stage of BVH construction, work inside each node could be parallelized, which involves creating the bounding boxes for internal nodes and choosing best split based on SAH values for different splitting positions. As the number of nodes grows, the number primitive inside each node drops. This change reduces the performance gain that could be achieved by node-level parallelization. Thus, a possible solution is to switch work granularity from primitives to nodes. The dependency during construction is between parent nodes and their children. This dependency limits the subtree-level parallelization at the early stage but becomes negligible when more unprocessed nodes are exposed.

Figure 3. Different subtrees can be done in parallel
Figure 4. Internal Operations of a node can be performed in parallel

Our sequential implementation of BVH is borrowed from the PBRT book. The basic process inside one node is described as follow. First, the process builds the large bounding box containing the current node. Then, it starts to apply SAH. A bounding box of centroids is built to determine the axis with the maximum diversity. Along this best axis, the centroid bounding box is equally divided into multiple bins and primitives are hashed into corresponding bins based on their centroids. With these bins, the actual SAH computation is done by splitting along each bin's boundary. The split with minimum cost is chosen. After the split is determined, primitives inside the current node is partitioned into two parts on each side of split.
Packet Tracing
Ray tracing contains a large amount of independent jobs, since each ray could be handled in parallel. A trivial implementation is to assign each thread a ray. However, the performance could be further improved by using SIMD. Instead of one ray, a thread could process a ray on each vector lane. This very idea is the motivation for packet tracing, which uses a ray packet as the basic unit for rendering. There is still a problem, which is known as ray incoherence. This incoherence happens when we go from primary rays to secondary rays. The incoherence would cause inefficient use of vector lanes since a subtree needs to be descended even when only a few rays hit its root.
We started with a serial ray-tracer that we developed for the Computer Graphics course (15-462).
Building BVH in Parallel
Since our aim is to render changing scenes efficiently with an acceptable frame rate, it is important to develop a spatial data structure with high quality to avoid applying intersection detection with all objects in a scene. We borrowed the sequential implementation of BVH from the PBRT book[ref] and began to parallelize the construction.

The first approach we tried was to use different threads to do work for different nodes. Whenever a node is split it produces two pieces of parallelizable work. We used OpenMP to exploit parallelism with multiple threads. A critical issue affecting the speedup was the scheduling policy. We tried a few of them before settling on one.

Scheduling Policies:
  1. Single Queue: We tried the simplest strategy first by maintaining a single queue. Each thread working on a node would put two of its children on the queue. Whenever, a thread became free it would pick up a task from the queue and work on it. This approach, however, had severe high contention among the threads for a obtaining a task because of the single queue. We noticed that over 15% of the time was being spent to waiting for the lock of the queue.

  2. Multiple queues: The next obvious thing to do was to split and maintain separate queues for each thread. Whenever a thread's queue is empty, it would obtain a task by stealing a task from another queue. This reduced the contention to a great extent. Since each thread had its own queue, the threads would obtain task from the back of their own queue and steal from front of another. This would not only avoid contention between the owner thread and stealing thread for obtaining a task but also lead to higher cache hits since the memory locations required for tasks at the back of the queue would be closer to the ones at the front of the queue. A drawback of this method is the complexity of implementing the fine-granularity lock for dequeues.

  3. Multiple queues with a priority queue: Though the above scheme provides for low contention for obtaining the work, the size of the work (determined by the number of primitives in the node) obtained on each steal may be smaller than the largest piece of work available in the system. This would lead to higher number of steals in the system, which would reduce the performance severely with coarse locks. Thus, to ensure that each free thread necessarily obtains the largest chunk of work available in the system, we also maintain a priority queue. Each thread only steals a task from the priority queue. Though this leads to the presence of a task in two queues (the priority queue and the owner's thread queue) needs to be carefully handled, this scheduling strategy further improved the build time of the BVH (~5-10%).
Though this provided us decent results (3.3x using 6 threads on GHC 3K machines), a major factor limiting speed up was the lack of nodes available initially. This was significant limitation because this led to threads being idle for about 20% of time during the BVH build. Apart from the tree level parallelism, there are parallelism opportunities for work inside a node. This "Node-level parallelism" involves building bounding boxes and calculating the costs of split boundaries in parallel. Thus, the benefit of this parallelism is prominent only when the number of primitives handled by the node is high. Therefore, to obtain performance improvement in both cases, we use a hybrid solution.

Initially, all available threads work on building the same node while exploiting the Node-level parallelism. This means threads are synchronized and process nodes one by one. When number of nodes available for work is greater than the number of available threads, we switch to using one thread per node and building multiple nodes in parallel. The work is assigned such that a thread works on nodes that are close to each other - to maintain good spatial locality. Using this combined approach we achieve a speed-up of ~3.9x using 6 threads, on GHC 3K machines.

At this point we made an interesting observation in the system. On detailed profiling, it was determined that a major bottleneck in the system was the simple memory allocation of new BVH nodes. In fact it accounted for nearly 40% of the build time! Thus, we built a memory pool which would occasionally allocate a large memory space and provide parts of it to the new nodes. Though this doesn't affect the speedup much, it definitely improved the running time to a great extent.

The next step to improving the performance of the system was to make use of the multiple vector lanes of the system using ISPC. At first we naively converted the most time consuming sections of the code into ISPC code. As expected, the multiple scatters and gathers required, instead of improving the performance, severely degraded the performance. After this we completely restructured architecture and replaced all array of structures with structures of arrays (SOA), though we were able to get rid of all gathers and scatters, we still only achieved a 20% speedup over the non-ISPC version. We believe there could be two reasons for this. Firstly, the operations are I/O bound. The number of compute operations per load is small (less than 3 for some functions). Secondly, for each node we perform there are multiple ISPC function calls. The overhead of these functions calls may have neutralized the benefit achieved by using multiple lanes. This reasoning is also supported by the fact that the "node-level parallelism" which works on larger sized nodes (on average) achieves a greater speedup with ISPC than the "subtree-level parallelism".

There is another solution for "node-level parallelism" based on parallel partition. Since we observed that partitioning the primitives after each split is fully serialized in the node-level stage, we tried to use GNU's parallel partition inside high-level nodes. This library requires the data structure to be arrays of structures (AOS). As mentioned before, this layout is not very SIMD-friendly. We could only use 3/4 of vector lanes by assigning each lane one dimension. The result shows this approach is slightly worse that the first approach. The trade off between parallel partition and efficient vector lanes usage is not obvious. Therefore, we chose to stick to our first approach.
Packet Tracing
The operation of performing ray tracing itself is embarrassingly parallel. We can simply assign each pixel of the screen to be performed by an individual thread and using a good scheduling strategy achieve near linear speedup. Our goal here, thus, is to use better algorithms to achieve better performance than that. We are also trying to use the vector lanes of the CPU to get a better speedup.

The goal of packet tracing is to amortize the cost of tracing large packets with spatially coherent rays. It involves evaluating multiple rays at once. The normal ray tracing algorithm loops over all pixels on the screen, and for each pixel loops over all scene objects to find intersections. In this scheme, we group a block of screen pixels together and trace a frustum through the scene rather than individual rays. If the frustum does not intersect an object, then none of its interior rays will either, and the ray tracer can safely skip those rays in its computations. We follow the approach of [Ingo Wald ref], in which we also keep track of the first active ray in the packet as we move down the tree. This avoids many ray-to-bounding-box checks as we traverse through the tree when the packet size is large enough. By large enough, we mean the size of a packet is several times larger than the vector width so we could save one or more SIMD calls while sacrificing efficiency in one call.

We use packet tracing for primary and shadow rays. This is because reflected and refracted rays are no longer coherent and do not provide any further improvement. We use frustum check for primary rays only. This is because creating a frustum for shadow rays is not only challenging but also not very useful since the frustum formed for (divergent) shadow rays would be very wide and would not provide much performance benefit. Yet by packetizing shadow rays, we could still benefit from the usage of SIMD. For the SIMD part, each lane takes a ray and do the same operation against one primitive or bounding box.

Apart from the frustum and SIMD, we are using another strategy. Even though we start with coherent rays at the top of the tree, as the node becomes small, the rays can be divergent. We keep records of the first active ray in the packet. So we don't need to execute a SIMD call on the inactive rays at the beginning of the packet. As figure 5. shows, we only check the red or active rays in the middle.
Figure 5. Recording the Active Rays (red) Saves SIMD Calls
Combining all the strategies we take, the BVH traversal for packets could be described as follow.
  1. Check the first active ray in the packet against an interval node. If the active ray hits the box, recursively check against its children. Push the node togethter with the index of the first active ray into the stack.
  2. If the active ray misses the box, test collision for the frustum. If the frustum misses the box, all rays in the packet must miss this subtree. Trace back with the node and first active at the stack.
  3. If the first active ray changes, use SIMD to get the current first active ray. Instead of testing all the rays, the SIMD call returns once an active ray is found.
Once the packet arrives a leaf, use the current first active position to test collision with SIMD.
The experiments are made for Happy Buddha scene, which contains about 1M triangles. The environment is a GHC 3000 machine, which has the Six-Core Xeon CPU or twelve virtual processors and eight vector lanes with sse4-x2. We would show the performance profile first and then we would present the speedups. At last, we would provide current result for packet tracing.
Running Time Profile
We separate the main process of BVH construction into four parts: data initialization, node-level parallelization, subtree-level parallelization, and tree flattening. Data initialization is to assign values for supporting data structures based on the primitives. Node-level parallelization and subtree-level parallelization are doing the actual construction in parallel, which have been described in details in previous section. Tree flattening is converting the tree structure into a linked list based on depth-first order. This part is currently done sequentially. Without detailed timer, the performance for these four parts is:

Table 1. Running Time Table

Data initialization (ms) Node-level Parallelization (ms) Subtree-level Parallelization (ms) Tree Flattening (ms) Total (ms)

For node-level parallelization and subtree-level parallelization, the process could be further broken down into several stages as mentioned in background section. We start by allocating memory for the current node and building the bounding box for it (we changed the implementation so a parent node builds bounding box for its children). Then, the SAH values are computed, including building the centroid bounding box, binning the primitives and computing cost based on bins. After the split is determined, the primitives are rearranged. Queuing information is pushed for the future work.

We didn't add building leaves and median splitting to the breakdown graph for the simplification, since there are not much we could do to improve these two parts. Also, note that the running time here is longer than actual time because of logging.

Table 2. Running Time Table for Parallelization

Phases Node Allocation (ms) AABB for Current Node (ms) Centroid Bounding Box (ms) Binning (ms) Cost Computation (ms) Partition (ms) Enqueuing (ms) Building Leaves (ms) Meidan Splitting (ms)
Subtree-level (average)552041453716231025


Figure 5. Breakdowns for Total Running Time, Node-level Parallelism, and Subtree-level Parallelism
As we can see from the data, the subtree-level parallelism costs the major amount of time. It is also the part we spent most of our time to optimize. Instead of having an extremely expensive step, the running times of different steps are approximately in the range of 10~60 ms. The steps with slightly higher costs, such as node allocation, centroid bounding box, binning, have been already optimized with parallelization or better sequential code. Although the time distribution between node-level and subtree-level seems to be highly unpropotional, the current setting is reasonable since the overhead grows quickly when the size of node decreases.
Execution Time and Speedup
We measured the execution time for various number of threads using the same configuration. The speedup is relative to single threaded non-ISPC version. Also, notice that we show the execution time for the parallelized part only (node-level and subtree-level parallelism). We used a 1.1 Million triangle mesh for the experiments.

Figure 6. Speedup for ISPC and non-ISPC version
As the figures show, the speedup is not perfect but pretty linear before we reach six, the number of physical cores. After that point, the trend changes to be flat for non-ISPC version and to fluctuate for ISPC version. There isn't much difference between the ISPC and non-ISPC version performance but becomes distinct with larger number of threads.

It's hard to conclude whether our program is memory-bounded or compute-bounded, since we are using different strategies for different steps. There are some obvious factors stopping the speed-up growing perfectly--sequential parts, like partition and median splitting, and synchronization overhead. The unremarkable improvement by ISPC might be attributed to the ratio of work that is actually SIMDized. As we have mentioned, we only write bounding box construction in ISPC, which only takes 25% of the time. The overall speedup is still affected by other sequential and parallelized parts.
Packet Tracing Performance
In this case we are working on a scene with 55000 triangles, the following graph shows the speed up relative to the naive(non-ISPC) single threaded version of ray tracer. We see that in we get about 2x more speed up by using packets which check ray-node and ray-primitive intersections using ISPC.

Figure 7. Speed up for naive parallelism and packet tracing
The following graph shows the time spent in parts of the algorithm for different number of threads. Notice that even though the reflected rays that are much less in number than the primary or shadow rays they perform about the same amount of computation primarily because they are not packetised. Also, we see that with higher number of threads, the proportion of time spent for primary rays goes lower than that of shadow rays. This effect may be attributed to frustum culling. Since that we perform Monte-Carlo ray tracing and calling a random function is an essential component.

Figure 8. Breakdown of Packet Tracing Running Time
List of Work by Each Student
Equal work was performed by both project members.