Final Report

1. Introduction

1.1 Summary

Scotty 3D is a fully-featured software package developed in 15-462 Computer Graphics for modelling, rendering and animation. The modelling module includes operations that modify mesh objects; the rendering module is a raytracer with bounding volume hierarchy (BVH) that synthesizes high quality images; the animation module creates animation through movement of the meshes. For our project, we implemented each module of the software package, and implemented the ray tracing step using CUDA to utilize the Graphic Processing Unit (GPU) of a computer, in order to achieve high performance.

Our project consists of only two main parts: resolving compatibility issues between the original C++ software and the newly created CUDA software, and ray tracing through various primitives with different materials. We have implemented these steps, and were able to achieve, on average, over 60x speedups rendering various scenes with different numbers of primitives and materials.

1.2 Problem

1.2.1 Ray Tracing Computation

In order to render a photorealistic scene, a technique called ray tracing is applied in order to realistically capture the movement of light rays and photons in a scene, and rasterize the resulting color on the screen based on the radiance of each point captured by our camera. For a given point and ray, it involves integrating over each incoming light ray, either from a light source or from scattering light bounces. Below is the rendering equation:

Figure: The rendering equation for a given point and a given light ray

Since it is practically impossible to numerically solve an integration problem over a continuous space, ray tracers use a technique called Monte Carlo integration, which is sampling over the distribution of our integration space, in order to numerically approximate the results. We are interested in an unbiased and consistent solution. In order to reduce the variance in this approximation, many light rays are needed to generate a high quality image. For testing our scenes, we sampled 512 light rays for each pixel, and allowed each light ray to bounce up to 4 times, i.e. we allow an integration depth of 4 before terminating. With these parameters, we generated over 1.2 billion light rays to render our scenes! Improving the performance of ray tracing involves primarily two main methods: fully utilizing the computation power to trace light rays as quickly as possible, and improve the sampling method to reduce the number of light rays needed for high quality images. We are interested in investigating the former method, and utilizing the computation power of a GPU to improve the performance of a ray tracer.

1.2.2 CUDA Software

CUDA is an extremely powerful parallel computing platform that allows GPU to be used for general purpose computing. With CUDA, it is extremely easy to utilize the many cores in a GPU to parallelize and speed up computations. However, to fully integrate CUDA into existing software involves efficiently managing memory communication between CUDA devices and the CPU. Our Scotty 3D implementation utilizes C++ interfaces and objects in order to provide polymorphism and inheritance. It allows the software to render different shapes and objects, with different materials, and under various lighting conditions. With CUDA, memory management becomes an even bigger issue when many C++ objects and pointers are involved. For our software to be featured correctly, it must be able to correctly offload data to the CUDA devices, without significantly slowing down the program due to overhead and communication costs.

1.3 Our Approach

Ray tracing is inherently extremely parallelizable. Due to the independent nature of each light ray, each ray can be traced in parallel. There are many methods to improve the thread utilization of ray tracing. Most notably, packetizing coherent light rays together will improve a warp utilization within each core, and lead to a speed up in our performance.

We have also implemented a few methods and libraries for managing classes that are used on both the CUDA hosts and devices. We utilized the unified memory model introduced in CUDA 6, in order to reduce memory management overhead for classes that are heavily referenced in both the CPU and the GPU. We have also implemented pseudo-polymorphism by managing function pointers and class pointers, as CUDA does not support C++ polymorphism currently, but it is integral to our renderer being able to generate dynamic scenes and objects.

1.4 Related Work

Here are some research papers that we referred for optimization techniques.  Specifically, others have exploited parallelization of packet ray-tracing which we researched heavily. The following paper contains the implementation that we followed:

Fast Ray Sorting and Breadth-First Packet Traversal  for  GPU Ray Tracing

The following is an introduction to unified memory management published by NVIDIA:

Unified Memory in CUDA 6

The following links are resources provided by 15-462 Computer Graphics:

15-462 Computer Graphics

Skeleton Code

User Guide

Developer Manual

1.5 Contributions

Our project implements a sequential version of most mesh editing operations, ray tracing operations and animation operations. The key highlight of this project is tackling the two problems outlined above. We have implemented our ray tracer to work seamlessly between the application software coded in C++, as well as the ray tracing computation coded in CUDA. We have parallelized the ray tracing operations and used the computation power of the GPU, and achieved over 60x speedup for many of the testing scenes over the sequential version of the ray tracer.

2. Methods

2.1 CUDA Software

In order to seamlessly include CUDA in our code, we have implemented a few interfaces to allow C++ classes to be used in both the device code and the host code.

First, we resolved memory communication issues by utilizing CUDA Unified Memory. It allows memory to be allocated in both the host pageable memory, and the device memory, using virtual address pointers. In other words, memory allocated in them can be used in both device code and host code without CudaMemcpy. Coherence is maintained within CUDA framework, by transferring data as needed. This gives us two main advantages when converting the C++ code to CUDA. First, it no longer requires CudaMemcpy to explicitly transfer data between host and devices. For many C++ classes with advanced data structures, they often require deep copies to maintain correctness, which are time-consuming and contentious. Unified memory eliminates that need. Second, classes containing vast amounts of data, unused by the device code, do not have to be copied over to the device, as they would have if invoking CudaMemcpy (where every field within the class is copied over). Unified memory dynamically transfers data as needed, and reduces the amount of data stored in the cache of the GPU, as well as reduces the communication cost overall.

When CUDA compiles code using C++ interfaces, a v-table pointing to the inherited function is used to resolve inheritance issues, so that children classes will invoke the right methods, instead of their parents’ methods that they have overridden. CUDA does not support interfacing class inheritance, as the v-table function pointers reference function addresses in the Host memory, and they cannot be called. In order to get around the issue, we have to explicitly cast our classes to the inherited children classes, in order to reference the correct function.

2.2 Ray Tracing

In order to compute the color of the pixels to be rendered, the renderer simulates photon movement and activity. This is particularly challenging, as it requires significant runtime and computation, numerically integrating over an infinite number of light rays over their luminance to compute how much light is bouncing into the camera. This was the bottleneck of our project, and thus the most imperative area to parallelize. Using CUDA, we utilize the cores of GPU and optimize the performance of the algorithm by tracing the numerous light rays in parallel.

Because ray tracing involves an integration over an infinite number of light rays, our implementation performs Monte Carlo methods to compute the numerical integration. For any random sampling algorithm, we want to ensure its consistency and unbiasedness. Our ray tracer samples light ray directions uniformly randomly, which is consistent and unbiased; however, it requires many light ray bounces and samples to lead to a high quality photorealistic result.

We first implemented the naive parallel solution, where we parallelize ray tracing by assigning a pixel to each thread in a thread block, without reorganizing rays to ensure coherence. In this solution, we performed a depth-first ray tracing solution: the first ray is generated by the camera and then, at the intersection point, we sample secondary rays that can contribute to the luminance at that point, and proceed to trace them. In this solution, workload can be fairly unbalanced, as a ray that does not hit any bounding volume hierarchy will terminate fairly quickly. Also, light rays can be incoherent, which refers to them being scattered in different directions and origins, especially after a few bounces. When incoherent rays are traced together in a warp, the vector utilization is decreased, due to rays quickly diverging into different nodes in the BVH. However, despite the negatives, the naive version still results in almost about 60x speedup, which is more than we envisioned.

Secondly, we experimented with sorting rays in order to improve vector utilization within a warp, and tracing them breadth-first. It works as follows: when rays are generated, they are hashed to an integer. The least significant 11 bits are used to encode the direction of the ray, while the most significant 21 bits are used to encode the positions of the ray. Starting from the least significant bit, we encoded the θ of the ray by dividing the 2π domain into 64 parts, and mapping the resulting parts to the lower 6 bits. We do the same for ɸ, discretizing them into a 5 bit number. For each of the x,y,z, each of them is 7-bit, and divide the scene into 128*128*128 blocks. This is how we come up with the integer hash number. Rays with the same hash numbers are considered coherent, and are traced together in the same warp. According to our research, this method should lead to significant (about 5x) speed up over the naive version. However, we achieved similar results. We have hypothesized that the amount of work it takes to hash and sort the rays is significant, and leads to non-optimal speed up. Also, after the rays are traced, we generated the secondary rays, synchronized the thread blocks, and sorted them before proceeding. This step involves sending all the rays to the host, mapping them back to the devices, and so on. The additional synchronization and communication cost is limiting our speedup.

Figure: Packet ray tracing allow SIMD execution for many coherent rays at once.

3. Experimental Setup

Scotty 3D takes .dae files and renders them using ray tracing.

We ran our code in GHC 5207. The computers have NVIDIA GeForce GTX 1080 GPU with CUDA compute capability 2.0.  The GTX 1080  has 2560 NVIDIA CUDA cores, compared to the CPU’s 4 hyper-threaded cores. The GTX 1080 can also perform vector instructions on warp size of 32 bits.

To test the speedup of our parallel implementation, we can render a series of .dae files and compare their durations to our sequential reference times.  

To render an image in parallel using CUDA, first make in the build folder and then run it, specifying the light, samples, and bounces:

./scotty3d -c -l <light samples> -s <samples per pixel> -m <bounces> <dae file>

When the window appears, press ‘r’ to render the image.

Files to render are in the folder ../dae/, for example, ../dae/sky/CBspheres_lambertian.dae.

4. Experimental Evaluation

We observed around 60x speedup, on average, from the sequential version of our code.  Below are some results comparing the versions when rendering the metal and matte spheres:

4.1 CBspheres:

(8 CPU Threads)
CUDA parallel implementation Speedup
512 samples per pixel, 16 light samples, and 4 bounces 1574.1891 seconds 21.3792 seconds 73.6318x
512 samples per pixel, 64 light samples, and 4 bounces 9752.4263 seconds 211.7631 seconds 46.0535x

4.2 CBspheres_lambertian:

(8 CPU Threads)
CUDA parallel implementation Speedup
512 samples per pixel, 16 light samples, and 4 bounces 1486.2337 seconds 18.6020 seconds 79.8964x
512 samples per pixel, 64 light samples, and 4 bounces 9013.1422 seconds 177.3732 seconds 50.8146x

Seeing as the GeForce has 2560 cores, and the CPU has 8 doubly threaded cores, full speedup around 320x.  However, our “sequential” implementation actually utilized 8 CPU threads, and so wasn’t entirely sequential to begin with, as rendering would be incredibly slow and thus essentially infeasible.  We believe that the following reasons contribute to our non-perfect speedup:

Due to some ray-tracing inefficiencies detailed in section 2.2, the ray-tracing work is unevenly distributed.  Specifically, rays that hit the “black” areas of the scene terminate immediately, whereas more complex areas of the scene such as the reflective balls are more computation heavy.

Also, when altering the C++ software to be CUDA compatible, we were forced to make some inefficient choices, such as repeated function pointer casting as a work-around for inheritance (detailed in 2.1), as well as excessive memory references which take extra time.

Considering these sources of decreased speedup, we were still able to achieve very significant speedup and we believe that ⅓ of the max is a great starting point for continued work in further optimizing Scotty 3D.

5. Conclusions and Future Work

5.1. Challenges

When planning for our project, we underestimated the complexity of converting a C++ project into a CUDA project. Although many libraries and build tools are provided for the transition, there are many functional capability issues that require a nontrivial amount of work and computation to get around. Although CUDA does have extensive coverage and support for most C++ functionalities, one thing it does not support is seamless object inheritance. This is a problem that we spent weeks trying to solve, and had planned only for hours. Because we have created our project using many object-oriented programming principles, the lack of full support of object inheritance made the conversion extremely difficult, and not many related work is done in solving this issue. We ended needing to spend significant time investigating the behavior of CUDA compilation and linkage, and only through trial and error can we solve the problem.

We have also underestimated the complexity of implementing parallel ray tracing. Even though we have done extensive research, many of the methods were not quite relatable to the GPU thread block model. For example, while investigating the parallel construction of BVH, a lot of research is done on efficiently estimating the surface area heuristic using threads, but its inherent tree structure makes the parallel construction extremely difficult, and we ended up having to cut back our scope.

We also had difficulty testing individual parts of our implementations, due to various dependencies on each other, as well as long run time. CUDA error messages were extremely non-descriptive as well. Some of our errors include running out of memory, not meeting dependencies, failing to reference device methods and class properties, but CUDA will only return a message “invalid device function” during runtime, or even silently fail. As a result, we had to perform extensive testing to isolate our errors.

5.2. Lessons Learned

We were pretty surprised at how significant the speedup was using the NVIDIA GeForce GTX 1080 GPU.  Before the project, we had predicted speedups of around 10x, so we were very pleasantly surprised by the 60x we ultimately achieved. We had a lot of trouble integrating CUDA into our existing C++ code. We were very surprised by some of the features that were not supported by CUDA, that would silently fail and terminate despite not running properly. We were, however, able to extensively test our code to find these errors, and manipulate pointer references in order to make it work.

From this experience, we’ve seen first hand the power of graphics processing units. Even though we knew its incredible ability to parallelize computations over many cores, using the SPMD model, we are still stunned by the resulting speedups we were able to achieve. We have also found out that CUDA, despite being a leading API in performing computations on the GPU, is still very much so a work in progress, and requires a lot more features to be integrated natively into existing C++ projects, but it can be an incredibly powerful tool, if we spent enough time tuning the debugging our code.

5.3. Technical Conclusion

We were very happy to achieve significant speed up and render very detailed and complicated scenes in photorealistic quality. We have improved our knowledge of managing GPU memory, as well as its parallelization model. There are many methods to fully utilize the many cores and threads in the GPU, and maximize warp utilization as well. In order to fully optimize the quality of a program run on a GPU, one must, not only schedule work fairly onto each thread block to ensure each core is running and being utilized, but also maximize warp utility to maximize vectorization speedup. GPU memory transfer cost is significant as well despite its high bandwidth. In order to write efficient functions, we must also consider memory allocation and data transfer, and try to minimize the time it takes to synchronize data.

5.4. Future Work

We have not implemented CUDA parallel version of the modelling module and the animation module. Modelling module is mostly sequential, but there are global operations such as Catmull-Clark Subdivision, that is parallelizable and can be optimized and sped up. The animation module involves a series of computationally intensive functions such as skinning and inverse kinematic. It will be interesting to explore the GPU version of them also.

For our ray tracer, we can improve the speed up by parallelizing the construction of BVH tree, and potentially using a more complicated structure such as OctoTree, that is more GPU friendly. Even though it is not the bottleneck of our program, a fast BVH tree construction is essential to make real-time raytracing happen. For reference, a scene with amount 1 million primitives currently require about 2 seconds to construct the BVH sequentially. For real-time rendering, this must be kept to the magnitudes of milliseconds. Our ray tracer parallelism is not fully optimized as well. Further optimizing the work assignment, and ray packetization can allow us to get about 5x more speedup according to the research paper. Bidirectional ray tracing is a technique that generate rays also from light source that eventually end up in the camera space. It uses dynamic programming to reduce some operations, as well as improve our result due to consistent light samples.

Another way of improving our ray tracer, is to improve the sampling methods, so that it does not require as many samples to achieve photorealistic quality. As mentioned before, the numerical integration is approximated via Monte Carlo sampling methods. Various methods of biased importance sampling while maintaining consistency can reduce the number of samples that are contributing little to no radiance to our points. For example, metropolis light transport and photon mapping are techniques that biased the sampling methods, so that it will converge to the true integration a lot quicker, with less samples.

There is a lot of research being done, or has been done in the topic of high performance ray tracer using the GPU. Most of them can be applied and added to our ray tracer to improve the rendering quality, as well as the runtime of our software.