Our project was an effort to implement an engine to price path-dependent (Asian) stock options in parallel. Our implementation is written in C++ and CUDA, and was tested on the Gates cluster machines with the GTX480/680 graphics cards. Our deliverable is a program that calculates and returns an option price determined by a number of user-defined variables, within as short a period of time as possible. Here is a link to our presentation: link
A detailed mathematical explanation of what we are trying to implement can be found on our proposal page. To explain our objective in computer science terms, we are trying to add parallelism to the generation and computation of an inherently sequential tree induction problem. By using CUDA to generate the tree in parallel, and by using some clever decompositions of the tree, our program should achieve maximal speed-up in calculating an option price for making (hopefully) good investments.
Our pricing algorithm is divided into two parts, the first part is to generate a recombining tree of stock prices. At each node (i, j), we store a range of 2N possible averages (arithmetic) of prices of the stock at that time. These averages represent an pseudo-random subset of the iCj paths from (0,0) to (i,j). This approximation is necessary since we would need to store O(2^N) floats otherwise (compared to O(N^3) floats currently). Pseudocode for this part of the algorithm is as follows:
After the binomial tree of the average prices has been created, we need to apply the backwards induction algorithm to find the current value of our option (the value at node (0,0)). This part of the algorithm is as follows:
where - denotes the appropriate option price one level deeper in our option price tree (if it does not exist, we use a linear interpolation between two prices "close" to it).
To make our code more CUDA-friendly, we implemented our benchmark sequential algorithm with arrays to represent our average price and option price trees (we used the C++ STL only to generate random numbers). The operations performed were just reads and writes to those arrays. The inputs to the algorithm are the current stock price, the strike price of the option, the volatility of the stock returns, and the depth of the binomial tree. The output is the current price of the option at current time.
There are several computationally expensive parts of the problem that could potentially benefit from parallelization. In step 1, generating the 2N random paths to compute the running average at each point (i, j) and in step 2, implementing the backwards induction algorithm in parallel. In step 1, generating samples of random averages at every node is independent, so we expect significant of speedup. However there is some divergent SIMD execution at every node (e.g. the running averages to the node at (4,2) depends on the stock price from i=0 to i=4 while the running averages to the node at (2,2) only depends on the stock price from i=0 to i=2). Thus, as the depth of our tree N increases, there is more computation performed nodes near depth N compared to nodes near depth 1. Step 2 is inherently sequential since each depth of the tree, i, is dependent on the next level of the tree, i+1. Thus, we can only benefit from dividing the tree into smaller chunks and performing the backwards induction algorithm on those chunks sequentially.
The first step we took was to implement a serial version of our code in C++ to be run on a single thread on a CPU. Our implementation was written entirely from scratch.
When we worked on decomposing our serial implementation to a parallel one, we realized that the pieces to be executed in parallel were not very CPU intensive and could benefit greatly from a shared address space. Therefore, we decided to write our parallel implementation in CUDA to be run on the Gates Cluster machines equipped with Intel Xeon W3520 CPUs and a NVidia GTX480 GPU.
We also tried two separate approaches to implementing parallelism in CUDA: relying entirely on global memory accesses and utilizing shared memory. Therefore, we ended up with two different implementations for each of the steps mentioned above.Step 1: Global Memory Implementation
To reiterate, the first step of the calculation is the generation of the recombining binomial tree of depth N. To do this, we need random (up, down) paths of length i for each node (i,j). Using Thrust, we generated these paths, represented as randomly generated floats (<0.5f = down, >=0.5f = up), and passed a global structure containing these paths to the kernels, in addition to the global price tree. We assign one CUDA thread a single (i,j) node of the tree, each computing the path of length 2*N for (i,j) using the random paths of length N, and directly storing the values in the global tree structure. This constitues O(N) * O(N^2) * O(2N) = O(N^4) global accesses, and does not scale well, as demonstrated by our results below.Step 1: Shared Memory Implementation
In order to minimize the high traffic to global memory mentioned above, we wrote a different parallel implementation utilizing shared memory. In this implementation, each kernel call would process only the number of threads whose paths could be entirely contained within shared memory. We also added a hash function written in device code in order to approximate the random paths required by this step within each kernel call, eliminating the need for a global array of random paths generated by Thrust. We expected this to improve speedup by reducing the memory access bottleneck we suffered from in the other version. However, we ran into a different scaling problem: as N increased, the size of the 2*N path required for each (i,j) increased as well. Therefore, since the size of shared memory is a fixed 48K, the number of nodes (threads) we could process in parallel decreased as N increased. This slowdown is exhibited in the results before as well.Step 2: Global Memory Implementation
In this step, we recognized that each level (nodes of the same depth in the tree) could be processed in parallel, because it only depend on the prior level. Therefore, we processed each of the N levels in parallel, while writing updates to the global option tree directly.Step 2: Shared Memory Implementation
To better use shared memory, we decomposed the problem by assigning subtrees of the same level to different threads. We structured the subtrees such that the nodes in the subtree could be computed independently, and the subtrees could be entirely contained in shared memory, and written to global memory at the end of the computations. By doing this, we increased the granularity of the pieces of work for each thread, but we hoped that the reduction of global memory accesses again would overcome the reduction of parallelism. This did not turn out to be the case.
Figure 2 plots the speedup of running step 1 of our parallel implementation using shared memory and our parallel implementation using global memory against the serial implementation.
There are two notable features in the graph: the sharp increase in speedup from depth 16 to 32 and the approximately exponential decay of speedup after 32. We theorize that the speedup from 16 to 32 is because the overhead of transferring memory to and from the GPU for N = 16 outweighs the performance benefits of doing the pieces in parallel. N = 32 seems to be a “sweet spot” where the speedup overweighs the aforementioned overhead. For values of N greater than 32, we note that the number of (i, j) nodes that can be computed in parallel linearly decreases, since the size of the 2*N list for each node increases. At the same time, the total number of nodes in the tree in O(N^2).
Figure 3 plots the speedup of running step 2 of our parallel implementation using shared memory and our parallel implementation using global memory against the serial implementation.
The first thing to note is that the Y-axis (speedup) in this case is scaled to a value less than 1, implying speed-down. We suspect this is due to a number of factors. First, the number of pieces of work that can be done in parallel is very small: O(N) and decreasing. This is because of the inductive nature of the algorithm, where each level of the tree depends on the next. Also, in the case of the shared memory implementation, the number of pieces that can be done in parallel decreases very quickly, due to the limited size of shared memory and the large amounts of data that have to be stored for the subtrees.
Figure 4 plots the speedup of a "hybrid" algorithm consisting of our shared-memory, parallel implementation of step 1 and the serial implementation of step 2. Overall, we see a speedup of 4 at the previously mentioned optimum N=32, and a speedup tapering off to 2 as N increases.
We tried running our code on the Gates machines equipped with the GTX680 to see if there was a significant speedup. However, the speedup was very minor for the values of N we tried (within 10%), for different implementations. We are quite sure this is due to the fact that the decomposition of our problem is very coarse grained. Because the work assigned to each thread at at least some task for a single (i, j) in the tree, there are O(N^2) pieces of work, which for most N, is much smaller than the maximal number of threads on the GPUs (both 480 and 680). Therefore, the minor speedup was most likely due to the differences in the clock speed for the GPUs.
It would have been interesting to implement a version of our algorithm on a heterogeneous processor, since certain parts of our algorithms may have been faster if they were performed on the CPU and other parts on the GPU, while not facing certain limitations of transferring data between processors.
In our initial proposal, we mentioned we wanted to try running our program on real-time data, like many financial firms do with their own supercomputers in order to make investments. However, we discovered that access to such data is limited to people with brokerage accounts, and therefore as not accessible to us at the time of our analysis.
Equal work was performed by both project members.