An Exploration of the GPGPU Design Space

We will implement a parameterizable, cycle-accurate simulator of a general-purpose GPU. In particular, we intend to implement accurate models of parallel thread execution, a memory hierarchy with shared and global memory, and thread scheduler policies. This will allow us to study the runtime profile of a GPU’s hardware features in a cycle-by-cycle way and to pay close attention to the hardware features most exercised. URL: TBD

Background

Modern SoCs rely heavily on GPUs to offload data parallel workloads for faster computation. While CPU architecture has a rich open-source design space, GPU architecture has unfortunately been primarily a proprietary topic with fewer open source tools available. Due to the limitations on performance counters implemented in production designs, it is not always possible to conduct fine-grained studies of various elements of a GPU’s design to help intuit information about the vendor’s design decisions or the effect that a certain design decision could have on performance.

When creating new micro-architectural features, it is necessary to create simulated models of those features to benchmark how they interact with the rest of the product. Unfortunately, hardware-level implementations are substantially more costly to develop and execute benchmarks far slower than simulations. While emulators provide functional models of the result of executing a program, they do not provide a cycle-accurate model of how hardware would actually execute a program. This makes instrumentation that tracks metrics related to the performance of the u-arch impossible.

GPGPUs, such as those exposed via the CUDA architecture, have large blocks of cheaply created threads that are physically scheduled in smaller execution contexts within a specific multiprocessor. Each thread of the execution context maps to a specific hardware thread in the multiprocessor which maintains its own program state but is ideally controlled in a SIMT paradigm where a single instruction is applied simultaneously to multiple threads that run in parallel to improve the throughput of a given computation.

The Challenge

The primary challenge is implementing sufficient features to conduct an interesting design space exploration. In particular, not having sufficient instructions/other parameters will make it difficult to simulate realistic workloads. We will need to think carefully about which instructions need to be supported versus can be expressed as a set of simpler instructions executed on the simulator. For example, memory operations are required but a full menu of ALU instructions are not because certain instructions can be simulated using other instructions. There is always the opportunity to add more features to this model to make it more realisitc with modern GPGPUs available on the market, but doing so may not be time efficient because those instructions are rarely used or certain schedulers can be inefficient to implement in hardware.

The second challenge is matching design concerns that are considered in real GPU designs. For example, power (and, by extension, cooling) is a primary design concern for commercially available GPUs used to trained LLMs. However, without a physical implementation, we will need to conduct research on current trends in memory design (such as modern HBM) to get realistic estimates for these parameters. In the core itself, we may not be able to weigh the actual efficacy of a feature because of hardware limitations that a simulation cannot properly embody, such as critical path or area complexity.

Resources Required

No special resources will be required. We intend to run this code on our local computers and it will be single threaded (we’re simulating a parallel architecture, not making one after all). We intend to draw off of a previously implemented architectural simulator for decoding instructions and issuing them. The execution logic will be different because there is now abundant parallelism and non-uniformity in memory that must be considered. We do not believe that there are particular resources that will need to be cited here, as all the instructions we implement are ones that we previously used when completing assignment 2 and can be implemented with some thought. We may cite sources for memory performance characteristics, but this will be an extra feature that is dependent on time and, as such, we have not yet investigated.

Features we Must Implement

Support for Execution of Basic Operations

As a baseline, we intend to implement the following operations: add (to simulate ALU ops), branches (to simulate branch divergence), and loads and stores. In particular, we intend to support loads and stores from both shared memory and global memory. The key difference will be the access latency and capacity. All loads and stores will be tagged as either global or local. Branches will be marked with the probability that the branch will be taken/not taken, the jump targets, and the % of threads that will experience branch divergence once implemented. Since we are not simulating the production of specific results, this will be sufficient to study the behavior of the GPU during execution of instructions.

Support for Performance Counters

We will use performance counters in our sensitivity studies. Performance counters are a key deliverable that we believe is possible to accomplish by creating key/value pairs of each relevant counter and updating on every event that invokes the specific behavior. The key challenge is instrumentation of the correct counters. We seek to find balance between counting too many small details and losing out on important trends and being too coarse-grained. For example, counting the direction of a branch might be less interesting than the stall cycles due to divergence, because only the latter guides us on further optimizations.

The areas that we believe are relevant include the frontend, backend, and memory system. On the frontend, we are particularly interested in the instruction mix fetched from memory, the effect of memory latency on keeping the parallel elements fed, and on branching properties like % of slots not filled due to branch divergence and number of stalls cycles waiting for branch resolution. On the backend, we intend to study the effect of shared memory on pressure for operands. Due to the simplicity of our instruction mix, we do not believe that investigating properties like stalls due to non-pipelined operations will be interesting. We are also curious to document different pipeline depths to examine frequency vs. dependency tradeoffs, however this is a second order concern on GPUs (as opposed to CPUs) and therefore has a lower priority. The memory subsystem has a rich design space that needs to be carefully benchmarked as well. We are particularly interested in average memory access time on dynamic workflows (as opposed to idealised assumptions of shared memory usage), stalls induced waiting for memory, % of threads in an execution group that are stalled, and, if implemented, the effect of different cache hierarchies on average memory access time (AMAT). Another concern is the physical layout of the memory, such as the width of our memory bus because it affects the rate of delivering values to the core, and the affect of different banked memory layouts because this can lead to data being inaccessible while other banks are left idle. The specific counters will vary depending on the final implementation details and any synchronization points that might need to be added to ensure correctness of execution.

Optional Features

Thread Synchronization: For our necessary features, we did not explore synchronization operations such as barriers or reductions. These are not strictly necessary if true data parallelism can be achieved. However, this may not be a realistic goal in many cases, despite a workload being a good candidate for parallelization on a GPU. One optional feature could be to implement these primitives. This is challenging due to the subtle nature of synchronization bugs in the final results that are produced by a program that is run. However, this is a rich design space that would be interesting to consider because sync points raise the possibility that certain execution groups could become blocked, leading to fragmented instruction issues. There also arises the possibility of starvation due to poor scheduler decisions to deschedule an execution group for a lengthy period of time due to the presence of a barrier.

Several Multiprocessors: We have considered a single multiprocessor due to simplicity of implementation, but this is not sufficient to capture all of the parallelism that a highly parallel workload could expose. This would be an interesting direction for further exploration due to the possible need to move execution groups between each of the multiprocessors depending on the overall workload balance of the system such as a high % of threads stalled for memory. The challenge arises when moving thread groups and with efficient scheduler design to exploit the extra parallelism that has been exposed. Furthermore, this would require a new trace design to accurately expose additional parallelism.

Platform Choice

We’re simulating a platform, not exploiting the resources of one. Any modern CPU will do.

Schedule

Week 1: ISA Definition & Infrastructure

We intend to define the entire ISA required for the base implementation and create a simulator that is functionally capable of parsing the input file and making it ready for interpretation. The goal is to design a minimal notation for instructions that does not create overly specified details or require complex logic to decipher between several different cases. Once read, we want to have a launch point for our threads to begin reading from the stream and issuing the specified work.

Week 2: Implementation of Operands

We intend to implement the ability to simulate basic operations (ALU and branches). These will not require any memory fetches but will test the core’s ability to fetch and decode instructions according to the format specified. When implementing branches, we will need to pay close attention to correct resolution of branch direction and the possibility that divergence occurs as specified by the instruction. During this time, we will need to develop basic execution traces that allow us to characterize the correctness of behavior.

Week 3: Counters and Benchmarking

We intend to implement basic counters that follow naturally from the instruction mix implemented so far. This will require not only the creation of counters, but, more importantly, sufficiently long test cases to test the steady state behavior of the GPU over time. We intend to generate these test cases by converting compiled CPU kernels for certain parallel operations and mapping them to the instruction mix executed by the simulator. Because there is no need to produce specific numerical results, we intend to capture the structure of the program (ie... the loops taken in matrix multiplication) rather than the specific instructions that would be executed on a specific ISA.

Week 4: Memory Operations

We intend to implement memory operations under the assumption that memory bandwidth is not a limiting factor. This is to say that all threads executing in hardware can fetch their memory operands simultaneously within a single clock cycle and there is no need to serialise fetches over multiple cycles due to limited bandwidth. This will begin to open up richness in the scheduler design space, as threads can now become blocked for many cycles waiting for memory. When threads become blocked, it may no longer be optimal to consider scheduling a particular thread, instead selecting other runnable threads and running them until they too become blocked on memory.

This brings us to a natural checkpoint because at this point we will have a model that implements the essential functionality that is required based on the criteria that we laid out earlier. This will give us the opportunity to take some time to think about bottlenecks that we may have discovered during the implementation phase and consider how to properly instrument performance counters to classify those bottlenecks. If none are present, this will give us the opportunity to think about interesting workloads that will exercise memory operations and reveal non-idealities in the scheduler.

Week 5: Detailed Counters and Sweeps

Given a correct implementation, we will spend a significant amount of time designing a proper experiment to sweep the various parameters that were created during the design phase. Our goal is to limit the design space exploration to avoid an exhaustive search, as doing so is inefficient in terms of time and does not require any interesting thought about which parameters are most relevant. We’ll spend time selecting optimal sizes for each structure based off of which ones are locally optimal and then study interference between structures (for example, faster branch resolution might make it harder to hide memory latency because there are fewer threads that are waiting for another operation).

After a convincing design space exploration, we will use the remaining time to produce a clear narrative about what bottlenecks exist and attempt to correlate them to trends in industry. Depending on time, we would very much like to increase the complexity of the ISA to explore more realistic workloads as we described earlier.