We will implement a parameterizable, cycle-accurate simulator of a general-purpose GPU. In particular, we intend to implement accurate models of parallel thread execution, a memory hierarchy with shared and global memory, and thread scheduler policies. This will allow us to study the runtime profile of a GPU’s hardware features in a cycle-by-cycle way and to pay close attention to the hardware features most exercised. URL: TBD
Modern SoCs rely heavily on GPUs to offload data parallel workloads
for faster computation. While CPU architecture has a rich open-source
design space, GPU architecture has unfortunately been primarily a
proprietary topic with fewer open source tools available. Due to the
limitations on performance counters implemented in production designs,
it is not always possible to conduct fine-grained studies of various
elements of a GPU’s design to help intuit information about the vendor’s
design decisions or the effect that a certain design decision could have
on performance.
When creating new micro-architectural features, it is necessary to
create simulated models of those features to benchmark how they interact
with the rest of the product. Unfortunately, hardware-level
implementations are substantially more costly to develop and execute
benchmarks far slower than simulations. While emulators provide
functional models of the result of executing a program, they do not
provide a cycle-accurate model of how hardware would actually execute a
program. This makes instrumentation that tracks metrics related to the
performance of the u-arch impossible.
GPGPUs, such as those exposed via the CUDA architecture, have large
blocks of cheaply created threads that are physically scheduled in
smaller execution contexts within a specific multiprocessor. Each thread
of the execution context maps to a specific hardware thread in the
multiprocessor which maintains its own program state but is ideally
controlled in a SIMT paradigm where a single instruction is applied
simultaneously to multiple threads that run in parallel to improve the
throughput of a given computation.
The primary challenge is implementing sufficient features to conduct
an interesting design space exploration. In particular, not having
sufficient instructions/other parameters will make it difficult to
simulate realistic workloads. We will need to think carefully about
which instructions need to be supported versus can be expressed as a set
of simpler instructions executed on the simulator. For example, memory
operations are required but a full menu of ALU instructions are not
because certain instructions can be simulated using other instructions.
There is always the opportunity to add more features to this model to
make it more realisitc with modern GPGPUs available on the market, but
doing so may not be time efficient because those instructions are rarely
used or certain schedulers can be inefficient to implement in
hardware.
The second challenge is matching design concerns that are considered in
real GPU designs. For example, power (and, by extension, cooling) is a
primary design concern for commercially available GPUs used to trained
LLMs. However, without a physical implementation, we will need to
conduct research on current trends in memory design (such as modern HBM)
to get realistic estimates for these parameters. In the core itself, we
may not be able to weigh the actual efficacy of a feature because of
hardware limitations that a simulation cannot properly embody, such as
critical path or area complexity.
No special resources will be required. We intend to run this code on our local computers and it will be single threaded (we’re simulating a parallel architecture, not making one after all). We intend to draw off of a previously implemented architectural simulator for decoding instructions and issuing them. The execution logic will be different because there is now abundant parallelism and non-uniformity in memory that must be considered. We do not believe that there are particular resources that will need to be cited here, as all the instructions we implement are ones that we previously used when completing assignment 2 and can be implemented with some thought. We may cite sources for memory performance characteristics, but this will be an extra feature that is dependent on time and, as such, we have not yet investigated.
As a baseline, we intend to implement the following operations: add (to simulate ALU ops), branches (to simulate branch divergence), and loads and stores. In particular, we intend to support loads and stores from both shared memory and global memory. The key difference will be the access latency and capacity. All loads and stores will be tagged as either global or local. Branches will be marked with the probability that the branch will be taken/not taken, the jump targets, and the % of threads that will experience branch divergence once implemented. Since we are not simulating the production of specific results, this will be sufficient to study the behavior of the GPU during execution of instructions.
We will use performance counters in our sensitivity studies.
Performance counters are a key deliverable that we believe is possible
to accomplish by creating key/value pairs of each relevant counter and
updating on every event that invokes the specific behavior. The key
challenge is instrumentation of the correct counters. We seek
to find balance between counting too many small details and losing out
on important trends and being too coarse-grained. For example, counting
the direction of a branch might be less interesting than the stall
cycles due to divergence, because only the latter guides us on further
optimizations.
The areas that we believe are relevant include the frontend, backend,
and memory system. On the frontend, we are particularly interested in
the instruction mix fetched from memory, the effect of memory latency on
keeping the parallel elements fed, and on branching properties like % of
slots not filled due to branch divergence and number of stalls cycles
waiting for branch resolution. On the backend, we intend to study the
effect of shared memory on pressure for operands. Due to the simplicity
of our instruction mix, we do not believe that investigating properties
like stalls due to non-pipelined operations will be interesting. We are
also curious to document different pipeline depths to examine frequency
vs. dependency tradeoffs, however this is a second order concern on GPUs
(as opposed to CPUs) and therefore has a lower priority. The memory
subsystem has a rich design space that needs to be carefully benchmarked
as well. We are particularly interested in average memory access time on
dynamic workflows (as opposed to idealised assumptions of shared memory
usage), stalls induced waiting for memory, % of threads in an execution
group that are stalled, and, if implemented, the effect of different
cache hierarchies on average memory access time (AMAT). Another concern
is the physical layout of the memory, such as the width of our memory
bus because it affects the rate of delivering values to the core, and
the affect of different banked memory layouts because this can lead to
data being inaccessible while other banks are left idle. The specific
counters will vary depending on the final implementation details and any
synchronization points that might need to be added to ensure correctness
of execution.
Thread Synchronization: For our necessary features,
we did not explore synchronization operations such as barriers or
reductions. These are not strictly necessary if true data parallelism
can be achieved. However, this may not be a realistic goal in many
cases, despite a workload being a good candidate for parallelization on
a GPU. One optional feature could be to implement these primitives. This
is challenging due to the subtle nature of synchronization bugs in the
final results that are produced by a program that is run. However, this
is a rich design space that would be interesting to consider because
sync points raise the possibility that certain execution groups could
become blocked, leading to fragmented instruction issues. There also
arises the possibility of starvation due to poor scheduler decisions to
deschedule an execution group for a lengthy period of time due to the
presence of a barrier.
Several Multiprocessors: We have considered a single
multiprocessor due to simplicity of implementation, but this is not
sufficient to capture all of the parallelism that a highly parallel
workload could expose. This would be an interesting direction for
further exploration due to the possible need to move execution groups
between each of the multiprocessors depending on the overall workload
balance of the system such as a high % of threads stalled for memory.
The challenge arises when moving thread groups and with efficient
scheduler design to exploit the extra parallelism that has been exposed.
Furthermore, this would require a new trace design to accurately expose
additional parallelism.
We’re simulating a platform, not exploiting the resources of one. Any modern CPU will do.
We intend to define the entire ISA required for the base implementation and create a simulator that is functionally capable of parsing the input file and making it ready for interpretation. The goal is to design a minimal notation for instructions that does not create overly specified details or require complex logic to decipher between several different cases. Once read, we want to have a launch point for our threads to begin reading from the stream and issuing the specified work.
We intend to implement the ability to simulate basic operations (ALU and branches). These will not require any memory fetches but will test the core’s ability to fetch and decode instructions according to the format specified. When implementing branches, we will need to pay close attention to correct resolution of branch direction and the possibility that divergence occurs as specified by the instruction. During this time, we will need to develop basic execution traces that allow us to characterize the correctness of behavior.
We intend to implement basic counters that follow naturally from the instruction mix implemented so far. This will require not only the creation of counters, but, more importantly, sufficiently long test cases to test the steady state behavior of the GPU over time. We intend to generate these test cases by converting compiled CPU kernels for certain parallel operations and mapping them to the instruction mix executed by the simulator. Because there is no need to produce specific numerical results, we intend to capture the structure of the program (ie... the loops taken in matrix multiplication) rather than the specific instructions that would be executed on a specific ISA.
We intend to implement memory operations under the assumption that
memory bandwidth is not a limiting factor. This is to say that all
threads executing in hardware can fetch their memory operands
simultaneously within a single clock cycle and there is no need to
serialise fetches over multiple cycles due to limited bandwidth. This
will begin to open up richness in the scheduler design space, as threads
can now become blocked for many cycles waiting for memory. When threads
become blocked, it may no longer be optimal to consider scheduling a
particular thread, instead selecting other runnable threads and running
them until they too become blocked on memory.
This brings us to a natural checkpoint because at this point we will
have a model that implements the essential functionality that is
required based on the criteria that we laid out earlier. This will give
us the opportunity to take some time to think about bottlenecks that we
may have discovered during the implementation phase and consider how to
properly instrument performance counters to classify those bottlenecks.
If none are present, this will give us the opportunity to think about
interesting workloads that will exercise memory operations and reveal
non-idealities in the scheduler.
Given a correct implementation, we will spend a significant amount of
time designing a proper experiment to sweep the various parameters that
were created during the design phase. Our goal is to limit the design
space exploration to avoid an exhaustive search, as doing so is
inefficient in terms of time and does not require any interesting
thought about which parameters are most relevant. We’ll spend time
selecting optimal sizes for each structure based off of which ones are
locally optimal and then study interference between structures (for
example, faster branch resolution might make it harder to hide memory
latency because there are fewer threads that are waiting for another
operation).
After a convincing design space exploration, we will use the remaining
time to produce a clear narrative about what bottlenecks exist and
attempt to correlate them to trends in industry. Depending on time, we
would very much like to increase the complexity of the ISA to explore
more realistic workloads as we described earlier.