Current Status

We have successfully implemented a MIMD machine that supports the RISC-V RV32I instruction set and custom MIMD specific instructions to fetch the thread and block ID of a particular thread and the block and grid dimensions. Currently, we resolve branch divergence by each thread independently executing its own code, and, as a result, tracking its local copy of the PC. Memory is currently limited to global memory, and the order of memory accesses is dictated by the thread number. Because memory returns the same cycle as requested, all specified threads execute on every cycle. We have verified the correctness of this implementation against golden models on single threaded code and verified the correctness of special instructions through custom testcases. We opted to expand the instruction set above the initial proposal of adds, loads, and stores to enable testing on compiler generated code, which reduces the effort of generating traces by and instead allows us to write more natural code.

Revised Schedule

April 15 - 17: Memory Realism

We intend to implement a realistic memory model that adds delays to fetching instructions. This brings up some small design challenges to effectively track the status of each request and ensure that it’s not served until ready. The larger complexity comes from the need to schedule threads only when memory is ready to go. This will involve maintaining a scoreboard of all of the inflight memory requests and selecting threads only once their memory requests have been fulfilled.

April 16 - 18: Branching Realism

We intend to make a more realistic scheduling policy and branching policy be limiting the number of concurrent threads that are being executed to a fixed parameter and will broadcast an instruction to all threads instead of independently fetching instructions. This exposes branch divergence as an important condition to handle. We will simplify this by fetching and executing every PC that’s requested by a thread group on a specific timestamp for predictability and to avoid using queues of needed instructions.

April 19-21: Performance Characterisation

After implementing all required functionality, we will conduct a detailed parameter sweep and count relevant metrics related to hardware thread utilisation (the number of threads that are executed per cycle), average runnable threads per cycle (to account for memory stalls), average memory bandwidth consumed, and frequency of thread switches.

April 21-23: Additional Functionality

Time allowing, we intend to simulate thread synchronisation (within a thread block) and caching of registers in shared memory (instead of storing them in ideal register files). Both of these functions will improve the realism of the model and will warrant further design space exploration because shared memory will be a more important factor and synchronisation will open up the set of problems that can be handled.

Meeting Deliverables

We are on track to meet all of the functionality deliverables and are optimistic that we will have time to consider implementing some of the extra features before the deadline. We opted to swap the order of completing the model and conducting performance sweeps/implementing performance counters because the sweeps will be more interesting an a more firm u-arch that implements a greater set of functions. Due to the frequency of updating the code, we were concerned that adding performance counters to track metrics might become outdated quickly and require regular updates versus simply implementing the counters on a more firm design. As a result, we do not have any intermediate plots to show (though we do have a lot of code).

Our main concerns for remaining implementation are correctness and development of interesting benchmarks. Correctness is a concern as we develop more complex multi-threaded code, especially code that is not race free because we have not implemented any synchronisation. Though the tool is meant to be a design space exploration tool (versus a reference model), correctness can be relaxed but the results still need to be reasonable and any incorrectly executed instructions cannot cause the program to stop before executing interesting sections of the code. To characterise our design, we are focusing on data-parallel problems like matrix multiplication that produce correct results without any thread-thread communication.

Final Demonstration

We intend to present to the class the results of our design space exploration based on the figures that we believe were the most surprising/informative to us. Because we do not simulate a graphics pipeline or have rendering capabilities, we feel that a live demonstration is not an effective way to communicate the results.