Design and implementation of fork-join API: We defined and implemented the interfaces through which client code interacts with the “pool” of FJava threads. We wrote the infrastructure for scheduler, task runners and work queues (from scratch).
Work Stealing: We adapted and implemented sender initiated and receiver initiated strategies for work stealing from from this paper. We have successfully integrated and tested these strategies with our FJava framework. Additionally, we had to made modifications to these algorithms to support the sync operation of our framework. We have also analyzed the performance of these two strategies using key metrics like number of tasks stolen/delegated, overall time taken etc. Please see the section of preliminary results for further details.
Test suites and Benchmarking: We developed a bunch of test suites to evaluate performance of our framework against Java 7’s ForkJoin as well as sequential implementation for various workloads. The various workloads we chose are:
- Matrix multiplication
- “Computation heavy” map
- “Computation light” map
Metric Collection Framework: We have created a framework for metric collection built on top of parfait; parfait is an open source java monitoring library. This framework has proven extremely handy in identifying key performance metric for further analysis.
For the final demo, we will provide an indepth analysis (along with graphs) for various work stealing startegies and compare it against Java 7’s ForkJoin implementation. We will also show a live demo of our framework on different workloads; we will strive our best that it beats Java 7’s ForkJoin implementation.
Analysis of Preliminary Results:
So far we have competitive results with the Java 7 Fork Join implementation, with both of the task stealing algorithms, for all of the tasks (Matrix Multiplication, Fibonacci, Quicksort, Map, Primes).
Here are the results for a Matrix Multiplication test. The test case involves multiplicating two 2048x2048 matrices of floats. All of the tests were done on a MacBook Pro with an i7 processor.
The next image shows the results for the Sender Initiated algorithm in the Matrix Multiplication task.
FJava performs slightly better than Java 7 Fork Join. Same results hold for the receiver initiated algorithm, in the picture below.
Both implementations achieve a 2x speedup over the sequential version on a two core, hyperthreaded machine. The sequential threshold is set to a 64x64 matrix block. We have not tweaked this parameter yet.
On the next figure, we can see the amount of delegations done by each task runner when using the Sender Initiated algorithm.
Interestingly, the Sender Initiated algorithm does very few delegations, however the number of tasks completed by each processor is almost the same (differs by less than 1%). This tells us that the amount of work done by each processor is similar. One possible reason for this is that the delegations that take place occur at the very beginning of the algorithm, therefore all of the task runners acquire large tasks that generate large trees of descendant tasks.
In constrast, here are the number of steals done by each task runner when the Receiver Initiated algorithm is used:
All of the task runners had to do over 15 steals to stay busy. Possibly, the tasks that the task runners stole were not large enough to keep them busy for the entire run. However, the performance of this algorithm is the same as the Sender Initiated algorithm for the Matrix Multiplication task.
Even though FJava slightly outperforms Java 7 Fork Join on the Matrix Multiplication task, Java 7 outperforms FJava on the Primes task. This task has more work imbalance than the matrix multiplication task. Basically, it consists on determining which numbers of an array are primes. The numbers range from 1000 to 216. Here are the results for the Sender Initiated algorithm. The results for the Receiver initiated algorithm are similar.
Both implementations achieve over a 2.0x speedup over the sequential version on a 4 core machine. The sequential threshold was set to 4000. We have not tweaked this parameter yet.
There are no known issues with our framework. After performing more than a few hundred runs on different workloads, we are confident that our system doesn’t suffer from deadlocks or any other concurrency bugs.
We believe that performance still can be improved, most of our efforts will concentrate on experimenting with new algorithms for scheduling, improving locality, and measuring the effects of tweaking several parameters of the algorithms (for instance, varying the delta value of the Sender Initiated algorithm, varying the sequential threshold, and others).
The metric collection framework uses a Stopwatch from Guava’s library for timing short methods, such as getTask and stealTask. Unfortunately, the timing information from the stopwatch doesn’t seem consistent with other key metrics. Hence, we need to investigate the root cause and fix it appropriately.
You can see the schedule in the home page. Currently we’re ahead of schedule. :D