FJava: Fork Join for Java

By: Ankit Agarwal and Iosef Kaver


What is FJava?

FJava is a high level fork join framework for the Java programming language.

Our framework outperforms the native Java 8 Fork Join framework under some workloads, and gets competitive results for others. With our implementation, we demonstrate that private deques are an effective work stealing algorithm for a Fork Join framework.

FJava also includes an instrumentation feature, that provides metrics that allow you to fine tune your program easily for your machine.

In short, FJava allows you to run divide and conquer tasks in parallel with minimal changes to your sequential code.

Awesome.

Instructors: The final report is here.


Here is an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

public class FJavaQuickSort extends FJavaTask {

  private long [] array;
  private int left;
  private int right;

  public FJavaQuickSort(long [] array, int left, int right) {
    this.array = array;
    this.left = left;
    this.right = right;
  }

  @Override
  public void compute() {
    if(right <= left) return;

    if(right - left <= Definitions.QUICKSORT_SEQ_THRESHOLD) {
      Arrays.sort(array, left, right+1);
      return;
    }
    int mid = partition();
    new FJavaQuickSort(array, left, mid-1).runAsync(this);
    new FJavaQuickSort(array, mid+1, right).runSync(this);
    sync();
  }

  public static void sort(long[] array, int left, int right) {
    FJavaPool pool = FJavaPoolFactory.getInstance().createPool();
    FJavaQuickSort task = new FJavaQuickSort(array, left, right);
    pool.run(task);
  }
}

Why is it cool?

Well, you can run any divide and conquer algorithm in parallel now.

This means that you can get almost a linear speedup on a multi core machine for most divide and conquer problems almost for free.

If you want to get fancy the framework allows you to change different settings, such as the task stealing algorithm that the framework uses. Currently we support the Sender Initiated and the Receiver Initiated private deques algorithms, as well as a simple concurrent deques algorithm.

Awesome.


Goals and Challenges

  • Implement a correct, fork join framework for the Java programming language.
  • Implement code to report metrics of the framework. For instance, total job time, statistics about the spawned job execution times, statistics about the idleness of the threads, and others. This information can help the user to tune the framework properly for the machine (for example, using an optimal threshold level for stopping parallelism).
  • Design, implement and compare at least 3 work stealing strategies.
  • Compare the system to the current Java’s implementation of ForkJoinPool.
  • Design and implement several test suites to measure the performance and correctness of the framework.

Preliminary results

The next figure shows the speedup achieved by FJava and Java Fork Join relative to the sequential version of the code for several problems.

  • Primes: Call isPrime for an array of 5,000,000 numbers.
  • Matrix Multiplication: Multiply two 2048x2048 matrices recursively.
  • Fibonacci: Solve fibonacci(50) recursively.
  • QuickSort: Sort 10,000,000 longs using QuickSort.
  • LU Decomposition: Decompose a 4096x4096 matrix.

The next figure shows the speedup achieved by FJava and Java Fork Join relative to the sequential version of the code for the Matrix Multiplication problem, as we increase the number of cores. We can see the speedup grows linearly until we reach 12 cores for both frameworks.


Lastly, since FJava uses private deques, it performs better when the sequential threshold used is small.

The sequential threshold indicates how small does the problem need to be before we start solving the problem directly instead of creating more child tasks. With a smaller sequential threshold, the size of the problems that will be solved sequentially is smaller. In some scenarios, a smaller sequential threshold is desirable since more work units are generated which leads to a better work balance. In other cases, a larger sequential threshold is more optimal since the overhead of generating a large number of child tasks is more than the time it takes to solve a problem sequentially. Thus, there is a tradeoff in choosing the right value of the sequential threshold.

The reason why private deques perform better with a smaller sequential threshold is because idle workers must wait for busy workers to stop working on a task before the busy worker can answer steal requests. The larger the sequential threshold, the more time busy workers will spend working on tasks instead of answering steal requests. Therefore, large sequential thresholds can lead to work imbalance when private deques are used.

To solve this problem we added to our API a tryLoadBalance call. When used appropriately, FJava shows competitive performance to Java 8 Fork Join even when the sequential threshold is large. The user is responsible for making sure they call tryLoadBalance periodically during their long computations

The next figure shows the speedup achieved with and without the tryLoadBalance call, for different values of the sequential threshold T, in the Primes problem described above, in a 12 core machine.



Demo


We will explain the general design and implementation of our framework. Also, we will demonstrate several graphs that compares the performance of FJava to Java 8 native Fork Join. Lastly, we will do a quick live demo that demonstrates that it is easy to parallelize sequential divide and conquer problems with this framework, briefly showcasing the instrumentation capabilities of the framework.

Learn more about FJava in the Blog Archives