C Library for fork-join style parallelism

Team Members: Ayush Agrawal (ayushagr@andrew.cmu.edu)

Project Proposal

Checkpoint Report

Final Project Report

Summary:

Fork-join parallelism is one of the ways of describing parallelism where execution branches off in parallel at designated points in the program, to join (merge) at a subsequent point and resume sequential execution[1]. It is particularly useful when it comes to parallelizing recursive algorithms. Silk is a library written in C that provides the users with an interface to parallelize their recursive algorithms using the fork-join model. The library is inspired from the "cilk" library implemented at MIT[2].

Motivation:

I plan to write a C library that provides support for fork-join pattern of parallelism to:

Background:

Parallel libraries/frameworks come in all flavors. Some are tailored specific to a domain while others are meant to be more general purpose. They differ significantly in their ease of use and the amount of control they provide to the user as well. A general workflow of specifying parallelism to mapping it onto the hardware consists of the following four steps:
  1. Decomposition
  2. Assignment
  3. Orchestration
  4. Mapping
The meaning of the four steps are explained in this section along with a brief description of what silk does in regard with each of them. Detailed explanation of how silk each of them can be found in Section II of this paper. Decomposition refers to how you break down a problem into smaller subproblems that can be dealt with in parallel. The subproblems can have dependencies on each other and they need not all be of the same size. Since Silk is meant to be a general purpose library for parallelizing recursive algorithms, it leaves the decomposition of the problem to the user of the library.

Assignment refers to assigning the subproblems created during decomposition to workers. A worker is an execution context which can be a process, thread, task, event, etc. Assignment of work can be left to the user of the library or can be conducted by the library itself or can be a mix of both. For Silk, a worker is a POSIX thread[4]. I choose to use posix threads for their compatibility and wide support. Note that many platforms offer certain extensions to the POSIX threads API. Silk refrains from using any such extension. Silk performs assignment of work on its own. It uses dynamic scheduling to do so[3].

Orchestration refers to the communication between the workers to ensure correctness and efficiency. Workers often need to talk to each other to signal things like availability of work, state of shared queues, end of a phase of computation, etc. The current version of Silk uses the implementation of mutex provided by the pthread library to implement synchronization among workers. Any another form of orchestration required to implement an algorithm is left for the user of the library to implement and is not part of the library itself.

Mapping refers to mapping the workers to actual hardware units. Most libraries won’t interface with the hardware directly and instead interface with the operating system and provide a mapping between worker and an operating system construct. Since silk uses POSIX threads, it simply uses embodies the mapping used by the pthread library which at the time of writing this paper meant a 1:1 mapping between user level threads and kernel threads.

Since there is an overhead associated with managing the workers, during the decomposition phase, the user of the library may wish to stop subdividing a problem when the size of the problem becomes less than a certain threshold and start using a sequential algorithm for problems of that size. This is necessary from the point of efficiency and is not a requirement by the library. The choice of the threshold is left to the user as well.

My Schedule:

Period Work Done
11/01/19 - 11/07/19 Read the two research papers specified in the project proposal
11/08/19 - 11/14/19 Spend time coming up with the interface to be exposed by the library
11/15/19 - 11/21/19 Implemented the work queue data structure and wrote quicksort using the Silk interface
11/20/19 - 11/24/19 Made the signle work queue implementation stable
11/25/19 - 11/28/19 Extended the implementation to a distributed work queue
11/28/19 - 12/02/19 Implemented child first (continuation stealing)
12/02/19 - 12/05/19 Wrote matrix vector multiplication using Silk and gathered stats
12/06/19 - 12/09/19 Wrote the final report and made the poster for the presentation

Implementation Details:

Four different versions of the Silk library were implemented:
  1. CoFSQ - Continuation First Single Work Queue
  2. CoFDQ - Continuation First Distributed Work Queue
  3. ChFSQ - Child First Single Work Queue
  4. ChFDQ - Child First Distributed Work Queue
The implementations details can be found in the final report.

Results:

A detailed description of the results can be found in the final report.

To summarize, the distributed work queue version is more performant than the single work queue version since there is no single queue which ends up being a point of contention. And the child first version has a shorter maximum queue length as compared to the continuation first version. A number of graphs created for the purpose of highlighting this summary are as follows:

Challeneges:

A lot of work went into getting the distribtued work queue to be both efficient and correct. I did try to fine tune the random victim selection to yield good results but I feel there is still room to improve more.

The most challenging part of this project according to me was thinking about what should go inside the silk_join routine - what it means to join? Once the semantics of join became clear, other parts of the code just fell into place. I believe the semantics of silk_join that I ended up implementing differs a lot from the standard notion of a join operation. I encourage one to go through the final report where I have talked about it in detail before attempting to use this library.

Future Work:

As mentioned in challenges, I feel there is scope to improve the distributed work queue implementation.

Also when the Silk library calls the user callback, it passes to it the id of the silk worker. This behavior can be extended to pass more state then just the id and then the user be allowed to modify some of that information to dynamically configure the behavior of the Silk runtime.

References:

  1. https://en.wikipedia.org/wiki/Fork-join_model
  2. Blumofe, R. D., & Leiserson, C. E. (1998). Space-Efficient Scheduling of Multithreaded Computations. SIAM Journal on Computing, 27(1), 202 - 229
  3. Blumofe, R. D., & Leiserson, C. E. (1999). Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, 46(5), 720 - 748
  4. https://en.wikipedia.org/wiki/POSIX_Threads
  5. http://web.eecs.utk.edu/~huangj/cs360/360/notes/Setjmp/lecture.html