I picked my project based on the approach that I wanted to use, rather than the problem that I wanted to solve. In particular, I like lockfree (LF) data structures and I took this class partly because I want a chance to practice writing them. But my experience is that, because of the danger of use-after-free, they're not easy or practical unless you already have special memory system support.
In light of this, I think a memory allocator might be an interesting place to try a LF approach because it might be possible to arrange for guarantees that make a simpler LF algorithm feasible. Or it might be possible to arrange memory system support in a simpler way.
An allocator splits a shared address space into chunks of variable size in order to assign ownership of chunks to threads.
I think that you can efficiently parallelize an allocator by reducing the problem to one of non-shared allocations or of non-variable sizes. Thread-local caches (aka data-parallel malloc) solve the former problem and slab allocation (which can be done with an LF stack) solves the latter.
I think that a simple LF allocator could be written by combining those two approaches into a global fixed-sized allocator that distributes caches to threads.
But as I mentioned, what really interests me is the chance to use a more interesting LF algorithm to solve the general problem directly.
In particular, the sequential program which I'm currently aiming to parallelize is a very basic allocator implemented as follows:
I want to support parallel execution by implementing as much of the same logic as I can without using any locks.
(EDIT: I've realized that there's actually a potential use for this: a chunk that was allocated from a cache belonging to thread A might be freed by thread B. Keeping that chunk in the B's cache would be possible, but it would complicate coalescing and is liable to create false sharing if the ends of the chunk aren't on cache line boundaries. It would be optimal if B could do a LF insertion of that chunk back into A's cache allocator - which now qualifies as a shared, variable-size chunk allocator. If you don't need to support deletion, that's still really easy and cheap. I might actually go in this direction, if this proves to be a novel idea. Is there any reason why you might want to support deletion/stealing from another thread's cache?)
(EDIT2: *If* this pans out, it'll be possible to write a wrapper that can turn any sequential allocator into a lockfree allocator. New project direction? Yes.)
LF lists are a solved problem (although doubly-linked list algorithms are real hard, and I might have to live without them), and their designers claim good scalability. I think that the main challenge will be attempting to develop a variant of those algorithms to ensure consistent access to boundary tags as well.
This has a risk of catastrophic failure. It's easy to write a correct CUDA program and hard to make it fast. On the other hand, it's hard just to write correct LF code, and I might wind up with nothing to show. I want to attempt a complete design as fast as possible so that I can switch to something simpler if this proves to be out of my league.
Because LF algorithms are so intricate, I don't think there's as much room to fiddle with my code to improve performance. I'll only get one shot to find an appropriate algorithm, and my scaling will probably be limited by it.
I've already done work on this as part of a hobby project. I have a testing framework, a locking segregated-lists allocator, and a LF slab allocator wrapper around it.
It's not fun to just implement someone else's design, but I know of several LF allocators and data structures which claim to scale well. I'll spend part of a week working independently, and then I'll compare my ideas to existing implementations.
Doubly-linked lists (ignore patent; see collected references):
My goal is to get practice using a specific parallel programming technique by shoehorning it into a systems problem - and not, necessarily, to solve the problem in the best way.
Because of the difficulty involved, I don't think that I can aim for perfect scaling, and especially not for impressive overall performance (global lists aren't very cache-friendly compared to thread-local caches). I'll consider the project a success if I have a correct lockfree allocator. If it turns out to perform poorly in comparison to an allocator using a different technique, then I think that it would still be a worthwhile investigation to analyze exactly why that is.
I'm really open to someone saying "this isn't the intent of the assignment". If it doesn't sound that interesting, then I'd like to hear that too.
I plan to produce pseudocode, scaling graphs, and potentially embarrassing comparisons to other allocators.
If I happen to finish early, I'd like to actually attempt a fast, scalable allocator, perhaps using a simpler design. Alternatively: there's a really interesting paper about a (CUDA) allocator that's able to exploit SIMD.
I plan to work on 4 and 6 core CPUs. I'm not sure how an allocator would even work on CUDA. It might be neat to target Blacklight-level parallelism, but dealing with 8 hour queues is miserable.
|Week||What We Plan To Do|
|Apr 1-7||Use wisdom gained from time travel to win at stocks.|
|Apr 8-14||Get sequential code back in shape. Design.|
|Apr 15-21||Finish design or change project. Read papers. Begin implementation?|
|Apr 22-28||Exam2 & other classes. Implementation + debug.|
|Apr 29-May 5||Inevitable rewrite.|
|May 6-11||Debug + optimization.|