Project Checkpoint

Team Members: Benson Qiu and Tommy Hung

Updated Schedule

 Days  Goal
 4/20 - 4/23  Improve cache locality of optimistic concurrent cuckoo hashmap.
 Create better test suite for benchmarking.
 4/24 - 4/30  Additional optimizations for our optimistic concurrent cuckoo hashmap implementation.
 5/1 - 5/9  Parallel cuckoo hashmap using BFS to speed up insertion time.

Work Completed So Far

We have implemented the following:

  1. Serial hashmap using separate chaining.
  2. Parallel implementation of (1) using chain-level locks.
  3. Sequential hashmap using cuckoo hashing.
  4. Sequential hashmap using cuckoo hashing with cache optimizations.
  5. Optimistic concurrent cuckoo hashmap. This implementation uses key version counters incremented using atomic instructions rather than mutexes for more fine grained parllelism. It also uses a modified path discovery/eviction policy that reduces false misses and allows for lower concurrent read latency during write conflicts.

Preliminary Results

 Implementation  20M Insert Time (Seconds)  20M Read Time (Seconds)
 C++11 Unordered Map  32.91  12.95
 Sequential Cuckoo  13.93  10.37
 Parallel (Bucket-Level) Separate Chaining with 48 Threads  3.50  1.47
 Optimistic Concurrent Cuckoo with 48 Threads  21.91  1.22

We have included preliminary results of our hashmap implementations above, running our test suite on an Intel Xeon E5645 CPU. As expected, the parallel bucket-level hashmap is faster than sequential implementations. One perhaps surprising result is that our optimistic concurrent cuckoo hashmap insertion is even slower than the sequential cuckoo hashmap. This is because the optimistic concurrent cuckoo hashmap is optimized for a read-heavy workload. We sacrifice insertion throughput but improve read throughput by minimizing the chances of concurrent read-write conflicts.

We have managed to implement most of what we originally planned to achieve by the end of the semester. However, to be fair, we believe that for versions (4) and (5), there are some performance bugs or implementation details that we may be missing that is preventing us from achieving even greater speedup. Specifically, we have not been able to achieve speedup by adding tags in (4) to improve cache locality and reduce pointer dereferences. (UPDATE: We have solved this issue!)

Presentation Plan

We will create a number of graphs to visualize our benchmarking results. These graphs will mostly show the performance of different key-value store implementations as we vary the percentage of read and write requests.

Current Issues

  1. It is difficult to debug optimizations when they pass our correctness tests but do not provide additional performance.
  2. We are having trouble finding easy-to-use and accurate tools to measure cache performance. We plan on looking into perf soon, but have not had a chance to do so yet.