We are going to implement and benchmark several algorithms for a high performance, concurrent, in-memory hashmap. Our code will be tested on an Intel Xeon Phi coprocessor (Latedays cluster) and an Intel Xeon Processor E5-2680 (Unix-6 Machine).
Fan et. al1 describes a set of algorithmic improvements to Memcached, an in-memory key-value store intended for use in speeding up dynamic web applications by alleviating database load. We plan to use several ideas from the paper for our hashmap implementation2.
Chained HashMaps (Figure 1) contain an array of buckets, where each bucket is a pointer to a linked list chain. Separate chaining is relatively easy to implement and provides expected constant-time insertions and lookups. The "next" pointers for each node in the linked list have poor cache locality, and using a pointer for each key-value entry requires additional memory overhead. Separate chaining is the algorithm used in Memcached.
Cuckoo HashMaps hash a key to two possible buckets. In Figure 2, each bucket is 2-way set associative, so a key can be inserted in one of four possible slots across two buckets. If all slots are full, a key will be randomly evicted and rehashed. Cuckoo HashMaps can be represented as a contiguous array of pointers to a HashEntry struct of key-value pairs. Compared to separate chaining, cuckoo hashing benefits from better cache locality, is more memory efficient because it contains no pointers, and has worst-case constant-time lookups.
Optimistic Concurrent Cuckoo HashMaps (Figure 3) have several optimizations for read-heavy workloads built onto the basic cuckoo hashing scheme. Each bucket is 4-way set associative, with each slot containing a HashPointer struct that has a tag and a pointer to the HashEntry. After generating two hash values for a key, we apply another hash to the hash values to determine the tag value. When reading or writing keys, we only make a pointer dereference if the tag values match. A HashPointer struct requires 16 bytes of memory on a 64-bit system, so we make the buckets 4-way set associative to achieve high cache locality on a 64-byte cache line. A key-version counter array is used to provide concurrent lock free reads. With 4-way set associative buckets, we hope to achieve 90% space efficiency.
Using the C++ std::unordered_map as a baseline, we will iteratively implement several hashmap algorithms, with a focus on maximizing concurrent performance under different read/write workloads. We will also interested in measuring memory efficiency of our implementations.
Our final implementation will hopefully parallelize well to concurrent, read-heavy workloads and be much more space efficient than a basic separate chaining implementation. The optimistic concurrent cuckoo hashmap will use atomic hardware instructions for lock-free reads, and we will choose a data structure that maximizes cache locality and space efficiency.
Concurrently accessing and modifying a shared data structure is inherently challenging and prone to concurrency issues such as deadlock and race condtions. We will also design a good test suite that accurately profiles how different hashmap implementations perform under various workloads. The throughput depends not only on percentage of reads and writes, but also on how the concurrent requests are interleaved. Simultaneous reads and writes with keys that hash to the same buckets will naturally generate more contention. We hope to capture this in our test cases.
We will be using Intel Xeon Processor E5-2680 from the Unix machines and the Intel Xeon Phi coprocessor from the Latedays cluster.
We plan to create speedup graphs for all of our implementations. For the implementations involving cuckoo hashing and optimistic concurrent cuckoo hashing, we will also see how speedup varies as we change the workload (percentage of read/write requests). By the time we reach our final implementation with optimistic cuckoo hashing, we hope to achieve a 10x speedup over the sequential version with 97.5% read requests and 2.5% write requests on the Latedays cluster. This is based on metrics from Fan et. al.1 The system that they used is similar to the Latedays cluster.
Our project will be implemented in C++ and should work on a variety of platforms. We will use the Xeon Phi on the Latedays cluster for final testing because it provides an isolated and consistent testing environment. Also, the Xeon Phi is one of the more powerful resources we have access to, allowing us to determine the maximum throughput that we can achieve.
|4/3 - 4/9||Read literature; implement sequential key-value store|
|4/10 - 4/16||Implement coarse and fine grained key-value store|
|4/17 - 4/23||Implement cuckoo hashing|
|4/24 - 4/30||Begin optimistic cuckoo hashing implementation|
|5/1 - 5/9||Finish optimistic cuckoo hashing implementation|