Theory and practice of chunked sequences

Sequence data structures that are parallel and cache friendly

Deepsea project

Team

Overview

In this project, we address the question:

Can we design asymptotically and practically efficient data structures for sequences that can support a broad range of operations, including push/pop operations on both ends, concatenation, and split at a specified position?

In our ESA’14 paper 1, we present two new algorithms for which we use adaptations of the classic chunking technique for accelerating tree-based representations of sequences by representing sequences as trees of small, fixed-capacity arrays (i.e., chunks). We prove that our algorithms deliver tight amortized, worst-case bounds with small constant factors. We present two C++ implementations of our two algorithms and a number of experiments comparing our implementations to highly tuned and specialized sequence data structures, such as STL deque and STL rope. This work presents the first data structures for shared memory that simultaneously match the semantic and asymptotic profile of the Finger Tree of Hinze and Patterson 2, deliver strong guarantees with respect to constant-factor performance (as well as asymptotic performance), and compete well with highly tuned but more specialized data structures.

We have made available a long version of our ESA’14 article which includes appendices. In the appendix, there is additional discussion of experiments, extra experimental results, and proofs.

Source code and documentation

Our source code is hosted on a Github repository.

Documentation is available in HTML or PDF format.

Running and extending our expriments

We encourage interested parties to evaluate the findings of our empirical study.

The source code of our prototype implementation is hosted on a Github repository.

Prerequisites

To have enough room to run the experiments, your filesystem should have about 30GB of free hard-drive space and your machine at least 128GB or RAM. These space requirements are so large because some of the input data we use are huge.

We use the nix package manager to handle the details of our experimental setup. The first step is to install nix on your machine.

Setting up the environment

First, we need to download the source code.

$ git clone https://github.com/deepsea-inria/chunkedseq.git
$ git clone https://github.com/deepsea-inria/sc15-pdfs.git
$ cd chunkedseq/script

If the machine does not already store a copy of the input data, we run the following command, which will just build the binaries for the experiments and the benchmark script.

$ nix-build benchmark.nix

If it succeeds, then the nix-build command should leave behind a new symlink named result in the script folder. This results folder stores all the files generated by nix build script.

The benchmarking binaries we are going to use are now reachable from result/bench. To save time typing, let us add this folder to our $PATH.

$ export PATH=`pwd`/result/bench/:$PATH

Running the experiment

The next and final step is to run the benchmark script. This process involves downloading all the input data (unless it’s already present on the machine), running the benchmarks, and generating plots. To get started, we can issue the command to run all the experiments that were reported in the paper.

$ bench.pbench paper

It may be the case that the download of one of the input graphs fails. If so, then to recover it may be necessary to manually delete the graph and try again. If the process fails completely, please contact the team members.

To get statistical significance, you will need to take multiple samples of each data point, which you can execute the following.

$ bench.pbench paper -runs 29 -mode append

If you are interested to run only those experiments listed in the paper, then you may want to run the following command instead.

$ bench.pbench fifo,lifo,dfs,bfs,pbfs

It will take at least a few hours to complete the thirty of runs for each data point. To get results faster, but with lower statistical significance, reduce the number of runs. If, after getting the initial results, you wish to add more runs, just run again, but this time, pass the argument -mode append.

When the experiment completes, the results table should appear as a pdf file in a newly created _results folder. The latex source for the table is also generated and stored in the same folder.

References

Get the bibtex file used to generate these references.

Acar, Umut A., Arthur Charguéraud, and Mike Rainey. 2014. “Theory and Practice of Chunked Sequences.” In ESA 2014, 8737:25–36. LNCS. Springer Berlin Heidelberg. doi:10.1007/978-3-662-44777-2_3.
Charguéraud, Arthur, and Mike Rainey. 2017. Efficient Representations for Large Dynamic Sequences in ML.” ML Family Workshop. https://hal.inria.fr/hal-01669407.
Hinze, Ralf, and Ross Paterson. 2006. “Finger Trees: A Simple General-Purpose Data Structure.” Journal of Functional Programming 16 (2). New York, NY, USA: Cambridge University Press: 197–217. doi:10.1017/S0956796805005769.