Distributed Key-Value Database Server of Raspberry Pis

A 15-418 Project by Sean Klein (smklein) and Xiaofan Li (xli2)


We are going to use properties of flash memory and consistent hashing to parallize a webserver and database across a distributed system of Raspberry Pis



Modern computers are often significantly impeded by issues stemming from memory access. As most systems use disk for large quantities of storage, there is a penalty for both random read latency and power consumption. In a distributed system, this means that accessing data sorted by a hashed key can be a bottleneck for client operations. Additionally, the wattage and price required to power a large server rack can be discouraging to those who want to host their own servers. Using low-power, diskless nodes, like raspberry pis (running with ARM 700 MHz CPUs), many of these problems can be solved. By running a large quantity of small nodes in parallel, a memory bandwidth comparable to a typical server rack can be achieved at only a fraction of the cost.


  1. Each pi has a full OS which has more computation capability than a disk controller. As a consequence, each Pi can be responsible for its portion of data, where both client requests and maintenance of data is done locally. Instead of prioritizing load balancing, we are going to focus our efforts on the computation capabilities of a parallel pi system. This will hopefully include creating interesting parallelizable "compute" tasks, which can take advantage of the GPUs on standard pis.
  2. Pis are flexible with memory when setting up a distributed system. In our prototype, a distributed, share-nothing architecture is used to vastly improve scalability by reducing dependincies. However, a NUMA style system with a centralized disk would be a completely feasible alternative to solve the problem of having too little storage (by default, there is 8GB of storage per Pi).
  3. Pis are very portable, and they consume a low amount of power. The small size can turn out to be useful for developing low cost personal servers with integrated functionalities. Also, because each pi only consumes ~5V (up to 2.5W), they are very power efficient compared to traditional server blades.
  4. Pis are cheap! An 8 node pi server only costs approximately $350, whereas an 8 core server will cost at least several thousand dollars.


    Some of the main challenges related to this project might be:
  1. Setting up the hardware. We need to understand how basic CPU usage + networking of the Raspberry Pis work, and also need to make sure that they can talk to each other on a local network before we connect to any sort of larger network. Also, as mentioned above, the specific topology of the network might affect the efficiency of the entire system. Thus there are some very interesting engineering choices that have to be made to ensure that we are maximizing both power and runtime efficiencies.
  2. The software implementation of the master/worker server is a non-trivial effort. The challenge of how to best balance the work between the workers is even more challenging with non-replicated database servers.
  3. The implementation of the consisitent hashing ring is also challenging. This ties into the challenge of having very limited bandwidth on those hardwares. Understanding the concept of consistent hashing and being able to implement and test on real hardware are among the most difficult aspects of this project.
  4. Measuring power and runtime of computation tasks are two metrics worth considering. Right now we are thinking about directly measuring across each Pi for the total current and then calculating the overall power consumption of the entire system. In addition to power, measuring performance could also be tricky. We plan to write test cases and timing code using golang's built-in testing packages to time our server for large numbers of requests.


Pi Setup



A basic system capable of communication

We want to build a scalable and parallel server made of Raspberry Pis, each with its own CPU (ARM 700MHz) and storage (8G MicroSD). These hardware specs are suitable for parallel computation in a distributed environment. This server will not only look up values based on keys but also do computation on the value and key pairs. This attempts to mimic a real world webapp. For example, clients of a bank might want to know how much money they have with the current stock price, or how much they might owe the bank from the student loans with compounded interest rate. To recreate these problems, we will perform a non-trivial computation on the looked-up data. On the software side, we will need to create functions which can get information from the server, place information on the server, or compute new information using existing server information.

Lookups and distributed computation

Ideally, the client will talk to one dedicated master node in the server network and this master will decide where to forward the request. Then the master node will forward the request to one of the worker databases in the network and then wait for other requests. The worker, upon reveiving a look-up with computation request, will look up the value based on the key. Since the entire database is parallelized among many worker nodes, the look-up time should be significantly reduced. Afterwards, the computation can be done locally on the CPU of the worker. This is essentially the heart of the project: having a working version of a distributed server implemented with approximately 5 Raspberry Pis, using consistent hashing for database information.


GPU-based computation:

As more of our framework has fallen into place, we have decided that moving worker tasks to the pi's GPU is a more interesting task in the context of this class than dynamic load balancing.


Our project is primarily focused on construction of a parallel, balanced, RPC system, capable of passing certain benchmarks. We will display these benchmarks, and discuss the implications of the results.


We are hesistant to currently make any predictions about absolute performance, as there may be additional overheads to the bandwidth values identified from the raspberry pi and microSD manufacturers. Rather, we can compare the distributed system to a smaller version of itself -- ideally, we should see that a distributed system with four workers under heavy load will be ~3-4 times faster than an identical system with only one worker. This can be a guiding goal for removing bottlenecks between nodes, and finding an appropriate task to balance computation and data-lookup time.


As we mentioned above, our hardware has been chosen for a variety of reasons -- it's easy to parallelize, flash memory is faster than disk for random lookups, pis are cheap, and they consume minimal power (approximately 3.5 W each). We will be using the go programming language to create this webserver, as it is ideal for thread-safe, distributed systems.


Our test bench tries to account for a variety of factors in our system, in order to discover both weaknesses and strengths. Some factors we considered include the following:

Currently, we have three types of tests:

As we can see in the chart above, the lantency for multiple lightweight accesses to the pis is less than 100 ms in all test cases.

As we can see here, in the general computation case, the performance on Pis is worse than on traditional CPUs.

However, when viewed as "performance/watt", we can see that the pis are actually doing much better than their server counterparts.

Additionally, the "performance/dollar" also shows the pis doing better than the Andrew machines.

That being said, the Pis are not infallible -- they are limited to a 17 MB/S bandwidth


Each Pi consumes 0.3A when running tasks while using 5V microUSB. Thus, each Pi consumes 1.5W. A system of one master and five workers will consume about 10W. From http://www.dell.com/us/business/p/precision-t3500/pd , the cluster machines we are comparing to consumes 525W per machine. This means, as long as we can achieve 1/50th of the performance of one physical machine, the system of Pi’s are actually more cost efficient than running one large server.


We can infer different pieces of information from each of our tests.

Observing the simple GET/POST request test (basic test), we can see that our distributed pi system takes a very reasonable (typically under 50 ms/request) response time.

The hash test shows us a lot of information, especially when viewed by a variety of lenses. We can see (as we expected) that the pis are actually slower to finish computationally bound tasks. That being said, when viewed as "speed per watt", or "speed per dollar", we can see that the pis actually excel when compared to the traditional andrew machines. Observing the pi system by itself, we can also see that the compute task show a linear speedup while increasing either (1) the number of clients, or (2) the number of worker nodes, as would be expected.

The bandwidth test identified a problem with our system -- a bottleneck on the master node, at 17 MB/second. As we can see from the last bar graph above, the average request time for uploading a 1 MB image increases drastically when this cutoff is crossed.


Week Plan Reality
4/6 to 4/12 Purchase hardware components, assemble the parts of the distributed webserver, and achieve a simple "ping" communication between raspberry pi distributed system, connected by a networking switch, contacting static IP addresses. Successful client-server connection achieved using golang on a raspberry pi with a DHCP assigned IP address. We plan to integrate more pis, and also attempt to use static IP addresses. Currently, one of our pis was configured with static IP, but it ended up becoming bricked.
4/13 to 4/19 Transfer communication into code. Additionally, create startup mechanism for workers and masters to boot up and register with one another. After doing so, create functions to "get", "store", and "compute" using information stored on worker nodes. Successfully created "get" and "store" tasks within a client - master - worker rpc framework. Tested on both unix.andrew machines and pis. Some bugs regarding dynamic IP addresses remain, but there are workarounds. Additionally, our compute task is somewhat limited NOW, but easy to expand, as a large amount of time has been spent refining the RPC framework between all three types of nodes in this system.
4/20 to 4/26 Make code from previous week impervious to heavy-duty loads -- reduce potential race conditions. After doing so, run some basic benchmarking tests, using some client-side tasks. Compare the performance under a variety of conditions (heavy computation, heavy database, heavy networking) running the system with a variable number of workers. If time permits, compare the system to a "real" webserver, using the test suite. We have implemented a testing framework which randomly generates GET/PUT/HASH/PICT tasks with certain probabilities. We have run very basic versions of the test on both the pis and the Andrew machines, but we need to collect more results.
4/27 to 5/3 If time permits, and the system has been successful thus far, implement a GPU-based image processing task which can be parallelized across multiple workers. We implemented a mechanism for clients to store images on the pis, but actually processing the image and returning the rendered result to the client has turned out to be excessively tedious. The issue is primarily with the Pi's access to the GPU: OpenGL 2.2, which was not simple enough to configure in time. However, we have figured out how to run arbitrary C code from the pis, which opens up the capabilities of the webserver for later development.
5/4 to 5/9 Add polish to the system, clean code, finish previous tasks, and prep for demo. Also, finals. Done!
After 418... We have actually received a SURG grant for our project, and plan to continue work on the database over the summer. Plans include adding a WAL, and periodically updating the file system to preserve durability. Furthermore, we would like to improve the consistent hashing ring to make it possible to add and remove workers easily. If time permits, implementing a PAXOS style distributed replication scheme for each worker would be an interesting challenge as well!


The rationale behind implementing the image transferring request (besides testing heavy bandwidth) is to enable GPU computation on the Pis, further increasing parallelism. However, this functionality has not been fully merged with the current system. We have a program that renders a triangle with the client's image and saves the rendered image in nonvolatile memory, to be sent back. We are currently working on displaying the returned image, as well as determining more computationally desirable tasks on the GPU resources available.

Click here to force me to publish the most recent copy of this site.