A 15-418 Project by Sean Klein (smklein) and Xiaofan Li (xli2)
We are going to use properties of flash memory and consistent hashing to parallize a webserver and database across a distributed system of Raspberry Pis
Modern computers are often significantly impeded by issues stemming from memory access. As most systems use disk for large quantities of storage, there is a penalty for both random read latency and power consumption. In a distributed system, this means that accessing data sorted by a hashed key can be a bottleneck for client operations. Additionally, the wattage and price required to power a large server rack can be discouraging to those who want to host their own servers. Using low-power, diskless nodes, like raspberry pis (running with ARM 700 MHz CPUs), many of these problems can be solved. By running a large quantity of small nodes in parallel, a memory bandwidth comparable to a typical server rack can be achieved at only a fraction of the cost.
We want to build a scalable and parallel server made of Raspberry Pis, each with its own CPU (ARM 700MHz) and storage (8G MicroSD). These hardware specs are suitable for parallel computation in a distributed environment. This server will not only look up values based on keys but also do computation on the value and key pairs. This attempts to mimic a real world webapp. For example, clients of a bank might want to know how much money they have with the current stock price, or how much they might owe the bank from the student loans with compounded interest rate. To recreate these problems, we will perform a non-trivial computation on the looked-up data. On the software side, we will need to create functions which can get information from the server, place information on the server, or compute new information using existing server information.
Ideally, the client will talk to one dedicated master node in the server network and this master will decide where to forward the request. Then the master node will forward the request to one of the worker databases in the network and then wait for other requests. The worker, upon reveiving a look-up with computation request, will look up the value based on the key. Since the entire database is parallelized among many worker nodes, the look-up time should be significantly reduced. Afterwards, the computation can be done locally on the CPU of the worker. This is essentially the heart of the project: having a working version of a distributed server implemented with approximately 5 Raspberry Pis, using consistent hashing for database information.
As more of our framework has fallen into place, we have decided that moving worker tasks to the pi's GPU is a more interesting task in the context of this class than dynamic load balancing.
Our project is primarily focused on construction of a parallel, balanced, RPC system, capable of passing certain benchmarks. We will display these benchmarks, and discuss the implications of the results.
We are hesistant to currently make any predictions about absolute performance, as there may be additional overheads to the bandwidth values identified from the raspberry pi and microSD manufacturers. Rather, we can compare the distributed system to a smaller version of itself -- ideally, we should see that a distributed system with four workers under heavy load will be ~3-4 times faster than an identical system with only one worker. This can be a guiding goal for removing bottlenecks between nodes, and finding an appropriate task to balance computation and data-lookup time.
As we mentioned above, our hardware has been chosen for a variety of reasons -- it's easy to parallelize, flash memory is faster than disk for random lookups, pis are cheap, and they consume minimal power (approximately 3.5 W each). We will be using the go programming language to create this webserver, as it is ideal for thread-safe, distributed systems.
Our test bench tries to account for a variety of factors in our system, in order to discover both weaknesses and strengths. Some factors we considered include the following:
As we can see in the chart above, the lantency for multiple lightweight accesses to the pis is less than 100 ms in all test cases.
As we can see here, in the general computation case, the performance on Pis is worse than on traditional CPUs.
However, when viewed as "performance/watt", we can see that the pis are actually doing much better than their server counterparts.
Additionally, the "performance/dollar" also shows the pis doing better than the Andrew machines.
That being said, the Pis are not infallible -- they are limited to a 17 MB/S bandwidth
We can infer different pieces of information from each of our tests.
Observing the simple GET/POST request test (basic test), we can see that our distributed pi system takes a very reasonable (typically under 50 ms/request) response time.
The hash test shows us a lot of information, especially when viewed by a variety of lenses. We can see (as we expected) that the pis are actually slower to finish computationally bound tasks. That being said, when viewed as "speed per watt", or "speed per dollar", we can see that the pis actually excel when compared to the traditional andrew machines. Observing the pi system by itself, we can also see that the compute task show a linear speedup while increasing either (1) the number of clients, or (2) the number of worker nodes, as would be expected.
The bandwidth test identified a problem with our system -- a bottleneck on the master node, at 17 MB/second. As we can see from the last bar graph above, the average request time for uploading a 1 MB image increases drastically when this cutoff is crossed.
|4/6 to 4/12||Purchase hardware components, assemble the parts of the distributed webserver, and achieve a simple "ping" communication between raspberry pi distributed system, connected by a networking switch, contacting static IP addresses.||Successful client-server connection achieved using golang on a raspberry pi with a DHCP assigned IP address. We plan to integrate more pis, and also attempt to use static IP addresses. Currently, one of our pis was configured with static IP, but it ended up becoming bricked.|
|4/13 to 4/19||Transfer communication into code. Additionally, create startup mechanism for workers and masters to boot up and register with one another. After doing so, create functions to "get", "store", and "compute" using information stored on worker nodes.||Successfully created "get" and "store" tasks within a client - master - worker rpc framework. Tested on both unix.andrew machines and pis. Some bugs regarding dynamic IP addresses remain, but there are workarounds. Additionally, our compute task is somewhat limited NOW, but easy to expand, as a large amount of time has been spent refining the RPC framework between all three types of nodes in this system.|
|4/20 to 4/26||Make code from previous week impervious to heavy-duty loads -- reduce potential race conditions. After doing so, run some basic benchmarking tests, using some client-side tasks. Compare the performance under a variety of conditions (heavy computation, heavy database, heavy networking) running the system with a variable number of workers. If time permits, compare the system to a "real" webserver, using the test suite.||We have implemented a testing framework which randomly generates GET/PUT/HASH/PICT tasks with certain probabilities. We have run very basic versions of the test on both the pis and the Andrew machines, but we need to collect more results.|
|4/27 to 5/3||If time permits, and the system has been successful thus far, implement a GPU-based image processing task which can be parallelized across multiple workers.||We implemented a mechanism for clients to store images on the pis, but actually processing the image and returning the rendered result to the client has turned out to be excessively tedious. The issue is primarily with the Pi's access to the GPU: OpenGL 2.2, which was not simple enough to configure in time. However, we have figured out how to run arbitrary C code from the pis, which opens up the capabilities of the webserver for later development.|
|5/4 to 5/9||Add polish to the system, clean code, finish previous tasks, and prep for demo. Also, finals.||Done!|
|After 418...||We have actually received a SURG grant for our project, and plan to continue work on the database over the summer. Plans include adding a WAL, and periodically updating the file system to preserve durability. Furthermore, we would like to improve the consistent hashing ring to make it possible to add and remove workers easily. If time permits, implementing a PAXOS style distributed replication scheme for each worker would be an interesting challenge as well!|
The rationale behind implementing the image transferring request (besides testing heavy bandwidth) is to enable GPU computation on the Pis, further increasing parallelism. However, this functionality has not been fully merged with the current system. We have a program that renders a triangle with the client's image and saves the rendered image in nonvolatile memory, to be sent back. We are currently working on displaying the returned image, as well as determining more computationally desirable tasks on the GPU resources available.