March 16, 2010 (Lecture 15)

March 16, 2010 (Lecture 15)

Introduction To Replication

Today we are going to move into the next topic: replication. It is often the case that we want to replciate data in a distributed system. We might do this to make our system more robust or accessible in light of failure, or to ensure that there is a copy of the data "nearby" in order to improve access latency.
But if we are not careful how we manage replication, we can actually end up with a system that encounters more latency, or is less likely to be available. During today's discussion, we are going to assume that we need one copy semantics, despite replication. In other words, we will assume that we want the results of read and write operations to be the same as they would be if they were acting on a single, non-replicated data store.
Our goal is to understand conflict, how to prevent it, and the performance trade-offs that we may encounter. We will do this in the context of simple techniques. Next class we will continue our discussion with more sophisticated and subtle approaches.

Replication and Conflict

If we have replicated data, we have a choice about which of the copies of the data we will access to complete any operation. Perhaps we could access only one replica, or perhaps all of them, or perhaps any number in between. The decision that we make could affect consistency.
Consider a system where there are 4 replicas, R₁, R₂, R₃, and R₄. Suppose we implemented the following policy regarding read and write operations:

Reads: Either R₁ or R₂
Writes: Either R₃ or R₄

The above policy has a problem. If writes occur, the reads will read stale data. Another policy might be the following:

Reads: All of the replicas
Writes: Any one of the replicas

As long as version numbers are used, the above policy prevents stale reads. Since the read will see all of the replicas, it will find the most recent version and report it to the application. But this policy isn't really a good one -- the most recent data isn't replicated and reads, requiring access to many replicas, are usually more common tham writes, which require access to only one server.
We can solve this problem by flipping our logic. Instead of the "write-one/read-all" policy above, we can use a "read-one/write-all" policy:

Reads: Any one of the replicas
Writes: All of the replicas

The "Write-all/read-one" policy is very frequently used, because it has many good characteristics:

Read will always get the most current data
The common case, read, is fast -- it requires access to only one replica
The most recent data is fully replicated, providing fault-tolerance.

In looking at these examples, we can see that the number of replicas that are required for a read operation to remain consistent depends on the number of servers required for a write operation. We must be guaranteed that the set of replicas selected by a read operation will intersect the set of servers selected by any write operation. If we have 5 servers and it take 3 to write, it must take 3 to read. If it takes 5 to write, it will take only 1 to read. If it takes 1 to write, reads must access all replicas. If this isn't the case, there is the potential for a read-write conflict -- a read might get stale data, becuase it doesn't see a recent read.
The general rule for avoiding a read-write conflict is this. If there are N replicas, and writes occur to some W replicas, and reads occur from some R replicas, then R + W > N.
Read-read conflicts aren't a problem -- one read doesn't affect another read, even if they occur form disjoint sets of replicas. Read accesses data, it doesn't change it, so it can't affect future operations.
Write-write conflicts can occur if less than a majority of the processors are required for a write. Consider a system with 4 replicas, each containing version 0 of some object. Now assume that one write updates the object to version 1 on servers 1 and 2. Assume another write updates the object to version 1 on servers 3 and 4. Now, a read will have no idea which object to use, even if it reads all 4. Two different versions with version number 1 will exist. To solve this problems, writes must affect a majority of the replicas.

Processor Failure, Partitioning, and Replica Control

Now let's consider the impact of our policy decisions on availability in the light of failure. Let's assume that a single processor fails in a system using a write-all/read-one policy. In this system, reads will be unaffected, but writes will not be possible. The not-so-useful write-one/read-all policy would allow writes, but not reads.
Is it possible for us to define a policy such that both reads and writes can continue after a failure? Perhaps. Consider a system that has 5 replicas and requires 3 replicas to write and 3 replicas to read. This system can continue to read and write, even if 2 processors fail. If both reads and writes require a majority of the processors, they can continue, despite a failure of the minority of the processors. But the price that we pay is extra communication in the common case of a functioning system -- the quorums are larger.
Tannenbaum and Van Renesse suggest another approach called voting with ghosts. This approach allows the counting of dead processors toward a write quorum. Basically a ghost processor is set up that votes on behalf of the dead processor. When it gets a new object via a write, it just throws it away (it is a ghost, afterall).
But this approach is somewhat problematic -- how does one know if a processor has failed, or if communication has failed? Sometimes, we may be able to tell the difference, but for the most part, it is impossible (remember the discussion of failure last class). Now consider a partitioning of the network. Processors in each partition will assume that the processors in the other partitions are dead and ghosts will cast their votes. Now both parititions are receiving updates. Once the network partioning is repaired, there will be a write-write conflict.
For this reason, voting with ghosts isn't very practical and other than the original publication, as far as I know, it is only discussed in Tannenbaum's own textbook. But I like to talk about it, because I'll use it as a bridge to discuss something later on -- so don't completely force it out of your mind.

Static Quorums

The decision about how many replicas should be involved in operations is known as quorum selection. What we have discussed so far implies a set of rules for selecting read and write quorums:

There is a read quorum, r such that at least r replicas must be accesed by a read operation.
There is a write quorum, w such that at least w replicas must be accesed by a write operation.
Given N replicas, r + w > N
w > N/2
Each object has a version number or sufficiently consistent timestamp
A less formal statement of these rules follows:

A read quorum is required for a read to succeed
A write quorum is required for a write to succeed
A write quorum is required for a write to succeed
Read and write quorums must always interect
Write quorums must always interect
A read can tell which of the replicas it accesses are the most up-to-date

Static quorum selection is a pessimistic approach to quorum selection, because in the event of a partitioning, updates can only occur in, at most, one of the partitions, for fear of a conflict. When we discussed Coda, we discussed an optimistic approach, where writes could occur, even in a partitioned network (disconnected client). These conflicts typically require human intervention to repair. Optimisitc replication is acceptable only if

conflicting updates are rare
conflicts are guaranteed to be detectable
damage from conflicts can be easily confined
repair is possible, or lossed updates are acceptable

Voting with Static Quorums

A version of the static quorum technique called, Voting with Static Quorums provides a mechanism for assigning an importance to various replicas.
It works exactly like the simple Static Quorum approach above, except that not all replicas count equally. Each replica is assigned a particular numbr of votes. Now, instead of defining a quorum in terms of a number of replicas, it is defined in terms of a number of votes. But the same rules as above still apply: we still need version numbers or synchronized timestamps, read and write quorums must still intersect, and writes require a majority of the votes.
This approach gives us a way of dealing with cached copies as replicas -- we can assign them 0 votes. Perhaps 0-vote replicas will require a version check from a read-quorum, but not a full data transfer.
It also gives us a way of prevent a bunch of unreliable servers from preventing a quorum. Now they can be given a low number of votes, but perhaps they'll have a useful replica in the event of a failure.

Example: Equally Weighted Replicas

This example assumes equally reliable, equidistant replicas and parallel updates. Since the replicas are equal, we'll give them the same number of votes:

Votes Access Time PFailure

Replica 1 1 750ms 0.01

Replica 2 1 750ms 0.01

Replica 3 1 750ms 0.01

If we assume that reads are most important, we will select a read-one/write-all policy, since this gives us the minimum number of replicas for a read. Since all access are performed in parallel, the latency for both reads and writes is 750mS. Under this policy, it is very unlikely that a read will fail -- all 4 replicas would have to fail, so the probability is (0.01)⁴= 10^-6
If we assume that writes are very important, we should select the smallest possible write quorum. We need a quorum of at least 2 votes to ensure that writes can't conflict, so we select w=2. This implies a read quorum, r=2, to ensure that reads and writes can't conflict. Since the operations can occur in parallel, the access time will still be 750mS. Since the read and write quorums will be the same, the probability of failure will be the same. It will take at least 2 failures (2 or 3) to prevent us from satisfing the quorum. P(at least 2 failures) = P(2 failures) + P(3 failures) = ₃C₂ = (0.01)² + (0.01)³. = 0.000301

	Votes	Access Time	PFailure
Replica 1	1	750ms	0.01
Replica 2	1	750ms	0.01
Replica 3	1	750ms	0.01

Example: Cached Replica

This example has 3 replicas as before, but it also includes a cached-replica. We'll assume that using the cache requires a version check that is much faster than a read or write access. We'll also make the fairy-tale assumption that the cache will never fail.

Votes Access Time Version Check PFailure

Cache 0 100ms 0mS 0.00

Replica 1 1 750ms 75mS 0.01

Replica 2 1 750ms 75mS 0.01

Replica 3 1 750ms 75mS 0.01

For this example, let's assume that writes are very important and set w=2, the smallest possible value. Let's set r=2, the smallest value possible, given the write quorum. As you might have guessed, a version check requires a read quorum.
To use the cache, we must access the local cache, perform a version check, and, if stale, read the object from remote replicas. This takes (75mS + 100mS)=175mS in the best case, and (75mS + 750mS)=825mS in the worst case.
Since the cache must be write-through and write to a quorum, this will take 750mS.
It is important to note that the probability of failure is the same as before -- if stale values are unacceptable, caching does not improve availability.

	Votes	Access Time	Version Check	PFailure
Cache	0	100ms	0mS	0.00
Replica 1	1	750ms	75mS	0.01
Replica 2	1	750ms	75mS	0.01
Replica 3	1	750ms	75mS	0.01