Introduction to Logging
So far our discussion of failure and recovery has centered around checkpointing the state of the system. The next technique we will discuss is logging. Instead of preserving the state of the entire system at once, we will log, or record, each change to the state of the system. These changes are easy to identify -- we just log each message. In the event of a failure, we will play back this log and incrementally restore the system to its original state.
The most intuitive technique for logging a system is synchronous logging. Synchronous logging requires that all messages are logged before they are passed to the application. In the event of a failure, recovery can be achieved by playing back all of the logs. Occasional checkpointing can allow the deletion of prior log entries.
This approach does work, but it is tremendously expensive. Messages cannot be passed to the application, until the messages are committed to a stable store. Stable storage is usually, much slower than RAM, &c. Sometimes performance can be improved by carefully managing the I/O device, e.x. using one disk for the log file to avoid seek delays, but it is still much slower than astable storage.
In order to avoid the performance penalty associated with synchronous logging, we may choose to collect messages in RAM and to write them out occasionally as a batch, perhaps during idle times. If we take this approach, we have a similar management problem to that of uncoordinated checkpoints. Some messages may be logged, while others may not be logged.
Our examples above assumed that messages would be logged by the receiver. Mosat of our discussion of logging will, in fact, be focused on receiver-based techniques. But, before we dive into these, let's take a quick look at logging messages at the sender.
Sender based logging is very important in those cases where receivers are thin or unreliable. In other words, we would want to log messages on the sender if the receiver does not have the resources to maintain the logs, or if the receiver is likely to fail. This might be the case, for example, if the sender is a reliable server and the receiver a portable mobile device.
So, let's assume that each sender logs each message it sends and that the receiver logs nothing. Recovery isn't quite so easy as having each sender play back its logs. Although each sender can play the messages back in the same order in which they were dispatched, there is no way to order the messages among the senders.
One solution to this problem is to follow the following protocol:
- The sender logs the message and dispatches it.
- The receiver receives the messaeg and ACKs it with the current time (local to the receiver).
- The sender adds the timestamp contianed in the ACK to its log entry to the message -- th message is now fully logged
If the above protocol is followed, the timestamp can be used to ensure that messages from multiple servers are processed by the receiver in the proper order. This is because all of the timestamps were assigned by the client, so clock skew is not a problem. (The timestamp can be as simple as a receive sequence number).
But there is one small problem. Consider Sender sending a Message to Receiver. Now consider the same Sender sending a message, m', to Receiver. If the Receiver fails before it ACKS m, it is unknown whether or not m was received by the Receiver before the Sender dispatched m' -- we can't establish whether or not a causal relationship exists.
One solution to this problem is to require that the sender send an and ACK-ACK to the receiver. If the receiver blocks until it receives the ACK-ACK, the order of the messages will be clear.
Given this protocol, recovery is very, very straight-forward. Upon reinitialization, the failed receiver sends a message to all other hosts in the system with the number of the last message that it remembers ACKing. The senders then replay their logs, including the receiver-assigned sequence numbers. The receiver applies these messages in the order of their sequence number. The senders know to ingore the responses to these messages. This process leads to a very simple, uncomplicated recovery.