## Understanding Paxos and Distributed Consensus

```(minor wording correction and more complaining added 10/2/2016,
minor edits 10/5/2016)```

Multi-proposer Paxos is a very clever and notoriously slippery algorithm for obtaining distributed consensus. In this note I try to explain it clearly and provide a correctness proof that gives some intuition why it works – to the extent that it does work. I am specifically concerned with Multi-Proposer Paxos, the first algorithm discussed in “Paxos Made Simple”.  What is often called “Paxos” involves single Proposer variants which are much simpler and less interesting.

I think this is right – let me know if you spot an error.

## Rules for how Paxos works

There is a finite set of processes or network sites, some of which are Proposers and some Acceptors (the sets can intersect). Each proposer has a unique id, confusingly called a sequence number. A proposal is a pair consisting of a proposal value and the sequence number of the Proposer. The goal is to ensure that if two Proposers believe that they have convinced the Acceptors to come to a consensus on a value, they must both agree on the same value, even though they may disagree on sequence number. The most clever part of Paxos is the observation that since we don’t care which value wins, even though we do care that some unique value wins, we can force Proposers to inherit values of the most likely previous proposal.

1. Proposers can ask Acceptors to approve sequence numbers and to accept proposals which include a value and the Proposer’s sequence number. Acceptors do not have to approve or accept but are limited to approving and accepting what Proposers send them.
2. When an Acceptor approves a sequence number it:
1. Promises to not approve any smaller sequence numbers
2. Promises to not accept any proposals with smaller sequence numbers
3. Returns to the Proposer the proposal with the highest sequence number it has already accepted, if any.
3. The Proposer cannot send any proposals or select a value for a proposal until it gets approval for its sequence number from a majority of Acceptors.
4. Once the Proposer has approval from a majority of Acceptors it must select the value of the proposal with the highest sequence number sent to it during the approval phase (the inherited proposal). If the approval phase did not turn up any accepted proposals, the Proposer can pick any value. In this case the Proposer “originates” the value.
5. Once the value is selected, the Proposer can never change the value and can only propose the pair of that value and its sequence number – unless it increases its sequence number, abandons the proposal and value, and starts over.
6. The choice of a new sequence number must preserve the property that each sequence number belongs to only one Proposer, ever.

(see the  code for what this looks like in a simulation)

## Why it works intuition

The first thing to see is that individual Acceptors are forced to order their approvals and accepts by sequence number. If an Acceptor has approved j and accepted (i,v) and j>i then we know that it must have approved j after it accepted (i,v). The sequence of operations for that Acceptor must act like this:

The Chang-Maxemchuk algorithm (US Patent 4,725,834 ) solves atomic broadcast (and in-order broadcast) problems for distributed networks in a far simpler and more efficient way than some popular alternatives. In fact, the obscurity of this method is hard to understand given the current interest in distributed consensus.

The basic idea is simple algebra. A source site or process broadcasts “data messages” to a list of sites n sites. Data messages are tagged with sequence numbers and each sequence number is associated with exactly one “responsible”destination site so that  n consecutive sequence numbers map to n sites (the entire list).  For example, if the list sites are numbered 0 … n-1, then sequence number q could be mapped to responsible site q mod n.  Sites on the list broadcast numbered acknowledgment messages to all sites on the list and the source. Only the responsible site for sequence number can create an acknowledgment message numbered  i and the responsible site will only create the acknowledgment if it has received data message i and all lower numbered data messages and acknowledgment messages.  As a result, when the source sees acknowledgment message n+i it is assured that all sites have received the data message numbered  and the acknowledgment.

That’s normal operation mode. There is a reformation mode which is used to create  a list after a failure.  Reading the reformation mode description in the original paper is a good education in how to describe standard “leader election” clearly:

Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list.

To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test.

Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed.

Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers.

The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.

This paper was published 1984 and the first Paxos paper was from 1988. In my opinion Paxos is a big step backwards from CM.

## Distributed consensus and network reliability

All of the distributed consensus algorithms I have been reviewing recently (Paxos, Raft, Zab, Chang Maxemchuck, Viewstamped, … ) are based on a number of assumptions about the network environment, including the assumption that messages may be lost but are not silently corrupted.  Is that a good assumption? Perhaps:

Real data 1995

Checksums disagree 2000

Koopman

## circularity problems in distributed consensus

The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response.  So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios).  I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.

Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler.  Ok.

## Making Paxos face facts

Lamport’s  “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm  changes radically in the middle of the presentation.  The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value.  The second part of the paper switches to a second, much simpler algorithm without much notice.

The paper begins with a problem statement:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

• Only a value that has been proposed may be chosen,
• Only a single value is chosen, and
• A process never learns that a value has been chosen unless it actually has been

The Paxos algorithm that is  presented first is liable to what we used to call livelock even in the absence of failure:

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.

It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of  proposers to a single member – a distinguished proposer.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals

Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm

Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live”   paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).

In any case, the single proposer problem is much simpler than the original problem.

## A brute force commit solution to the single proposer problem.

What, exactly do the bolded words in the problem statement mean?  “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.

A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.

Proposal is a simple action: the proposer sends a proposal message.

There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message.   A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process  that received the proposal message.  Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes  – a majority of the processes. Learning is also simple:

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer.  Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:

• Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
• Only a single value is chosen: follows from the first.
• A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.

If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some  of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail,  since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If  no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.

Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.

(revised)

## The difference between unspecified, undefined, and non-deterministic

There is too much confusion in the “formal methods” computer science literature between these three different terms. Let me start with what this means for a state machine and then move on to engineering objects such as threads. Suppose we have a map α: X → Y  where X is the set of finite sequences over X. In an earlier post I explain how these maps are equivalent to state machines.  For a finite sequence z we can leave  α(z) unspecified – meaning, we don’t make any assurances about the value other than that  α(z)∈ X.  Alternatively, we can say that α(z) is undefined which is a definite assertion that α does not associate any element of X with z. If α is considered as a set of pairs (s,x) with the function property that (s,x)∈α and (s,x’)∈α implies x=x’, then specifying that  α(z) is undefined specifies that (z,x) ∉ α for any x.  So  “α(z) is undefined ” is, in fact, a quite precise specification. Finally, if we say “α(z) is non-deterministic ” we mean that α is not a function at all, it is a relation on E* cross X so that (s,x)∈α and (s,x’)∈α  and  x=x’ is true for at least one s, x and x’. These are all different, but it’s common to mix them up in “formal methods” which has bad mathematical effects and worse effects when trying to understand how computer systems work.

Programs are purely deterministic systems. Given the same inputs, they always compute the same outputs. Without this property, they would not be useful at all. Generally, when researchers say a program is non-deterministic, they are discussing a program that has unspecified or undefined behavior or inputs. It’s that last part that really introduces confusion. For example, it is common to say threaded programs are non-deterministic. But that’s not correct. The apparent non-determinism is due to differences in i/o and scheduling – both of which are inputs. If the operating system schedules a multi-threaded program the same way twice and the input data is the same, the computations will be identical.  The same can be said about i/o.

In fact, the underlying error is sloppy definition of system state. We can see this most clearly in Robin Milner’s CCS book where he speaks of the non-deterministic characteristic  of a device where he has somehow included the choices of the external operator or the weather conditions as part of system state. The consequence is that his state machines are non-deterministic, and this makes the math really complicated and slippery – for no good reason.