Regarding ease-of-use, it’s often struck me when reviewing data systems papers that the evaluation sections are full of performance and correctness criteria, but only rarely is there any discussion of how well a system helps its target users achieve their goals: how easy is it to build, maintain, and debug applications?; how easy is it to operate and troubleshoot? Yet in an industry setting, systems that focus on ease of use (even at the expense of some of the other criteria) have tended to do very well. What would happen if a research program put ease of use (how easy is it to achieve the outcome the user desires) as its top evaluation criteria? Adrian Colyer
All of the distributed consensus algorithms I have been reviewing recently (Paxos, Raft, Zab, Chang Maxemchuck, Viewstamped, … ) are based on a number of assumptions about the network environment, including the assumption that messages may be lost but are not silently corrupted. Is that a good assumption? Perhaps:
The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )
The company built a parallel computer based on Motorola 68000s with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”. All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup. Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary. When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message. In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.
The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems. The replicated state machine operation depended on this design in order to make the state machine operation deterministic. Suppose the primary requested, for example, the time and then made a decision based on the time. A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange. But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).
The performance issue was mitigated by some clever engineering, but was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.
Anyways, here’s the description from the patent.
To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.
In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.
United States Patent 4,590,554 Glazer , et al.May 20, 1986
See also: A message system supporting fault tolerance.
and a very similar later patent.
Distributed consensus involves organizing a collection of independent agents – processes or network sites – to agree on some value or sequence of values. Many distributed consensus methods depend on a leader-follower scheme in which the leader is an agent that essentially tells the followers what the values are. The challenges in such methods are to determine when enough of the followers have accepted the value and how to recover from failures of agents. In particular, failures of the leader trigger some procedure to select a new leader. Leader election, however, is a distributed consensus problem. In fact, leader election is the harder problem. Once there is a leader, consensus in the followers can be produced by a dead simple protocol (see the second part of this ). Oddly, leader election is generally treated as a minor issue. For example, in “Paxos made simple” we read:
The famous result of Fischer, Lynch, and Patterson  implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.
The FLP result is essentially a tautology: if an agent doesn’t ever get any information that reliably distinguishes between failure and slow response in a second agent, the first agent cannot reliably distinguish between failure of the second agent and slow response. So the import of the first sentence is that leader election depends on timeouts or “randomness” (perhaps this means some analysis of probability of failure scenarios). I don’t think this is correct, but it’s an interesting claim. The second sentence says nothing more than that an algorithm that fails to progress will never produce a false result – which I think is also a dubious claim.
Algorithm P solves problem X by assuming some other mechanism solves X and then by using that mechanism to make problem X simpler. Ok.
Lamport’s “Paxos Made Simple” paper is notoriously hard to understand but at least part of the difficulty is that the algorithm changes radically in the middle of the presentation. The first part of the paper presents a subtle (maybe too subtle) method to permit multiple processes or network sites to agree on a consensus value. The second part of the paper switches to a second, much simpler algorithm without much notice.
The paper begins with a problem statement:
Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:
- Only a value that has been proposed may be chosen,
- Only a single value is chosen, and
- A process never learns that a value has been chosen unless it actually has been
The Paxos algorithm that is presented first is liable to what we used to call livelock even in the absence of failure:
It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen.
It’s argued that this original Paxos works in the sense that it never can reach an inconsistent state (it is “safe” ), but the livelock scenario means it can easily fail to progress even without process failure or loss of messages. To avoid this scenario, on page seven of an eleven page paper, Paxos is redefined to restrict the set of proposers to a single member – a distinguished proposer.
To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals
Then, on page nine, when going over how Paxos can be used to build replicated state machines, Lamport writes:
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm
Since much of the complexity of the first part of the paper involves describing how Paxos safely resolves contention between competing proposers, the modification is far from minor. It’s unfortunate that both the multi-proposer and single proposer algorithm are called by the same name which seems to cause confusion in the literature and certainly obscures the presentation in “Paxos Made Simple“. For example, the “Paxos Made Live” paper appears to discuss an implementation based on the first (the multi-proposer) Paxos algorithm but then appears to revert to the second method via a mechanism called “master leases”).
In any case, the single proposer problem is much simpler than the original problem.
A brute force commit solution to the single proposer problem.
What, exactly do the bolded words in the problem statement mean? “Chosen” and “Learn” are the two hard ones. “Proposed” is pretty clear: a process sends messages to all the other processes that says ” I , process A, propose value V with supporting documentation x“.
A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it.
Proposal is a simple action: the proposer sends a proposal message.
There are the usual assumptions: messages may be lost or duplicated or delivered out of order, but are neither corrupted or spurious. That is: if a process receives a message from a second process, the second process must have previously transmitted that message. A process can propose a value, but that proposal may never arrive at any of the other process or it may arrive at all of them or some of them. It makes sense that an accept operation is the transmit of an accept message by some process that received the proposal message. Presumably then, the value is chosen when the original proposer receives accept messages from a large enough set of the Acceptor processes – a majority of the processes. Learning is also simple:
To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.
Suppose that there is a distinguished proposer process D. Suppose that process sends value to all Acceptors and marks the value as “chosen” if it receives an accept response from a majority of Acceptors. Acceptors only send accept messages to the distinguished proposer if they receive a proposal from the distinguished proposer. Let Learners ask the distinguished proposer for chosen values. To make the learning process more efficient, the distinguished proposer can notify Acceptors that the value has been chosen and they can answer Learners too. All properties are now satisfied:
- Only a value that has been proposed may be chosen: only D proposes so all it has to do is to select a single value.
- Only a single value is chosen: follows from the first.
- A process never learns that a value has been chosen unless it actually has been: the process D knows reliably and can inform users and Acceptors that have received notification from D can also inform Learners reliably.
If there are no failures, there is nothing else to discuss. If processes can fail, there is still not much to discuss. As long as the distinguished proposer doesn’t fail, everything works. If the distinguished proposer and some of the Acceptors fail before any Acceptor has accepted, a poll of Acceptors will reveal no accepted value – and we know that no Learner has been falsely told there is some consensus value. If some Learner was told about a chosen value and the leader and some minority of Acceptors fail, since a majority of Acceptors must have accepted and a minority have failed, there is at least one Acceptor that knows the accepted value and it can be copied to all surviving Acceptors. There cannot be two different accepted values. If no Learner was told about an accepted value, either no value was chosen or one was and nobody was informed. In either case, we can just copy an accepted value if any of the Acceptors has one, or start from a blank slate otherwise. No harm no foul.
Notice, we have not had to worry about reliable store. All we need is the absence of spurious or corrupted messages and the survival of at least 1/2 of the Acceptors. If we need a sequence of values, the distinguished proposer can just rerun the same process as needed and incorporate a sequence number in the values. The distinguished proposer is, of course, the single point of failure but an election mechanism can select new proposers.
When people think of mathematical models of state for programs and other computer systems, it’s natural and conventional to consider state as a map from symbolic names of state variable to values. This is an error for a number of reasons including, most critically, the lack of any compositional structure. Let’s step back and think about what a discrete state system looks like.
Notation warning: I’m using _ for subscript x_p or f_r or one_(two).
- We have a set E of discrete “events” called the event alphabet, The event alphabet defines all the possible events and inputs that can cause state to change.
- A finite sequence of events over E can be thought of as a system history, describing what series of events drove the system from its initial state (which is the point at which no events at all had happened.) You could complain at this point that deterministic, single thread of execution (the sequence) models can’t describe non-determinism, parallelism, concurrency, and abstract properties, but you’d be wrong – keep reading.
- Now we are going to take an odd step – I don’t want to start with a state machine or other object that represents the whole system. I want to start with state variables that express some system property and how it changes in response to events. For example, a state variable Executing might tell us the process identifier of the currently running (executing) process or it could be set valued in a multi-processor environment. And priority_p might be the priority of the process with identifier, p. As the system advances state, the values of the state variable change but often we want to have properties that are true in any state. For example, (Executing=p and q in Waiting and priority_p> priority_qimplies TimeSinceSwitch < t) would be a good property for some operating system that needs to schedule by priority but also let lower priority processes advance.
- State variables are dependent variables that depend on the event sequence and some map that extracts information from the event sequence. The event sequence is the free variable and the state variable is a dependent variable that embodies some aspect of system state. Everything that determines current state is in the event sequence but state is not a completed thing, not an discrete object, it is the cumulative information of multiple state variables depending on that event sequence. The same event sequence may correspond to completely different collections of state variables for different systems.
- We might have many state variables to specify how an operating system works or even how a simple gate works. Let’s think of all of them at the same level of abstraction as depending on the same event sequence (finite, of course).
- The key to this approach is to be able to compactly define state variables and abstract properties of state variables even though the event alphabets and maps on sequences of events will often be enormously complicated.
- Say a variable y is a state variable for an event alphabet E if y=f(σ) for some function f where σ is a free variable in the set of finite sequences over E which includes the empty sequence Nil. The map f is a solution for y that extracts values from the sequence as state advances (as events are appended to the event sequence). Essentially y depends on two parameters: the sequence and the map. These parameters are often implicit.
- By the definition of state variable, if y is a state variable, then y=f(σ) for some f so y(z) = f(z) by the usual convention.
- If y is a state variable then y(Nil) is the value of y in the initial state since y(Nil) = f(Nil).
- The convention here is that σ is always a free variable over E* (the set of finite sequences over alphabet E).
- If z is a sequence and e is an event or a variable over events, write Append(z,e) or just ze for the sequence obtained by appending e to z on the right. So y(σe) is the value of y in the “next state” – the state after e – since y(σe) = f(σe).
- If y = y(σe) then the event “e” leaves the value of y unchanged.
- If y(σe) = y +2 then the event “e” increases the value of y by 2. The equation y(σe)=y+2 can be rewritten as f(σe)=f(σ) + 2 if we know f.
- A primitive recursive definition on the sequence will completely define the solution f if the recursive map is defined. So [y(Nil)= k and y(σe) = h(e, y)] defines f if h and k are defined. Note y(Nil) = f(Nil)=k. And y(σe)= h(e,y) = h(e,y(σ)) = h(e,f(σ)). Suppose sequence w=<a,b,c>, then y(w) = f(<a,b,c>) = h(c,f(<a,b>)) = h(c,h(b,f(<a>)) = h(c,h(b,h(a,k)))). In many cases we don’t want to or can’t specify f completely – f is a solution to the constraints on y and there may be many or infinitely many or no solutions.
- For example, suppose C is a state variable and C(Nil)=0 and C(σe) = 1+C. Then we have defined C to be the length of the event sequence.
- Define L(σ e) = e so L is the most recent event. Note we do not need to define L(Nil).
- Define [ SinceReset(Nil) = 0 and SinceReset(σ e) = ( 0 if e=RESET and 1+ SinceReset otherwise).
- Suppose we have a state variable Clock that somehow determines time passed since initial state from the event sequence. Then (Clock(σ e)- Clock) is the duration of event e in this state (and may be different for different values of σ ). Define waiting_p(Nil)=0 and waiting_p(σ e)=0 if Executing(σ e) =p and waiting_p(σ e)= waiting_p+(Clock(σ e)- Clock) otherwise. So waiting_p is the time that process p has been waiting to become the executing process. We might want a property (waiting_p > k only if priority_p < priority_p where q = Executing).
- Ok, now we can consider component composition – which is just function composition. Consider the state variable L defined above and suppose we have a new alphabet B consisting of Left_x and Right_x for x in some set X of data values. The goal is to construct a shift register from n instances of L. Define Li = L(ui) where ui is also a state variable that has values that are sequences of elements of X. Usually we want components to state in the initial state so ui(Nil) = Nil. This means Li (Nil)= L(Nil) which is something we didn’t define. Now define ui(σ e) where “e” is an element of the event alphabet of the composite system. Remember I’m using juxtaposition to mean “append on the right” so, for example (u_iL_(i-1) ) means “append the current value of L_(i-1) to the current value of “
- u_i(σ e)= u_i x if i=1 and e=Right_x or if i=n and e=Left_x
- u_i(σ e)= u_i L_(i-1) if i>1 and
- u_i(σ e)= u_i L_(i+1) if i<n and e=Right_x
- The definition above causes all n components to change state in parallel when an event is appended to the event sequence of the composite system – but it’s just composition: L_i = L(u_i)= L(u_i(σ)) and L_i(σ e)= L(u_i(σ e) ) =L(u_i(σ)e’) = e’ where e’ is calculated as above.
- If a device D that is constructed from components C_1, … C_n that are interconnected. D is a function of σ but the components don’t “see” the same events. For each component, we need a state variable u_i that is sequence valued and a function of σ . Usually these are defined recursively: u_i(Nil)= Nil and u_i(σe)= concatenate(u_i,h_i(e,C_1(u_1)… C_n(u_n ))) or something like that. That is, each event e for D causes event or events h_i(e,C_1(u_1 )… C_n(u_n )) to be appended to the sequence of events u_i.
Unable to display