The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

MiFID2 and security – keeping track of the money


A shorter version of this post is on the  FSMLabs web site.  MiFID2 is a new set of regulations for the financial services industry in Europe that includes a much more rigorous approach to timestamps.  Timestamps are in many ways the foundation for data integrity in modern processing systems – which are distributed, high speed, and generally gigantic. But when regulations or business or other constraints require timestamps to really work, the issues of fault tolerance and security come up. It doesn’t matter how precise your time distribution is if a mistake or a hacker can easily turn it off or control it.

TimeKeeper incorporates a defense-in-depth design to protect it from deliberate security attacks and errors due to equipment failure or misconfiguration. This engineering approach was born out of a conviction that precise time synchronization would become a business and regulatory imperative.KeystoneCops

  1. Recent disclosures of still more security problems in the NTPd implementation of NTP show how vulnerable time synchronization can be without proper attention to security. PTPd and related implementations of the PTP standard have similar vulnerabilities.
  2. Security and general failure tolerance should be on the minds of firms that are considering how to comply with the MiFID2 rules because time synchronization provides both a broad attack surface and a single point of failure unless properly implemented.

The first step towards time non-naive time synchronization is a skeptical attitude on the parts of IT managers and developers. Ask the right questions at acquisition and design time to prevent unpleasant surprises later.

One of the most dangerous aspects of the just disclosed NTPd exploit is that NTPd will accept a message from any random source telling it to stop synchronizing with its actual time sources. Remember, NTPd is an implementation of NTP, other implementations may not suffer from the same flaw. That d is easy to overlook, but it’s key. TimeKeeper’s NTP and PTP implementations will, for example, ignore commands that do not come from the associated time source and will apply analytical skepticism to commands that do appear to come from the source. TimeKeeper dismisses many of these types of attacks immediately and will start throwing off alerts to provoke automated and human counter-measures. The strongest protection TimeKeeper offers, however, comes from its multi-source capabilitiesthat allow it to compare multiple time sources in real-time and reject a primary source that has strayed.

Correct time travels a long, complex path from a source such as a GPS receiver or a feed like the one British Telecom is now providing. Among the questions system designers need to ask are the following two.

  1. Is the chain between source and client safeguarded comprehensively and instrumented end-to-end?
  2. Is there a way of cross-checking sources against other sources and rejecting bad sources?

Without positive answers to both of these questions, the time distribution technology is inherently fragile and robust MiFID2 timestamp compliance will be unavailable.

The painting is: “Quentin Massys 001” by Quentin Matsys (1456/1466–1530) – The Yorck Project: 10.000 Meisterwerke der Malerei. DVD-ROM, 2002. ISBN 3936122202. Distributed by DIRECTMEDIA Publishing GmbH.. Licensed under Public Domain via Commons – 

The Enterprise Profile for PTP and TimeKeeper

One of the most interesting things we saw in the proposed IEEE 1588 enterprise profile was a bold suggestion on fault tolerance that looked familiar. Here’s FSMLabs press release from September 2011

TimeKeeper 5.0 offers the ability to monitor multiple time distribution channels, even those operating on different time distribution standards or of different quality due to distance or network issues. As an example, a TimeKeeper client may monitor two different Precision Time Protocol (PTP) “master clocks” and three different Network Time Protocol (NTP) servers. In addition, if the time quality of TimeKeeper’s primary sources becomes questionable, TimeKeeper can now switch from tracking one time source to another, according to a fail-over list provided at configuration time.

This press release described products that were already in the field in production. I remember that although customers liked this capability, talks at timing conferences often provoked complaints from engineers who insisted that the PTP “Best Master Clock” protocol already solved the problem. Anyways, it was gratifying to see that by February 2015 a similar, scaled down, capability was being proposed for the PTP Enterprise Profile. 

Clocks SHOULD include support for multiple domains.  The purpose is to support multiple simultaneous masters for redundancy. Leaf devices (non-forwarding devices) can use timing information from multiple masters by combining information from multiple instantiations of a PTP stack, each operating in a different domain. Redundant sources of timing can be ensembled, and/or compared to check for faulty master clocks. The use of multiple simultaneous masters will help mitigate faulty masters reporting as healthy, network delay asymmetry, and security problems.  Security problems include man-in-the-middle attacks such as delay attacks, packet interception / manipulation attacks. Assuming the path to each master is different, failures malicious or otherwise would have to happen at more than one path simultaneously. Whenever feasible, the underlying network transport technology SHOULD be configured so that timing messages in different domains traverse different network paths.

Note that there are three things missing from this proposal that were in TimeKeeper 5.0 back in 2011: the ability to use NTP sources as well as PTP, the ability to use multiple PTP sources in the same domain, and working software. Stating “SHOULD” in a standard is a long way from “works in the field” but recognition of the problem is a good step.


From Jersey to Wall Street – or the equivalent

cartaretA common configuration for FSMLabs TimeKeeper customers is to cross-couple time sources in New Jersey and New York City or London and Slough or Chicago and Aurora or Singapore and Sidney- any two trading locations that are connected with high quality network. Sometimes the network connection does not even have to be that great. TimeKeeper will cross-check time sources, complain when things look wrong, and failover when needed.  Multiple time sources also produces a timestamp provenance. Trading applications will have a record showing that the  timestamps they produce were in agreement with two or more independent sources. A number of firms scale this basic design to multiple sites: increasing the depth of fault-tolerance and the strength of the provenance. Cross-coupling time feeds also  provides early warning on a number of network problems. Several customers saw TimeKeeper warnings about secondary sync  and found on investigation that their network providers were changing equipment or rerouting without notice.




Queues and algebra

Suppose we have a state machine Q, that implements a common first in first out queue. The input alphabet of Q consists of “Deq” and “Enq x” where “x” ranges over a set of values, say, V. Let’s fix the max length of Q at some constant k. As usual, given a sequence of events “w”, write Q(w) for the output of Q in the state reached by following “w” from the initial state. For the empty string NULL, define Q(NULL)=(). If a=Deq then define Q(wa) depending on Q(w) – if Q(w)=() or Q(w)=(x) then Q(wa)=(). If Q(w)=(x1 .. xn) then define Q(wa)=( x1 …). If a=Enq x then if Q(w)=() define Q(wa)=(x), if Q(w)=(x1 .. xn) where n<k define Q(wa)=(x,x1 … xn) and otherwise leave Q(wa)=Q(w). The number of states of Q is a function of the number of elements of V and the constant k and grows rapidly as they grow – sum[i=1, k](|V|^i  ).

If you visualize the graph of Q, you can see it contains non-trivial loops – which means that the monoid associated with Q contains groups. The advantage of decomposing Q into a product of smaller machines is that this might guide a smarter implementation. Well, I’m kind of cheating here, because programmers have found a variety of smart implementations that do strip out the group structure.

The easy decomposition produces two counters, and a collection of k storage elements that are loop free state machines: an input “store x” moves the machine to a “x” state. They do contain loops but only trivial ones. The counters are group machines. We number the storage cells 0 to k-1.  We can have one mod k counter that “points” to the tail of the queue and one that counts the size of the queue. On an enq operation the enqueued value is stored in the storage element identified by subtracting the size from the tail mod k,   and the size is increment mod k+1. An enq operation when the size=k is ignored and has no effect. On a deq, the size is decremented and the tail is incremented by one mod k+1. What we’ve done is construct a system in which the state machine monoids are two cyclic groups and k-aperiodic monoids. If we then switch to more complicated data structures, like ordered lists, it’s not that hard to see how a permutation group might be used to manipulate pointers to a pile of storage locations. For example (n1,n2,n3,n4,blank, n5,n6) might be an ordered list with n5 and n6 being free. An enqueue would generate some shuffle to say, (n1,n2,n5,n3,n4,blank,n6) and then a dequeue would produce (n2,n5,n3,n4,blank, n6,n1).

A nice property of groups is that “undo” always works. So if the data in the storage elements can be marked in some way, a deq or enq can be undone by subtracting mod k or permuting backwards.

Fault tolerant patent application for virtual machine

[0018]For incoming network packets, the following is done during the logging mode. When a packet is received, an event-request is posted for the VMM, at Block 210. When the VMM processes the event, it stops the VM, synchronizes the guest VCPU state, and then calls into the device emulator (i.e., device emulation software event handler), at Block 220. The device emulator logs an event at Block 230. Then, the device emulator receives the packets, and logs their contents, at Block 240. The last packet logged is marked so that during replay the device emulator can know when the last packet for this event occurs in the log.

[0019]During replay the following occurs: When an I/O event is encountered in the log, the device emulator (i.e., device emulation software event handler) is called by the VMM. The device emulator reads all packets that were logged, and copies them into the memory of the VM. In this way, the receive queue of packets is updated at the exact same point in the instruction execution sequence during logging and replay.

United States Patent Application 20090007111
Kind Code A1
Nelson; Michael ;   et al. January 1, 2009


VMWARE application.