The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.


The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

Annals of unintentional irony

Come back after a trip to see Marc Andreessen’s team of twitter posters complaining wryly about how government is so gosh durn big and citing experts like Milton Friedman.


That is, Marc Andreesen, the former National Center for Supercomputing Applications programmer who helped write the Mosaic Web Browser for the WWW that CERN scientists developed on top of the Internet technology that DARPA and NSF researchers and bureaucrats created on top of prior government funded technology like integrated circuits, and then built a company that made him rich with the team that worked together at NCSA,  engaging in the typical rich person bitching about the gummint.  Which shows how little reality matters to people.

Ikea and RedHat

This is state of the art for systems software now – which is not all that impressive.

Glantz explained that Ikea has more than 3,500 Red Hat Enterprise Linux (RHEL) servers deployed in Sweden and around the world. With Shellshock, every single one of those servers needed to be patched and updated to limit the risk of exploitation. So how did Ikea patch all those servers? Glantz showed a simple one-line Linux command and then jokingly walked away from the podium stating “That’s it, thanks for coming,” as the audience erupted into boisterous applause. On a more serious note, Glantz said that it took approximately 2.5 hours to test, deploy and upgrade Ikea’s entire IT infrastructure to defend against Shellshock.  Eweek.

How business works in the real world: Apple and GT and Beelzebub

GT Advanced declared bankruptcy and blamed Apple for its problems. Apple called GT Advanced’s story “defamatory”.  I have no idea about the specifics in this case but I do know about  big companies pushing insanely onerous and self-defeating terms on small ones.  Here’s the original claim by GT Advanced:

At the start of negotiations, Apple offered to buy 2,600 sapphire growing furnaces from GT Advanced, which GT Advanced would operate on behalf of Apple, the “ultimate technology client to land,” according to Squiller.

“In hindsight, it is unclear whether Apple even intended to purchase any sapphire furnaces from GTAT,” he wrote.

But after months of hard negotiating, Apple offered a deal under which it would shift away economic risk by lending GT Advanced the money to build the furnaces and grow the sapphire, and then sell it exclusively to Apple for less than market value, Squiller wrote.

GT Advanced was effectively forced to accept the unfair deal in October 2013 because its intense negotiations with Apple had left it unable to pursue deals with other smartphone makers, he said.

Punchinello_Mayor_HallBack when we were selling a real-time OS, I contacted an ex-boss who now had a high position in a telecommunications company to see if he could help us sell into it. His response was “don’ t touch this place, it is expert in destroying small vendors.” And, whatever the actual story with GT and Apple, the storyline is not at all unusual. The elements of a smaller company spellbound by prospects of a huge deal/giant customer, followed by time consuming negotiations, followed by onerous demands – been there, done that. We once were negotiating with a huge semiconductor company about a big deal that, over time, got worse and worse for us. We “finalized” with some terms that we thought might be survivable. And then the semiconductor company negotiators, one of whom by this point our negotiators were privately calling “Beelzebub”, announced they had to take the deal “to management” and came back with much more absurd demands. We were able to walk away but we saw other small companies make deals with Beelzebub and then fail. There is a strong impulse in some big companies, among some business units, to squeeze small company vendors way too far.

Once, after a sales visit to a big Wall Street Firm, two experienced sales people for a second supplier told me they were shocked by what I had said to the customer and advised me never to make the same error. What was my mistake? I had told the customer that we had produced a product that worked a lot better than what they were currently using, cost them a lot less, and was highly profitable to us. “Never do that”, counseled my colleagues, “they want to believe that, at best, you are breaking even on the deal, otherwise they think they left money on the table.”

The absurdity of these kinds of “negotiations” is that they are highly unprofitable for big companies. The potential savings are usually negligible for the bigger company. Negotiating purchase agreements is expensive for big companies. If they are even in the position of dealing with a small vendor, there must be some compelling business reason for getting whatever the small vendor is selling. This also means that a deal that would be unprofitable, perhaps damaging, to the vendor would put a critical part of the supply chain at risk for the purchaser. But instead of closing the deal and moving on, some purchasing groups in some companies want to grind the vendor down to “show value” to their management or maybe even just out of habit. Sometimes this requires the smaller company to walk away, sometimes to go over the heads of the purchasing group to business units waiting for the product. We’ve probably lost some deals that would have worked out well in the end by refusing to accept onerous terms, but we’ve also walked away from deals that would have been fatally unprofitable. The good thing is that walking away from an overbearing big customer is usually a good starting point for a sales effort aimed at its competitors.



What happens when you do not use enterprise quality technology.

News from last year.
Aug. 26, 2013 11:5primary_SafetyLastStillClock2 a.m. ET

A glitch in time synchronization caused German exchange operator Deutsche BoerseAG DB1.XE -0.16% ‘s Eurex Exchange arm to halt trading for slightly more than an hour early Monday, the latest in a thread of technical issues at global exchanges.

The Eurex derivatives trading market was halted at 0620 GMT, 20 minutes after trading started, “in order to protect the integrity of the market,” Deutsche Boerse said.

“An incorrect time synchronization within the system” triggered the market halt. The problem was solved, pre-trading started at 0720 GMT and, as of 0730 GMT all products were again tradable, the exchange said.

A person close to the matter said the glitch was caused by a problem with the system clock, not extreme data load, noting that current trading volumes are far below previous peaks.

Wall Street Journal

From what we understand, the problem was due to a number of GPS time servers that dropped leap year adjustments, so jumping the time back 36 seconds. Then PTP2 software adjusted server time to jump backward 36 seconds. It’s not the first failure of this sort (or the last). The legacy technology for time synchronization lacks the cross check, failover, and alerting that are built into TimeKeeper. That technology was not designed with enterprise in mind and although it has been heavily modified over time, it is very difficult to engineer a solution to basic architectural mismatch. For example, TimeKeeper is architected to treat all time sources the same, but software like PTPd and NTPd is designed for a specific protocol. So TimeKeeper can use a feed from one protocol to crosscheck another, but that requires a clumsy grafting process in one of the protocol specific time synchronization programs.  As another example, the PTP standard has a completely inappropriate fault-tolerance technique baked into the standard – a technique that is a holdover from the PTP standard origin in device control. The standard was designed for systems with really simple networks and time consumers. A single cable with some welding machines on the end is perhaps a typical intended use case. The idea was that the time sources would advertise how good they were and the consumers would simply pick the one that said it was the best. This is an absurd policy for an enterprise network with super-powerful server computers receiving time packets across complex networks dotted with routers and switches. TimeKeeper was designed to ignore this “best master” protocol for fault tolerance and to utilize the smarts of the consumers to detect problems in the feed and to select among alternatives.



Patents considered harmful, by some.

That’s not how it looks from here but I think part of the muddiness in the software patent argument is the result of arguments that really attack the entire idea of patents but are advanced as being specific to software patents. The whole idea of a patent is that you do foster innovation and competition by “handing out monopolies”. Maybe that idea is wrong, but if you think it is wrong you are likely opposed to patents – period.  Tim Lee and many of the other

Famous anti-innovation patenter.
Famous anti-innovation patenter.

people who dislike software patents confuse the issue by simultaneously claiming that (1) software patents are inherently worse/different than other patents and (2) that software patents choke off innovation because of properties that turn out to apply to all patents. Opposing patents in general, however, is a more radical proposition than opposing software patents – perhaps more a more radical proposition than people feel comfortable making.

My position is that many patents are wrongly granted for “inventions” that are neither novel nor un-obvious and that the system for adjudicating patents is way too slow, error prone, and expensive. But patents themselves serve a useful purpose.  The obvious example is Excel which is effectively a monopoly without the benefit of any patents at all. The work of the innovators was, without patent protection, rapidly copied by companies with better marketing and greater resources. Innovation stopped. End of story.

And “business method” patents, in general, are not really software or computer patents at all. Usually they are efforts to patent a well known method of doing business adding some technology to the mix to buttress a claim of novelty. One could have similarly claimed a hundred years ago that making sales calls via telephone was an invention or that delivering adverts by TV instead of by radio was an invention.