The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

Patent 5,995,745

Adding real-time support to general purpose operating systems

AbstractA general purpose computer operating system is run using a real time operating system. A real time operating system is provided for running real time tasks. A general purpose operating system is provided as one of the real time tasks. The general purpose operating system is preempted as needed for the real time tasks and is prevented from blocking preemption of the non-real time tasks.



What is claimed is:

  1. A process for running a general purpose computer operating system using a real time operating system, including the steps of:

providing a real time operating system for running real time tasks and components and non-real time tasks;

providing a general purpose operating system as one of the non-real time tasks;

preempting the general purpose operating system as needed for the real time tasks; and

preventing the general purpose operating system from blocking preemption of the non-real time tasks.

Continue reading “Patent 5,995,745”

RTLinux early paper from 20 years ago

An early RTLinux paper with Michael Barabanov.

Only 20 years later, the idea is now not too scary.


Real-Time Linux
Michael Barabanov Victor Yodaiken March 3, 1996

1 Introduction

If you wanted to control a camera or a robot or a scientific instrument from a PC, it would be natural to think of using Linux so that you could take advantage of the development environment, X-windows, and all the networking support. But, Linux cannot reliably run these kinds of hard real-time applications. A simple experiment will illustrate the problem. Take a speaker and hook it up to one of the pins from the parallel port. Then run a program that toggles the pin. If your program is the only one running, the speaker will produce a nice somewhat steady tone. Not completely steady, but not bad. When Linux updates the file system every couple of seconds, you might notice a small change in tone. If you move the mouse over a couple of windows the tone becomes irregular. If you start netscape in one of the windows, you will hear intervals of silence as your program waits for higher priority processes to run.

The problem is that Linux, like most general purpose operating systems, is designed to optimize average performance and to try to give every process a fair share of compute time. This is great for general purpose computing, but for real-time programming precise timing and predictable performance is more important than average performance. For example, if a camera fills a buffer every millisecond, then a momentary delay in the process reading that buffer may cause data loss. If a stepper motor in a lithography machine must be turned on and off in precise intervals in order to minimize vibration and to move a wafer into position at the correct time a momentary delay may cause an unrecoverable failure. And consider what might happen if the task that causes an emergency shutdown of a chemistry experiment must wait to run until Netscape redraws the window. It turns out that redesigning Linux to provide guaranteed performance would take an enormous amount of work. And taking on such a job would defeat our original purpose. Instead of having an off the shelf general purpose OS, we would have a custom made special purpose OS that would not be riding the wave of the main Linux development effort. So what we did was slip a small, simple, real-time operating system underneath Linux. Linux becomes a task that runs
only when there is no real-time task to run and we pre-empt Linux whenever a real-time task needs the processor. The changes needed in Linux itself are pretty minimal. Linux is mostly unaware of the real-time operating system as it goes about its business of running processes, catching interrupts, and controlling devices. But real-time tasks can run to a quite high level of precision. In our test P120 system, we can schedule tasks to run within a precision of about 20 microseconds.

And more ….

What does the UNIX file system do?

Unix, Linux, Windows and other operating systems and the world wide web all support file systems with the familiar path file names  like

 or "/system/passwords/secret/dontread.txt"

although sometimes with different separator characters between the individual “flat” file names. For example, Windows uses “”. As long as we know how to separate flat file names in the sequence, it doesn’t matter. The flat file names are chained together in a path through the file system that shows “where” a file can be found. URL’s in the world wide web are just path file names with some more information around them.  Constructing the file system involves a clever technique for embedding a tree in a simpler file system where file names are just numbers.

For historical reasons, the base file system uses numbers called “inode numbers” to name files.  Ignoring modifications, this file system looks like a function F:InodeNumbers → FileData. The tree emerges from information stored in some of the files. FileData includes some files that are just data and some files that are maps called “directories” (or “folders”). Directory maps have the form d: SimpleFileNames → InodeNumbers.  If we have a path file name “a/b/c” and a starting inode number i, we can first get d1 = F(i), the contents of file i which should be a directory, then get ia= d1(a) the inode number of the file named a, and then da= F(ia) and  ib= da(b) and db= F(ib) and  ic= db(c)  and then the contents of the file “a/b/c” is F(ic ) – assuming that the path is defined.  More concisely, we can write  ia= F(i)(a) and  ib= F(ia)(b) and so on where functions are resolved left to right: for example, F(i)  is a map which is then applied to a. 

Computing the translation of a path file name to an inode number can be defined recursively in terms of a function usually called namei (for names to inode numbers).  If the path file name is the empty path, then we are already where it leads: namei(i,Empty) = i.  If the path file name is not empty, it has the form a/p  where  is a simple file name (of any length) and  is a path file name with one less simple file name in it than the original path: namei(i,a/p) = namei(F(i)(a),p) It’s possible that namei(i,p) is not defined – for example, F(i) might not even be a directory function or it might be one but d=F(i) might not be defined on the leftmost simple file name in the path. In that case, we have “file not found” or “404” in the case of a URL.

A UNIX type file name has a special inode number for the “root” directory.  For any path  file contents is then U(p)= F(namei(root,p)).  A consistent file system will have at least the following properties.

  1. No orphans. For every  in InodeNumbers,  if F(i)  is defined there must be a path  so that namei(root, p) = i. 
  2. No dangling references. For every so that F(i)  is a directory function and for ever simple file name so that F(i)(a)  is defined, F( (F(i))(a)) must also be defined (that is, if F(i)=d  and d(a)=j  it must be the case that  F(j) is defined.)  

Another useful property limits cycles or loops through the file system and aliases. If U(p)  is a directory, let Children(p) = {a: U(p)(a) is defined} where  is a variable over flat file names.  Then define find(p) = {p} if is not a directory or Children(p) = emptyset and define find(p) = union{find(pa): a in Children(p)} . If there are no loops, this is a well defined function that terminates with the set of leaf nodes reachable from p. For example if one were in an organization concerned about security, there might be regular monitoring of find(/home/snowden) to see if any unauthorized data had been collected.

The most stringent non-alias requirement would be that if namei(root,p) = namei(root,q) then p=q. There can be no loops if there are no aliases. This requirement is usually relaxed to accommodate the “parent” and “self” pseudo file names, and hard and soft links. The simple file name “.” is usually reserved to mean “self” so that if F(i)  is a directory F(i)(“.”) = i. The pseudo-file-name “..” is used for “parent” so that if F(i)(a)=j  and F(j) is also a directory, then F(j)(“..”) = i. These pseudo-file names introduce both loops and aliases so we could just limit the requirement for no aliases to the cases where and  don’t contain any pseudo-file-names. Note that the definition of the parent pseudo-file-name limits many kinds of loops because it cannot be that a directory points back at two different parents.

Soft links, a later addition to UNIX files, are a more complex problem. For soft links we add file contents that are path file names and modify namei  so that if j=F(i)(a)  is a soft link with F(j)=q, then namei(i,a/p) = namei(root,concat(q,p)). The original definition of namei has a nice property that the path shrinks by one flat name at every step and this change loses that property and makes it easy to create loops that never finish. The solution to that is to count soft links and just give up if a path takes us to more than some set limit number of soft links.