Formal methods for doing what?

John Regehr’s question below gets to one of the basic problems I see in the field of “formal methods” - the general failure of researchers in the field to look at experimental data and the operation of actual systems.  The normal operation of science in, say physics or biology or even analysis of algorithms, involves interplay between the development of theory and practice. Theoretical developments that conflict with experimental data are considered to be problems for the theory and strenuous efforts are made to fix the theory. “Formal methodists” have not been, in general, willing to dig into the data enough to follow this approach. The question of “liveness” provides a good example.

The informal concepts of “liveness” and “safety” seem interesting for understanding programs. The usual definitions [ e.g. from Manna and Pnueli’s “Temporal Verification of Concurrent Programs”] are something like:

  • A liveness property says some assertion  about system state is eventually true – for example, a program terminates or a program advances.
  • A safety property says some assertion about system state is never true – for example, the system operating system deadlocks.

These are possibly useful concepts with some intuitive appeal and we should be able to represent these intuitive concepts mathematically. For me, the next step is to look in detail at the data and to try to formalize what engineers find pertinent about these properties in actual systems.  But 25 years of follow-on research (M&P’s paper is 1981) involves all sorts of wacky but clever and technically difficult attempts to find a mathematical interpretation in which “liveness” is a property of infinite state sequences. For example, there is an influential paper by Alpern & Schneider (the tech report is 1986) which uses Büchi automata and infinite state sequences to model liveness properties.  Motivation from empirical data is entirely absent from that work. I can’t imagine a “liveness” property of an operating system that has no bound. There are properties where we don’t really need to know the bound or where the bound depends on some partially known factors (e.g. how much memory will be available), but there is still a bound. When programmers say that every process makes progress, the natural liveness property of a time-shared system, they mean eventually to be within some time frame. Scheduling processes in the limit is not an interesting property of operating system. Perhaps there are infinite state properties of interest, but I have not seen them described in the literature.

If a mathematical method involves counter-intuitive concepts or strange infinities or recourse to non-euclidean geometries we don’t normally consider part of the operation of programs, but it helps solve a problem then it would be justifiable to use it.  Nobody runs  algorithms in the limit but analyzing them in the limit gives us a rough idea of performance and, in many cases, a good intuition about how real implementations may perform. Here’s Knuth

The O-notation is a big help in approximation work, since it describes briefly a concept that occurs often and it supresses detailed information that is usually irrelevant. Furthermore it can be manipulated algebraically in familiar ways … [ Volume 1, 3rd edition. Page 108]

We can look in vain for credible arguments of this sort in the formal methods literature. Instead all sorts of exotic concepts primarily from meta-mathematics are brought in without justification from the field of study – because formal methods became its own field of study only superficially related to actual computer engineering.

Real computer systems are  finite state. Either the system can loop without ever satisfying the assertion or it cannot and must enter a state where the assertion is true. (And calling a loop an “infinite path” only makes something simple into something complicated.) Boundedness is a property of the actual objects we are attempting to model. A reasonable argument for using Büchi automata or well-founded sets or lattice-theoretical methods would need to show how some properties of software systems that are important to engineers could be represented with suppression of “detailed information that is usually irrelevant” and advantages in algebraic manipulation. Instead, formal methods researchers generally justify these methods on the simplicity of the mathematics needed to formalize the concepts. The argument that some unbounded notion of eventuality can be captured by a certain simplified but complete logical construct that only needs 5 axioms and 3 deductive rules may be an interesting exercise, but it is a result about formal logic, not a result about computer science – unless that notion of eventuality can be shown to be important in actual systems design.

So when I looked at liveness properties of scheduling in the second chapter, the implicit bound is a one of the primary issues to analyze and it is interesting to examine what happens when our high level constraint that “all processes must eventually progress” comes in contact with resource constraints that cost microseconds. Liveness properties remain usefully distinct from safety properties even if we discard unbounded liveness. “Every process eventually makes progress” is a perfectly sensible abstraction of one property of a time-shared operating system and it is different from safety properties like “the system never deadlocks”.  We can use the bound to transform liveness into safety – “there is no reachable state where more than T nanoseconds have passed since a process has executed for any process P for some T that we do not specify now.” But I think this just reflects the way systems actually work and it is useful for designers to understand that there may be underlying real-time constraints even for time-shared processors.  In fact, I believe that many of the most difficult problems of practical system design (e.g. the “fork bomb” and “OOM” in UNIX systems) are harder to solve because our design methods “supress detailed information that is usually irrelevant” at the high level, but do not properly unsuppress this information when we need it.  [This last paragraph may be particularly obscure, but I will try to clarify it more later]

Operating system design and specification: Part 1.

[Note: This is the first in a series of “chapters”. I’ll be revising as I see errors and in response to comments. As usual, this material is copyright Victor Yodaiken and rights are given to make, but not sell, verbatim copies only (in addition to fair use rights) ]

Update 1: Evening June 6.

Update 2: Morning June 22 Link to chapter 2.

Even the most obvious and central properties of operating systems are poorly understood and are widely assumed to be out of the reach of the kinds of mathematical approaches used in other engineering disciplines. Design specifications for operating systems tend to feature API definitions, but only the vaguest gestures to describe scheduling properties, timing, i/o capacity, and what operations do (semantics). The first part of this note develops some “informal” (english) specifications of interesting properties and shows why these would help but don’t provide the right level of precision. The second part will introduce some mathematical tools for describing operating systems and their components in terms of state machines. The mathematics is not the usual “formal method” symbol-fest and prerequisites are just familiarity with recursive functions and state machines.

The necessity for some mathematical approach is illustrated by the vagueness of security specifications which often reduce to “more secure” or “very secure” . As a result, software security, in practice, is obtained by adding mechanisms and hoping. But this is the general condition. It would be remarkable if we had solid security specifications, given that supposedly simple properties, like the desired behavior of a scheduler are elusive. General purpose operating systems are often assumed to provide “live” scheduling – so that every process eventually makes progress. Here’s an attempt at sketching what this means:

Live1: There is some N and some E so that for every process P, if P is ready to run at time T, then by time T+N nanoseconds, P will have either executed E additional instructions or P will have executed more than 0 additional instructions and requested an I/O operation which caused it to suspend.

Live1 is a useful property: assuring that the background page cleaning process and the process that checks for new hardware will eventually run even if the main data base application is under heavy load. But many constraints on system design follow from this requirement. For example, we better never have a situation where both P and Q have been waiting to run for N – NanosecondsToCompute(E) nanoseconds unless the OS somehow knows that both will have time to execute enough instructions to request an I/O operation. As usual in computer science, people spend endless hours trying to optimize the performance of schedulers, and little time trying to understand exactly what it is they are supposed to do. The result is a mishmash of conflicting local optimizations.

As I’ve noted elsewhere, one common conflict is due to attempts to graft live schedulers to real-time schedulers that aim at minimizing the delay for the highest priority process.

Real1: There is some R so that the highest priority runnable process is never delayed more than R nanoseconds.

Real1” is much too vague, for example it allows the system to run the highest priority task for one instruction and then delay R nanoseconds repeatedly, but it still causes difficulties. Real1 has a hard time with Live1. What happens if low priority process P is assured run-time due to Live1 but high priority process Q is assured run-time by Real1 and there is only one processor? It would be nice if there were rules that could help deduce conflicts between Live1 and Real1 just as we use rules to show that there is no integer solution to (a > b AND b > a). But even at this level of specification we can seem that there may be a problem.

Consider something a little more complex: what is involved in process switching. Of course process switching is something we have been doing for 50 years, at least. At a high level, the state of the switched out process must be saved so that when or if it is eventually resumed it will start in exactly the same place – sort of. The “sort of” is there because the contents of process memory pages can change while the process is suspended and this is the common situation when the process suspends waiting for input. Even when the process is suspended waiting for output, the process I/O structures will change to mark the new write location pointer (seek pointer). Process state must be defined in such a way as to ignore the movement of pages to backing store from memory. And we need to account for shared data pages and files, asynchronous I/O, blocked signals that remain pending, and more.

ContextSave: If P is active at time T and the next instruction completes saving the state of P, then a resume of P at time T+K will restore the processor state so that the next instruction will appear to be the completion of the instruction at time T, and and changes to the P process state in the time between T and T+K should be legal suspend state transformations of P’s state.

The ContextSave specification is going to need a lot of fleshing out to mean anything solid. We have to defined “process state” is and what legal suspend state transformations are and might need some thought on that instruction execution as the boundary between saved and unsaved states.

In the next installment, I will introduce methods of constructing and composing the gigantic, partially determined, multi-level state machines we need to describe how operating systems work.