The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

Advertisements

Cutting and pasting about Kodak's demise

kodakThis graph from Peter Diamandis about how Kodak entered the sinkhole is kind of amazing. Diamandis explains Kodak’s failure to swerve in what is I think the orthodox Silicon Valley analysis:

[in 1996] Kodak had a $28 billion market cap and 140,000 employees.

In 1976, 20 years earlier, Kodak had invented the digital camera. They owned the IP and had the first mover advantage. This is a company that should have owned it all.

Instead, in 2012, Kodak filed for bankruptcy, put out of business by the very technology they had invented.

What happened? Kodak was married to the “paper and chemicals” (film development) business… their most profitable division, while the R&D on digital cameras was a cost center. They saw the digital world coming on, but were convinced that digital cameras wouldn’t have traction outside of the professional market. They certainly had the expertise to design and build consumer digital cameras — Kodak actually built the Apple QuickTake (see photo), generally considered the world’s first consumer digital camera. Amazingly, Kodak decided they didn’t even want to put their name on the camera.

There is more of the same (2012 and before) in the MIT Technology review. This is a totally convincing story (it had me convinced), but it leaves out three things:

  1. the boring old chemicals division of Eastman Kodak which was spun off in 1993 (three years earlier) is still around, profitable ( $10B/year revenue) and dwarfs what’s left of the original company,
  2. and  Fuji films, Kodak’s also ran in chemical film space managed to leap over the sinkhole and prosper, but not by relying on digital cameras.
  3. Kodak did briefly become a market leader in digital cameras but ran into a more fundamental problem.

Back in 1996, the business press and analysts thought Kodak was doing the right thing by divesting its chemical business.

NEW YORK — Eastman Kodak Co., struggling against poor profit and high debt, Tuesday took a big step in its corporate restructuring, announcing that it will divest Eastman Chemical Co. and in one fell swoop wipe out $2 billion of debt.

Such a spinoff would not have occurred just a few years ago, analysts said, and the move signals that Chief Executive Kay R. Whitmore is responding to new, tougher markets and stockholder pressure to improve financial results quickly.

“They are now recognizing that they are not a growth company, that they must go through this downsizing,” analyst Eugene Glazer of Dean Witter Reynolds said in an interview on CNBC.

Kodak’s shares, up sharply Monday in anticipation of the announcement, ended down $1.375 to $52.375 on the New York Stock Exchange.

[…] “We determined that there was little strategic reason related to our core imaging and health business for Kodak to continue to own Eastman,” Whitmore said at a news conference.

Kodak, best known for photography products but also a major pharmaceutical and chemicals group, has endured slow growth for years. Its photography business has been hit by changing demographics, foreign rivals and new technologies such as camcorders. Whitmore said Kodak sales, especially in photography and imaging, were weak.

[…] Costs will be reduced elsewhere in the company, Whitmore said. Other executives also have said Kodak will cut spending on research and development of new products.

In retrospect, dumping the cash generating parts of the business and cutting R&D was  not the best plan even if Wall St. analysts loved the idea. But it’s easy to be a genius after the fact as Willy Shih points out:

Responding to recommendations from management experts, from the mid-1990s to 2003 the company set up a separate division (which I ran) charged with tackling the digital opportunity. Not constrained by any legacy assets or practices, the new division was able to build a leading market share position in digital cameras — a position that was essentially decimated soon thereafter when smartphones with built-in cameras overtook the market.

Yes, those camera phones – which not too many people saw coming in the 1990s. Not only that, but Kodak’s path in digital imaging was not obvious.

The transition from analog to digital imaging brought several challenges. First, digital imaging was based on a general-purpose semiconductor technology platform that had nothing to do with film manufacturing — it had its own scale and learning curves. The broad applicability of the technology platform meant that it could be scaled up in numerous high-volume markets (such as microprocessors, logic circuits, and communications chips) apart from digital imaging. Suppliers selling components offered the technology to anyone who would pay, and there were few entry barriers. What’s more, digital technology is modular. A good engineer could buy all the building blocks and put together a camera. These building blocks abstracted almost all the technology required, so you no longer needed a lot of experience and specialized skills.

Semiconductor technology was well outside of Kodak’s core know-how and organizational capabilities. Even though the company invested lots of money in the basic research and manufacturing of solid-state semiconductor image sensors and developed some notable inventions (including the color filter array that is used on virtually every color image sensor), it had little hope of being a competitive volume supplier of image sensor components, and it was difficult for Kodak to offer something distinctive.

And Shih, perhaps unintentionally, reinforces Diamandis’s point that the top company managers failed to face up to the problem.

For many managers of legacy businesses, the survival instinct kicked in. Some who had worked at Kodak for decades felt they were entitled to be reassigned to the new businesses, or wished to control sales channels for digital products. But that just fueled internal strife. Kodak ended up merging the consumer digital, professional, and legacy consumer film divisions in 2003. Kodak then tried to make inroads in the inkjet printing business, spending heavily to compete with fortified incumbents such as HP, Canon, and Epson. But the effort failed, and Kodak exited the printer business after it filed for Chapter 11 bankruptcy reorganization in 2012.

Management chaos and “spending heavily to compete with fortified incumbents”.

Yep.

With the benefit of hindsight, it’s interesting to ask how Kodak might have been able to achieve a different outcome. One argument is that the company could have tried to compete on capabilities rather than on the markets it was in. This would have meant directing its skills in complex organic chemistry and high-speed coating toward other products involving complex materials — a path followed successfully by Fuji. However, this would have meant walking away from a great consumer franchise. That’s not the logic that managers learn at business schools, and it would have been a hard pill for Kodak leaders to swallow.

it would have been a hard pill for Kodak leaders to swallow.

But wasn’t that their job? So to conclude this exercise in cut and paste, what about Fuji? The Economist had an interesting take:

 the digital imaging sector accounts for only about one-fifth of Fujifilm’s revenue, down from more than half a decade ago.

How Fujifilm succeeded serves as a warning to American firms about the danger of trying to take the easy way out: competing through one’s marketing rather than taking the harder route of developing new products and new businesses. […]

Like Kodak, Fujifilm realised in the 1980s that photography would be going digital. Like Kodak, it continued to milk profits from film sales, invested in digital technologies, and tried to diversify into new areas. Like Kodak, the folks in the wildly profitable film division were in control and late to admit that the film business was a lost cause. As late as 2000 Fujifilm counted on a gentle 15 or 20-year decline of film—not the sudden free-fall that took place. Within a decade, film went from 60% of Fujifilm’s profits to basically nothing.

If the market forecast, strategy and internal politics were the same, why the divergent outcomes? The big difference was execution.

Fujifilm realised it needed to develop in-house expertise in the new businesses. In contrast, Kodak seemed to believe that its core strength lay in brand and marketing, and that it could simply partner or buy its way into new industries, such as drugs or chemicals. The problem with this approach was that without in-house expertise, Kodak lacked some key skills: the ability to vet acquisition candidates well, to integrate the companies it had purchased and to negotiate profitable partnerships. “Kodak was so confident about their marketing capability and their brand, that they tried to take the easy way out,” says Mr Komori.

Fujifilm realised it needed to develop in-house expertise in the new businesses.

ok.

 

MiFID II, GPS and UTC time

I have a post up on FSMLabs web site about the use of GPS and other satellite time for MiFID II timestamp compliance.  It’s fascinating how much effort has recently gone into trying to convince people that MiFID II will require direct time from a national lab or certified via a national lab despite the clear wording in MiFID II proposed regulations. To me, the deal is sealed in the Cost Benefit Analysis in which the ESMA regulators write

“The final draft RTS also reduces costs of the initial draft RTS proposed in the CP by allowing UTC disseminated via satellite systems (i.e. GPS receiver or the use of other satellite systems when available)”

That is not a promise one can easily walk away from. ESMA justifies the regulations with a cost/benefit analysis in which the costs for time stamping are limited by license to use GPS time.  Of course, legal reasoning and logic are not always the same, but I’m trying to figure out how ESMA regulators could claim that they didn’t mean it, or why they would have such a motivation.

South_Sea_Bubble

 “South Sea Bubble” by Edward Matthew Ward via href=”https://commons.wikimedia.org/wiki/File:South_Sea_Bubble.jpg#/media/File:South_Sea_Bubble.jpg” Wikimedia

MiFID2 and security – keeping track of the money

1024px-Quentin_Massys_001

A shorter version of this post is on the  FSMLabs web site.  MiFID2 is a new set of regulations for the financial services industry in Europe that includes a much more rigorous approach to timestamps.  Timestamps are in many ways the foundation for data integrity in modern processing systems – which are distributed, high speed, and generally gigantic. But when regulations or business or other constraints require timestamps to really work, the issues of fault tolerance and security come up. It doesn’t matter how precise your time distribution is if a mistake or a hacker can easily turn it off or control it.

TimeKeeper incorporates a defense-in-depth design to protect it from deliberate security attacks and errors due to equipment failure or misconfiguration. This engineering approach was born out of a conviction that precise time synchronization would become a business and regulatory imperative.KeystoneCops

  1. Recent disclosures of still more security problems in the NTPd implementation of NTP show how vulnerable time synchronization can be without proper attention to security. PTPd and related implementations of the PTP standard have similar vulnerabilities.
  2. Security and general failure tolerance should be on the minds of firms that are considering how to comply with the MiFID2 rules because time synchronization provides both a broad attack surface and a single point of failure unless properly implemented.

The first step towards time non-naive time synchronization is a skeptical attitude on the parts of IT managers and developers. Ask the right questions at acquisition and design time to prevent unpleasant surprises later.

One of the most dangerous aspects of the just disclosed NTPd exploit is that NTPd will accept a message from any random source telling it to stop synchronizing with its actual time sources. Remember, NTPd is an implementation of NTP, other implementations may not suffer from the same flaw. That d is easy to overlook, but it’s key. TimeKeeper’s NTP and PTP implementations will, for example, ignore commands that do not come from the associated time source and will apply analytical skepticism to commands that do appear to come from the source. TimeKeeper dismisses many of these types of attacks immediately and will start throwing off alerts to provoke automated and human counter-measures. The strongest protection TimeKeeper offers, however, comes from its multi-source capabilitiesthat allow it to compare multiple time sources in real-time and reject a primary source that has strayed.

Correct time travels a long, complex path from a source such as a GPS receiver or a feed like the one British Telecom is now providing. Among the questions system designers need to ask are the following two.

  1. Is the chain between source and client safeguarded comprehensively and instrumented end-to-end?
  2. Is there a way of cross-checking sources against other sources and rejecting bad sources?

Without positive answers to both of these questions, the time distribution technology is inherently fragile and robust MiFID2 timestamp compliance will be unavailable.

The painting is: “Quentin Massys 001” by Quentin Matsys (1456/1466–1530) – The Yorck Project: 10.000 Meisterwerke der Malerei. DVD-ROM, 2002. ISBN 3936122202. Distributed by DIRECTMEDIA Publishing GmbH.. Licensed under Public Domain via Commons – 

Annals of unintentional irony

Come back after a trip to see Marc Andreessen’s team of twitter posters complaining wryly about how government is so gosh durn big and citing experts like Milton Friedman.

//platform.twitter.com/widgets.js

That is, Marc Andreesen, the former National Center for Supercomputing Applications programmer who helped write the Mosaic Web Browser for the WWW that CERN scientists developed on top of the Internet technology that DARPA and NSF researchers and bureaucrats created on top of prior government funded technology like integrated circuits, and then built a company that made him rich with the team that worked together at NCSA,  engaging in the typical rich person bitching about the gummint.  Which shows how little reality matters to people.

Ikea and RedHat

This is state of the art for systems software now – which is not all that impressive.

Glantz explained that Ikea has more than 3,500 Red Hat Enterprise Linux (RHEL) servers deployed in Sweden and around the world. With Shellshock, every single one of those servers needed to be patched and updated to limit the risk of exploitation. So how did Ikea patch all those servers? Glantz showed a simple one-line Linux command and then jokingly walked away from the podium stating “That’s it, thanks for coming,” as the audience erupted into boisterous applause. On a more serious note, Glantz said that it took approximately 2.5 hours to test, deploy and upgrade Ikea’s entire IT infrastructure to defend against Shellshock.  Eweek.

Gilbert and Sullivan's innovative business model

From the Financial Times: (and you should buy a subscription)

Piracy is a problem as old as the music industWilliam Schwenck Gilbert, Arthur Sullivan - The Pirates of Penzance - (Sheet music)ry itself. In Victorian times, it was illicitly copied sheet music that was the avowed enemy of the artist, and the operetta team Gilbert and Sullivan paid toughs to go round London pubs smashing up pianos with sledge hammers whenever they found bootlegged scores.