Saturday, December 26, 2009

Redundancy Considered Harmful

I mentioned Chris W. Johnson in our last blog entry. One paper of Chris's is of particular interest: The Dangers of Interaction with Modular and Self-Healing Avionics Applications: Redundancy Considered Harmful.

"Redundancy is one of the primary techniques for the engineering of safety-critical systems. Back-up resources can be called upon to mitigate the failure of primary systems. Traditionally, operator intervention can be required to manually switch between a failed unit and redundant resources. However, programmable systems are increasingly used to automatically detect failures and reconfigure underlying systems excluding faulty components. This creates problems if operators do not notice that their underlying systems have been reconfigured. In this paper, we examine a number of additional concerns that arise in the present generation of redundant, safety-critical applications."

IEEE Spectrum recently covered the same incident, that of Malaysia Airlines Flight 124: Automated to Death; Robert N. Charette investigates the causes and consequences of the automation paradox.

Both of the above involve taking the human out of the loop on the two assumptions, A. Humans are unreliable, B. Automation never goes wrong. The classic 1983 movie War Games shows how badly this scenario can end. During a secret simulation of a nuclear attack, one of two United States Air Force officers is unwilling to turn a required key to launch a nuclear missile strike. The officer's refusal to perform his duty convinces systems engineers at NORAD that command of missile silos must be maintained through automation, without human intervention. The automated system ultimately leads to a near catastrophic nuclear Armageddon. Seems our Science Fiction is once again becoming reality.

Taking us unreliable humans out of the safety loop makes sense in theory, but what is lost in that theory is that when things do go wrong it is up to us humans to solve the problem fast, such as before impact with the ground.

In the Flight 124 incident the fault-tolerant air data inertial reference unit (ADIRU) was designed to operate with a failed accelerometer. The redundant design of the ADIRU also meant that it was not mandatory to replace the unit when an accelerometer failed. Therefor the unit, with a now known fault, was not replaced for many years.

Sensor feed back is a common methodology in safety systems to confirm that the output really did transition to the required state. The problem is that as you add components your system becomes less reliable, see MIL-HDBK-217 Parts Count Analysis. A parts count analysis is a reliability prediction analysis that provides a rough estimate of a system's failure rate - how often the system will fail in a given time period. Parts count analyses are normally used early in a system design, when detailed information is not available.

To put the problem in everyday practical terms, every failure that I've had with one of my vehicles has been a failure of a sensor, never the system being sensed by the sensor.

Am I saying that we should not use feedback sensors? No.

Do we put in two feedback sensors? Perhaps you have heard: A man with one clock knows what time it is. A man with two clocks is never sure. More than two? At some point we run up against practice realities such as size, weight, power and costs.