Saturday, December 26, 2009

Redundancy Considered Harmful

I mentioned Chris W. Johnson in our last blog entry. One paper of Chris's is of particular interest: The Dangers of Interaction with Modular and Self-Healing Avionics Applications: Redundancy Considered Harmful.

"Redundancy is one of the primary techniques for the engineering of safety-critical systems. Back-up resources can be called upon to mitigate the failure of primary systems. Traditionally, operator intervention can be required to manually switch between a failed unit and redundant resources. However, programmable systems are increasingly used to automatically detect failures and reconfigure underlying systems excluding faulty components. This creates problems if operators do not notice that their underlying systems have been reconfigured. In this paper, we examine a number of additional concerns that arise in the present generation of redundant, safety-critical applications."

IEEE Spectrum recently covered the same incident, that of Malaysia Airlines Flight 124: Automated to Death; Robert N. Charette investigates the causes and consequences of the automation paradox.

Both of the above involve taking the human out of the loop on the two assumptions, A. Humans are unreliable, B. Automation never goes wrong. The classic 1983 movie War Games shows how badly this scenario can end. During a secret simulation of a nuclear attack, one of two United States Air Force officers is unwilling to turn a required key to launch a nuclear missile strike. The officer's refusal to perform his duty convinces systems engineers at NORAD that command of missile silos must be maintained through automation, without human intervention. The automated system ultimately leads to a near catastrophic nuclear Armageddon. Seems our Science Fiction is once again becoming reality.

Taking us unreliable humans out of the safety loop makes sense in theory, but what is lost in that theory is that when things do go wrong it is up to us humans to solve the problem fast, such as before impact with the ground.

In the Flight 124 incident the fault-tolerant air data inertial reference unit (ADIRU) was designed to operate with a failed accelerometer. The redundant design of the ADIRU also meant that it was not mandatory to replace the unit when an accelerometer failed. Therefor the unit, with a now known fault, was not replaced for many years.

Sensor feed back is a common methodology in safety systems to confirm that the output really did transition to the required state. The problem is that as you add components your system becomes less reliable, see MIL-HDBK-217 Parts Count Analysis. A parts count analysis is a reliability prediction analysis that provides a rough estimate of a system's failure rate - how often the system will fail in a given time period. Parts count analyses are normally used early in a system design, when detailed information is not available.

To put the problem in everyday practical terms, every failure that I've had with one of my vehicles has been a failure of a sensor, never the system being sensed by the sensor.

Am I saying that we should not use feedback sensors? No.

Do we put in two feedback sensors? Perhaps you have heard: A man with one clock knows what time it is. A man with two clocks is never sure. More than two? At some point we run up against practice realities such as size, weight, power and costs.

Epistemic Questions in Software System Safety

C. Michael Holloway presented an interesting paper [Towards a Comprehensive Consideration of Epistemic Questions in Software System Safety] at the 4th System Safety Conference 2009; coauthored with Chris W. Johnson, which you can watch here:

Towards a Comprehensive Consideration of Epistemic Questions in Software System Safety

C M Holloway

From: 4th System safety conference 2009

2009-10-26 12:00:00.0 Manufacturing Channel

>> go to webcast>> recommend to friend

"For any system upon which lives depend, the system should not only be safe, but the designers, operators, and regulators of the system should also know that it is safe. For software intensive systems, universal agreement on what is necessary to justify knowledge of safety does not exist."
To sum up Michael's paper and presentation in a nutshell, Michael says that we are not asking the correct questions to know if our systems are safe. He explores the difference between believing the system is safe, thinking the system is safe, and knowing the system is safe. He covers twelve fundamental questions that we all need to agree on before we, as an industry, can agree our systems are safe. He has thirty questions in all that need to be asked. Many that are asked after there has been an accident. What additional questions would you ask to know if your system is truly safe?