Sunday, April 4, 2010

Evaluating Software's Impact on System and System of Systems Reliability from SEI at CMU

Carnegie Mellon University Software Engineering Institute (SEI) has published a new White Paper: Evaluating Software's Impact on System and System of Systems Reliability by John B. Goodenough.

Alas I have to say the title is more impressive than the paper itself. I'd sum up the paper as saying "We need to define our terms to Contractors and Sub-Contractors when we talk about software to them".

There is one paragraph in the paper that I do think is particularly important:

"Hardware engineers typically think that software failures are deterministic because certain inputs or uses can reliably cause a failure. But although all software failures are deterministic in the sense that they occur every time certain conditions are met, the likelihood of the conditions being met becomes, eventually, a function of usage patterns and history, neither of which are deterministic. In effect, after egregious software faults have been removed, failure occurrences become non-deterministic. In fact, certain types of software failure are inherently non-deterministic because they depend on more knowledge of program state than is typically available. For example, failures due to race conditions and memory leaks typically depend on usage history and, for race conditions, subtle details of system state. Although these are removable design deficiencies, their occurrence appears to be random (although typically the frequency of such failures increases as the load on the software system increases). In short, it is not unreasonable to think of software failures as eventually mimicking hardware behavior in their seemingly non-deterministic occurrence."

The above describes why it can be so hard to write correct software, and even harder to test it. "Perfect Software" is something that only exists in theory.

Maybe a particular bug only manifests when the non-deterministic timing of the application has interrupts nested three or four levels deep (Some Atmel and Zilog parts let you do this), while turning on the radio, turning on the left turn signal and pressing on the brake simultaneously, and the unit has been running continuously for over 18 hours and 12 minutes, during a ESD event...