Something new I'm trying here in the blog is to open it up to an occasional Guest Blogger. If you have an idea for something that fits in with the general theme of the site, get in contact with me.
Today's Guest Blogger is Rajstennaj Barrabas, whom may be contacted at: RB (at) OkianWarrior (dot) com . If you have comments for RB use that address or leave comments below. I'll turn it over to RB now:
Examining past software failures gives us insight into how failures arise, so we can anticipate and avoid failures in the future.
One such failure is the Therac-25, a radiation therapy machine that killed several patients due to buggy software.
Very briefly, the Therac has a high-power mode which is used once a metal shield drops into place to protect the patient. A particular keyboard sequence entered by the operator (double-pressing the "return" key at just the right moment) caused a cascade of failures where the software eventually jumped into the middle of a function. The system engaged high-power mode without first lowering the shield, killing the patient.
To my mind, this was the first fatal accident caused by a software problem; or at least, the first well-known one. People suddenly realized that software could hurt people, and that perhaps special care should be taken.
(The link has a more detailed summary, with pointers to more in-depth reports.)
Analysis at the time noted that the high-power function did not check that the protective shield was in place. Lowering the shield was done prior to the call, and the function assumed that this had happened.
The conclusion was that software should always check its assumptions, and this became a set of "best practices" for safety-certified systems.
In safe software, every function should began with ASSERTs that check the arguments for validity, and these should be active in the released code. The execution time is negligible in most cases, and ASSERTs do a good job of ensuring proper behaviour.
As calculation proceeds, more ASSERTs should check the intermediate results; for example, to ensure nothing overflows, or that values are in range.
The Therac function could have ASSERTed that the shields were in place.
Most embedded programs don't use 100% of the CPU time. The spare capacity can be used to check up on the program and further ensure that everything is going well.
Each module can supply a function "xxxBIT" (where "BIT" stands for "Built In Test") which checks the module variables for consistency. As the program runs, it can call these functions during idle moments.
For example, the serial driver (SerialBIT) can check that the buffer pointers still point within the buffer, that the hardware registers haven't changed, and so on.
On bootup, the memory manager (MemoryBIT) knows the last-used static address for the program (ie - the end of .data), and should fill unused memory with a pattern. In it's spare time it checks to make sure the unused memory still has the pattern. This finds all sorts of "thrown pointer" errors. (Checking all of memory can take too long, so MemoryBIT can be coded to check a small portion each call.)
The stack pointer can be checked - put a pattern at the end of the stack, and if it changes you knew something went recursive or used too much stack (StackBIT).
The EPROM can be checksummed periodically.
Every module should have a BIT function which checks every imaginable error, to be called in the processor's spare time - over and over continuously.
The Therac could have continually tested its calculations to check for cascade failures.
The overall effect is a very "stiff" program - one that will either work completely or won't work at all. There is no intermediate behavior, no flexibility of action. Cascade failures are caught early, terminating the operation before things get out of hand.
A "stiff" program doesn't give erroneous or misleading results - it either works as intended or fails completely. Showing a blank screen is better than showing bad information, or even a frozen screen.
(Of course, this is situation specific. A blank screen is OK for aircraft - where the flight crew can take appropriate action - but perhaps not medical. You can still detect errors, log the problem, and alert the user.)
Some functions are time sensitive and can't afford the time spent error checking (interrupt handlers, for instance), but these can be identified and removed on a case-by-case basis.
Conventional wisdom says to use checking during development, and remove it on released code.
Done right, error checking has negligible impact on code speed but returns great gains in safety.
No comments:
Post a Comment