Saturday, December 26, 2009

Redundancy Considered Harmful

I mentioned Chris W. Johnson in our last blog entry. One paper of Chris's is of particular interest: The Dangers of Interaction with Modular and Self-Healing Avionics Applications: Redundancy Considered Harmful.

"Redundancy is one of the primary techniques for the engineering of safety-critical systems. Back-up resources can be called upon to mitigate the failure of primary systems. Traditionally, operator intervention can be required to manually switch between a failed unit and redundant resources. However, programmable systems are increasingly used to automatically detect failures and reconfigure underlying systems excluding faulty components. This creates problems if operators do not notice that their underlying systems have been reconfigured. In this paper, we examine a number of additional concerns that arise in the present generation of redundant, safety-critical applications."

IEEE Spectrum recently covered the same incident, that of Malaysia Airlines Flight 124: Automated to Death; Robert N. Charette investigates the causes and consequences of the automation paradox.

Both of the above involve taking the human out of the loop on the two assumptions, A. Humans are unreliable, B. Automation never goes wrong. The classic 1983 movie War Games shows how badly this scenario can end. During a secret simulation of a nuclear attack, one of two United States Air Force officers is unwilling to turn a required key to launch a nuclear missile strike. The officer's refusal to perform his duty convinces systems engineers at NORAD that command of missile silos must be maintained through automation, without human intervention. The automated system ultimately leads to a near catastrophic nuclear Armageddon. Seems our Science Fiction is once again becoming reality.

Taking us unreliable humans out of the safety loop makes sense in theory, but what is lost in that theory is that when things do go wrong it is up to us humans to solve the problem fast, such as before impact with the ground.

In the Flight 124 incident the fault-tolerant air data inertial reference unit (ADIRU) was designed to operate with a failed accelerometer. The redundant design of the ADIRU also meant that it was not mandatory to replace the unit when an accelerometer failed. Therefor the unit, with a now known fault, was not replaced for many years.

Sensor feed back is a common methodology in safety systems to confirm that the output really did transition to the required state. The problem is that as you add components your system becomes less reliable, see MIL-HDBK-217 Parts Count Analysis. A parts count analysis is a reliability prediction analysis that provides a rough estimate of a system's failure rate - how often the system will fail in a given time period. Parts count analyses are normally used early in a system design, when detailed information is not available.

To put the problem in everyday practical terms, every failure that I've had with one of my vehicles has been a failure of a sensor, never the system being sensed by the sensor.

Am I saying that we should not use feedback sensors? No.

Do we put in two feedback sensors? Perhaps you have heard: A man with one clock knows what time it is. A man with two clocks is never sure. More than two? At some point we run up against practice realities such as size, weight, power and costs.

Epistemic Questions in Software System Safety

C. Michael Holloway presented an interesting paper [Towards a Comprehensive Consideration of Epistemic Questions in Software System Safety] at the 4th System Safety Conference 2009; coauthored with Chris W. Johnson, which you can watch here:

Towards a Comprehensive Consideration of Epistemic Questions in Software System Safety

C M Holloway

From: 4th System safety conference 2009

2009-10-26 12:00:00.0 Manufacturing Channel

>> go to webcast>> recommend to friend

"For any system upon which lives depend, the system should not only be safe, but the designers, operators, and regulators of the system should also know that it is safe. For software intensive systems, universal agreement on what is necessary to justify knowledge of safety does not exist."
To sum up Michael's paper and presentation in a nutshell, Michael says that we are not asking the correct questions to know if our systems are safe. He explores the difference between believing the system is safe, thinking the system is safe, and knowing the system is safe. He covers twelve fundamental questions that we all need to agree on before we, as an industry, can agree our systems are safe. He has thirty questions in all that need to be asked. Many that are asked after there has been an accident. What additional questions would you ask to know if your system is truly safe?

Sunday, September 27, 2009

E-Prescribing prior art examples and code.

In my previous post, I talked about what I believed was the first electronic based prescription. I have now posted some scans of that work from long ago, in the section on evils of Software Patents at http://www.softwaresafety.net.

Sunday, May 31, 2009

When was the very first electronic prescription used? I say ~1978.

I asked Fred Trotter, advocate of Free Open Source Software, in the Medical Field, a question about FOSS certification of Medical Software.

I mentioned that I did work in this field back in the 70's; In actuality it seems we were creating the field in the 70's. Fred asked that I cover this in my blog, so here we are.

"Bob,
Your work on e-prescribing is an important source of prior-art!! Please consider detailing exactly what you did, and how you did and even source code as well as the dates that you did this on your blog!!" - Fred Trotter

In the years 1977 to 1982 I was working my way through school by writing medical software for the office of Dr. Armour and Dr. McDowell, in Farrell PA.

Keep in mind the time frame, 1977, the IBM PC had not yet be invented, and Internet effectively did not exist outside of academia and the military. The top end, off-the-shelf, computers of the day were the Apple-II and the TRS-80 Model-One. Very few people knew what a personal computer was, as I'm not sure the term "personal computer" had yet been coined at that point in time.

I was a classic Nerd, the movie Revenge of the Nerds was a documentary of my life. [Us Nerds did win by the way, or you would not be reading this right now would you?] My father knew Dr. Armour through their shared interest in Amateur Radio.

Dr. Armour was interested in the new area of computers and how they might help his medical practice become more efficient. Mostly Dr. Armour and McDowell practiced obstetrics, that is they delivered babies and new born follow up care. As giving birth and what most new born's do, have not changed in many Milena, Dr. A. wanted a standardized menu you where you could enter the common things that would happen, so that the printed notes could be put in the charts, and the common prescriptions for the new mothers and children could be printed. I'm sure we all know how bad the handwriting of most all doctors are. That is because they have to write a lot, and it gets tiring.

Dr. A. setup his personal TRS-80 Model-I, as the Model-III did not yet exist; it would be out soon, in the back office of his practice, gave me a key to the back door, had me set in on a few exams, with the permission of the mothers to be, and exams of the new born's, gave me a few notes and long discussions on what he wanted, which I then coded up in BASIC. C compilers were rare and expensive then. After a few back-and-fourth sessions a basic system was setup to try out. Today this set up would be termed and Expert System, but I did not know that at the time.

I don't recall for sure if we moved the Model-I to a cart, or if we had gotten a second Model-I, anyway a cart with computer and *noisy* line printer was placed in one of the exam rooms. Eventually all of the exam rooms had TRS-80 Model-III's in them, each with quitter printers (remember the frequently sleeping babies in the room??). Networking as we know it today did not exist, it was just starting to come out in its earliest forms.

The two things I remember most are spending time in the exam rooms, remember as a Nerd, that always made me a bit queasy, and the day Dr. Armour came in and said he just got a phone call from the Druggist, I think it was RiteAid at the Shenango Valley Mall, but I don't recall for sure.

The Druggists called us and asked if the printed prescription he was holding in his hand was for real. I remember distinctly asking "Is there a problem? What is wrong with it?". To which Dr. A. replied, "No, they loved them, they could read them! They want more, and want to know if others doctors in the area will be using these, they hope.

I know Dr. Armour did discuss doing this setup with other doctors in the area, but I don't recall anything really coming of it. Remember small computers were still unknown to most anyone at this point in time.

I know Dr. Armour and I never even considered patenting or copyrighting the system back then, the nature of the time. Not sure this would count as Open Source, the term did not exist then. We would have given the code to anyone that wanted it. Few doctors seemed to 'get it' then. I wonder if they even get it now at times?

I know I don't have any of that source code or notes any longer. This does give me a good reason to go visit Dr. Armour and see how he is doing, and to see if he has anything left. He retired along ago.

One other thing worth mentioning was that Dr. A. and I attended the very first MUMPS conference in DC in 1981. Still have the DEC MUMPS badge around here some place, and the manuals on the language. Dr. A. and I thought it would be a good way to get things networked. The technology that Dr. A. could afford at that point in time, DEC machines, was just a bit out of reach, so we never did a lot with it. To this day I still look in on what is happening with MUMPS once in a while. There are many places that still use MUMPS. Still have the books:

  • Computers in Ambulatory Medicine; Proceedings of the Joint Conference of the Society of Computer Medicine and the Society for Advanced Medical Systems. October 30-November 1, 1981 Sheraton Washington Hotel, Washington, D.C.
  • A Manual of COMPUTERS IN MEDICAL PRACTICE.
  • Computer Programming In ANS MUMPS. A self-instruction manual for non-programmers, by Arthur F. Krieg and Lucille K. Shearer.

The bottom line is can anyone point to a earlier date than the late 1970's for e-prescribing?

Use offsetof() to avoid structure alignment issues in C

Dan Saks wrote Padding and rearranging structure members; Here's what C and C++ compilers must do to keep structure members aligned, at Embedded.com recently.

No discussion of structure alignment is complete without covering offsetof() from stddef.h.

When I mentioned this to Dan he pointed me to his article on Catching errors early with compile-time assertions, where he does mention offsetof().

offsetof() gives you the offset, or the number of bytes, of a particular structure item in the C language, from the start of the structure. This makes writing safe portable code much easier with structures, as when offsetof() is used properly there are no longer worries about how different compilers might align the structure members, on machines of different word sizes.

Sunday, May 10, 2009

Should Developers Be Liable For Their Code?

ZDNet has an interesting article about if developers should be liable for the code they write:

Software companies could be held responsible for the security and efficacy of their products, if a new European Commission consumer protection proposal becomes law.

Commissioners Viviane Reding and Meglena Kuneva have proposed that EU consumer protections for physical products be extended to software. They suggested change in the law is part of an EU action agenda put forward by the commissioners after identifying gaps in EU consumer protection rules.

The Linux Journal and SlashDotOrg, have follow ups. If you are easily offended by vulgar language, best avoid the SlaDotOrg link.

Most causes of system faults are created before the first line of code is written, or first schematic is drawn. The errors are caused by not understanding the requirements of the system.

What do you think?

Sunday, April 19, 2009

The Power of Ten 10 Rules for Writing Safety Critical Code

I just came across a site in an ad on Embedded.com that every reader of this blog needs to check out:

The Power of Ten 10 Rules for Writing Safety Critical Code.

Their rule #10 comments supports our position you want every useful compiler warning you can get.

Do you just ignore Compiler Warnings?

Something I just saw on the AVR-GCC list:

"I mean the compiler gives some of the most stupid warnings, such as , when a function that is declared but not used..."

or a past favorite of mine "It is only a warning, just ignore it". Yellow colored traffic lights are "only warnings", that most people do seem to ignore, and governments are gaming to enhance revenue; sorry wrong blog...

I have always had a zero tolerance for warnings in code. If you have a warning in your code, your code is broken.

If you are using GCC here are the warnings you can enable, that I use in my own Makefiles:

Make sure you have -W and -Wall in your CFLAGS.

CFLAGS += -W -Wall -Wstrict-prototypes -Wchar-subscripts

I generally run with every warning/error message turned on with the exception of pedantic and unreachable-code. The later frequently gives bogus results and the former goes off on commonly accepted code.

# -Werror : Make all warnings into errors.
CFLAGS +=  -Werror

# -pedantic : Issue all the mandatory diagnostics listed in the C
# standard. Some of them are left out by default, since they trigger frequently
# on harmless code.
#
# -pedantic-errors : Issue all the mandatory diagnostics, and make all
# mandatory diagnostics into errors. This includes mandatory diagnostics that
# GCC issues without -pedantic but treats as warnings.
#CFLAGS +=  -pedantic

#-Wunreachable-code
#Warn if the compiler detects that code will never be executed. [Seems
to give bogus results]
#CFLAGS += -Wunreachable-code

#Warn if an undefined identifier is evaluated in an `#if' directive.
CFLAGS += -Wundef

# Dump the address, size, and relative cost of each statement into comments in
# the generated assembler code. Used for debugging avr-gcc.
CFLAGS += -msize

# -Winline : Warn when a function marked inline could not be
#        substituted, and will give the reason for the failure.
CFLAGS +=  -Winline


Most of the following are turnned on via -Wall:

# Functions prologues/epilogues expanded as call to appropriate
# subroutines. Code size will be smaller.  Use subroutines for function
# prologue/epilogue. For complex functions that use many registers (that needs
# to be saved/restored on function entry/exit), this saves some space at the
# cost of a slightly increased execution time.
CFLAGS += -mcall-prologues

# Use rjmp/rcall (limited range) on >8K devices. On avr2 and avr4 architectures
# (less than 8 KB or flash memory), this is always the case. On avr3 and avr5
# architectures, calls and jumps to targets outside the current function will
# by default use jmp/call instructions that can cover the entire address range,
# but that require more flash ROM and execution time.
#CFLAGS += -mshort-calls

# Do not generate tablejump instructions. By default, jump tables can be used
# to optimize switch statements. When turned off, sequences of compare
# statements are used instead. Jump tables are usually faster to execute on
# average, but in particular for switch statements where most of the jumps
# would go to the default label, they might waste a bit of flash memory.
# CFLAGS += -mno-tablejump

# Allocate to an enum type only as many bytes as it needs for the declared
# range of possible values. Specifically, the enum type will be equivalent to
# the smallest integer type which has enough room.
# CFLAGS += -fshort-enums

# Dump the address, size, and relative cost of each statement into comments in
# the generated assembler code. Used for debugging avr-gcc.
CFLAGS += -msize

# Dump the internal compilation result called "RTL" into comments in the
# generated assembler code. Used for debugging avr-gcc.
# CFLAGS += -mrtl

# Generate lots of debugging information to stderr.
#CFLAGS += -mdeb

#-Wchar-subscripts
#Warn if an array subscript has type char. This is a common cause of
error, as programmers often forget that this type is signed on some
machines. This warning is enabled by -Wall.
#
#-Wcomment
#Warn whenever a comment-start sequence `/*' appears in a `/*'
comment, or whenever a Backslash-Newline appears in a `//' comment.
This warning is enabled by -Wall.
#
#-Wfatal-errors
#This option causes the compiler to abort compilation on the first
error occurred rather than trying to keep going and printing further
error messages.
#
#-Wformat
#Check calls to printf and scanf, etc., to make sure that the
arguments supplied have types appropriate to the format string
specified, and that the conversions specified in the format string
make sense.
#
#-Winit-self (C, C++, Objective-C and Objective-C++ only)
#Warn about uninitialized variables which are initialized with
themselves. Note this option can only be used with the -Wuninitialized
option, which in turn only works with -O1 and above.
#
#-Wimplicit-int
#Warn when a declaration does not specify a type. This warning is
enabled by -Wall.
#
#-Wimplicit-function-declaration
#-Werror-implicit-function-declaration
#Give a warning (or error) whenever a function is used before being
declared. The form -Wno-error-implicit-function-declaration is not
supported. This warning is enabled by -Wall (as a warning, not an
error).
#
#-Wimplicit
#Same as -Wimplicit-int and -Wimplicit-function-declaration. This
warning is enabled by -Wall.
#
#-Wmain
#Warn if the type of `main' is suspicious. `main' should be a function
with external linkage, returning int, taking either zero arguments,
two, or three arguments of appropriate types. This warning is enabled
by -Wall.
#
#-Wmissing-braces
#Warn if an aggregate or union initializer is not fully bracketed. In
the following example, the initializer for `a' is not fully bracketed,
but that for `b' is fully bracketed.
#
#          int a[2][2] = { 0, 1, 2, 3 };
#          int b[2][2] = { { 0, 1 }, { 2, 3 } };
#
#This warning is enabled by -Wall.
#
#-Wmissing-include-dirs (C, C++, Objective-C and Objective-C++ only)
#Warn if a user-supplied include directory does not exist.
#
#
#-Wparentheses
#Warn if parentheses are omitted in certain contexts, such as when
there is an assignment in a context where a truth value is expected,
or when operators are nested whose precedence people often get
confused about.
#
#This warning is enabled by -Wall.
#
#-Wsequence-point
#Warn about code that may have undefined semantics because of
violations of sequence point rules in the C standard.
#
#This warning is enabled by -Wall.
#
#-Wreturn-type
#Warn whenever a function is defined with a return-type that defaults
to int. Also warn about any return statement with no return-value in a
function whose return-type is not void.
#
#This warning is enabled by -Wall.
#
#-Wswitch
#Warn whenever a switch statement has an index of enumerated type and
lacks a case for one or more of the named codes of that enumeration.
(The presence of a default label prevents this warning.) case labels
outside the enumeration range also provoke warnings when this option
is used. This warning is enabled by -Wall.
#
#-Wswitch-default
#Warn whenever a switch statement does not have a default case.
#
#-Wswitch-enum
#Warn whenever a switch statement has an index of enumerated type and
lacks a case for one or more of the named codes of that enumeration.
case labels outside the enumeration range also provoke warnings when
this option is used.
#
#-Wtrigraphs
#Warn if any trigraphs are encountered that might change the meaning
of the program (trigraphs within comments are not warned about). This
warning is enabled by -Wall.
#
#-Wunused-function
#Warn whenever a static function is declared but not defined or a
non-inline static function is unused. This warning is enabled by
-Wall.
#
#-Wunused-label
#Warn whenever a label is declared but not used. This warning is
enabled by -Wall.
#
#-Wunused-parameter
#Warn whenever a function parameter is unused aside from its declaration.
#
#-Wunused-variable
#Warn whenever a local variable or non-constant static variable is
unused aside from its declaration This warning is enabled by -Wall.
#
#-Wunused-value
#Warn whenever a statement computes a result that is explicitly not
used. This warning is enabled by -Wall.
#
#To suppress this warning cast the expression to `void'.
#
#-Wunused
#All the above -Wunused options combined.
#
#-Wuninitialized
#Warn if an automatic variable is used without first being initialized
or if a variable may be clobbered by a setjmp call.
#
#This warning is enabled by -Wall.
#
#-Wstring-literal-comparison
#Warn about suspicious comparisons to string literal constants. In C,
direct comparisons against the memory address of a string literal,
such as if (x == "abc"), typically indicate a programmer error, and
even when intentional, result in unspecified behavior and are not
portable.
#
#-Wall
# All of the above `-W' options combined. This enables all the warnings about
# constructions that some users consider questionable, and that are easy to
# avoid (or modify to prevent the warning), even in conjunction with macros.
# This also enables some language-specific warnings described in C++ Dialect
# Options and Objective-C and Objective-C++ Dialect Options.
#
#
#
#
#-Wextra
#-Wfloat-equal
#Warn if floating point values are used in equality comparisons.
#
#-Wtraditional (C only)
#Warn about certain constructs that behave differently in traditional
and ISO C. Also warn about ISO C constructs that have no traditional C
equivalent, and/or problematic constructs which should be avoided.
#
#-Wdeclaration-after-statement (C only)
#Warn when a declaration is found after a statement in a block
#
#
#-Wshadow
#Warn whenever a local variable shadows another local variable,
parameter or global variable or whenever a built-in function is
shadowed.
#
#
#-Wunsafe-loop-optimizations
#Warn if the loop cannot be optimized because the compiler could not
assume anything on the bounds of the loop indices.
#
#
#-Wpointer-arith
#Warn about anything that depends on the 'size of' a function type or
of void. GNU C assigns these types a size of 1, for convenience in
calculations with void * pointers and pointers to functions.
#
#-Wbad-function-cast (C only)
#Warn whenever a function call is cast to a non-matching type. For
example, warn if int malloc() is cast to anything *.
#
#-Wcast-qual
#Warn whenever a pointer is cast so as to remove a type qualifier from
the target type. For example, warn if a const char * is cast to an
ordinary char *.
#
#-Wcast-align
#Warn whenever a pointer is cast such that the required alignment of
the target is increased. For example, warn if a char * is cast to an
int * on machines where integers can only be accessed at two- or
four-byte boundaries.
#
#-Wwrite-strings
#When compiling C, give string constants the type const char[length]
so that copying the address of one into a non-const char * pointer
will get a warning; when compiling C++, warn about the deprecated
conversion from string constants to char *. These warnings will help
you find at compile time code that can try to write into a string
constant, but only if you have been very careful about using const in
declarations and prototypes. Otherwise, it will just be a nuisance;
this is why we did not make -Wall request these warnings.
#
#-Wconversion
#Warn if a prototype causes a type conversion that is different from
what would happen to the same argument in the absence of a prototype.
This includes conversions of fixed point to floating and vice versa,
and conversions changing the width or signedness of a fixed point
argument except when the same as the default promotion.
#
#
#-Wsign-compare
#Warn when a comparison between signed and unsigned values could
produce an incorrect result when the signed value is converted to
unsigned. This warning is also enabled by -Wextra; to get the other
warnings of -Wextra without this warning, use `-Wextra
-Wno-sign-compare'.
#
#
#-Waggregate-return
#Warn if any functions that return structures or unions are defined or
called. (In languages where you can return an array, this also elicits
a warning.)
#
#
#-Wstrict-prototypes (C only)
#Warn if a function is declared or defined without specifying the
argument types. (An old-style function definition is permitted without
a warning if preceded by a declaration which specifies the argument
types.)
#
#-Wold-style-definition (C only)
#Warn if an old-style function definition is used. A warning is given
even if there is a previous prototype.
#
#-Wmissing-prototypes (C only)
#Warn if a global function is defined without a previous prototype
declaration. This warning is issued even if the definition itself
provides a prototype. The aim is to detect global functions that fail
to be declared in header files.
#
#-Wmissing-declarations (C only)
#Warn if a global function is defined without a previous declaration.
Do so even if the definition itself provides a prototype. Use this
option to detect global functions that are not declared in header
files.
#
#-Wmissing-field-initializers
#Warn if a structure's initializer has some fields missing.
#
#-Wmissing-noreturn
#Warn about functions which might be candidates for attribute
noreturn. Note these are only possible candidates, not absolute ones.
Care should be taken to manually verify functions actually do not ever
return before adding the noreturn attribute, otherwise subtle code
generation bugs could be introduced. You will not get a warning for
main in hosted C environments.
#
#-Wmissing-format-attribute
#Warn about function pointers which might be candidates for format
attributes. Note these are only possible candidates, not absolute
ones.
#
#-Wpacked
#Warn if a structure is given the packed attribute, but the packed
attribute has no effect on the layout or size of the structure.
#
#-Wpadded
#Warn if padding is included in a structure, either to align an
element of the structure or to align the whole structure. Sometimes
when this happens it is possible to rearrange the fields of the
structure to reduce the padding and so make the structure smaller.
#
#-Wredundant-decls
#Warn if anything is declared more than once in the same scope, even
in cases where multiple declaration is valid and changes nothing.
#
#-Wnested-externs (C only)
#Warn if an extern declaration is encountered within a function.
#
#-Wunreachable-code
#Warn if the compiler detects that code will never be executed.
#
#-Winline
#Warn if a function can not be inlined and it was declared as inline.
Even with this option, the compiler will not warn about failures to
inline functions declared in system headers.
#
#-Winvalid-pch
#Warn if a precompiled header (see Precompiled Headers) is found in
the search path but can't be used.
#
#-Wvolatile-register-var
#Warn if a register variable is declared volatile. The volatile
modifier does not inhibit all optimizations that may eliminate reads
and/or writes to register variables.
#
#-Wdisabled-optimization
#Warn if a requested optimization pass is disabled. This warning does
not generally indicate that there is anything wrong with your code; it
merely indicates that GCC's optimizers were unable to handle the code
effectively. Often, the problem is that your code is too big or too
complex; GCC will refuse to optimize programs when the optimization
itself is likely to take inordinate amounts of time.
#
#-Wstack-protector
#This option is only active when -fstack-protector is active. It warns
#about functions that will not be protected against stack smashing.

Sunday, March 29, 2009

IEC 60730 Power Up Self-Tests

[April/6th/2010: Updated this entry to add Microchip. Do you know of any others to be added?]

I was asked this week what I knew about "a self test at power up according standard IEC61508". First thing I can tell you is that Functional safety of electrical/electronic/programmable electronic safety-related systems has a price tag of over $1200! I always find the high prices of these numerous standards extremely frustrating and expensive.

In the past I was involved with the creation of reports, Programmable Electronic Mining Systems: Best Practice Recommendations (In Nine Parts) for the Centers for Disease Control (CDC)/ National Institute for Occupation Safety and Health (NIOSH) Mining Division. These reports draw heavily from International Electrotechnical Commission (IEC) standard IEC 61508 [IEC 1998a,b,c,d,e,f,g] and other standards. They are in the public domain, and can be found at my hardware site.

The newer standard, IEC 60730 is also mandating power up self-tests. You can preview what you are getting for your big bucks here.

The IEC 60730 safety standard for household appliances is designed for automatic electronic controls, to ensure safe and reliable operation of products. I always find it a bit ironic that now things like our refrigerator and dishwasher, have more stringent standards than some of the devices that really can kill us.

IEC 60730 segments automatic control products into three different classification:

  • Class A: Not intended to be relied upon for the safety of the equipment.
  • Class B: To prevent unsafe operation of the controlled equipment.
  • Class C: To prevent special hazards.

Hardware:

  • Independent clocked Watchdog Timer - this provides a safety mechanism to monitor:
    • The flow of the software
    • Interrupt handling & execution
    • CPU clock too fast, too slow and no clock
    • CRC Engine when available - this provides a fast mechanism for:
      • Testing the Flash memory.
      • Check on serial communication protocols such as UART, I2C, SPI.

    Software:

    • CPU Register
    • Program Counter
    • Flash CRC Using software and/or hardware CRC engines
    • RAM Tests
    • Independent Watchdog Timeout

Safety regulations and their impact on MCUs in home appliances has a short introduction to 60730.

Fortunately for us several companies have implemented IEC 60730 compliant libraries. Listed alphabetically:

What all of these tests fail to address in any meaningful way is what happens when a power up test fails? Best you can hope for is that you have a beeper or LED hooked up directly to a Micro pin that you can blink or beep. For example if you find that your accumulator has a stuck bit, you are hosed as at that point. You can not guarantee that anything you do is going to be correct.

There is also the problem of the trade off of being thorough with exhaustive tests, verses being fast. Some standards such as NFPA mandate that the system must be operational in under one second to complicate maters even further. I did have a micro one time that did have a hardware failure. The XOR instruction was broken, but only on certain bit combinations. Every other aspect of the part worked just fine. It took days to debug that problem. As at the time the micro in question was hard to get and expensive, swapping it first was not an option.

One closing thought is that you need to be vary wary of simple RAM tests. Writing 0xAA/0x55 tells you almost nothing about open address lines etc.

Saturday, March 28, 2009

Anyone want to do a term paper on CRCs?

Do you know any Math Majors that need a subject for a term paper?

The Embedded System community needs one written on CRCs that is practical rather than pedagogical like the texts books that address the subject. Here is a sample paper.

Resulting paper should have practical answers that people in embedded systems land like myself can understand and use. Reading Polynomials over Galois Fields tend to makes my eyes glaze over. My speed of Mathematics is more that of Trachtenberg Speed System of Basic Mathematics.

What brought this on today is the new Atmel XMega processor that I'm designing with, that uses the CRC polyonmial: x^24 + 4x^3 + 3x +1. That polynomial does not seem to be any of the standard ones, so what are its error detection properties?

Polynomial's have to have certain properties, while they may all be primes, not all primes make good CRC's. For example the properties that make good CRC polynomials will make a very bad random number generator, and vice-versa. Both are done with multi-tap shift registers. CRC generators do NOT generate maximal-length sequences. In fact, the polynomials are deliberately chosen to be reducible by the factor X + 1, because that happens to eliminate all odd-bit errors. -- Embedded Systems Programming Jan/1992 Jack Crenshaw. I admittedly have never understood why the "good ones" are the good ones. More of the math vs get the work done.

For some background take a look at these papers:

I know that CRC is good only over a certain block length, but what is that block length? The syndrome length? Syndrome length-1?

One article stated "a 16 bit CRC is good for 4K bits minus one"; I have not figure out how that works out, so I question its accuracy.

I want to CRC my code in Flash, however I don't want to use a 16 bit CRC if I really should be using a 32 bit CRC. I know the odds of this making any real difference is minuscule, but never want to give those Lawyers an opening.

Andrew Tannenbaum, in Computer Networks is often quoted talking about 16 bit CRC being "99.9998%" good at detecting.... but how do you calculate these percentages for CRC's of various length and more importantly the polynomial in use?

Since we are doing polynomial division and the CRC is the residue of that division there will be many CRC's that have the same answer, which is not what you want. This is why longer CRC's are better over longer bit runs.

From Tannenbaum, in Computer Networks:

  • Detect all single bit errors.
  • Detect all occurrences of two single-bit errors for frames less than 2n-1 bits in length.
  • Detect all odd number of bits errors.
  • Detect all burst errors with a length less the n.
  • Detect all but 1/2n-1 burst errors of length n + 1.
  • Detect all but 1/2n other errors.

Where n = number of bits in CRC.

See also Algebraic Codes for Data Transmission, Cambridge University Press, 2002.

I've spent several years looking actually for some of these CRC answers, even in real books such as Algebraic Codes for Data Transmission, Cambridge University Press, 2002. The books that I have found are already written for people that understand the math, rather than people like me that just want to get the job done, and want to cite a reference in the source code.

My random CRC crib notes collected over many years:

  • "Cyclic code for error detection" by W. Peterson and D. Brown, Proc. IRE, Vol 49, P 228, Jan 1961. This is the oldest reference to CRC I could find, and the most obtuse as far as 'getting the work done vs math'.
  • "Error Correcting Codes" W. Peterson, Cambridge, MA MIT PRess 1961.
  • Tannenbaum, Andrew. Computer Networks, 128-32. Englewood Cliffs, NJ Prentice-Hall 1981.
  • "Technical Aspects of Data Communications", by McNamara, John E. Digital Press. Bedford, Mass. 1982
  • Ramabadran T.V., Gaitonde S.S., A tutorial on CRC computations, IEEE Micro, Aug 1988.
  • Advanced Data Communication Control Procedure (ADCCP). Federal Register / Vol. 47, No. 105 / Tuesday, June 1, 1982 / Notices
  • CRC-32 (USA) IEEE-802: Polynomial $04C11DB7: X32 + X26 + X23 + X22 + X16 + X12 + X11 + X10 + X8 + X7 + X5 + X4 + X2 + X +1
  • $DEBB20E3: PKZIP
  • CRC-CCITT V.41 Polynomial $1021 X16 + X12 + X5 + 1
  • "CRC generators do NOT generate maximal-length sequences. In fact, the polynomials are deliberately chosen to be reducible by the factor X + 1, because that happens to eliminate all odd-bit errors." -- Embedded Systems Programming Jan/1992 Jack Crenshaw
  • 16-Bit CRC can detect:
  • 100% of all single-bit errors
  • 100% of all two-bit errors
  • 100% of all odd numbers of errors
  • 100% of all burst errors less than 17 bits wide
  • 99.9969% of all bursts 17 bits wide
  • 99.9985% of all burst wider than 17 bits (the same as the checksum)

All burst errors of 16 or fewer bits in length and all double-bit errors separated by fewer than 65,536 bits (or 8192 bytes). Dr. Dobb's Journal, May 1992 Fletcher's Checksum by John Kodis.

For $1021:
"T" $1B26
"THE" $7D8D
"THE,QUICK,BROWN,FOX,01234579" $7DC5

Byte-wise CRC Without a table, Crenshaw 1992 [This is the one I use the most, because it fits in 2K parts, where I rewrote it in C, and AVR ASM.]

B:Byte
CRC:16 Bit unsigned

B := B XOR LO(CRC);
B := B XOR (B SHL 4);
CRC := (CRC SHR 8) XOR (B SHL 8) XOR (B SHL 3) XOR (B SHR 4);

Build Table:

I:Index 0->255
Z:Byte

Z := I XOR (I SHL 4);
Table[I] := (Z SHL 8) XOR (Z SHL 3) XOR (Z SHR 4);

Update CRC using Table:
CRC := (CRC SHR 8) XOR Table[ Data XOR (LO(CRC) ];

"Calculating CRCs by Bits and Bytes" by Greg Morse; Byte Magazine September 1986.

CRC is ones complimented, then transmitted least significant byte first. The resulting magic number via a quirk of polynomial syndromes will always be $F0B8 if there where no errors. [No math book I've read has even mentioned it, let a alone explain it, but it is what I look for in all of my CRC code for "good" vs "bad" blocks.]

T = Dx XOR Rx
U =     T7 T6 T6 T4
XOR T3 T2 T1 T0

CRChi = R15 R14 R13 R12 R11 R10 R9 R8
CRClo = R7  R6  R5  R4  R3  R2  R1 R0
Data  = D7  D6  D5  D4  D3  D2  D1 D0
T     = T7  T6  T5  T4  T3  T2  T1 T0
U     = U7  U6  U5  U4  0   0   0  0

Bit *15 14 13 12 11 *10 9  8  7   6   5   4   *3  2   1  0
#1                           R15 R14 R13 R12 R11 R10 R9 R8
#2       U7 U6 U5 U4 T3  T2 T1 T0
#3                       U7 U6 U5 U4  T3  T2  T1  T0
#4                                                U7  U6  U5 U4

Line 1 is CRChi moved into CRClo; line 2 is the high nybble of U and the low nibble of T; line 3 is the line 2 byte shifted left by 3 bits; and live 4 is U shifted right by 4 bits.

If byte is "T" ($54), CRC = $FFFF, then answer should be $1B26.

Cyclic Redundancy Checks:

With a properly constructed 16-bit CRC, an average of one error pattern will not be detected for every 65,535 that would be detected. That is, with CRC-CCITT, we can detect 99.998 percent of all possible errors.

It is precisely this paragraph that lead me to ask the original questions:

"It should be noted that CRC polynomials are designed and constructed for use over data blocks of limited size; larger amounts of data will invalidate some of the expected properties (such as the guarantee of detecting any 2-bit errors). For 16-bit polynomials, the maximum designed data length is generally 2^15 - 1 bits, which is just one bit less than 4K bytes. Consequently, a 16-bit polynomial is probably not the best choice to produce a single result representing an entire file, or even to verify a single EPROM device (which are now commonly 8K or more). For this reason, the OS9 polynomial is 24 bits long."

"By some quirk of the algebra, it turns out that if we transmit the complement of the CRC result and then CRC-process that as data upon reception, the CRC register will contain a unique nonzero value depending only upon the CRC polynomial (and the occurrence of no errors). This is the scheme now used by most CRC protocols, and the magic remainder for CRC-CCITT is $1D0F (hex)."

No reference has every explained this "quirk". $1D0F is more commonly expressed as $F0B8, reverse bit order.

Friday, March 27, 2009

Software Safety blog reader Michael Barr has a new article Bug-killing standards for firmware coding on Embedded.com, where he discusses "Ten bug-killing rules" Michael also has his own blog.

There is also an interesting discussion going on in the comments section related to the article.

I even added a comment of my own:

Dale Shpak wrote:

" I have debugged millions of lines of code and have encountered the following type of error many times:

while (condition);

{

/* Execute conditional code */

}"

If you put this in your .emacs file:

(global-cwarn-mode 1)

Errors such as "if(condition);" and "while(condition);", as well as "if( x = 0 )" type errors are highlighted.

No need to use the One True Brace style when you are using the One True Editor... :-)

Also MISRA 21.1(a)/2004 requires the use of static analysis tools, that would never allow the passage of an always executing "conditional".

MISRA doesn't say much about style. It does say braces will always be used. I say that they should clearly show the nesting. Path coverage testing is hard enough without playing "find the matching brace" (EMACS helps out here too).

Saturday, March 14, 2009

On a personal note, yesterday I received my re-certification papers from the American Society for Quality (ASQ). I am now officially certified for three more years as a Certified Software Quality Engineer (CSQE).
The CMMI Product Team has released Technical Report CMU/SEI-2009-TR-001:
"CMMI for Services (CMMI-SVC) is a model that provides guidance to service provider organizations for establishing, managing, and delivering services. The model focuses on service provider processes and integrates bodies of knowledge that are essential for successful service delivery."

Long hours link to dementia risk

Something I really like from the Agile Methodology is the 40 hour week. BBC News is reporting:
Long working hours may raise the risk of mental decline and possibly dementia, research suggests. The Finnish-led study was based on analysis of 2,214 middle-aged British civil servants. It found that those working more than 55 hours a week had poorer mental skills than those who worked a standard working week. The American Journal of Epidemiology study found hard workers had problems with short-term memory and word recall. “ This should say to employers that insisting people work long hours is actually not good for your business ” - Professor Cary Cooper University of Lancaster

Saturday, February 28, 2009

In C are you a Righty or Lefty?

Do you write your code, like almost everyone does, like this (Those are Zeros if you have a funky font):
if( x == 0 ){...}
or do you do it correctly and do it this way?:
if( 0 == x ){...}
Why is the latter the correct way? It prevents you from making this mistake:
if( x = 0 ){...}
"Unless Debugging is an Obsession" put the constants on the left in any conditional test. Also use a lot of parentheses, you can never have to many parentheses, if there is more than one condition in the test. When you put the condition on the left, the compiler will refuse to compile the code at all, because you can not assign a value to a constant. Putting the constants on the right may elicit a warning if you are using a good quality compiler, if your lucky. I've been giving out this advice for years. The responses have been interesting:
I've never made that mistake. I don't need such crutches. -- AVR GCC List It does not read right. -- Well known Compiler Guru, in private email.
What is wrong with reading it as "if Zero is equivalent to X"? Do you want to ship products on time and under budget, or do you want to write code in the way that everyone else does?

Wanted: Experianced Embedded System Developer with a Brain

"I am a consultant and I am frequently hired by CEO's and CFO's who are at their whits end with the 'kids' that got hired by the other kids that got the job then decided the lights were brighter and more sparkley someplace else..." --- by FlyingGuy (989135) on SlashDot.org.

That seemed like a good introduction to this real Want Ad I saw on Craigs List this week. I have all of the experience they are looking, would you sign up based on this Ad (not that I'm looking right now)?:

EE / Embedded Control Hardware / Software Robot Instrumentation

Needed: One damn hot engineer to finish a robotics project for a very established company in East Pittsburgh Area.

This is a full time position but if you are some hot talented Carnegie Mellon University Robotics student we'll consider part time, as long as you perform and deliver ( unlike the previous degreed graduated CMU student.)

This is a robotics project but the robotics are simple. The little robot is designed to carry instrumentation into a tight, hot crack where no instrumentation has gone before.

Personal Requirement:

1.A brain. 2. A watch 3. A cell phone that you answer 4. Ability to give up girl friend for being paid professionally. 5. Working with us professionally between the hours of 7am and 7pm, and not the reverse.

Professional Requirements

An excellent understanding and experience with digital circuit design, layout and interfacing. You had darn well better know how to lay out circuit boards and use a hot air rework station to put down SMD if you have to. You need a full understanding of VLSI circuits as well as discrete circuitry. Motor control and instrumentation associated with robotics. Servo Motors, DC Motors, Step motors Motors, Encoders etc..

A phenomenal understanding of the ATMEL AVR type of chipsets and supporting circuits and an excellent command of the C language used for writing code for those chips. You must have a complete mastery of all of the chips features, A/D, I/O, all TX/RX methods, Counter Timers etc..because they are all in use. Reading the articles in Make Magazine do NOT count. Read the first line again; Phenomenal Understanding.

I would hope that you also have a competent ability to write software in a windows environment for the display of the data the robot sends back. Even if its liberty basic / visual basic that is ok, but we'd prefer a full C++ development environment expertise.

You need to have enough understanding of analog electronics to digitize, transmit, store and display the information as well as the use of DC power supplies and supporting instrumentation such as digital storage oscilloscopes. Don't go getting a funny look on your digital experience face if someone asks you about the impedance of your connection.

You must be able to produce and provide documentation. Schematics, illustrations, photo documentation of progress, component lists etc... so they don't have to be extorted from you if you no longer work for us.

You will be signed up with a non-disclosure and confidentiality agreement. You will have a police & background check performed on you as well as drug testing. No criminal history and no history of drug use. Period.

I personally don't care if you are a student, have a BS, MS, or a Ph.D. What we need is ability and capability along with a high desire (even desperation) to work and finish the project. We are looking or talent and I personally was probably doing assembly language programming and building circuits by hand before you were even dumping in your diapers...and I'll be the chief person interviewing you. Come prepared IF you make it to the interview process.

As with any project, there is a point where it ends but of course...what project have you seen that ever ends. A success of a project always moves to improvements and expansion of that project so there is the very very real potential for this to be full time unending employment. Full professional pay, full benefits, vacation time, medical, house, picket fence, 2.5 kids etc.

The work environment is professional in every sense. Nice office, large lab and work area, new Dell Computers for everything, excellent people to work for and just good natured and nice all around. No jag offs trying to make a joke at your expense. We guarantee that.

So you have a choice. The red pill or the blue pill. If you decide the blue pill than please wake up tomorrow and forget all about this. If you decide the red pill then please send back an email with your interest as well as your resume, experience, links to your website with photos /video of accomplishments/project (not your cat) etc...

The USA is basically in a depression and there millions of people out there with extreme talent looking for professional positions so if you want this position you had better make your submittal good.

Thank you, Steve.

Article: http://pittsburgh.craigslist.org/egr/1049407629.html

Steve sums up my view, and the views of many of today's HR departments. Some of the HR blogs indicate that they have turned into babysitting services, to keep the newly degreed young people from moving on when they are hit with the least bit of negativity.

Like FlyingGuy in the introduction, I do my own part time consulting gig. I get called in to clean up the mess left by people with lots of letters after their name.

I once went in to clean up a project that was designed by a committee of people spread all over the world. The unit was large moving equipment that if something went wrong, people might die. The unit was composed of several different CPU modules communicating on a property bus. Each modules software was written by a different group in a different part of the world.

The operators requested speed was input in Feet Per Minute. The output to a Variable Frequency Drive was in tenths of Hertzs. The tachometer feedback was in RPM, and to top it off all the internal calculations where done in Radians-Per-Second.

The first thing I did to get the project back on track was to adopt a standardized variable naming convention, that included the units. For example the Operator Request became operator_request_fpm_u16. You then knew immediately you where dealing with Feet Per Minutes, and that it was a 16 bit unsigned variable. After the variable name clean up may of the bugs became self documented, when you saw something like "operator_request_fpm_u16 / vfd_hz_s32" in the code, you knew there was a problem that needed fixed...

What has been your experiences with hiring people? Do you turn away people with experience in favor of people with degrees?

Sunday, February 1, 2009

Embedded Systems A Volatile Business

At the end of Embedded systems - a volatile business Jack Ganssle says: Bob Paddock sent me the link [Embedded System Compilers generate dangerous code], and thereby wrecked my day.

Always happy to wreck a day, by pointing out Software Safety issues. :-)

I believe that Jack and I see eye-to-eye on most embedded system issues, but I have to disagree in one area. In his column Skip bugging to speed delivery Jack stated: We only inspect new code because there just isn't time to pour over a million lines of stuff inherited in an acquisition.

Development times are always shorter that we want, so not wanting to look at inherited code may seem like a good shortcut to save time. Myself I do not see it that way. Just because a bug is old, does not mean that it is something to be ignored. The Zune problem, that we have already covered here is a perfect example. The Board Support Package was 'inherited' so it seems no one bothered to inspect the code.

Atmel's new TouchLib product shows us an other example of code that can't be inspected at all. TouchLib is only available as a binary library file. The source code is not available, even under NDA (I asked), to allow for inspection. In Atmel's view this is their way of protecting their Intellectual Property.

If you take the trouble to actually read the three different TouchLib licenses, from registering, installing and in the TouchLib archive, all three more or less say: "If this product screws up it is not our problem".

Why would I want to use code that I can not verify as safe and correct? My company is the one that would have to deal with potential warranty issues, calls from angry customers etc. I should just tell them "Sorry, we used software that we got off the Internet for free, but have no idea how it works"? What would your reaction to that be as a customer? I don't think you'd be very happy. Unhappy customers don't come back as paying customers.

Saturday, January 3, 2009

Does Time Keeping become a safety issue during Leap Seconds in a Leap Year?

At the end of 2008 we had a unusual event of a double compound time leap. 2008 was both a Leap Year, and ended with a Leap Second.

Wikipedia covers how to calculate Leap Years quite well, so I will not duplicate what should be very well known rules of how to calculate leap years here. Sadly it seem programmers don't know these rules. After all as far back as 1886 Christian Zeller came up with Zeller's Congruence to calculate the day of the week, and he got it right without even knowing what a computer was.

The event that most are talking about today, are the locking up of Microsoft's Zune 30 gigabyte devices.

While most people are bashing Microsoft for the problem, the problem really lies in the i.MX31 Board Support Package from Freescale. After registering you can download the i.MX31: Multimedia Applications Processor Board Support Package (FSL-WCE500-14-BSP), with full source code.

The now infamous file rtc.c, as reported on may other Zune related web sites, may be found in WINCE500\PLATFORM\COMMON\SRC\ARM\FREESCALE\PMIC\MC13783\RTC. The lockup bug comes down to the error that there will never be 367 days in *any* year in the function ConvertDays:

 while (days > 365)

  {

    if (IsLeapYear(year))

   {

    if (days > 366)

     {

      days -= 366;

      year += 1;

     }

    }

   else

   {

     days -= 365;

     year += 1;

    }

  }

The line should have read "366 == days" or my preference "days >= 366". A simple Code Inspection should have caught a simple minded bug like this during a design review. We can only assume based on this code, and other questionable constructions in the same file (I did not look at other files), that no such inspections or reviews were done. It is also clear that they didn't even running something as inexpensive as Lint on their code. Doing so would have weeded out several of the non-executable paths that are present.

There have been Leap Second related crashes of various devices, beyond Zune related to Leap Seconds, such as some Linux Kernel Versions. So what I want to concentrate on is the Leap Second. A Leap Second may be inserted or removed to get the common UTC time standard in sync with the Earth's Rotation.

What is more insidious than rtc code above is that any product based on the MC13783 Power Management and Audio Circuit chip, as used in the crashing Zune's, is doomed from the start, because at the hardware level it is impossible to support Leap Seconds correctly. To be fair to Freescale most second counting clock chips have the same problem.

"4.1.2.2.1 Time and Day Counters

The real time clock runs from the 32 kHz clock. This clock is divided down to a 1 Hz time tick which drives a 17 bit time of day (TOD) counter. The TOD counter counts the seconds during a 24 hour period from 0 to 86,399 and will then roll over to 0. When the roll over occurs, it increments the 15-bit DAY counter. The DAY counter can count up to 32767 days..."

According to the National Institute of Standard and Technology on Radio Controled Clocks (page 27) a properly functioning clock would tick from 23:59:59 to 23:59:60, then to 00:00:00 during a Leap Second event. So a single day could have 86,400 seconds, counting zero.

A clock that that is not capable of display 23:59:60 would have two consecutive displays of 23:59:59.

Leap Seconds do not always add a second, they can also subtract a second, so a 'day' could correctly only have 86,398 seconds.

From a software perspective something like libtai that supports two time scales: (1) TAI64, covering a few hundred billion years with 1-second precision; (2) TAI64NA, covering the same period with 1-attosecond precision. Both scales are defined in terms of TAI, the current international real time standard, is worth considering. As long as the Leap Second tables are properly kept up to date, which presents problems of its own.

If you really want to dig into issues of Leap Seconds then check out the LEAPSECS -- Leap Second Discussion List. Also if you are in any way interested in the preciseness of time keeping then check out the Time Nuts Discussion List at the LeapSeconds site.

I am not aware of any loss of life, or loss of major income, due to any of the Leap Second problems that occurred this time. Will we be able to say the same for the next Leap Second that occurs?