Sunday, March 29, 2009

IEC 60730 Power Up Self-Tests

[April/6th/2010: Updated this entry to add Microchip. Do you know of any others to be added?]

I was asked this week what I knew about "a self test at power up according standard IEC61508". First thing I can tell you is that Functional safety of electrical/electronic/programmable electronic safety-related systems has a price tag of over $1200! I always find the high prices of these numerous standards extremely frustrating and expensive.

In the past I was involved with the creation of reports, Programmable Electronic Mining Systems: Best Practice Recommendations (In Nine Parts) for the Centers for Disease Control (CDC)/ National Institute for Occupation Safety and Health (NIOSH) Mining Division. These reports draw heavily from International Electrotechnical Commission (IEC) standard IEC 61508 [IEC 1998a,b,c,d,e,f,g] and other standards. They are in the public domain, and can be found at my hardware site.

The newer standard, IEC 60730 is also mandating power up self-tests. You can preview what you are getting for your big bucks here.

The IEC 60730 safety standard for household appliances is designed for automatic electronic controls, to ensure safe and reliable operation of products. I always find it a bit ironic that now things like our refrigerator and dishwasher, have more stringent standards than some of the devices that really can kill us.

IEC 60730 segments automatic control products into three different classification:

  • Class A: Not intended to be relied upon for the safety of the equipment.
  • Class B: To prevent unsafe operation of the controlled equipment.
  • Class C: To prevent special hazards.

Hardware:

  • Independent clocked Watchdog Timer - this provides a safety mechanism to monitor:
    • The flow of the software
    • Interrupt handling & execution
    • CPU clock too fast, too slow and no clock
    • CRC Engine when available - this provides a fast mechanism for:
      • Testing the Flash memory.
      • Check on serial communication protocols such as UART, I2C, SPI.

    Software:

    • CPU Register
    • Program Counter
    • Flash CRC Using software and/or hardware CRC engines
    • RAM Tests
    • Independent Watchdog Timeout

Safety regulations and their impact on MCUs in home appliances has a short introduction to 60730.

Fortunately for us several companies have implemented IEC 60730 compliant libraries. Listed alphabetically:

What all of these tests fail to address in any meaningful way is what happens when a power up test fails? Best you can hope for is that you have a beeper or LED hooked up directly to a Micro pin that you can blink or beep. For example if you find that your accumulator has a stuck bit, you are hosed as at that point. You can not guarantee that anything you do is going to be correct.

There is also the problem of the trade off of being thorough with exhaustive tests, verses being fast. Some standards such as NFPA mandate that the system must be operational in under one second to complicate maters even further. I did have a micro one time that did have a hardware failure. The XOR instruction was broken, but only on certain bit combinations. Every other aspect of the part worked just fine. It took days to debug that problem. As at the time the micro in question was hard to get and expensive, swapping it first was not an option.

One closing thought is that you need to be vary wary of simple RAM tests. Writing 0xAA/0x55 tells you almost nothing about open address lines etc.

Saturday, March 28, 2009

Anyone want to do a term paper on CRCs?

Do you know any Math Majors that need a subject for a term paper?

The Embedded System community needs one written on CRCs that is practical rather than pedagogical like the texts books that address the subject. Here is a sample paper.

Resulting paper should have practical answers that people in embedded systems land like myself can understand and use. Reading Polynomials over Galois Fields tend to makes my eyes glaze over. My speed of Mathematics is more that of Trachtenberg Speed System of Basic Mathematics.

What brought this on today is the new Atmel XMega processor that I'm designing with, that uses the CRC polyonmial: x^24 + 4x^3 + 3x +1. That polynomial does not seem to be any of the standard ones, so what are its error detection properties?

Polynomial's have to have certain properties, while they may all be primes, not all primes make good CRC's. For example the properties that make good CRC polynomials will make a very bad random number generator, and vice-versa. Both are done with multi-tap shift registers. CRC generators do NOT generate maximal-length sequences. In fact, the polynomials are deliberately chosen to be reducible by the factor X + 1, because that happens to eliminate all odd-bit errors. -- Embedded Systems Programming Jan/1992 Jack Crenshaw. I admittedly have never understood why the "good ones" are the good ones. More of the math vs get the work done.

For some background take a look at these papers:

I know that CRC is good only over a certain block length, but what is that block length? The syndrome length? Syndrome length-1?

One article stated "a 16 bit CRC is good for 4K bits minus one"; I have not figure out how that works out, so I question its accuracy.

I want to CRC my code in Flash, however I don't want to use a 16 bit CRC if I really should be using a 32 bit CRC. I know the odds of this making any real difference is minuscule, but never want to give those Lawyers an opening.

Andrew Tannenbaum, in Computer Networks is often quoted talking about 16 bit CRC being "99.9998%" good at detecting.... but how do you calculate these percentages for CRC's of various length and more importantly the polynomial in use?

Since we are doing polynomial division and the CRC is the residue of that division there will be many CRC's that have the same answer, which is not what you want. This is why longer CRC's are better over longer bit runs.

From Tannenbaum, in Computer Networks:

  • Detect all single bit errors.
  • Detect all occurrences of two single-bit errors for frames less than 2n-1 bits in length.
  • Detect all odd number of bits errors.
  • Detect all burst errors with a length less the n.
  • Detect all but 1/2n-1 burst errors of length n + 1.
  • Detect all but 1/2n other errors.

Where n = number of bits in CRC.

See also Algebraic Codes for Data Transmission, Cambridge University Press, 2002.

I've spent several years looking actually for some of these CRC answers, even in real books such as Algebraic Codes for Data Transmission, Cambridge University Press, 2002. The books that I have found are already written for people that understand the math, rather than people like me that just want to get the job done, and want to cite a reference in the source code.

My random CRC crib notes collected over many years:

  • "Cyclic code for error detection" by W. Peterson and D. Brown, Proc. IRE, Vol 49, P 228, Jan 1961. This is the oldest reference to CRC I could find, and the most obtuse as far as 'getting the work done vs math'.
  • "Error Correcting Codes" W. Peterson, Cambridge, MA MIT PRess 1961.
  • Tannenbaum, Andrew. Computer Networks, 128-32. Englewood Cliffs, NJ Prentice-Hall 1981.
  • "Technical Aspects of Data Communications", by McNamara, John E. Digital Press. Bedford, Mass. 1982
  • Ramabadran T.V., Gaitonde S.S., A tutorial on CRC computations, IEEE Micro, Aug 1988.
  • Advanced Data Communication Control Procedure (ADCCP). Federal Register / Vol. 47, No. 105 / Tuesday, June 1, 1982 / Notices
  • CRC-32 (USA) IEEE-802: Polynomial $04C11DB7: X32 + X26 + X23 + X22 + X16 + X12 + X11 + X10 + X8 + X7 + X5 + X4 + X2 + X +1
  • $DEBB20E3: PKZIP
  • CRC-CCITT V.41 Polynomial $1021 X16 + X12 + X5 + 1
  • "CRC generators do NOT generate maximal-length sequences. In fact, the polynomials are deliberately chosen to be reducible by the factor X + 1, because that happens to eliminate all odd-bit errors." -- Embedded Systems Programming Jan/1992 Jack Crenshaw
  • 16-Bit CRC can detect:
  • 100% of all single-bit errors
  • 100% of all two-bit errors
  • 100% of all odd numbers of errors
  • 100% of all burst errors less than 17 bits wide
  • 99.9969% of all bursts 17 bits wide
  • 99.9985% of all burst wider than 17 bits (the same as the checksum)

All burst errors of 16 or fewer bits in length and all double-bit errors separated by fewer than 65,536 bits (or 8192 bytes). Dr. Dobb's Journal, May 1992 Fletcher's Checksum by John Kodis.

For $1021:
"T" $1B26
"THE" $7D8D
"THE,QUICK,BROWN,FOX,01234579" $7DC5

Byte-wise CRC Without a table, Crenshaw 1992 [This is the one I use the most, because it fits in 2K parts, where I rewrote it in C, and AVR ASM.]

B:Byte
CRC:16 Bit unsigned

B := B XOR LO(CRC);
B := B XOR (B SHL 4);
CRC := (CRC SHR 8) XOR (B SHL 8) XOR (B SHL 3) XOR (B SHR 4);

Build Table:

I:Index 0->255
Z:Byte

Z := I XOR (I SHL 4);
Table[I] := (Z SHL 8) XOR (Z SHL 3) XOR (Z SHR 4);

Update CRC using Table:
CRC := (CRC SHR 8) XOR Table[ Data XOR (LO(CRC) ];

"Calculating CRCs by Bits and Bytes" by Greg Morse; Byte Magazine September 1986.

CRC is ones complimented, then transmitted least significant byte first. The resulting magic number via a quirk of polynomial syndromes will always be $F0B8 if there where no errors. [No math book I've read has even mentioned it, let a alone explain it, but it is what I look for in all of my CRC code for "good" vs "bad" blocks.]

T = Dx XOR Rx
U =     T7 T6 T6 T4
XOR T3 T2 T1 T0

CRChi = R15 R14 R13 R12 R11 R10 R9 R8
CRClo = R7  R6  R5  R4  R3  R2  R1 R0
Data  = D7  D6  D5  D4  D3  D2  D1 D0
T     = T7  T6  T5  T4  T3  T2  T1 T0
U     = U7  U6  U5  U4  0   0   0  0

Bit *15 14 13 12 11 *10 9  8  7   6   5   4   *3  2   1  0
#1                           R15 R14 R13 R12 R11 R10 R9 R8
#2       U7 U6 U5 U4 T3  T2 T1 T0
#3                       U7 U6 U5 U4  T3  T2  T1  T0
#4                                                U7  U6  U5 U4

Line 1 is CRChi moved into CRClo; line 2 is the high nybble of U and the low nibble of T; line 3 is the line 2 byte shifted left by 3 bits; and live 4 is U shifted right by 4 bits.

If byte is "T" ($54), CRC = $FFFF, then answer should be $1B26.

Cyclic Redundancy Checks:

With a properly constructed 16-bit CRC, an average of one error pattern will not be detected for every 65,535 that would be detected. That is, with CRC-CCITT, we can detect 99.998 percent of all possible errors.

It is precisely this paragraph that lead me to ask the original questions:

"It should be noted that CRC polynomials are designed and constructed for use over data blocks of limited size; larger amounts of data will invalidate some of the expected properties (such as the guarantee of detecting any 2-bit errors). For 16-bit polynomials, the maximum designed data length is generally 2^15 - 1 bits, which is just one bit less than 4K bytes. Consequently, a 16-bit polynomial is probably not the best choice to produce a single result representing an entire file, or even to verify a single EPROM device (which are now commonly 8K or more). For this reason, the OS9 polynomial is 24 bits long."

"By some quirk of the algebra, it turns out that if we transmit the complement of the CRC result and then CRC-process that as data upon reception, the CRC register will contain a unique nonzero value depending only upon the CRC polynomial (and the occurrence of no errors). This is the scheme now used by most CRC protocols, and the magic remainder for CRC-CCITT is $1D0F (hex)."

No reference has every explained this "quirk". $1D0F is more commonly expressed as $F0B8, reverse bit order.

Friday, March 27, 2009

Software Safety blog reader Michael Barr has a new article Bug-killing standards for firmware coding on Embedded.com, where he discusses "Ten bug-killing rules" Michael also has his own blog.

There is also an interesting discussion going on in the comments section related to the article.

I even added a comment of my own:

Dale Shpak wrote:

" I have debugged millions of lines of code and have encountered the following type of error many times:

while (condition);

{

/* Execute conditional code */

}"

If you put this in your .emacs file:

(global-cwarn-mode 1)

Errors such as "if(condition);" and "while(condition);", as well as "if( x = 0 )" type errors are highlighted.

No need to use the One True Brace style when you are using the One True Editor... :-)

Also MISRA 21.1(a)/2004 requires the use of static analysis tools, that would never allow the passage of an always executing "conditional".

MISRA doesn't say much about style. It does say braces will always be used. I say that they should clearly show the nesting. Path coverage testing is hard enough without playing "find the matching brace" (EMACS helps out here too).

Saturday, March 14, 2009

On a personal note, yesterday I received my re-certification papers from the American Society for Quality (ASQ). I am now officially certified for three more years as a Certified Software Quality Engineer (CSQE).
The CMMI Product Team has released Technical Report CMU/SEI-2009-TR-001:
"CMMI for Services (CMMI-SVC) is a model that provides guidance to service provider organizations for establishing, managing, and delivering services. The model focuses on service provider processes and integrates bodies of knowledge that are essential for successful service delivery."

Long hours link to dementia risk

Something I really like from the Agile Methodology is the 40 hour week. BBC News is reporting:
Long working hours may raise the risk of mental decline and possibly dementia, research suggests. The Finnish-led study was based on analysis of 2,214 middle-aged British civil servants. It found that those working more than 55 hours a week had poorer mental skills than those who worked a standard working week. The American Journal of Epidemiology study found hard workers had problems with short-term memory and word recall. “ This should say to employers that insisting people work long hours is actually not good for your business ” - Professor Cary Cooper University of Lancaster