Sunday, February 27, 2011

The Anatomy of a Race Condition: Toyota vs AVR XMega

NASA has released their report on Toyota's sudden acceleration problem. The report indicates that there was no problems found with the electronics, hardware or software.
They blame the issue on user error and bad floor mats. As no problem was found, we can be 100% certain that no problems at all exist in the hundreds of thousands of lines of software code, in the vehicles electronics, right?
"Because proof that the ETCS-i caused the reported UAs [Unintended Accelerations] was not found does not mean it could not occur." - pg 17.
"Today's vehicles are sufficiently complex that no reasonable amount of analysis or testing can prove electronics and software have no errors. Therefore, absence of proof that the ETCS-i has caused a UA does not vindicate the system." - pg. 20.
Something that I find most annoying is that the areas where the embedded system hardware is discussed the most, is the area of the most redaction (blacked out sections). Why?

If a problem was to still be lurking, unfound, it could be what is known as a Software Race Condition. What does a software race condition actually look like?

We can find an easy example to pick apart in the AVR-LibC bug tracker, bug#29774: "Prologue/epilogue stack pointer manipulation not interrupt safe in [AVR] XMega".

To understand the problem here, you need a bit of historical background. In AVR's prior to the XMega, when an Enable Interrupt instruction was executed, the instruction following the Enable was guaranteed to execute with interrupts still turned off. In the mists of time someone thought it was a cool hack to save an instruction cycle by restoring half of the stack pointer, enabling interrupts,then restoring the other half of the stack pointer. The problem with such novel hacks is they invariably come back to bite you in the future.
Like a bad Soap-Opera story you can probably already see where thisis going? In the XMega when interrupts are enabled the following instruction is not guaranteed to execute before an interrupt occurs. Now the stage is set for the race condition.

The current generation of XMega parts can run code in a singlec ycle at up to 32 MHz. That means we have at minimum one 1/32 MHz, or31.25 nano-second window for the software race to happen. In a complex system there are probably more than one interrupt enable happening. To add more pain, the XMega can nest interrupts three levels deep.

You see that if an interrupt occurs exactly at the point where interrupts are enabled, only half of the stack pointer has been restored. So the new interrupt saves its registers someplace,odds are high it is not the right place! The new interrupt eventually returns, tries to restore its registers,from someplace that might have been read-only-memory, and bang we are off to the races with a crashed system doing who knows what.Maybe a full open throttle? No message shows up in any logs because there was no event logged through a call to the event logging system,as this was never an anticipated event; "systematic software malfunction in the main central processor unit (CPU) that is not detected by the monitor system".

Due to the short length of the 31.25 ns race window possibility, a crash may never happen, may happen every 18 hours and 22 minutes, or as often as I win the lottery [Give Wheeling Systems a try]. It could take some certain combination of options and users actions to cause the conditions of enabling interrupts while returning from an interrupt, while getting an interrupt. Turn the radio dial, press the brake peddle while the over automated headlights turn themselves on perhaps?

While the this bug was actually reported in the AVR-LibC bug tracker, the problem is actually within the AVR port of GCC. Specifically the file gcc-version/gcc/config/avr/libgcc.S.

I fixed my copy of WinAVR-GCC with a hex editor, so my projects would not suffer for this bug. Realistically how many other people will have done that? Not many I would guess. It is impossible to tell from the hideous Atmel website (all glitz, no useful information) what the state of the bug truly might be today.

For those that want to fix the problem the solution is to simply write the lower half of the stack pointer first: "To prevent corruption when updating the Stack Pointer from software, a write to SPL will automatically disable interrupts for up to 4 instructions or until the next I/O memory write". As GCC does not yet nativity support the XMega, the XMega features are maintained as a set of patches. Those patches have been updated to fix the problem.

It would be easy for some to say that one should not use Open Source compilers for real production projects, as I've seen a few prominent people state. I have a copy of IAR's AVR compiler, at no small price tag, that I've seen produce complete crap for output. So just because you paid, perhaps a lot of money, for it doesn't mean it is error free.

Some standards require that the code generated by the tool be inspected. At what level of detail is the question? I once actually ran into an assembler that produced correct listings, however the generated .HEX file was wrong. That problem took days to find. Disassembling, with an independent tool from a different provider, the generated .HEX file is one option, however it is not always easy to figure out what optimized compiler code is doing in a reasonable amount of time.

What kind of tool problems have you ran into?

Now returning back to Toyota. Section tells us that a Renesas,formally NEC, V850E1, and GreenHills ISO/ANSI Compiler are used for the control software of interest to us. Alas the section that might shed light on Race Conditions is completely redacted.

Starting on page 112 Tin Whiskers become a prominent failure mechanism. Keep in mind that according to page 19 of the report only six vehicles were analyzed. The whisker problem discussed from a seventh vehicle accelerator assembly only.

Tin (and Other Metal Whisker) Whisker are such a problem that NASA has given them their own Homepage. Tin Whiskering on PCBA Capacitors in Storage by Terry Munson gives a different, but still depressing, view of the Tin Whisker problem.

The comments about the whiskers over at Circuit Assembly Magazine are also educational, that I recommend that you read, for example:

Brett Emison pokes several holes in the NASA report, What NASA's Report Said About Toyota Sudden Acceleration for example:
The NHTSA/NASA report did little to address issues documented by drivers who actually experienced an unexplained sudden unintended acceleration event.
NASA's findings do not solve the question of what caused Kevin Haggerty's well documented sudden acceleration event. Haggerty owned a 2007 Toyota Avalon that experienced at least 5 different sudden acceleration events. Haggerty did not have accessory floor mats and his OEM mats were secured in place. Sticky pedals couldn't have caused the problem because he didn't have his foot on the pedal. On Haggerty's final incident, he was actually able to drive the vehicle while the engine was racing out of control into his local Toyota dealership.
He got to the parking lot, shifted to neutral and stopped the car with its brake smoking and engine racing out of control. He got out of the car and the engine was still racing (no pedal misapplication) Service technicians were able to look at he car and confirm the unintended acceleration was not caused by floor mats, sticking pedals or driver error. They also confirmed no computer error codes (meaning the computer was not detecting whatever was causing the problem).
Bob Landman
LDF Coatings, LLC
I do wonder if NASA applied their Software Safety Guidebook to Toyota's source code? I assume they did not, as it is not listed among the techniques they did apply.

...Tune in tomorrow for an other episode of As The Stack Churns...