Showing posts with label Software Race Condition. Show all posts
Showing posts with label Software Race Condition. Show all posts

Sunday, February 27, 2011

The Anatomy of a Race Condition: Toyota vs AVR XMega

NASA has released their report on Toyota's sudden acceleration problem. The report indicates that there was no problems found with the electronics, hardware or software.
They blame the issue on user error and bad floor mats. As no problem was found, we can be 100% certain that no problems at all exist in the hundreds of thousands of lines of software code, in the vehicles electronics, right?
"Because proof that the ETCS-i caused the reported UAs [Unintended Accelerations] was not found does not mean it could not occur." - pg 17.
"Today's vehicles are sufficiently complex that no reasonable amount of analysis or testing can prove electronics and software have no errors. Therefore, absence of proof that the ETCS-i has caused a UA does not vindicate the system." - pg. 20.
Something that I find most annoying is that the areas where the embedded system hardware is discussed the most, is the area of the most redaction (blacked out sections). Why?

If a problem was to still be lurking, unfound, it could be what is known as a Software Race Condition. What does a software race condition actually look like?

We can find an easy example to pick apart in the AVR-LibC bug tracker, bug#29774: "Prologue/epilogue stack pointer manipulation not interrupt safe in [AVR] XMega".

To understand the problem here, you need a bit of historical background. In AVR's prior to the XMega, when an Enable Interrupt instruction was executed, the instruction following the Enable was guaranteed to execute with interrupts still turned off. In the mists of time someone thought it was a cool hack to save an instruction cycle by restoring half of the stack pointer, enabling interrupts,then restoring the other half of the stack pointer. The problem with such novel hacks is they invariably come back to bite you in the future.
Like a bad Soap-Opera story you can probably already see where thisis going? In the XMega when interrupts are enabled the following instruction is not guaranteed to execute before an interrupt occurs. Now the stage is set for the race condition.

The current generation of XMega parts can run code in a singlec ycle at up to 32 MHz. That means we have at minimum one 1/32 MHz, or31.25 nano-second window for the software race to happen. In a complex system there are probably more than one interrupt enable happening. To add more pain, the XMega can nest interrupts three levels deep.

You see that if an interrupt occurs exactly at the point where interrupts are enabled, only half of the stack pointer has been restored. So the new interrupt saves its registers someplace,odds are high it is not the right place! The new interrupt eventually returns, tries to restore its registers,from someplace that might have been read-only-memory, and bang we are off to the races with a crashed system doing who knows what.Maybe a full open throttle? No message shows up in any logs because there was no event logged through a call to the event logging system,as this was never an anticipated event; "systematic software malfunction in the main central processor unit (CPU) that is not detected by the monitor system".

Due to the short length of the 31.25 ns race window possibility, a crash may never happen, may happen every 18 hours and 22 minutes, or as often as I win the lottery [Give Wheeling Systems a try]. It could take some certain combination of options and users actions to cause the conditions of enabling interrupts while returning from an interrupt, while getting an interrupt. Turn the radio dial, press the brake peddle while the over automated headlights turn themselves on perhaps?

While the this bug was actually reported in the AVR-LibC bug tracker, the problem is actually within the AVR port of GCC. Specifically the file gcc-version/gcc/config/avr/libgcc.S.

I fixed my copy of WinAVR-GCC with a hex editor, so my projects would not suffer for this bug. Realistically how many other people will have done that? Not many I would guess. It is impossible to tell from the hideous Atmel website (all glitz, no useful information) what the state of the bug truly might be today.

For those that want to fix the problem the solution is to simply write the lower half of the stack pointer first: "To prevent corruption when updating the Stack Pointer from software, a write to SPL will automatically disable interrupts for up to 4 instructions or until the next I/O memory write". As GCC does not yet nativity support the XMega, the XMega features are maintained as a set of patches. Those patches have been updated to fix the problem.

It would be easy for some to say that one should not use Open Source compilers for real production projects, as I've seen a few prominent people state. I have a copy of IAR's AVR compiler, at no small price tag, that I've seen produce complete crap for output. So just because you paid, perhaps a lot of money, for it doesn't mean it is error free.

Some standards require that the code generated by the tool be inspected. At what level of detail is the question? I once actually ran into an assembler that produced correct listings, however the generated .HEX file was wrong. That problem took days to find. Disassembling, with an independent tool from a different provider, the generated .HEX file is one option, however it is not always easy to figure out what optimized compiler code is doing in a reasonable amount of time.

What kind of tool problems have you ran into?

Now returning back to Toyota. Section 6.7.1.2 tells us that a Renesas,formally NEC, V850E1, and GreenHills ISO/ANSI Compiler are used for the control software of interest to us. Alas the section that might shed light on Race Conditions is completely redacted.

Starting on page 112 Tin Whiskers become a prominent failure mechanism. Keep in mind that according to page 19 of the report only six vehicles were analyzed. The whisker problem discussed from a seventh vehicle accelerator assembly only.

Tin (and Other Metal Whisker) Whisker are such a problem that NASA has given them their own Homepage. Tin Whiskering on PCBA Capacitors in Storage by Terry Munson gives a different, but still depressing, view of the Tin Whisker problem.

The comments about the whiskers over at Circuit Assembly Magazine are also educational, that I recommend that you read, for example:

Brett Emison pokes several holes in the NASA report, What NASA's Report Said About Toyota Sudden Acceleration for example:
The NHTSA/NASA report did little to address issues documented by drivers who actually experienced an unexplained sudden unintended acceleration event.
NASA's findings do not solve the question of what caused Kevin Haggerty's well documented sudden acceleration event. Haggerty owned a 2007 Toyota Avalon that experienced at least 5 different sudden acceleration events. Haggerty did not have accessory floor mats and his OEM mats were secured in place. Sticky pedals couldn't have caused the problem because he didn't have his foot on the pedal. On Haggerty's final incident, he was actually able to drive the vehicle while the engine was racing out of control into his local Toyota dealership.
He got to the parking lot, shifted to neutral and stopped the car with its brake smoking and engine racing out of control. He got out of the car and the engine was still racing (no pedal misapplication) Service technicians were able to look at he car and confirm the unintended acceleration was not caused by floor mats, sticking pedals or driver error. They also confirmed no computer error codes (meaning the computer was not detecting whatever was causing the problem).
Bob Landman
LDF Coatings, LLC
http://www.ldfcoatings.com
I do wonder if NASA applied their Software Safety Guidebook to Toyota's source code? I assume they did not, as it is not listed among the techniques they did apply.

...Tune in tomorrow for an other episode of As The Stack Churns...

Sunday, August 15, 2010

Buggy Toyota Software. Don't they have hills in Japan?

After years of experience with "American" cars, for reasons of reliability and hidden rust (Did they design places for the rust to hide on purpose?), and not wanting to buy from a company that stopped honoring its warranties, my wife and I went for a used low mileage Toyota Van.


After the years of hype about Toyota Reliability I keep running into software bugs.  The annoying kind I could fix if they supplied source code with their vans.


I'm not talking about their well know sudden acceleration issues, but more every day issues, that are clearly caused by software.


First of all if you put the windows down on the sliding doors, then the doors will not latch into an open position.  The manual says that this is a safety feature.  How is having to race a door to keep it from smashing your hand anytime you unload groceries or load up on spring water at the local spring (Neither being on a level surface) a safety feature?


To make maters worse, the amount that windows must be down before the doors do not latch, is different between the two sides of the van, and the grade that the van is parked upon impacts the latch point as well, so the only real choice is to always remember to put up the windows on beastly hot August days.


Then we have the headlights.  There is a very rigorous sequence of events that must be followed to get the headlights to turn themselves off automatically.  Deviate from that sequence in any way, and you end up with a dead battery.  No chime, or anything else, that your lights are on when you open the door (there is a chime, as it goes off for reasons yet unknown while driving down the road at times, usually related to something about the passenger air bag).


I simply do not comprehend why simple software issues like this even have to exist in our vehicles.


Slash Dot has a story on the New Jaguar XJ Suffers Blue Screen of Death as well.


Am I the only person left that wants my vehicle to be a tool to transport me and mine from point A to point B, and not be an Infotainment Center?


Not to leave sudden acceleration issues completely out of a Toyota related post, I noticed this in the 2008 Owner's Manual:





Installation of a mobile two-way radio system


As the installation of a mobile two-way radio system in your vehicle could affect electronic systems such as multiport fuel injection system/sequential multiport fuel injection system, electronic throttle control system, cruse control system, dynamic laser cruise control system, anti-lock brake system, traction control system, vehicle stability control system, SRS airbag system and seat belt pretensioner system, be sure to check with your Toyota dealer for precautionary measures or special instructions regarding installation.



So what do I do about the guy driving next to me with the two-way system, her cell phone in the passing car, the cell phone tower I drive by, or the transmitter from the traffic control system at the intersection?



If Toyota can not get simple things like headlights and door correct, what should we think about their ability to handle complex real time code?

Sunday, May 23, 2010

Counterfeit ESD products. Guide on EMC for Functional Safety.

Interference Technology of ITEM Publications, has released their 2010 EMC Directory and Design Guide - Digital Edition. [Alas the Digital Edition is an annoying FlipBook that does not render correctly in Opera. Using FireFox you can find the link to the PDF version.]

Several interesting articles, but there are two in particular I wanted to bring to your attention:

  • The Dip Tube by Robert J. Vermillion.
  • The IET’s Guide on EMC for Functional Safety by Keith Armstrong.

To the first one you've got to be wondering whats a Dip Tube? Properly rendered as DIP, Dual Inline Package, it makes more sense to you I'm sure. Those long plastic rails that our ICs are supplied to us in from the factory. It seems that not only do we have to worry about counterfeit parts. We now have to also worry about real parts coming in counterfeit anti-static protection, making the real parts unusable junk. Also offers some words of wisdom on reusing anti-static protection. The words are "don't do it".

The lengthy second article is an introduction to the Guide on EMC for Functional Safety, August 2008, ISBN 978-0-9555118-2-0. Available from The Institution of Engineering and Technology as a PDF, or as a real book (colored chemicals on dead trees). Checklists are found here.

The guide is a 9-step Process to Functional Safety taking EMC in to account. It even includes useful checklists to aid project management, design and compliance assessment.

I'll quote Mr. Armstrong introduction directly:

"Electronic complexity is increasing with no end in sight, increasing self-generated noise levels, while the feature sizes in silicon integrated circuits continue to shrink, making them emit more noise while at the same time more susceptible to noise. The use of electronics in safety related applications is growing very rapidly indeed, with (once again) no end in sight.

We have already reached the point where the normal testing-based approach to electromagnetic compatibility (EMC) is totally inadequate where safety is concerned, as current media interest in automobiles with malfunctioning 'electronic throttles' shows. [The Toyota problem seems like a classic case of Priority Inversion to me.]

...

It comprehensively describes practical and cost-effective procedures for both management and engineering, and can be used immediately to help to save lives and reduce injuries, whenever electronic technologies are used in safety-implicated products, systems or installations of any kind. It is so practical that it even includes useful checklists to aid project management, design and compliance assessment.

...

The IET Guide can also be used to improve reliability, for example in high-reliability, mission-critical, or legal metrology applications.

...

EMC immunity testing is never sufficient on its own for safety I hope I have shown that EMC testing can never be sufficient - on its own - to demonstrate that functional safety risks are low-enough, or that risk-reduction will be high enough, over the life-cycle of an EFS, taking its physical and climatic environments (including wear and aging) into account. The number of variables is simply too large. Test plans could be drawn up which would provide the necessary design confidence, but no-one (even governments) could afford their cost, or the very long time they would take. But we’ve been here before! In the 1990s it was realised that testing was not sufficient to demonstrate that software programs were reliable enough for use in safety systems. After many hundreds of man-years of work by academia and industry, the result was Part 3 of IEC 61508."

While on the subject of EMC, Mr. Armstrong is a frequent contributor to the EMC Journal freely available from the Compliance Club. Do check out the past archives of the Journal.