Sunday, February 27, 2011

The Anatomy of a Race Condition: Toyota vs AVR XMega

NASA has released their report on Toyota's sudden acceleration problem. The report indicates that there was no problems found with the electronics, hardware or software.
They blame the issue on user error and bad floor mats. As no problem was found, we can be 100% certain that no problems at all exist in the hundreds of thousands of lines of software code, in the vehicles electronics, right?
"Because proof that the ETCS-i caused the reported UAs [Unintended Accelerations] was not found does not mean it could not occur." - pg 17.
"Today's vehicles are sufficiently complex that no reasonable amount of analysis or testing can prove electronics and software have no errors. Therefore, absence of proof that the ETCS-i has caused a UA does not vindicate the system." - pg. 20.
Something that I find most annoying is that the areas where the embedded system hardware is discussed the most, is the area of the most redaction (blacked out sections). Why?

If a problem was to still be lurking, unfound, it could be what is known as a Software Race Condition. What does a software race condition actually look like?

We can find an easy example to pick apart in the AVR-LibC bug tracker, bug#29774: "Prologue/epilogue stack pointer manipulation not interrupt safe in [AVR] XMega".

To understand the problem here, you need a bit of historical background. In AVR's prior to the XMega, when an Enable Interrupt instruction was executed, the instruction following the Enable was guaranteed to execute with interrupts still turned off. In the mists of time someone thought it was a cool hack to save an instruction cycle by restoring half of the stack pointer, enabling interrupts,then restoring the other half of the stack pointer. The problem with such novel hacks is they invariably come back to bite you in the future.
Like a bad Soap-Opera story you can probably already see where thisis going? In the XMega when interrupts are enabled the following instruction is not guaranteed to execute before an interrupt occurs. Now the stage is set for the race condition.

The current generation of XMega parts can run code in a singlec ycle at up to 32 MHz. That means we have at minimum one 1/32 MHz, or31.25 nano-second window for the software race to happen. In a complex system there are probably more than one interrupt enable happening. To add more pain, the XMega can nest interrupts three levels deep.

You see that if an interrupt occurs exactly at the point where interrupts are enabled, only half of the stack pointer has been restored. So the new interrupt saves its registers someplace,odds are high it is not the right place! The new interrupt eventually returns, tries to restore its registers,from someplace that might have been read-only-memory, and bang we are off to the races with a crashed system doing who knows what.Maybe a full open throttle? No message shows up in any logs because there was no event logged through a call to the event logging system,as this was never an anticipated event; "systematic software malfunction in the main central processor unit (CPU) that is not detected by the monitor system".

Due to the short length of the 31.25 ns race window possibility, a crash may never happen, may happen every 18 hours and 22 minutes, or as often as I win the lottery [Give Wheeling Systems a try]. It could take some certain combination of options and users actions to cause the conditions of enabling interrupts while returning from an interrupt, while getting an interrupt. Turn the radio dial, press the brake peddle while the over automated headlights turn themselves on perhaps?

While the this bug was actually reported in the AVR-LibC bug tracker, the problem is actually within the AVR port of GCC. Specifically the file gcc-version/gcc/config/avr/libgcc.S.

I fixed my copy of WinAVR-GCC with a hex editor, so my projects would not suffer for this bug. Realistically how many other people will have done that? Not many I would guess. It is impossible to tell from the hideous Atmel website (all glitz, no useful information) what the state of the bug truly might be today.

For those that want to fix the problem the solution is to simply write the lower half of the stack pointer first: "To prevent corruption when updating the Stack Pointer from software, a write to SPL will automatically disable interrupts for up to 4 instructions or until the next I/O memory write". As GCC does not yet nativity support the XMega, the XMega features are maintained as a set of patches. Those patches have been updated to fix the problem.

It would be easy for some to say that one should not use Open Source compilers for real production projects, as I've seen a few prominent people state. I have a copy of IAR's AVR compiler, at no small price tag, that I've seen produce complete crap for output. So just because you paid, perhaps a lot of money, for it doesn't mean it is error free.

Some standards require that the code generated by the tool be inspected. At what level of detail is the question? I once actually ran into an assembler that produced correct listings, however the generated .HEX file was wrong. That problem took days to find. Disassembling, with an independent tool from a different provider, the generated .HEX file is one option, however it is not always easy to figure out what optimized compiler code is doing in a reasonable amount of time.

What kind of tool problems have you ran into?

Now returning back to Toyota. Section tells us that a Renesas,formally NEC, V850E1, and GreenHills ISO/ANSI Compiler are used for the control software of interest to us. Alas the section that might shed light on Race Conditions is completely redacted.

Starting on page 112 Tin Whiskers become a prominent failure mechanism. Keep in mind that according to page 19 of the report only six vehicles were analyzed. The whisker problem discussed from a seventh vehicle accelerator assembly only.

Tin (and Other Metal Whisker) Whisker are such a problem that NASA has given them their own Homepage. Tin Whiskering on PCBA Capacitors in Storage by Terry Munson gives a different, but still depressing, view of the Tin Whisker problem.

The comments about the whiskers over at Circuit Assembly Magazine are also educational, that I recommend that you read, for example:

Brett Emison pokes several holes in the NASA report, What NASA's Report Said About Toyota Sudden Acceleration for example:
The NHTSA/NASA report did little to address issues documented by drivers who actually experienced an unexplained sudden unintended acceleration event.
NASA's findings do not solve the question of what caused Kevin Haggerty's well documented sudden acceleration event. Haggerty owned a 2007 Toyota Avalon that experienced at least 5 different sudden acceleration events. Haggerty did not have accessory floor mats and his OEM mats were secured in place. Sticky pedals couldn't have caused the problem because he didn't have his foot on the pedal. On Haggerty's final incident, he was actually able to drive the vehicle while the engine was racing out of control into his local Toyota dealership.
He got to the parking lot, shifted to neutral and stopped the car with its brake smoking and engine racing out of control. He got out of the car and the engine was still racing (no pedal misapplication) Service technicians were able to look at he car and confirm the unintended acceleration was not caused by floor mats, sticking pedals or driver error. They also confirmed no computer error codes (meaning the computer was not detecting whatever was causing the problem).
Bob Landman
LDF Coatings, LLC
I do wonder if NASA applied their Software Safety Guidebook to Toyota's source code? I assume they did not, as it is not listed among the techniques they did apply.

...Tune in tomorrow for an other episode of As The Stack Churns...

Tuesday, February 22, 2011

Test Driven Development with Embedded C - Discount through Friday Feb 28th

We need just three more people to bring James Grenning to Cleveland next month.

If you were on the fence about signing up, there is now a twenty percent discount if you sign up before 4PM EST on the 28th of February. Enter the code "FENEO", for Firmware Engineers of Northeast Ohio (FENEO).

See my last blog entry Test Driven Development Embedded C with James Grenning March 22nd to 24th in Cleveland for details.

Saturday, February 19, 2011

Test Driven Development Embedded C with James Grenning March 22nd to 24th in Cleveland. Be there!

Late last summer James Grenning was scheduled to give his class on Test Driven Development for Embedded C. Alas not enough people signed up to get him all the way to Cleveland. This year he is trying again.

His course is scheduled for March 22-24, 2011 at cost of $1495. If you plan on attending, please register ASAP so we can make sure there are enough students to justify bringing James to Cleveland.

I covered James new book Test Driven Development for Embedded C [Nov./2010] in my blog about Makefile tip #0 on automatic serial numbers to be embedded in C code.

Test Driven Development is a powerful technique for building embedded software. This hands-on course teaches the practice of Test Driven Development in the challenging environment of C. In this course you will learn how TDD helps overcome some of the challenges embedded developers face including: unpredictable schedules, poor quality, and the problems that follow. In addition, embedded software developers must conquer the realities of concurrent hardware/software development, scarce target hardware availability, long download times, high deployment costs, as well as the challenges of testing embedded C.

TDD leads to better designs, towards more object oriented approaches to C. In this call you will also learn some of the design principles that can help to guide engineers to better designs.

Most of you have existing legacy code. In this class you will learn valuable techniques for dealing with legacy code. You will see incremental approaches to getting control of the legacy code with tests making improvements to the design less risky.

Test-Driven Development, a key agile practice, helps software developers improve schedule predictability and product quality and can do the same for embedded developers. TDD is valuable even outside of agile development methods.

This course describes the problems addressed by TDD, as well as the additional challenges and benefits of applying it to embedded software. You will learn the test driven techniques as well as specific design approaches to make your C code to testable today, maintainable tomorrow, and ready for a long useful life.

This course will get you and your team well on the way to applying TDD for Embedded C in your development efforts.

Friday, February 11, 2011

Columbia County 36 inch gas line rupture. Is three a trend?

Yesterday I reported on a gas explosion in Allentown Pennsylvania. The total of damaged houses there is now up to 47.

Today a thirty-six inch gas line ruptured near Hanoverton Ohio. The gas pipeline explosion rocked entire county of Columbiana, reports say it was felt as far away as Pittsburgh PA.

Fox8 News has video of the event.

We now have three major gas explosions in less than 30 days. I don't like this trend. Will my neighborhood be next? Maybe yours?

Are we looking at a systemic failure of technology?

Thursday, February 10, 2011

Allentown explosion and fire. An other frozen sensor?

Today in Allentown Pennsylvania two houses were leveled and six damaged beyond repair. Five people are known dead.

Allentown Police Captain George Medero has said that fire looked to be the result of a gas explosion.

Lets pray that we are not seeing a trend of gas company sensors that are freezing up and causing pressure regulators to go bonkers.

Wednesday, February 9, 2011

Tantalum bust in Goma; Congo. Stock up on capacitors! Before the movie??

In Mark Doyle's, British Broadcasting Corporation Correspondent, blog, Mark posted an entry relevant to anyone designing Embedded Systems power supplies. Mark is reporting that "The President of Congo, Joseph Kabila, recently ordered a ban on mining in the area". The a area he is referring to are the Gold and more importantly to us Tantalum mines in that area.

Mark's online summary does not mention it, however on the actual BBC report I heard on the radio yesterday morning, they specifically mentioned Tantalum "used in mobile phones". Seems the Bad Guys were smuggling Tantalum, but as no one knows what that is, so they call it Gold and "other minerals". The on air report also said this whole episode had the good makings for a Spy Movie, complete with the take down of the Bad Guys on the airport tarmac.

Tantalum Capacitors are used as part of the power supply regulators in many devices. They have several unique properties, such high capacitance to volume ratio, an Equivalent Series Resistance (ESR) that falls within the Goldilocks Zone of not to low and not to high, to prevent the regulators from oscillating etc.

Newer regulators are stable with Ceramic Capacitors. The problem is supporting the legacy designs that can not be changed due to the acronym agency paper work. Pick the one of your choice: FDA, MSHA, FCC, UL etc... :-(

Saturday, February 5, 2011

Want innovation? Then get out of the way!

Do you know what I find so irritating about today's economy? The White House, and Big Boys like Intel are playing with hundreds of millions of dollars to spur market based innovation.

'They' just don't seem to get that the current lack of innovation in the country is due to small business and entrepreneurs being crushed under burdensome paper work, unfunded government mandates, environmental regulations that lack scientific foundation or common sense, the unknown costs of confusing heath care regulations, and obscure IRS regulations.

Did you know that the very people being asked to be innovative are actually singled out in the Tax Code:

  • Engineer
  • Designer
  • Drafter
  • Computer programmer
  • Systems analyst
  • or other similarly skilled worker
for special punishment by the IRS? Whatever happened to "Fairness for all"? The actual suicide note of consultant and programmer of Joe Stack, introduces us to the issue created by Tax Reform Act of 1986; see also the Small Business Job Protection Act of 1996, and the Pension Protection Act of 2006.

The tax code issue of who is an "Employee", in the purview of the IRS, is so confusing that the IRS has tried to clarify the issue multiple times such as PRESENT LAW AND BACKGROUND RELATING TO WORKER CLASSIFICATION FOR FEDERAL TAX PURPOSES;2007 and SECTION 530: ITS HISTORY AND APPLICATION IN LIGHT OF THE FEDERAL DEFINITION OF THE EMPLOYER-EMPLOYEE RELATIONSHIP FOR FEDERAL TAX PURPOSES; National Association of Tax Reporting and Professional Management;2009.

In my personal view, until the IRS is replaced with one of the many proposals for a Flat Tax (The harder you work, the more 'They' take is not much of a motivation to be innovative in the current system), and Congress and other politicians are forced to follow the laws that they create for the rest of us (We get their health care plan or they get ours), things are not going to get any better. No one wins in the Race to the Bottom...

To see what kind of road blocks to being innovate are, I would like to see one of the presidents young daughters (I do know they are to young to do this, but it makes the point) go to a random city in each state (to see the difference of regulations between states) and open a business that does something simple like sell Ice Cream Cones. Along the way they document each permit, each fee, and each regulation that they must comply with, be it local, state and Federal, before they sell even their first Cone. Then for the next year document each interaction with the Government for taxes, permits, and fees while documenting those costs. After that then they try to open a more complex business like a circuit board manufacture where lots of environmental rules come into play. Then maybe 'They' would understand the real world of a normal, that is outside of political circles, business people that want to change the world but are to busy shuffling Government forms.

Strategy for American Innovation: Promote Market-Based Innovation, Startup America, Ice Cream, Joe Stack, Computer Consultant, Suicide Note, IRS Section 530 1706, Intel

Who controls us? Our technology?

In a Free Software Foundation appeal for donations, Technological power should be held by all users of a technology, Benjamin Mako Hill, whom is on the board of directors at the of the FSF, states "Control over technology is power". Past events in Egypt are a good example of that.

Turning the control of technology issue around, on January 24th 2011, the community of Fairport Harbor in Lake County, located alongside Lake Erie northeast of Cleveland Ohio, literally exploded.

3,200 people suddenly found that their lives were controlled by the technology of the Gas Company. Authorities say built-up pressure in natural gas lines led to a house explosion and then a series of fires.

Various reports say spokesman Matt Butler with the Public Utilities Commission of Ohio, says ice that formed in a sensor line caused a gas pressure regulator to fail when the outside temperature was 7'F.

What I wonder is if this sensor system was designed by some well educated person setting in a warm office in a warm claimant, who has never seen a real gas well, while any country hick from the Oil Patch knows that the moisture in Natural Gas condenses when it is cold?

Environmental factors must always be considered in any infrastructure design. Maybe it has never been 7'F where a system is to be deployed, but can you say with certainty Mother Nature won't change her mind about that next year or in ten years? Redundancy is not always a bad thing when it comes to failure prone sensors and technology.