Saturday, February 27, 2010

Explaining Interrupts to Non-Programmers, or "I got a go Potty!".

I needed to explain interrupts to people that know nothing about how computers work, can you think of a better example?:

EMI has been getting a lot of blame for the Toyota problems. I speculate that the problem is an even more insidiously interrupt race condition.

Designing for EMI compliance is not new, as this simple and archaic introduction to the subject shows, from 1999: http://www.zilog.com/docs/appnotes/an_noise_imm.pdf

What those of us in the Embedded System industry want to see is the Toyota Source Code. The standard cooperate argument is that this would be a "Trade Secret", and lead to the loss of revenue if available. Perhaps. But in Toyota's case they've already lost trust and substantial amount of revenue. The only way to regain trust is an independent analysis of their software.

There are experts in such systems, like myself, that know how to look for things like improperly handled timer overflow interrupts, or other race conditions. Looking at the source code is what needs to be done, to put the issue of there being a software problem to reset. What we lack is the source code from Toyota to analyze.

Even without EMI causing problems for Embedded Systems (ECU's), there are other more subtle ways that things can go wrong with software that is not correctly written.

In any modern system there are things known as "Interrupts" running. The timing of interrupts can be tricky to those without experience. You could unknowingly create a single instruction window that last only nano-seconds, that if an interrupt falls in *exactly* the wrong place, things go bad, and do not recover. It may take exactly the right sequence of events, to fall exactly in the aberrant interrupt timing window, that is *extremely* hard to reproduce because of the timing issue.

Lets put this in a non-programming perspective:

Lets say you and a group of friends are all setting around your living room talking, drinking some healthy Green Tea. We will call this a running program. Now your two year old daughter enters the room and forcefully announces for all to hear "I got go potty!". Your conversation has been "interrupted", and unless you like it messy, you immediately stop what you are doing, put your tea down, and go service your daughter's request. Once her needs have been dealt with you return to your friends, pick up your tea and continue your conversation where you left off.

The point that your daughter makes her request is completely random in relation to the conversation that is taking place. There might be one particular spot in the conversation where handling the request could have a worse overall long term outcome, such as you just dumped boiling hot tea on someone. Such can be the timing of interrupts.

Program Runs: Chat with friends while drinking tea.
Program is interrupted: I got to go potty!
Save Program State: Pause conversation, put down tea.
Handle interrupt request: Take daughter to bathroom; need I say more here?
Interrupt ends.
Restore original program state: Pickup tea, continue conversation.
http://www.softwaresafety.net