At the end of 2008 we had a unusual event of a double compound time leap. 2008 was both a Leap Year, and ended with a Leap Second.
Wikipedia covers how to calculate Leap Years quite well, so I will not duplicate what should be very well known rules of how to calculate leap years here. Sadly it seem programmers don't know these rules. After all as far back as 1886 Christian Zeller came up with Zeller's Congruence to calculate the day of the week, and he got it right without even knowing what a computer was.
The event that most are talking about today, are the locking up of Microsoft's Zune 30 gigabyte devices.
While most people are bashing Microsoft for the problem, the problem really lies in the i.MX31 Board Support Package from Freescale. After registering you can download the i.MX31: Multimedia Applications Processor Board Support Package (FSL-WCE500-14-BSP), with full source code.
The now infamous file rtc.c, as reported on may other Zune related web sites, may be found in WINCE500\PLATFORM\COMMON\SRC\ARM\FREESCALE\PMIC\MC13783\RTC. The lockup bug comes down to the error that there will never be 367 days in *any* year in the function ConvertDays:
while (days > 365)
if (days > 366)
days -= 366;
year += 1;
days -= 365;
year += 1;
The line should have read "366 == days" or my preference "days >= 366". A simple Code Inspection should have caught a simple minded bug like this during a design review. We can only assume based on this code, and other questionable constructions in the same file (I did not look at other files), that no such inspections or reviews were done. It is also clear that they didn't even running something as inexpensive as Lint on their code. Doing so would have weeded out several of the non-executable paths that are present.
There have been Leap Second related crashes of various devices, beyond Zune related to Leap Seconds, such as some Linux Kernel Versions. So what I want to concentrate on is the Leap Second. A Leap Second may be inserted or removed to get the common UTC time standard in sync with the Earth's Rotation.
What is more insidious than rtc code above is that any product based on the MC13783 Power Management and Audio Circuit chip, as used in the crashing Zune's, is doomed from the start, because at the hardware level it is impossible to support Leap Seconds correctly. To be fair to Freescale most second counting clock chips have the same problem.
"126.96.36.199.1 Time and Day Counters
The real time clock runs from the 32 kHz clock. This clock is divided down to a 1 Hz time tick which drives a 17 bit time of day (TOD) counter. The TOD counter counts the seconds during a 24 hour period from 0 to 86,399 and will then roll over to 0. When the roll over occurs, it increments the 15-bit DAY counter. The DAY counter can count up to 32767 days..."
According to the National Institute of Standard and Technology on Radio Controled Clocks (page 27) a properly functioning clock would tick from 23:59:59 to 23:59:60, then to 00:00:00 during a Leap Second event. So a single day could have 86,400 seconds, counting zero.
A clock that that is not capable of display 23:59:60 would have two consecutive displays of 23:59:59.
Leap Seconds do not always add a second, they can also subtract a second, so a 'day' could correctly only have 86,398 seconds.
From a software perspective something like libtai that supports two time scales: (1) TAI64, covering a few hundred billion years with 1-second precision; (2) TAI64NA, covering the same period with 1-attosecond precision. Both scales are defined in terms of TAI, the current international real time standard, is worth considering. As long as the Leap Second tables are properly kept up to date, which presents problems of its own.
If you really want to dig into issues of Leap Seconds then check out the LEAPSECS -- Leap Second Discussion List. Also if you are in any way interested in the preciseness of time keeping then check out the Time Nuts Discussion List at the LeapSeconds site.
I am not aware of any loss of life, or loss of major income, due to any of the Leap Second problems that occurred this time. Will we be able to say the same for the next Leap Second that occurs?
This comment has been removed by the author.ReplyDelete
I am glad there were no bad safety stories to go with this bug. Although a code inspection may find problems like this, the comprehensive test coverage achieved through Test Driven Development would almost certainly found this bug. At least that is what I thought a few days ago when I read this article.ReplyDelete
I dug a little deeper into the Zune 30G bug. Thanks for the pointers to the original source files. The complete context of the code snip makes it possible to see this bug more clearly.
I test drove the isolation and repair of this bug, and then wrote a blog about preventing or fixing bugs like this with Test Driven Development. Please take a look at it. You will see there is more to the fix than the missing "=".
Zune Bug: Test Driven Bug Fix