Why recursion may be the death of you

Mark R · Feb 22, 2014

This long legal transcript comes from the Bookout v Toyota case.

Recursion has long been known to be a potentially risky propgramming construct; not only is it potentially hard to understand by programmers, it carries a number of risks, notably that of stack overflow. While less of a problem on modern desktop machines, stack space can be severely constrained in embedded CPU systems. This is the reason why recursion has been banned for decades in automotive ECUs and similar safety-critical systems.

In this case, an expert on embedded system design talks about how a lack of understading of recursion may have allowed the toyota Camry ECU to malfunction causing the electronically controlled throttle to stay open.

Bookout vs Toyota

Cliffs
1. Toyota ECU runs a special real-time OS, with a number (exact number redacted) of processes that control various engine parameters.
2. There are specific processes for controlling parameters such as combustion and spark, and a general-purpose "everything including the kitchen-sink" process for a variety of miscellaneous tasks.
3. The "kitchen-sink" process has control of the electronic throttle angle computation, as well as managing cruise-control, various fail-safes, brake-pedal accelerator override, ECU DTC/error logging, failsafe mode enable, etc.
4. The CalculateRequiredThrottlePosition function is incredibly complex, about 1500 lines, with over 146 branches.
5. A large number of functions in the "kitchen-sink" process made heavy use of recursion, and Toyota had not taken account of this when calculating the amount of stack space required.
6. Stack usage by the "kitchen-sink" process averaged 94% when analysed. Toyota had designed for worst-case usage to be under 41%
7. Stack overflow in the "kitchen-sink" process would cause the OS to kill the process, and as this process had the watchdog code and error logging code, it would not be restarted, nor would any error codes be logged. When the process was killed, the throttle motor would remain at the position it was in prior to the process crash. As the combustion/fuel injection and ignition processes (and throttle plate motor control process) were separate, they would continue to run as normal.

Edit:

8. There is a second "monitor" CPU, that is intended to monitor the main ECU CPU. One thing it does is query the main ECU's RAM for brake pedal status, and compare it's hardware connection to the brake pedal switch. If there is a mismatch (indiciating that the main ECU's "kitchen sink" process has crashed and is not monitoring the brake pedal) then it will override the main ECU's connection to the throttle motor and drive it to closed, causing the engine to stall. However, this would failsafe would only work if the brake pedal was released for several seconds, prior to being held for several seconds. If the brake pedal was pressed at the time the "kitchen-sink" process crashed, the ping (which was a read to memory) would reveal "brake pressed", which would correspond with the hardware "brake pressed" signal, and the failsafe would not be enabled. If the driver, then released the pedal for several seconds, a mismatch would be detected, and the engine stalled. If the driver released only for a fraction of a second, the chance of the watchdog CPU catching the mismatch was low. Dyno testing with a camry connected to a debugger showed that the engine would continue to accelerate, for as long as the brake pedal was depressed following a debugger induced "kitchen sink" process crash. Once the brake pedal was released for 5 seconds, the engine would stall.

Graze · Feb 22, 2014

Mark R said:
This long legal transcript comes from the Bookout v Toyota case.

Recursion has long been known to be a potentially risky propgramming construct; not only is it potentially hard to understand by programmers, it carries a number of risks, notably that of stack overflow. While less of a problem on modern desktop machines, stack space can be severely constrained in embedded CPU systems. This is the reason why recursion has been banned for decades in automotive ECUs and similar safety-critical systems.

In this case, an expert on embedded system design talks about how a lack of understading of recursion may have allowed the toyota Camry ECU to malfunction causing the electronically controlled throttle to stay open.

Bookout vs Toyota

Cliffs
1. Toyota ECU runs a special real-time OS, with a number (exact number redacted) of processes that control various engine parameters.
2. There are specific processes for controlling parameters such as combustion and spark, and a general-purpose "everything including the kitchen-sink" process for a variety of miscellaneous tasks.
3. The "kitchen-sink" process has control of the electronic throttle angle computation, as well as managing cruise-control, various fail-safes, brake-pedal accelerator override, ECU DTC/error logging, failsafe mode enable, etc.
4. The CalculateRequiredThrottlePosition function is incredibly complex, about 1500 lines, with over 146 branches.
5. A large number of functions in the "kitchen-sink" process made heavy use of recursion, and Toyota had not taken account of this when calculating the amount of stack space required.
6. Stack usage by the "kitchen-sink" process averaged 94% when analysed. Toyota had designed for worst-case usage to be under 41%
7. Stack overflow in the "kitchen-sink" process would cause the OS to kill the process, and as this process had the watchdog code and error logging code, it would not be restarted, nor would any error codes be logged. When the process was killed, the throttle motor would remain at the position it was in prior to the process crash. As the combustion/fuel injection and ignition processes (and throttle plate motor control process) were separate, they would continue to run as normal.

I was going to ask why wasn't the brake process made to close the throttle but then I realized the same process was responsible for the throttle itself and was in the kitchen sink process.

This is interesting indeed.

BrightCandle · Feb 22, 2014

Its not so much recursion that is the issue. They aren't doing safety critical software development properly at all. If they didn't know before the car went out on the road that the software could fail there then some aspects of their spec was wrong. The way you design safety critical software to SIL4 or better is always based on worst case scenarios, fixed memory and no extra allocation. In essence Toyota isn't following any of the principles of safety critical software design and that is why it failed.

In a nutshell you shouldn't buy Toyota cars with software in them as they aren't complying with the required standards for how to develop this software safely at all.

Markbnj · Feb 22, 2014

BrightCandle said:
In a nutshell you shouldn't buy Toyota cars with software in them as they aren't complying with the required standards for how to develop this software safely at all.

You could probably find similar faults in every automobile's controlling software if you pulled it apart in a lawsuit. Analysis of a piece of software can reveal conditions that may potentially cause certain responses, but it doesn't say whether those conditions have actually arisen in operation. I've owned four Toyotas so far, have 120k miles on my latest one, and I have never had one decide to take off under its own command. As far as I know there is still no hardware interlock on the ignition or transmission, or apparently in the brains of these drivers.

BrightCandle · Feb 22, 2014

Markbnj said:
You could probably find similar faults in every automobile's controlling software if you pulled it apart in a lawsuit. Analysis of a piece of software can reveal conditions that may potentially cause certain responses, but it doesn't say whether those conditions have actually arisen in operation. I've owned four Toyotas so far, have 120k miles on my latest one, and I have never had one decide to take off under its own command. As far as I know there is still no hardware interlock on the ignition or transmission, or apparently in the brains of these drivers.

Just because everyone else is breaking the law doesn't mean its OK for toyota to do it. When I was doing safety critical software it would have been impossible for me to have written such a fault, by design of the development process that was put in place. It shouldn't be possible for them at all. What this lawsuit shows is that Toyota should be banned from selling cars until it can prove that its complying with SIL4 safety standards as required by the law of the land in which they sell these cars, not just the USA but most of the world.

Markbnj · Feb 22, 2014

BrightCandle said:
Just because everyone else is breaking the law doesn't mean its OK for toyota to do it. When I was doing safety critical software it would have been impossible for me to have written such a fault, by design of the development process that was put in place. It shouldn't be possible for them at all. What this lawsuit shows is that Toyota should be banned from selling cars until it can prove that its complying with SIL4 safety standards as required by the law of the land in which they sell these cars, not just the USA but most of the world.

So you're able to tell, from a paid outside consultant's analysis of their control system and software, that they are breaking the law?

DaveSimmons · Feb 22, 2014

> 6. Stack usage by the "kitchen-sink" process averaged 94% when analysed. Toyota had designed for worst-case usage to be under 41%

So they failed to test this properly as they kept adding everything into the kitchen sink, and apparently failed to analyze all possible consequences of overflow.

That's poor practice, but in fairness it's much easier to work backwards from a specific crash scenario (pun intended) than to predict it in advance.

BrightCandle · Feb 22, 2014

Markbnj said:
So you're able to tell, from a paid outside consultant's analysis of their control system and software, that they are breaking the law?

Based on the description of the type of failure I know they haven't complied with the safety requirements for safety critical software as this type of failure is impossible if you do so. Anyone that has worked in safety critical software for any length of time will know that they can't possibly have complied with it, that type of problem just can't happen if they had.

Mark R · Feb 22, 2014

The legal testimony is long but it is a good read if you have a couple of hours

The key points are that:
1. Automative industry standard is to use language subsets which are easier to validate. A common standard is MISRA C which has a number (abotu 150) of rules about coding; e.g. no recursion, very careful control of signed/unsigned types, avoidance of side-effects combined with && or ||, etc . Toyota did not follow the MISRA rules, and instead developed their own rules, which covered barely 30% of the MISRA rules and required many of the other MISRA rules to be broken. Static code analysis revealed nearly 100k MISRA violations and many thousands of violations of Toyota's own coding rules.

2. There was no control of code complexity of maintainability. The code was almost unintelligible even with comments, with numerous examples of multi-thousand LOC functions. Many functions had code complexity scores in the hundreds (the throttle control code had a cyclomatic complexity score of 146). Complexity scores of >20 indicate "high risk" of error and indicate difficulty in testing i.e. "untestable". Complexity scores of >50 are regarded as "unmaintainable".

3. Over 11,000 global variables, not all of which were clearly documented or named.

4. Ignoring exceptions/events generated by the OS (e.g. task exit events, hardware errors, other exceptions)

5. Not double-checking critical system variables as a guard against memory corruption. Many variables were mirrored (2nd copy bit inverted) in different blocks of memory, and would be checked when read, to ensure that memory corruption had not occurred. This was not the case with the throttle control code, which stored only a single copy of the prescribed throttle angle without parity or ECC correction.

6. Hardware watchdog systems were extremely lax, possibly the CPU was too weak for the code, and would suffer excessive lag under certain conditions. In addition, the watchdog timer reset pulse generation had been removed from an earlier version of the "kitchen sink" process, and had been delegated to a hardware timer peripheral. (!?!)

7. Watchdog systems had not been fully tested or were non-functional. E.g. out of 24 processes on the main CPU, few process failures would reliably be detected, and many would require an additional event before detection would occur (e.g. failure of the "kitchen sink" process would not be detected until the driver operated or released the brake pedal).

Markbnj · Feb 22, 2014

DaveSimmons said:
So they failed to test this properly as they kept adding everything into the kitchen sink, and apparently failed to analyze all possible consequences of overflow.

For some reason I find the fact that 41% was their worst case assumption to be interesting. Why would you want 59% more stack space than you need in an embedded system? I've never done any embedded development, but that seems like a lot of headroom.

Mark R · Feb 22, 2014

Markbnj said:
For some reason I find the fact that 41% was their worst case assumption to be interesting. Why would you want 59% more stack space than you need in an embedded system? I've never done any embedded development, but that seems like a lot of headroom.

The article said that they had calculated stack usage to be 82%, which was thought to be fine. But then doubled stack space "just to be sure".

The problem was that the method of static code analysis they were using was broken. It didn't take account of stack usage by the OS/context switching and didn't take into account usage by standard libraries, calls by function pointer or by recursion. Given that was the case, it was amazing that their software worked at all.

Markbnj · Feb 22, 2014

Mark R said:
The legal testimony is long but it is a good read if you have a couple of hours

The key points are that:
1. Automative industry standard is to use language subsets which are easier to validate. A common standard is MISRA C which has a number (abotu 150) of rules about coding; e.g. no recursion, very careful control of signed/unsigned types, avoidance of side-effects combined with && or ||, etc . Toyota did not follow the MISRA rules, and instead developed their own rules, which covered barely 30% of the MISRA rules and required many of the other MISRA rules to be broken. Static code analysis revealed nearly 100k MISRA violations and many thousands of violations of Toyota's own coding rules.

2. There was no control of code complexity of maintainability. The code was almost unintelligible even with comments, with numerous examples of multi-thousand LOC functions. Many functions had code complexity scores in the hundreds (the throttle control code had a cyclomatic complexity score of 146). Complexity scores of >20 indicate "high risk" of error and indicate difficulty in testing i.e. "untestable". Complexity scores of >50 are regarded as "unmaintainable".

3. Over 11,000 global variables, not all of which were clearly documented or named.

4. Ignoring exceptions/events generated by the OS (e.g. task exit events, hardware errors, other exceptions)

5. Not double-checking critical system variables as a guard against memory corruption. Many variables were mirrored (2nd copy bit inverted) in different blocks of memory, and would be checked when read, to ensure that memory corruption had not occurred. This was not the case with the throttle control code, which stored only a single copy of the prescribed throttle angle without parity or ECC correction.

6. Hardware watchdog systems were extremely lax, possibly the CPU was too weak for the code, and would suffer excessive lag under certain conditions. In addition, the watchdog timer reset pulse generation had been removed from an earlier version of the "kitchen sink" process, and had been delegated to a hardware timer peripheral. (!?!)

7. Watchdog systems had not been fully tested or were non-functional. E.g. out of 24 processes on the main CPU, few process failures would reliably be detected, and many would require an additional event before detection would occur (e.g. failure of the "kitchen sink" process would not be detected until the driver operated or released the brake pedal).

That all sounds pretty damning, and I don't have the experience in this kind of software to really evaluate the expert's testimony. Was testimony from Toyota's development group included in any way? I have no doubt the code was spaghetti. I also have no doubt the expert was paid a lot to make it look as much like spaghetti as possible. I've written some good code and bad code in my life, and I am not sure I would want my best code subjected to the scrutiny of a paid expert whose job is to make it look bad.

I don't think the paid expert's opinion about the code is very relevant. Unless he can set it up in a lab environment and show that a certain set of inputs, plausibly present during the incident in question, reliably result in the specific outputs that could have caused the incident as described, opinion is all it is. It could be ugly as hell and still work right under all real world inputs.

For that matter, has anyone _ever_ reproduced "unintended acceleration" in a lab setting, let alone that accompanied by total brake failure? The first time I remember this coming up was with Audi back in the... 80's I think. If I recall the whole damn system was mechanical and nobody ever showed how it could have happened. Audi even demonstrated that the brakes could bring the car to a stop and hold it under full throttle.

There is no engineering or science going on in a courtroom. Everything that happens in there is posturing to try and influence the court in one direction or another. If you put five or ten actual working developers of automotive control systems in a conference room with this code for as long as it takes, and _they_ come out and tell me Toyota should be banned from selling cars, then I'll be all over it.

Mark R · Feb 22, 2014

Markbnj said:
For that matter, has anyone _ever_ reproduced "unintended acceleration" in a lab setting, let alone that accompanied by total brake failure? The first time I remember this coming up was with Audi back in the... 80's I think. If I recall the whole damn system was mechanical and nobody ever showed how it could have happened. Audi even demonstrated that the brakes could bring the car to a stop and hold it under full throttle.

You're absolutely right about what happens in court. This is all about an expert pointing out every possible little flaw, to make Toyota look bad.

However, while no one had successfully replicated an engine run-away with total brake failure, this expert did reliably reproduce an engine-run away condition in a camry tested on a dyno, by connecting a debugger and terminating the relevant process. As the throttle was stuck open, brake boost would be lost if the pedal was pumped, and this could give the impression of brake failure.

So, while there is no proof that this was the cause of the reported incidents, it nevertheless directly contradicts toyota's claim that such a fault could not occur, as such a fault had been demonstrated in a production car, and that it was plausible, albeit very unlikely, that such a fault could occur in real life (the ECU did not use ECC RAM, the particular process was overly complex and had very little margin for design error, the watchdog systems did not reliably detect failure of this process and restart it, etc.)

Markbnj · Feb 22, 2014

Mark R said:
You're absolutely right about what happens in court. This is all about an expert pointing out every possible little flaw, to make Toyota look bad.

However, while no one had successfully replicated an engine run-away with total brake failure, this expert did reliably reproduce an engine-run away condition in a camry tested on a dyno, by connecting a debugger and terminating the relevant process. As the throttle was stuck open, brake boost would be lost if the pedal was pumped, and this could give the impression of brake failure.

So, while there is no proof that this was the cause of the reported incidents, it nevertheless directly contradicts toyota's claim that such a fault could not occur, as such a fault had been demonstrated in a production car, and that it was plausible, albeit very unlikely, that such a fault could occur in real life (the ECU did not use ECC RAM, the particular process was overly complex and had very little margin for design error, the watchdog systems did not reliably detect failure of this process and restart it, etc.)

Connecting a debugger and terminating a running process doesn't strike me as a real-world scenario. It strikes me as the equivalent of taking out the corner of a building with a bulldozer, and then when it falls down proclaiming that its construction was flawed. If this can really happen then there should be some way to get a vehicle into that state without having to connect a magic sledgehammer and beat on it.

Nevertheless, I'm sure it is enough to get Toyota creamed in court, and enshrine in precedent the idea that automobiles can take off on their own and leave the driver with no means of recovery.

BrightCandle · Feb 22, 2014

Its not a wholey unrealistic way however to reproduce the problem having shown that the software can fail in that way due to the way its written, to then simulate that type of failure and see the same fault as users have been claiming on the road seems like a pretty compelling argument. Its not like its happening a lot, its a relatively rare thing to happen so there is no way that testing a car in the lab would reproduce it reliably, just showing its possible for the cars software to crash due to lack of RAM and then having the process failure simulate the result is enough to take 0.5 and 0.5 and make 1, that shows how it happens.

Fox5 · Feb 22, 2014

Should have written it in Ada.

Or in Lisp if they wanted to use recursion.

BrightCandle · Feb 23, 2014

Even in Ada there is a subset for safety critical software called Spark. But yes Ada does this better. What surprises me most is guys like Lockheed martin and Northrop Grumman use the Misra standard for C instead of moving to GreenOS and Ada. Misra can work but it seems a lot of the power of C is removed when you use Misra whereas spark is a far simpler adjustment to its base language and fully checkable with the right compiler.

But most companies nowadays do misra C instead of spark Ada. I have done both and I would take spark Ada every time, its simply a better language for safety critical work.

Fox5 · Feb 23, 2014

BrightCandle said:
Even in Ada there is a subset for safety critical software called Spark. But yes Ada does this better. What surprises me most is guys like Lockheed martin and Northrop Grumman use the Misra standard for C instead of moving to GreenOS and Ada. Misra can work but it seems a lot of the power of C is removed when you use Misra whereas spark is a far simpler adjustment to its base language and fully checkable with the right compiler.

But most companies nowadays do misra C instead of spark Ada. I have done both and I would take spark Ada every time, its simply a better language for safety critical work.

Well, gotta use the right tool for a job. Companies use C because there are a lot of C developers, but it'd probably be worth the pain to learn new languages when they have an advantage.

Cerb · Feb 23, 2014

Fox5 said:
Should have written it in Ada.

Or in Lisp if they wanted to use recursion.

LISP would be pretty bad for anything that might need auditing. I don't know about safety critical software, but other embedded software, at least, may use Smalltalk variants, in part to implement TCO (not necessarily for recursion). Probably wouldn't have helped, anyway, since this very much looks like negligence, more than a technical error that happened to slip by.

Recursion should only be used a is regular feature when it is known to function as a while loop, not increase the stack space, and will not encounter a condition where it will not terminate, unless the recursing function is the equivalent of a main() (C can not perform TCO, so don't do it in C, regardless). Any case where it can increase stack space needs to have a hard limiter set, and conditions for what to do upon reaching it. And that's from a background of freaking web apps, much less something that could cause injury or death of it fails to work correctly.

BrightCandle · Feb 23, 2014

When it comes to looking at safety critical suitable languages you have to consider that its vital that no objects ever get created, there is no dynamic memory allocation or garbage collection. Dynamic memory allocation is a danger for causing crashes (as we can see in this case with Toyota's terrible implementation) and garbage collection affects the real time nature of the software with its pauses. Both of these make Smalltalk and lisp unsuitable for safety critical work.

Cerb · Feb 23, 2014

If garbage collection were unusable real-time, why Java? That is one solved problem. Reference-counting GC can be done real time, and there are JVMs capable of that, precisely for embedded real-time software. I don't know if it would suitable for controlling a car or not (I wouldn't want to try, for your stated reasons: dynamic allocation allows more chances to use unpredictable RAM amounts), but real-time GC exists. Pauses for GC can be controlled, and their time used predicted. Tracing GC, like .NET, OTOH, can't do that.

Markbnj · Feb 23, 2014

BrightCandle said:
Its not a wholey unrealistic way however to reproduce the problem having shown that the software can fail in that way due to the way its written, to then simulate that type of failure and see the same fault as users have been claiming on the road seems like a pretty compelling argument. Its not like its happening a lot, its a relatively rare thing to happen so there is no way that testing a car in the lab would reproduce it reliably, just showing its possible for the cars software to crash due to lack of RAM and then having the process failure simulate the result is enough to take 0.5 and 0.5 and make 1, that shows how it happens.

I absolutely agree that it has significant diagnostic value, in a lab setting. In a courtroom it is just a show meant to sway opinions. The fact that it can happen under the described conditions is a red flag for sure, especially if it can be shown that the culprit process actually does get terminated under real world operating conditions. If that can't be established, then Toyota could still be 100% correct about their software, as ugly as it apparently is.

Merad · Feb 24, 2014

Cerb said:
If garbage collection were unusable real-time, why Java? That is one solved problem. Reference-counting GC can be done real time, and there are JVMs capable of that, precisely for embedded real-time software. I don't know if it would suitable for controlling a car or not (I wouldn't want to try, for your stated reasons: dynamic allocation allows more chances to use unpredictable RAM amounts), but real-time GC exists. Pauses for GC can be controlled, and their time used predicted. Tracing GC, like .NET, OTOH, can't do that.

No way. You might get away with it on higher-end microcontrollers with more resources and less time critical applications. But on a lot of these devices your amount of RAM is measured in bytes. Not KB, bytes.

Scroll through some of the specs on this page: http://www.atmel.com/products/automotive/automotive_microcontrollers/avr-based_microcontroller.aspx

The most powerful one I saw was roughly on par with the Arduino, 16 MHz clock, 2 KB of RAM and 32 KB ROM for programs. Java ME Embedded OTOH requires a minimum of 130 KB RAM and 350 KB ROM. You're not even working in the same realm.

Cerb · Feb 24, 2014

Merad said:
No way. You might get away with it on higher-end microcontrollers with more resources and less time critical applications. But on a lot of these devices your amount of RAM is measured in bytes. Not KB, bytes.

If that small, there's no room for object/reference tables, much less any usable data. Are devices as weak as Atmel really as high as they can go?

Merad · Feb 24, 2014

Cerb said:
If that small, there's no room for object/reference tables, much less any usable data. Are devices as weak as Atmel really as high as they can go?

Why would they upgrade? Those processors have plenty of power to do the job. Globally Toyota produces something like 10 million vehicles annually. Even if it's only $50* per vehicle to upgrade to Java capable processors those added costs balloon pretty quickly.

Also I highly doubt Java would be suitable even if they upped the processing power. I'm sure engine sensors have precise timing requirements. Also, these devices require low level access to hardware. Timers, interrupts, memory mapped I/O registers... and more.

tl;dr: Use Java where it's appropriate.

*Just a guess, no idea what it would actually cost.

Why recursion may be the death of you

Diamond Member

Senior member

Diamond Member

Elite Member <br>Moderator Emeritus

Diamond Member

Elite Member <br>Moderator Emeritus

Elite Member

Diamond Member

Diamond Member

Elite Member <br>Moderator Emeritus

Diamond Member

Elite Member <br>Moderator Emeritus

Diamond Member

Elite Member <br>Moderator Emeritus

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Elite Member <br>Moderator Emeritus

Platinum Member

Elite Member

Platinum Member