Anandtech reviews Denver, finds bugs

NTMBK

Lifer
Nov 14, 2011
10,479
5,895
136
Case in point, we have encountered a floating point bug in Denver that has been traced back to the DCO, which under exceptional workloads causes Denver to overflow an internal register and trigger an SoC reset.

Ultimately this kind of inconsistent performance is a risk and a challenge for Denver. While no single SoC tops every last CPU benchmark, we also don’t typically see the kind of large variations that are occurring with Denver. If Denver’s lows are too low, then it definitely impacts the suitability of the SoC for high-end devices, as users have come to expect peppy performance at all times.

In practice, I didn't really notice any issues with the Nexus 9's performance, although there were odd moments during intense multitasking where I experienced extended pauses/freezes that were likely due to the DCO getting stuck somewhere in execution, seeing as how the DCO can often have unexpected bugs such as repeated FP64 multiplication causing crashes. In general, I noticed that the device tended to also get hot even on relatively simple tasks, which doesn't bode well for battery life. This is localized to the top of the tablet, which should help with user comfort although this comes at the cost of worse sustained performance.

Otherwise the review is much like we've seen elsewhere- fantastic GPU performance, extremely inconsistent CPU performance ("benchmark monster" on frequently repeated loops, not so hot in real world apps), gets very hot after extended use.

Hopefully NVidia can ship an updated to the code-morphing software that fixes/works around the bugs. And if/when Denver returns in 14nm, hopefully this experience will help them make it a little more polished!

(As an aside, it's a shame that Denver server plans seem to have been scrapped- this chip would probably actually be quite good at HPC, ignoring the FP bug. Lots of tight, frequently repeated loops would be like catnip to the DCO.)
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
One thing I was really hoping from the article, which wasn't there, is how the benchmarks change (if they change) from run to run.

There is a page that addresses this at a theoretical level, but they never say, run sunspider 10 times to see if the speed changes over time.

Is this just silly or is this something that could happen? It seems to me that if I run sunspider, part of the CPU time is going to be spent analyzing that code. If I run it multiple times, performance should increase as the DCO realizes this is important code, and has time to mull it over.

Incredibly interesting chip, but one has to wonder if future versions will have some sort of non-volatile memory for the cache, and maybe a companion core or two to 'run' the DCO.

Unfortunately, this sounds quite complicated and then the question is, why not just go OOO with a more traditional design?
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I think people have been too quick to assume that the lower performance cases are due to spending too little time in native optimized code. That may be the case to an extent, but there are other possible explanations. All layers of the CPU really rely on temporal locality of reference for decent performance. If a significant amount of your runtime is spent executing unique code you'll have bad performance everywhere.

In my experience with dynamic code generation for emulators, the real slowdowns happen when code that was already created has to be flushed, especially if this is repetitive due to techniques like self-modifying code. If the code never gets modified or has to be flushed then the translation costs barely register, especially since the big bulk ones tend to happen around when a bunch of other big bulk stuff happens like loading in data from disk. Now, translation for emulated code doesn't do anywhere near the analysis and optimization Denver's translator does, but I also don't have any cold path/hot spot triggering at all (everything is translated)

When looking at the performance comparisons you have to first remember that Denver is dual core vs a lot of quad core competitors, and that does make a big difference sometimes. But also, even with all of the features it has it'll sometimes still have weaknesses against traditional OoOE uarchs. Mainly, they can dynamically reorder over branches. The alternative at compile time results in combinatorial explosion for different branch paths. But for Denver it's very important to get as much as possible because it is so wide. So the result is some mix of much larger code footprint (hence the big L1 icache), redundant instructions that are executed when they aren't needed, or subpar extraction of parallelism. The DCO helps them be more aggressive on the worse cases and maybe even abandon some strategies that are failing as they do, but in this case you really can have code that fits okay in cache but uniformly is very branchy.

Is this just silly or is this something that could happen? It seems to me that if I run sunspider, part of the CPU time is going to be spent analyzing that code. If I run it multiple times, performance should increase as the DCO realizes this is important code, and has time to mull it over.

The problem with that is, as far as the DCO is concerned the code, which is dynamically generated by the browser JIT, looks new each time. Even code that's loaded from disk or moved around memory will look like new code. It could hash new code to see if it looks like old code and reuse translations, but that has its own expense, and it means the new code won't be translated in a way that's globally optimized against other newly translated code.