• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Nvidia Denver... finally here... and it looks good

xpea

Senior member
Feb 14, 2014
458
156
116
from hotchips 2014, Nvidia long awaited custom 64bit ARMv8 CPU on TSMC 28nm running at 2.5GHZ claimed to be faster than intel haswell 22nm 2955U at 1.4GHz base clock on SPECint bench.

Tegra-K1-Chips.jpg


details on the uarch:
http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/

whitepaper here: http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/
(free registration)
 
Last edited:

Khato

Golden Member
Jul 15, 2001
1,294
375
136
Will be interesting to see if there was any more information provided than what's in the whitepaper. Since the whitepaper, well, it goes to the level of detail that I've come to expect from NVIDIA, which is to say not much.

As for the performance figures in the whitepaper... Depends entirely upon the power consumption necessary to reach that frequency. And given the fact that they went from 4x A15 to 2x Denver I'm not exactly optimistic with respect to the power consumption figures. A 2.5 GHz Denver trading blows with a 1.4 GHz Haswell Celeron isn't exactly a stunning success unless it does so at Silvermont power levels. (As a note, it looks like the single-threaded performance ends up around 50% higher than a 2.4 GHz Silvermont.)
 

xpea

Senior member
Feb 14, 2014
458
156
116
A 2.5 GHz Denver trading blows with a 1.4 GHz Haswell Celeron isn't exactly a stunning success unless it does so at Silvermont power levels. (As a note, it looks like the single-threaded performance ends up around 50% higher than a 2.4 GHz Silvermont.)
well, it's the first time that an ARM CPU can touch intel big Core performance and it's already an amazing achievement.
For power level, I don't see it higher than the 4 A15s in K1-32
 

Sweepr

Diamond Member
May 12, 2006
5,148
1,143
136
A8 vs Denver should be interesting, I really like the two fast cores approach. Note that a 1.4GHz capped Haswell (2MB L3) isn't the most accurate comparison. If leaked specs are correct then fastest 4.5W TDP Broadwell-Y runs @ up to 2.6GHz (1.1GHz base, 4 threads, 4MB L3) + 24 EUs @ 850MHz GPU. Should be out even sooner than Denver.
 
Last edited:
Mar 10, 2006
11,715
2,012
126
A8 vs Denver should be interesting, I really like the two fast cores approach. Note that a 1.4GHz capped Haswell (2MB L3) isn't the most accurate comparison. If leaked specs are accurate fastest 4.5W TDP Broadwell-Y runs @ up to 2.6GHz (1.1GHz base) + 24 EUs @ 850MHz.

A8, Denver, and Broadwell-Y...should be a nice year for cool CPUs :)
 

Khato

Golden Member
Jul 15, 2001
1,294
375
136
well, it's the first time that an ARM CPU can touch intel big Core performance and it's already an amazing achievement.
For power level, I don't see it higher than the 4 A15s in K1-32

Eh, I wouldn't call it that amazing given how far they have to lower the goal posts in order to achieve such. Granted they at least managed to trade blows with Haswell running at 56% the frequency instead of having to step down to the lowest 1.1 GHz model.

As for power, maybe 2x Denver is lower than 4x A15, or maybe not? I don't believe that I've seen any claim from NVIDIA on that respect? Which likely wouldn't mean that it's good news.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Note that the celeron 2955U is locked to 1.4 ghz.

Also note that the celeron 2955U gets around 2200-2400 in geekbench 3 multicore. Well below what the S800 and the Tegra 4 would get (up to 3000 unthrottled). Beating the 2955U is therefore not terribly as impressive.
 

xpea

Senior member
Feb 14, 2014
458
156
116
A8, Denver, and Broadwell-Y...should be a nice year for cool CPUs :)
too bad they are not direct competitors, so it will be very difficult to evaluate their own merit. But at least they are the best of each ecosystem (iOS / Android / Windows)
 

xpea

Senior member
Feb 14, 2014
458
156
116
Eh, I wouldn't call it that amazing given how far they have to lower the goal posts in order to achieve such. Granted they at least managed to trade blows with Haswell running at 56% the frequency instead of having to step down to the lowest 1.1 GHz model.
So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?

Well.... AMD did a pretty good job back in the P4 days. :p
 

FatherMurphy

Senior member
Mar 27, 2014
229
18
81
I wonder if this Transmeta binary code translation is the real deal, the "secret sauce" that might be Nvidia's edge in a very competitive market, where its competitors in the CPU field are far more equipped/experienced to produce custom designs. If it works so well, why isn't anyone else doing it? Or is this a one-off, niche type design that won't serve Nvidia well in the long-term?

Anyone capable of explaining the binary code translations to a laymen like me? Based upon my readings of the whitepaper and other sites, the Transmeta method failed because years ago when Transmeta implemented its technology, hardware was not fast enough to take advantage of the method. That is, the software overhead to translate resulted in a net efficiency loss. But now hardware is much more powerful but power-efficient, and the (according to Nvidia), the Transmeta method, refined, now results in a positive efficiency gain.

Interesting stuff. I look forward to Anandtech's analysis.
 

Khato

Golden Member
Jul 15, 2001
1,294
375
136
So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?

Because it's well outside of the performance targets that ARM has been interested in.

And no, I don't expect anyone to beat Intel, quite the opposite actually. But declaring how awesome an architecture is merely because of it being capable of trading blows with Haswell running at a fraction of its design point without any indication of power consumption while doing so is questionable at best. If Denver is providing that level of performance on half a watt of power consumption then yes, it'd be an awesome architecture set for world domination. Whereas if it's burning through five watts it's a non-story and NVIDIA would've been better off sticking to the ARM designed cores.
 

xpea

Senior member
Feb 14, 2014
458
156
116
Because it's well outside of the performance targets that ARM has been interested in.

And no, I don't expect anyone to beat Intel, quite the opposite actually. But declaring how awesome an architecture is merely because of it being capable of trading blows with Haswell running at a fraction of its design point without any indication of power consumption while doing so is questionable at best. If Denver is providing that level of performance on half a watt of power consumption then yes, it'd be an awesome architecture set for world domination. Whereas if it's burning through five watts it's a non-story and NVIDIA would've been better off sticking to the ARM designed cores.
for power consumption let's wait for the reviews, shall we...
regarding other things, I totally disagree with you. What's the point for Nvidia to make another A57 implementation ? doing the same as Mediatek and Allwinner ? useless, they will be crushed in price with no differentiation whatsover. And they will miss a critical window this fall.
May I need to remember you that Denver will be here at least 6 months before Qualcomm own 64bit uarch and with a perfect timing to put them in the light for this year hot season and Android L release? New Nexus tablet will have TK1-64bit.
No, really, I disagree with you, despite all the delays, Nvidia is in a very good position with Denver to make a hit in Android TV / tablets / chromebook devices this Xmas.
 

Madpacket

Platinum Member
Nov 15, 2005
2,068
326
126
Hmm, faster game emulation would be awesome. Hopefully nvidia releases a new Shield handheld with this processor, leave the 32 bit stop-gap one for the tablet.
 

ams23

Senior member
Feb 18, 2013
907
0
0
A8 vs Denver should be interesting, I really like the two fast cores approach.

The A8 CPU performance should get close to Denver but may not match Denver in some benchmarks. The Google Octane v2.0 scores for Denver is reportedly more than 2x higher than A7-Cyclone!

post.cgi
 
Last edited:

ams23

Senior member
Feb 18, 2013
907
0
0
Also note that the celeron 2955U gets around 2200-2400 in geekbench 3 multicore. Well below what the S800 and the Tegra 4 would get (up to 3000 unthrottled). Beating the 2955U is therefore not terribly as impressive.

Look carefully at the graph above that NVIDIA presented. The Haswell [2955U] core that they benchmarked scores much higher than both Cortex A15 (in Tegra 4 and Tegra K1 32-bit variant) and Krait 400 (in S800) in Geekbench 3 Single-Core. In several of the benchmark tests, this Haswell [2955U] core is superior in performance to the A7-Cyclone core.

S800, Tegra 4, Tegra K1 32-bit variant all have twice as many cores as Haswell [2955U], A7-Cyclone, and Denver, so naturally they would look good in comparison using Multi-Core benchmarks. That said, even in Multi-Core benchmarks, dual-core Denver should come close to matching quad-core Cortex A15.
 
Last edited:

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
The A8 CPU performance should get close to Denver but will probably not match Denver in most benchmarks. The Geekbench 3 Single-Core and Google Octane v2.0 scores for Denver are reportedly more than 2x higher than A7-Cyclone!

post.cgi

Gotta wonder about that power consumption when under full load. It'd be nice if it was LESS than Tegra K1's 4 A15 cores.
 

jdubs03

Golden Member
Oct 1, 2013
1,294
904
136
Look carefully at the graph above that NVIDIA presented. The Haswell [2955U] core that they benchmarked scores much higher than both Cortex A15 (in Tegra 4 and Tegra K1 32-bit variant) and Krait 400 (in S800) in Geekbench 3 Single-Core. In several of the benchmark tests, this Haswell [2955U] core is superior in performance to the A7-Cyclone core.

S800, Tegra 4, Tegra K1 32-bit variant all have twice as many cores as Haswell [2955U], A7-Cyclone, and Denver, so naturally they would look good in comparison using Multi-Core benchmarks. That said, even in Multi-Core benchmarks, dual-core Denver should come close to matching quad-core Cortex A15.

In terms of single-thread performance (and multi), the 2955U gets a bit less than the A7 in 64-bit performance, though for 32-bit it does score considerably (~200 points single, ~500 points multi) higher. If Denver is only at ~2955U levels, I would be massively disappointed.

This information confuses me too, because the ST for TK1-32bit is ~1120, with mutli at ~3450. How could Denver only score ~200 points higher than 32-bit? Doesn't make sense to me.

If Denver were to match the multi-thread score of K1-32bit (which it should, or else I consider that another disappointment), I would expect the single-thread to be significantly higher than 1300-1400. It looks like I may have been wrong with my prediction of a ST score of 2000+.

The A8 CPU performance should get close to Denver but will probably not match Denver in most benchmarks. The Geekbench 3 Single-Core and Google Octane v2.0 scores for Denver are reportedly more than 2x higher than A7-Cyclone!

Source for this? I haven't fully looked at this stuff, but if the single-thread score was almost 2x the score of A7, the score would be around ~2700. And I don't think the A8 is going to get that close, though I think the A8 could get to around ~2100.
 
Last edited:

ams23

Senior member
Feb 18, 2013
907
0
0
Gotta wonder about that power consumption when under full load. It'd be nice if it was LESS than Tegra K1's 4 A15 cores.

Who knows, but it's pretty safe to say that CPU single-core perf. per watt is way better with Denver than R3 Cortex A15.

Here is some general info from the paper:

The most unique aspect of Denver is the dynamic code optimization. The core microarchitecture of the CPU is unique in that it has an in-order pipeline, but uses special software to reorder and optimize instruction traces. During repetitive code sequences, the Denver CPU collects dynamic runtime information during code execution and passes this information to the dynamic code optimizer; enabling the optimizer to assess more optimized ways for the code to be executed. The CPU uses hidden time slices to run the optimizer or can use the second core for optimizations for the active core.

The dynamic optimizer runs in its own private and protected state and is not visible to the operating system or any user code. The signed and encrypted dynamic optimizer code loads at boot into a protected part of main memory. By performing the reordering and register renaming in software, Denver eliminates the power hungry out-of-order control logic and yet it can achieve comparable results.

The profiler gathers info on program flow such as branch results (such as taken, not taken, strongly taken, and strongly not taken) and other hardware statistics tables and counters. The optimizer (Figure 1) recognizes opportunities to improve execution and then can rename registers, reorder loads and stores, improve control flow, remove redundant code, hoist redundant computations, perform loop unrolling, and other common optimizations. Because the run-time software performs optimization, the profiler can look over a much larger instruction window than is typically found in hardware out-of-order (OoO) designs. Denver could optimize over a 1,000 instruction window, while most OoO hardware is limited to a 192 instruction window or smaller. The dynamic code optimizer will continue to evaluate profile data and can perform additional optimizations on the fly.
 

ams23

Senior member
Feb 18, 2013
907
0
0
In terms of single-thread performance (and multi), the 2955U gets a bit less than the A7 in 64-bit performance, though for 32-bit it does score much higher. If Denver is only at ~2955U levels, I would be massively disappointed.
If Denver were to match the multi-thread score of K1-32bit (which it should, or else I consider that another disappointment), I would expect the single-thread to be considerably higher than 1300-1400. It looks like I may have been wrong with my prediction of a ST score of 2000+.

Source for this? I haven't fully looked at this stuff, but if the single-thread score was almost 2x the score of A7, the score would be around ~2700. And I don't think the A8 is going to get that close, though I think the A8 could get to around ~2100.

See the graph I linked to a few posts above from the paper: http://forums.anandtech.com/showpost.php?p=36609897&postcount=17

Denver should be significantly ahead of A7-Cyclone with most benchmarks. Looking at the Haswell variant that NVIDIA used, it is well ahead of A7-Cyclone in most of the benchmarks too other than Geekbench 3 Single-Core (where it is roughly equal, if perhaps just slightly behind), and DMIPS and Memset benchmarks (where it is behind).

That said, A8 CPU should have fairly similar performance to this Denver CPU.
 
Last edited:

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
Who knows, but it's pretty safe to say that CPU single-core perf. per watt is way better with Denver than R3 Cortex A15.

We'll see. The low-level software translation layer gives me worry. Transmeta did this, and to my limited knowledge, didn't have the best of success when it came to final product performance. Of course, we're talking ex Transmeta engineers creating Denver now, so who knows if they've been able to significantly improve it in the 7+ years since they last did it.

What is interesting to me is that it sounds like Denver is instruction set agnostic. If Nvidia were to ever acquire an x86 license, it'd be on. :wishful thinking:
 

ams23

Senior member
Feb 18, 2013
907
0
0
We'll see. The low-level software translation layer gives me worry. Transmeta did this, and to my limited knowledge, didn't have the best of success when it came to final product performance. Of course, we're talking ex Transmeta engineers creating Denver now, so who knows if they've been able to significantly improve it in the 7+ years since they last did it.

What is interesting to me is that it sounds like Denver is instruction set agnostic. If Nvidia were to ever acquire an x86 license, it'd be on. :wishful thinking:

Heh, yes, that is wishful thinking :)

According to NVIDIA:

The slight overhead of the dynamic optimization process is outweighed by the performance gains of already having optimized code ready to execute. In cases where code may not be frequently reused, Denver can process those ARM instructions directly without going through the dynamic optimization process, delivering the best of both worlds.

Dynamic Code Optimization works with all standard ARM-based applications, requiring no customization from developers, and without added power consumption versus other ARM mobile processors. That’s because the 7-wide superscalar design allows faster throughput than would otherwise be possible at the same clock speed.
 
Last edited:

jdubs03

Golden Member
Oct 1, 2013
1,294
904
136
Heh, yes, that is wishful thinking :)

According to NVIDIA:

Aright yea, I was only going off 2955U ST-perf, I didn't even glance at that chart (abnormal occurrence there) and noticing that its quite a bit higher. My apologies.

Here is a bit of extrapolation data:
10505315_10152680450119136_9083153074905498410_n.jpg


I had to post this on facebook to get this work lol. The ST-perf is 33.33% higher than the A7. So ~1800-1950 (which makes sense as 1120*1.625 = 1820), not too far away from my 2000+ guess. It can give you an idea of the other metrics too.

To think that they can get this kind of performance at 28nm planar, while challenging Core M and its 14nm 2nd gen tri-gate (even with the impressive reduction from 11.5W TDP to 4.5W TDP, with ~+performance) is pretty outstanding. I would also assume the power consumption is on par with TK1-32bit.
 
Last edited: