Bay Trail benchmark appears online, crushes fastest Snapdragon ARM SoC

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Khato

Golden Member
Jul 15, 2001
1,379
487
136
No, I recognized that you are quoting a valid equation but absolutely murdering the theory behind it. A die that is more than 10 times the size is not going to magically use the same amount of transistors as one that is a fraction of the size. You are right that P=F*C*V^2, you are absolutely wrong that Haswell's C = A15 or A57's C. Please, don't quote formulas that you do not understand.
I'm not claiming that Haswell's total capacitance is equal to that of any current ARM design. Did I ever state such? No, I was quite clear in stating "that Haswell (activity factor)*(capacitance) would have to be 2.38x that of the ARM SoC or greater in order for it to be less energy efficient." See that pesky little activity factor part? I even gave an example of such in one post just to be explicitly clear. Because yes Haswell has a lot more logic that allows it to execute a task faster, but just how many more transistors toggling does that actually translate into? I'll freely admit that I don't know, but it'd be quite surprising if it was over twice as many.

Oh, and I'd recommend against attempting to discredit my understanding of the subject. (In part for the fact that some might misconstrue such as a personal attack when it's not backed up by facts.) Sure I make mistakes from time to time, but what precisely have I said regarding dynamic power consumption that was incorrect?
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
I'm not claiming that Haswell's total capacitance is equal to that of any current ARM design. Did I ever state such? No, I was quite clear in stating "that Haswell (activity factor)*(capacitance) would have to be 2.38x that of the ARM SoC or greater in order for it to be less energy efficient." See that pesky little activity factor part? I even gave an example of such in one post just to be explicitly clear. Because yes Haswell has a lot more logic that allows it to execute a task faster, but just how many more transistors toggling does that actually translate into? I'll freely admit that I don't know, but it'd be quite surprising if it was over twice as many.

Oh, and I'd recommend against attempting to discredit my understanding of the subject. (In part for the fact that some might misconstrue such as a personal attack when it's not backed up by facts.) Sure I make mistakes from time to time, but what precisely have I said regarding dynamic power consumption that was incorrect?

I think the major issue here is the whole "power efficient" part. Are you claiming Haswell requires less energy to complete the same task, or less energy to operate at the same speed, or what? What I would say is given a task to complete x, Haswell will finish faster but use more energy the entire time. Now, the advantage of lower dynamic power consumption is probably kind of accurate (yes, I do think that ARM is capable of using 2.38x less transistors for the same job) but you're saying that Haswell is more energy efficient, as a blanket statement of sorts, when there are clearly other factors which could make the dynamic power consumption almost irrelevant.

My major issue is your blanket statement of "Haswell uses less volts, Haswell MUST be more energy efficient". There are so many factors that you don't know which make that blanket statement essentially useless. I would understand "Haswell could potentially be more efficient for dynamic power" or something like that, but claiming that it is more efficient in GENERAL is very unproven.

EDIT: To be clear, my issue is
Today, the HTC One's Krait 300 cores, to get to 1.7 Ghz have an operating voltage at full tilt at 1.275 volts.

The Cortex A-15-based Samsung Exynos-5-Octa has is at almost 1.3 volts when it's at full tilt.

Haswell's 3.8 Ghz is at 1.05 voltage, so much more power efficient. To put it bluntly: ARM's architecture as of now simply isn't designed for these high speeds at all, which is why battery life is so awful.

This is incorrect. Making conclusions across architectures and ISA's based on something as simple as the voltage is wrong. If haswell is more power efficient, it isn't JUST because the voltage is lower.
 
Last edited:

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
The footnote says its based on Intel claims. Which is an odd statement because just a few days before this ARM slide came out, Intel released a slide claiming the opposite. It showed Silvermont ahead of big.little in power and performance.

On what grounds did ARM feel justified in shifting the curves? Did they move their own chips to the right, or did they move Intel to the left? We don't know. Most likely ARM sampled Clovertrail and added 50% performance to the result. Is that guestimation going to be representative of baytrail/valleyview? Probably not.



silvermont_dynamic_range1_large.jpg

I'd hazard a guess that Intel used AnTuTu as part of their benchmarks and ARM didn't.
 

Khato

Golden Member
Jul 15, 2001
1,379
487
136
I think the major issue here is the whole "power efficient" part. Are you claiming Haswell requires less energy to complete the same task, or less energy to operate at the same speed, or what? What I would say is given a task to complete x, Haswell will finish faster but use more energy the entire time. Now, the advantage of lower dynamic power consumption is probably kind of accurate (yes, I do think that ARM is capable of using 2.38x less transistors for the same job) but you're saying that Haswell is more energy efficient, as a blanket statement of sorts, when there are clearly other factors which could make the dynamic power consumption almost irrelevant.

My major issue is your blanket statement of "Haswell uses less volts, Haswell MUST be more energy efficient". There are so many factors that you don't know which make that blanket statement essentially useless. I would understand "Haswell could potentially be more efficient for dynamic power" or something like that, but claiming that it is more efficient in GENERAL is very unproven.
Where did I make such a blanket statement? My original response was simply to provide an explanation of how Haswell could have better efficiency. And then inquire as to how ARM could be using less than 2.38x the number of transistors to perform the same workload as Haswell. (Yes, I made use of the term 'magical' for the fact that so many seem convinced that ARM has something extremely special about it.) Fact is, I can't really say anything about the specifics of the Haswell implementation, and I don't know the specifics of the ARM designs. But I do know that die size used for core logic doesn't necessarily have any correlation to activity factor, especially when talking about a fixed workload.

EDIT: To be clear, my issue is
Today, the HTC One's Krait 300 cores, to get to 1.7 Ghz have an operating voltage at full tilt at 1.275 volts.

The Cortex A-15-based Samsung Exynos-5-Octa has is at almost 1.3 volts when it's at full tilt.

Haswell's 3.8 Ghz is at 1.05 voltage, so much more power efficient. To put it bluntly: ARM's architecture as of now simply isn't designed for these high speeds at all, which is why battery life is so awful.

This is incorrect. Making conclusions across architectures and ISA's based on something as simple as the voltage is wrong. If haswell is more power efficient, it isn't JUST because the voltage is lower.
Did I say otherwise?
Again, like was just discussed, power consumption is much less simple than "less volts = more efficient". There are many other factors to consider other than operating voltage, including leakage (die size), process differences, die capacitance, and other factors like that. Saying something as simple as "it uses less volts, it is more efficient" is hopelessly simplistic and absolutely false.
Absolutely false? Come now, earlier you fully recognized the fact that it can be true (edit: or at least you stopped attempting to say I was incorrect about the simple fundamental property of dynamic power consumption.) Sure the statement made Mondozi does qualify as a tad bit too simplistic, but it's definitely the major factor in dynamic power consumption... and it requires a marked difference in the activity factor, capacitance, and frequency in order to make up for such large differences in voltage.
Hrmmmm, nope. In fact I agreed that Mondozi's statement was a bit too simplistic, but I took issue with calling it absolutely false when it could be correct. I'm not going to take a concrete statement on whether or not it actually is though until there are enough facts available to come to that conclusion.
 

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
Intel could easily get an A15 core, but how would ARM test a silvermont core?

They don't need to, Intel has already given numerous slides detailing exactly how much performance gains they are getting under various circumstances.

I do feel like I've made this point clear a few times already. So either you and the rest come out and say that Intel is lying or sand-bagging or whatever or you accept what they are saying.

PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.
 

tential

Diamond Member
May 13, 2008
7,348
642
121
They don't need to, Intel has already given numerous slides detailing exactly how much performance gains they are getting under various circumstances.

I do feel like I've made this point clear a few times already. So either you and the rest come out and say that Intel is lying or sand-bagging or whatever or you accept what they are saying.

PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.

I think it's a given though that you should never trust slides. For any slide you can find that has been issued by a company that is true, we could probably find 10 that weren't. Slides are literally just as good as speculation. Only when third parties do it is it really relevant.
 

SlimFan

Member
Jul 5, 2013
92
14
71
Intel's slides said which benchmarks were used. Yes, Antutu is on the list. But then again, so is SPEC, EEMBC, Dhrystone, Linpack, Quadrant, browser benchmarks, and others.

Which benchmarks, exactly, did ARM say they used when making the comparison slides? Somehow I didn't quite see that section. In fact, according to the ARM chart, an A7 is roughly the same performance at the high end as a Saltwell core. Given the fact that Saltwell can run at 2GHz, that seems to imply that the A7 has a 1.5x or so IPC advantage against Saltwell. Does anyone believe that? I don't think I've seen any data whatsoever that could possibly support that assertion. Could that be why ARM didn't happen to mention what workload(s) they were referring to?
 
Mar 10, 2006
11,715
2,012
126
Which benchmarks, exactly, did ARM say they used when making the comparison slides? Somehow I didn't quite see that section. In fact, according to the ARM chart, an A7 is roughly the same performance at the high end as a Saltwell core. Given the fact that Saltwell can run at 2GHz, that seems to imply that the A7 has a 1.5x or so IPC advantage against Saltwell. Does anyone believe that? I don't think I've seen any data whatsoever that could possibly support that assertion. Could that be why ARM didn't happen to mention what workload(s) they were referring to?

No, and that's why I think ARM is FOS.
 

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
I think it's a given though that you should never trust slides. For any slide you can find that has been issued by a company that is true, we could probably find 10 that weren't. Slides are literally just as good as speculation. Only when third parties do it is it really relevant.

Pretty much. I can't think of intel being outright deceptive with their marketing slides, nonetheless - you can't really base reality on what a marketing team spoon feeds to the masses, be it ARM or intel. Let's just wait for a product next month and that's when the real fun will begin (or not) ;)
 

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
Pretty good overview of the new Bay Trail uarch here:

http://www.realworldtech.com/silvermont/

To cement this view, Intel compared Silvermont to Saltwell for tablets, showing performance data suggesting a 2× increase in CPU performance at constant power or a 5× reduction in power. Intel also claims a >60% performance and >3× power edge over projections for competing 28nm tablet SoCs expected later this year. In the more abstract realm of CPU to CPU comparisons, Intel indicated that Silvermont is a much faster design than the Cortex A7 or the A15 at any given power level; the goal is to avoid any complicated and high latency software-driven power management and rely on firmware-based management and power gating. As a design point, this is a good choice and follows in the steps of Apple’s Swift or Qualcomm’s Krait, which are undeniably successful CPU cores.

900x900px-LL-b6c42921_silvermont-7.png
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Bay trail has way better power optimizations, actually. Its a full SoC. And it offers asynchronous cores.

Most power benefit in asynchronous comes from independent voltage rails, which AFAIK Silvermont doesn't support (unlike Krait)
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Note that the benchmarks in which CT+ gets schooled are FPU intensive benchmarks - and Saltwell was known for having a very seriously weak FPU.

Atom's FPU is not weak except glaring problems with x87 and packed double precision SSE; most likely neither of which are used on any of those tests. Nominally it can dual issue SP 128-bit FADD + 64-bit FMUL which isn't bad - better than Bobcat. We've been through this before though, remember?

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.
 
Mar 10, 2006
11,715
2,012
126
Most power benefit in asynchronous comes from independent voltage rails, which AFAIK Silvermont doesn't support (unlike Krait)

Exo,

From the recent AT article on SLM,

In all Intel Core based microprocessors, all cores are tied to the same frequency - those that aren’t in use are simply shut off (power gated) to save power. Qualcomm’s multi-core architecture has always supported independent frequency planes for all CPUs in the SoC, something that Intel has always insisted was a bad idea. In a strange turn of events, Intel joins Qualcomm in offering the ability to run each core in a Silvermont module at its own independent frequency. You could have one Silvermont core running at 2.4GHz and another one running at 1.2GHz. Unlike Qualcomm’s implementation, Silvermont’s independent frequency planes are optional. In a split frequency case, the shared L2 cache always runs at the higher of the two frequencies. Intel believes the flexibility might be useful in some low cost Silvermont implementations where the OS actively uses core pinning to keep threads parked on specific cores. I doubt we’ll see this on most tablet or smartphone implementations of the design.

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.

Makes sense.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Atom's FPU is not weak except glaring problems with x87 and packed double precision SSE; most likely neither of which are used on any of those tests. Nominally it can dual issue SP 128-bit FADD + 64-bit FMUL which isn't bad - better than Bobcat. We've been through this before though, remember?

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.

If OoO execution brings about 30% on integer code using the mostly single cycle latency ops, what effect would you expect when running FP code OoO w/ multi cycle latencies?

The use of SMT in Atom cores has shown huge jumps in FP performance, so the FPU isn't throughput bound.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Exo,

From the recent AT article on SLM.

Yeah I read that, even double checked it before posting. It doesn't mention anything about independent voltage rails, and given how the article isn't very positive about the capability it'd be very strange if it had it. Separate rails is a substantial design investment both in the SoC itself and in the support PMIC, it's not something to be taken lightly.

But it's a well documented fact that Qualcomm's aSMP (in the original Scorpion as well as Krait) has separate voltage rails for each CPU core (and L2 cache, and GPU). I don't know why the AT article didn't mention this in that section, although Brian Klug does mention it in a tweet (https://twitter.com/nerdtalker/status/346097950327459840).

From Qualcomm:

Independent clock and voltage:
Each core in the aSMP system has a dedicated voltage and clock including the L2 cache. This enables each CPU core to run at the most efficient power point or voltage and frequency depending on the type of workload being executed.

http://www.qualcomm.com/media/docum...em-on-chip-solutions-for-a-new-mobile-age.pdf

Dresdenboy said:
If OoO execution brings about 30% on integer code using the mostly single cycle latency ops, what effect would you expect when running FP code OoO w/ multi cycle latencies?

The use of SMT in Atom cores has shown huge jumps in FP performance, so the FPU isn't throughput bound.

But Silvermont's FP/SIMD queues are in-order (even though they call them reservation stations). RWT says that it has register renaming for FP but I think this is probably an error - it'd be pretty weird to have a ROB for the integer part but renaming via PRF for the FP part. Intel's block diagrams doesn't hint at any renaming for the FP pipes. Not that it'd make a big difference for in-order execution.

The two pipes are decoupled so they do execute out of order with respect to each other. For FP code that could mean that a long chain of FADDs followed by a long chain of FMULs could execute in parallel. But I think this benefit is minor compared to full OoO.

Despite this Intel says that it gets a bigger benefit in Silvermont from FP code than integer code. If they're including any significant amount of x87 and packed DP code that'd make a big difference. It's also possible that the code is more memory sensitive and gets better results from the latency hiding and MLP the improved memory subsystem provides, but I'd expect integer code to be more sensitive to this. FP code tends to be more data regular. In this regard it's possible that the improved data prefetch is helping more.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
a long chain of FADDs followed by a long chain of FMULs could execute in parallel. But I think this benefit is minor compared to full OoO.

Interesting thing to note is that if a typical FP intensive workload is a long chain of dependent FP ops, OOO doesn't matter (as much)
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Interesting thing to note is that if a typical FP intensive workload is a long chain of dependent FP ops, OOO doesn't matter (as much)

Sure, for some value of long that exceeds the reordering capability of the CPU. Even a pretty modest execution window will capture a lot of FP loop kernels, at least the ones that have at least partially independent iterations.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Also one more quick comment.

RWT says that it has register renaming for FP but I think this is probably an error

The two pipes are decoupled so they do execute out of order with respect to each other. For FP code that could mean that a long chain of FADDs followed by a long chain of FMULs could execute in parallel.

Your second paragraph would imply the need for register renaming.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Also one more quick comment.

Your second paragraph would imply the need for register renaming.

How do you figure? Even full OoOE doesn't need register renaming. It's just there to alleviate false dependencies and increase available ILP.