Bay Trail benchmark appears online, crushes fastest Snapdragon ARM SoC

Khato · Jul 7, 2013

sushiwarrior said:
No, I recognized that you are quoting a valid equation but absolutely murdering the theory behind it. A die that is more than 10 times the size is not going to magically use the same amount of transistors as one that is a fraction of the size. You are right that P=F*C*V^2, you are absolutely wrong that Haswell's C = A15 or A57's C. Please, don't quote formulas that you do not understand.

I'm not claiming that Haswell's total capacitance is equal to that of any current ARM design. Did I ever state such? No, I was quite clear in stating "that Haswell (activity factor)*(capacitance) would have to be 2.38x that of the ARM SoC or greater in order for it to be less energy efficient." See that pesky little activity factor part? I even gave an example of such in one post just to be explicitly clear. Because yes Haswell has a lot more logic that allows it to execute a task faster, but just how many more transistors toggling does that actually translate into? I'll freely admit that I don't know, but it'd be quite surprising if it was over twice as many.

Oh, and I'd recommend against attempting to discredit my understanding of the subject. (In part for the fact that some might misconstrue such as a personal attack when it's not backed up by facts.) Sure I make mistakes from time to time, but what precisely have I said regarding dynamic power consumption that was incorrect?

sushiwarrior · Jul 7, 2013

Khato said:
I'm not claiming that Haswell's total capacitance is equal to that of any current ARM design. Did I ever state such? No, I was quite clear in stating "that Haswell (activity factor)*(capacitance) would have to be 2.38x that of the ARM SoC or greater in order for it to be less energy efficient." See that pesky little activity factor part? I even gave an example of such in one post just to be explicitly clear. Because yes Haswell has a lot more logic that allows it to execute a task faster, but just how many more transistors toggling does that actually translate into? I'll freely admit that I don't know, but it'd be quite surprising if it was over twice as many.

Oh, and I'd recommend against attempting to discredit my understanding of the subject. (In part for the fact that some might misconstrue such as a personal attack when it's not backed up by facts.) Sure I make mistakes from time to time, but what precisely have I said regarding dynamic power consumption that was incorrect?

I think the major issue here is the whole "power efficient" part. Are you claiming Haswell requires less energy to complete the same task, or less energy to operate at the same speed, or what? What I would say is given a task to complete x, Haswell will finish faster but use more energy the entire time. Now, the advantage of lower dynamic power consumption is probably kind of accurate (yes, I do think that ARM is capable of using 2.38x less transistors for the same job) but you're saying that Haswell is more energy efficient, as a blanket statement of sorts, when there are clearly other factors which could make the dynamic power consumption almost irrelevant.

My major issue is your blanket statement of "Haswell uses less volts, Haswell MUST be more energy efficient". There are so many factors that you don't know which make that blanket statement essentially useless. I would understand "Haswell could potentially be more efficient for dynamic power" or something like that, but claiming that it is more efficient in GENERAL is very unproven.

EDIT: To be clear, my issue is

Mondozei said:
Today, the HTC One's Krait 300 cores, to get to 1.7 Ghz have an operating voltage at full tilt at 1.275 volts.

The Cortex A-15-based Samsung Exynos-5-Octa has is at almost 1.3 volts when it's at full tilt.

Haswell's 3.8 Ghz is at 1.05 voltage, so much more power efficient. To put it bluntly: ARM's architecture as of now simply isn't designed for these high speeds at all, which is why battery life is so awful.

This is incorrect. Making conclusions across architectures and ISA's based on something as simple as the voltage is wrong. If haswell is more power efficient, it isn't JUST because the voltage is lower.

SiliconWars · Jul 7, 2013

erunion said:
The footnote says its based on Intel claims. Which is an odd statement because just a few days before this ARM slide came out, Intel released a slide claiming the opposite. It showed Silvermont ahead of big.little in power and performance.

On what grounds did ARM feel justified in shifting the curves? Did they move their own chips to the right, or did they move Intel to the left? We don't know. Most likely ARM sampled Clovertrail and added 50% performance to the result. Is that guestimation going to be representative of baytrail/valleyview? Probably not.

I'd hazard a guess that Intel used AnTuTu as part of their benchmarks and ARM didn't.

Khato · Jul 7, 2013

sushiwarrior said:
I think the major issue here is the whole "power efficient" part. Are you claiming Haswell requires less energy to complete the same task, or less energy to operate at the same speed, or what? What I would say is given a task to complete x, Haswell will finish faster but use more energy the entire time. Now, the advantage of lower dynamic power consumption is probably kind of accurate (yes, I do think that ARM is capable of using 2.38x less transistors for the same job) but you're saying that Haswell is more energy efficient, as a blanket statement of sorts, when there are clearly other factors which could make the dynamic power consumption almost irrelevant.

My major issue is your blanket statement of "Haswell uses less volts, Haswell MUST be more energy efficient". There are so many factors that you don't know which make that blanket statement essentially useless. I would understand "Haswell could potentially be more efficient for dynamic power" or something like that, but claiming that it is more efficient in GENERAL is very unproven.

Where did I make such a blanket statement? My original response was simply to provide an explanation of how Haswell could have better efficiency. And then inquire as to how ARM could be using less than 2.38x the number of transistors to perform the same workload as Haswell. (Yes, I made use of the term 'magical' for the fact that so many seem convinced that ARM has something extremely special about it.) Fact is, I can't really say anything about the specifics of the Haswell implementation, and I don't know the specifics of the ARM designs. But I do know that die size used for core logic doesn't necessarily have any correlation to activity factor, especially when talking about a fixed workload.

sushiwarrior said:
EDIT: To be clear, my issue is

Mondozei said:

Today, the HTC One's Krait 300 cores, to get to 1.7 Ghz have an operating voltage at full tilt at 1.275 volts.

The Cortex A-15-based Samsung Exynos-5-Octa has is at almost 1.3 volts when it's at full tilt.

Haswell's 3.8 Ghz is at 1.05 voltage, so much more power efficient. To put it bluntly: ARM's architecture as of now simply isn't designed for these high speeds at all, which is why battery life is so awful.

Click to expand...

This is incorrect. Making conclusions across architectures and ISA's based on something as simple as the voltage is wrong. If haswell is more power efficient, it isn't JUST because the voltage is lower.

Did I say otherwise?

Khato said:
sushiwarrior said:

Again, like was just discussed, power consumption is much less simple than "less volts = more efficient". There are many other factors to consider other than operating voltage, including leakage (die size), process differences, die capacitance, and other factors like that. Saying something as simple as "it uses less volts, it is more efficient" is hopelessly simplistic and absolutely false.

Click to expand...

Absolutely false? Come now, earlier you fully recognized the fact that it can be true (edit: or at least you stopped attempting to say I was incorrect about the simple fundamental property of dynamic power consumption.) Sure the statement made Mondozi does qualify as a tad bit too simplistic, but it's definitely the major factor in dynamic power consumption... and it requires a marked difference in the activity factor, capacitance, and frequency in order to make up for such large differences in voltage.

Hrmmmm, nope. In fact I agreed that Mondozi's statement was a bit too simplistic, but I took issue with calling it absolutely false when it could be correct. I'm not going to take a concrete statement on whether or not it actually is though until there are enough facts available to come to that conclusion.

Haserath · Jul 7, 2013

SiliconWars said:
I'd hazard a guess that Intel used AnTuTu as part of their benchmarks and ARM didn't.

Intel could easily get an A15 core, but how would ARM test a silvermont core?

Arachnotronic · Jul 7, 2013

Haserath said:
Intel could easily get an A15 core, but how would ARM test a silvermont core?

Bingo...

SiliconWars · Jul 7, 2013

Haserath said:
Intel could easily get an A15 core, but how would ARM test a silvermont core?

They don't need to, Intel has already given numerous slides detailing exactly how much performance gains they are getting under various circumstances.

I do feel like I've made this point clear a few times already. So either you and the rest come out and say that Intel is lying or sand-bagging or whatever or you accept what they are saying.

PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.

mrmt · Jul 7, 2013

SiliconWars said:
PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.

Where did you get this info?

SiliconWars · Jul 7, 2013

mrmt said:
Where did you get this info?

Let me highlight the important part.

PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.

tential · Jul 7, 2013

SiliconWars said:
They don't need to, Intel has already given numerous slides detailing exactly how much performance gains they are getting under various circumstances.

I do feel like I've made this point clear a few times already. So either you and the rest come out and say that Intel is lying or sand-bagging or whatever or you accept what they are saying.

PS just a FYI, Intel *is* counting AnTuTu in their results and if you paid more attention to their slides you'd know this already.

I think it's a given though that you should never trust slides. For any slide you can find that has been issued by a company that is true, we could probably find 10 that weren't. Slides are literally just as good as speculation. Only when third parties do it is it really relevant.

mrmt · Jul 7, 2013

SiliconWars said:
Let me highlight the important part.

Does that mean an assumption of yours or do you have something concrete?

SlimFan · Jul 7, 2013

Intel's slides said which benchmarks were used. Yes, Antutu is on the list. But then again, so is SPEC, EEMBC, Dhrystone, Linpack, Quadrant, browser benchmarks, and others.

Which benchmarks, exactly, did ARM say they used when making the comparison slides? Somehow I didn't quite see that section. In fact, according to the ARM chart, an A7 is roughly the same performance at the high end as a Saltwell core. Given the fact that Saltwell can run at 2GHz, that seems to imply that the A7 has a 1.5x or so IPC advantage against Saltwell. Does anyone believe that? I don't think I've seen any data whatsoever that could possibly support that assertion. Could that be why ARM didn't happen to mention what workload(s) they were referring to?

Arachnotronic · Jul 7, 2013

SlimFan said:
Which benchmarks, exactly, did ARM say they used when making the comparison slides? Somehow I didn't quite see that section. In fact, according to the ARM chart, an A7 is roughly the same performance at the high end as a Saltwell core. Given the fact that Saltwell can run at 2GHz, that seems to imply that the A7 has a 1.5x or so IPC advantage against Saltwell. Does anyone believe that? I don't think I've seen any data whatsoever that could possibly support that assertion. Could that be why ARM didn't happen to mention what workload(s) they were referring to?

No, and that's why I think ARM is FOS.

blackened23 · Jul 7, 2013

tential said:
I think it's a given though that you should never trust slides. For any slide you can find that has been issued by a company that is true, we could probably find 10 that weren't. Slides are literally just as good as speculation. Only when third parties do it is it really relevant.

Pretty much. I can't think of intel being outright deceptive with their marketing slides, nonetheless - you can't really base reality on what a marketing team spoon feeds to the masses, be it ARM or intel. Let's just wait for a product next month and that's when the real fun will begin (or not)

beginner99 · Jul 8, 2013

Nothingness said:
I talked about smartphone SoC, which your pictures don't show

Yeah I think I read that somewhere too that the smartphone SOC (which is NOT baytrail) will still use PowerVR.

blackened23 · Jul 8, 2013

Pretty good overview of the new Bay Trail uarch here:

http://www.realworldtech.com/silvermont/

To cement this view, Intel compared Silvermont to Saltwell for tablets, showing performance data suggesting a 2× increase in CPU performance at constant power or a 5× reduction in power. Intel also claims a >60% performance and >3× power edge over projections for competing 28nm tablet SoCs expected later this year. In the more abstract realm of CPU to CPU comparisons, Intel indicated that Silvermont is a much faster design than the Cortex A7 or the A15 at any given power level; the goal is to avoid any complicated and high latency software-driven power management and rely on firmware-based management and power gating. As a design point, this is a good choice and follows in the steps of Apple’s Swift or Qualcomm’s Krait, which are undeniably successful CPU cores.

Exophase · Jul 9, 2013

erunion said:
Bay trail has way better power optimizations, actually. Its a full SoC. And it offers asynchronous cores.

Most power benefit in asynchronous comes from independent voltage rails, which AFAIK Silvermont doesn't support (unlike Krait)

Exophase · Jul 9, 2013

Intel17 said:
Note that the benchmarks in which CT+ gets schooled are FPU intensive benchmarks - and Saltwell was known for having a very seriously weak FPU.

Atom's FPU is not weak except glaring problems with x87 and packed double precision SSE; most likely neither of which are used on any of those tests. Nominally it can dual issue SP 128-bit FADD + 64-bit FMUL which isn't bad - better than Bobcat. We've been through this before though, remember?

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.

Arachnotronic · Jul 9, 2013

Exophase said:
Most power benefit in asynchronous comes from independent voltage rails, which AFAIK Silvermont doesn't support (unlike Krait)

Exo,

From the recent AT article on SLM,

In all Intel Core based microprocessors, all cores are tied to the same frequency - those that arent in use are simply shut off (power gated) to save power. Qualcomms multi-core architecture has always supported independent frequency planes for all CPUs in the SoC, something that Intel has always insisted was a bad idea. In a strange turn of events, Intel joins Qualcomm in offering the ability to run each core in a Silvermont module at its own independent frequency. You could have one Silvermont core running at 2.4GHz and another one running at 1.2GHz. Unlike Qualcomms implementation, Silvermonts independent frequency planes are optional. In a split frequency case, the shared L2 cache always runs at the higher of the two frequencies. Intel believes the flexibility might be useful in some low cost Silvermont implementations where the OS actively uses core pinning to keep threads parked on specific cores. I doubt well see this on most tablet or smartphone implementations of the design.

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.

Makes sense.

Dresdenboy · Jul 9, 2013

Exophase said:
Atom's FPU is not weak except glaring problems with x87 and packed double precision SSE; most likely neither of which are used on any of those tests. Nominally it can dual issue SP 128-bit FADD + 64-bit FMUL which isn't bad - better than Bobcat. We've been through this before though, remember?

It's getting a beating on most of those benchmarks because most of what they're comparing it against has more cores (and a lot of the benches scale unnaturally well with core count), more perf/MHz, and in a lot of cases more MHz.

If OoO execution brings about 30% on integer code using the mostly single cycle latency ops, what effect would you expect when running FP code OoO w/ multi cycle latencies?

The use of SMT in Atom cores has shown huge jumps in FP performance, so the FPU isn't throughput bound.

Exophase · Jul 9, 2013

Intel17 said:
Exo,

From the recent AT article on SLM.

Yeah I read that, even double checked it before posting. It doesn't mention anything about independent voltage rails, and given how the article isn't very positive about the capability it'd be very strange if it had it. Separate rails is a substantial design investment both in the SoC itself and in the support PMIC, it's not something to be taken lightly.

But it's a well documented fact that Qualcomm's aSMP (in the original Scorpion as well as Krait) has separate voltage rails for each CPU core (and L2 cache, and GPU). I don't know why the AT article didn't mention this in that section, although Brian Klug does mention it in a tweet (https://twitter.com/nerdtalker/status/346097950327459840).

From Qualcomm:

Independent clock and voltage:
Each core in the aSMP system has a dedicated voltage and clock including the L2 cache. This enables each CPU core to run at the most efficient power point or voltage and frequency depending on the type of workload being executed.

http://www.qualcomm.com/media/docum...em-on-chip-solutions-for-a-new-mobile-age.pdf

Dresdenboy said:
If OoO execution brings about 30% on integer code using the mostly single cycle latency ops, what effect would you expect when running FP code OoO w/ multi cycle latencies?

The use of SMT in Atom cores has shown huge jumps in FP performance, so the FPU isn't throughput bound.

But Silvermont's FP/SIMD queues are in-order (even though they call them reservation stations). RWT says that it has register renaming for FP but I think this is probably an error - it'd be pretty weird to have a ROB for the integer part but renaming via PRF for the FP part. Intel's block diagrams doesn't hint at any renaming for the FP pipes. Not that it'd make a big difference for in-order execution.

The two pipes are decoupled so they do execute out of order with respect to each other. For FP code that could mean that a long chain of FADDs followed by a long chain of FMULs could execute in parallel. But I think this benefit is minor compared to full OoO.

Despite this Intel says that it gets a bigger benefit in Silvermont from FP code than integer code. If they're including any significant amount of x87 and packed DP code that'd make a big difference. It's also possible that the code is more memory sensitive and gets better results from the latency hiding and MLP the improved memory subsystem provides, but I'd expect integer code to be more sensitive to this. FP code tends to be more data regular. In this regard it's possible that the improved data prefetch is helping more.

TuxDave · Jul 9, 2013

Exophase said:
a long chain of FADDs followed by a long chain of FMULs could execute in parallel. But I think this benefit is minor compared to full OoO.

Interesting thing to note is that if a typical FP intensive workload is a long chain of dependent FP ops, OOO doesn't matter (as much)

Exophase · Jul 9, 2013

TuxDave said:
Interesting thing to note is that if a typical FP intensive workload is a long chain of dependent FP ops, OOO doesn't matter (as much)

Sure, for some value of long that exceeds the reordering capability of the CPU. Even a pretty modest execution window will capture a lot of FP loop kernels, at least the ones that have at least partially independent iterations.

TuxDave · Jul 9, 2013

Also one more quick comment.

Exophase said:
RWT says that it has register renaming for FP but I think this is probably an error

The two pipes are decoupled so they do execute out of order with respect to each other. For FP code that could mean that a long chain of FADDs followed by a long chain of FMULs could execute in parallel.

Your second paragraph would imply the need for register renaming.

Exophase · Jul 9, 2013

TuxDave said:
Also one more quick comment.

Your second paragraph would imply the need for register renaming.

How do you figure? Even full OoOE doesn't need register renaming. It's just there to alleviate false dependencies and increase available ILP.

Bay Trail benchmark appears online, crushes fastest Snapdragon ARM SoC

Golden Member

Senior member

Platinum Member

Golden Member

Senior member

Lifer

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member