Mediatek MOAR COARS!

monstercameron · May 7, 2015

imported_ats said:
For consumer workloads, more cores generally results in worse delivered performance. AKA you are quite far from both reality and theory.

how does it deliver worse performance? especially in relation to the mobile market...

imported_ats · May 7, 2015

monstercameron said:
how does it deliver worse performance? especially in relation to the mobile market...

More cores require more overall power resulting in lower overall clock speeds resulting in lower ST performance which is the only performance that actually matters for 99.99% of actual consumer applications.

Systems analyst · May 7, 2015

imported_ats said:
More cores require more overall power resulting in lower overall clock speeds resulting in lower ST performance which is the only performance that actually matters for 99.99% of actual consumer applications.

You may be thinking of PCs with a CPU; these were never conceived for real-time use with the internet; mobile phone SoCs were. The A53 is efficient, because of its simplicity; if the real-time load can be met by A53s, then this is sufficient. The A57 or A72 is there for larger, intermittent real-time loads; it is a responsive system.

monstercameron · May 7, 2015

imported_ats said:
More cores require more overall power resulting in lower overall clock speeds resulting in lower ST performance which is the only performance that actually matters for 99.99% of actual consumer applications.

i dont think it matters for mobile workloads.

imported_ats · May 7, 2015

Systems analyst said:
You may be thinking of PCs with a CPU; these were never conceived for real-time use with the internet; mobile phone SoCs were. The A53 is efficient, because of its simplicity; if the real-time load can be met by A53s, then this is sufficient. The A57 or A72 is there for larger, intermittent real-time loads; it is a responsive system.

You may be confusing your fantasy land with reality. PCs have been doing real time workloads for DECADEs. Mobile phone SoCs do basically no realtime workloads. Nothing about mobile phones nor their workloads make them in any way more capable of dealing with real time workloads, nor do mobile phones and their workloads make more use of multiprocessor systems than PCs.

Exophase · May 7, 2015

imported_ats said:
More cores require more overall power resulting in lower overall clock speeds resulting in lower ST performance which is the only performance that actually matters for 99.99% of actual consumer applications.

Maybe back in the Core 2 era before DVFS, turbo boost, and core power gating were things being done. While mobile cores may not call it turbo boost they still either limit things ability based on activity or they start throttling clock speeds based on thermal or power limits. For any given power load, a dual core and quad core variant are allowed to clock about the same if only one or two cores need to be active, as per your single threaded scenario.

imported_ats · May 7, 2015

Exophase said:
Maybe back in the Core 2 era before DVFS, turbo boost, and core power gating were things being done. While mobile cores may not call it turbo boost they still either limit things ability based on activity or they start throttling clock speeds based on thermal or power limits. For any given power load, a dual core and quad core variant are allowed to clock about the same if only one or two cores need to be active, as per your single threaded scenario.

DVFS doesn't turn off power, it only reduces it. More cores even with DVFS will result in a lower top frequency point. If we are going by your theory, 8C Haswell-E should have the same or higher top turbo frequency as 6c Haswell-E, and oh look, it doesn't... And why not? Because thermal limits matter, and more cores even in low DVFS states still take power both active and passive. And just a reminder, this is a Haswell-E design with the best DVFS on the market AND FIVR on die voltage regulation, giving it by far the best power management and response in the industry.

Exophase · May 8, 2015

imported_ats said:
DVFS doesn't turn off power, it only reduces it. More cores even with DVFS will result in a lower top frequency point. If we are going by your theory, 8C Haswell-E should have the same or higher top turbo frequency as 6c Haswell-E, and oh look, it doesn't... And why not? Because thermal limits matter, and more cores even in low DVFS states still take power both active and passive. And just a reminder, this is a Haswell-E design with the best DVFS on the market AND FIVR on die voltage regulation, giving it by far the best power management and response in the industry.

Cores that are power gated consume virtually zero power, this is totally separate from DVFS (which is why I listed it separately). It involves a big fat transistor acting as an analog switch for the power supply. Save for a tiny leakage current it's like cutting the power line completely. Everything else created equal, there's no reason why simply having more cores (that can be power gated) would have a tangible impact on power consumption when they're off. But everything isn't always created equal.

There are more factors than core count that could be affecting Haswell-E, for example +5MB L3 cache that probably isn't gated off and more PCI-E lanes that may not be dynamically power budgeted. It does break Intel's previous tradition of having E series processors that turbo'd slightly higher than the non-E series ones in addition to having more cores (but also having a higher TDP)

lopri · May 8, 2015

Exophase said:
I don't see what this has to do with efficiency of 2+2 vs 4+4 cores, could you explain your line of reasoning to me? Or do you mean that 2+2 using global task switching can be more efficient than 4+4 using cluster migration?

I do not think this discussion is going anywhere because now I understand that you are talking in abstract (i.e. in ideal). I had based my assumptions on what had actually happened.

For example, in this experiment by AT;

http://www.anandtech.com/show/8800/the-meizu-mx4-pro-review/6

The normal performance state is the system running all eight cores at their designed frequency targets, so nothing is unusual there. The "Balance" and "Power Saving" states differ from what Samsung employs in its own devices in that instead of modifying the scaling logic of the SoC, they simply disable CPU cores entirely via hot-plugging. The "Balance" mode disabled three A15 cores effectively turning the system into a 5-core system with only one big CPU and four little ones. The "Power Saving" mode entirely shuts off the big cluster and runs the SoC as if it were a quad-core A7 system. To see how this affects performance and power, we turn to the PCMark power measurements.

The data obtained by Andrei suggests 1+4 was more efficient than 4+4 for the tasks at hand even under GTS. The article does not say precisely why that is so, but nevertheless there is an indication such as "hot-plugging big cores" which sounds like a costly operation to me in lieu of power consumption.

If we look back cluster migration, there were cases where all the big cores ran at the same frequencies regardless of the load. I am not exactly sure which SOC behaved that way, but the Exynos 5410 might have been one of those. Since clock frequency has a direct correlation with power consumption, having more cores spinning unused is detrimental to power consumption.

I anticipate your response to be along the line of "Those cases are imperfect examples," to which I would agree. There is no _logical_ or _inherent_ reason why more cores should equal to more power consumption when they are power-gated. But those real-world imperfections were what I was talking about.

Systems analyst · May 8, 2015

imported_ats said:
You may be confusing your fantasy land with reality. PCs have been doing real time workloads for DECADEs. Mobile phone SoCs do basically no realtime workloads. Nothing about mobile phones nor their workloads make them in any way more capable of dealing with real time workloads, nor do mobile phones and their workloads make more use of multiprocessor systems than PCs.

Try reading:

big-LITTLE processing with ARM Coretex-A15 & Coretex-A7

by Peter Greenhalgh of ARM. He states that, at 1 GHz, handover between clusters of processors can be accomplished in 20 Micro-seconds.

There is more detail in:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0318e/index.html

I believe that PCs were introduced with the 8086 + optional 8087 FPU. They were used for office-work. It seems to me that the A7 and A15 and big.little were designed to be a very responsive real-time system.

Systems analyst · May 8, 2015

imported_ats said:
DVFS doesn't turn off power, it only reduces it. More cores even with DVFS will result in a lower top frequency point. If we are going by your theory, 8C Haswell-E should have the same or higher top turbo frequency as 6c Haswell-E, and oh look, it doesn't... And why not? Because thermal limits matter, and more cores even in low DVFS states still take power both active and passive. And just a reminder, this is a Haswell-E design with the best DVFS on the market AND FIVR on die voltage regulation, giving it by far the best power management and response in the industry.

ARM power management is described in:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0318e/index.html

Section 3 refers. Cores are powered down.

imported_ats · May 8, 2015

Exophase said:
Cores that are power gated consume virtually zero power, this is totally separate from DVFS (which is why I listed it separately). It involves a big fat transistor acting as an analog switch for the power supply. Save for a tiny leakage current it's like cutting the power line completely. Everything else created equal, there's no reason why simply having more cores (that can be power gated) would have a tangible impact on power consumption when they're off. But everything isn't always created equal.

Power gating itself has lots of ancillary issues. dIdT issues, ripple issues, latency issues, etc. Also can have floorplan impacts as well. So it isn't exactly a panacea. Mostly useful for full sleep states and little else.

There are more factors than core count that could be affecting Haswell-E, for example +5MB L3 cache that probably isn't gated off and more PCI-E lanes that may not be dynamically power budgeted. It does break Intel's previous tradition of having E series processors that turbo'd slightly higher than the non-E series ones in addition to having more cores (but also having a higher TDP)

Cache power is absolutely minimal. 5930k and 5960x have the exact same number of PCI-E lanes. The frequency differential is really down to the additional power of the 2 additional cores in the design.

imported_ats · May 8, 2015

Systems analyst said:
Try reading:

big-LITTLE processing with ARM Coretex-A15 & Coretex-A7

by Peter Greenhalgh of ARM. He states that, at 1 GHz, handover between clusters of processors can be accomplished in 20 Micro-seconds.

You realize that ARM has moved completely aware from cluster based switching because its basically a complete failure, right? And even those GTS is pretty horrible, its significantly better than CBS. And even GTS has enough overheads that it is generally quite suboptimal to use 99% of the time. I'm well aware of almost all the issues that ARM is having as the concept isn't exactly new to me. We explored core hopping technologies all the way back at the beginning of the millennium as a possible solution to thermal hot spotting including having auxiliary hardware to speed up the process. Its almost entirely a technical dead end. And GTS and CBS are much much more problematic than core-hopping.

There is more detail in:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0318e/index.html

I'm well aware of the marketing and technical literature within the industry.

I believe that PCs were introduced with the 8086 + optional 8087 FPU. They were used for office-work. It seems to me that the A7 and A15 and big.little were designed to be a very responsive real-time system.

8086 and 8088s were used in networked systems. as were 286s. as were 386s. You seem to think that this concept of networked systems is new. Unfortunately for you, its rather quite old. And no, A7/15 weren't design for responsive real-time systems. They were designed to be compute cores for whatever ARM could sell them in. There's nothing about them that provides any advantage for any real-time workload. And bl is actually a detriment to any real-time workload as it adds significant latency, complexity, and massive variability.

If you post again, I hope it might contain a mildly interesting and intelligent argument, something you previous posts have certainly lacked.

imported_ats · May 8, 2015

Systems analyst said:
ARM power management is described in:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0318e/index.html

Section 3 refers. Cores are powered down.

Once again you are demonstrating a lack of understanding. I suggest you abandon this topic and therefore let the S/N increase.

Exophase · May 8, 2015

imported_ats said:
Power gating itself has lots of ancillary issues. dIdT issues, ripple issues, latency issues, etc. Also can have floorplan impacts as well. So it isn't exactly a panacea. Mostly useful for full sleep states and little else.

We can talk about complications that limit the scenarios you can enter power gating in but none of that changes the fact then when the cores are off and have been off for any significant amount of time they're not really using power. Every modern multicore design that I know of allows power gating individual cores (unless you except the Bulldozer line where it's individual modules...) so whatever impact it had on the design is moot.

imported_ats said:
Cache power is absolutely minimal. 5930k and 5960x have the exact same number of PCI-E lanes. The frequency differential is really down to the additional power of the 2 additional cores in the design.

You really don't know that.

Can you point to an example in the mobile SoC world (you know, what we're talking about) where an SoC with more of the same type of cores clocks lower? I can't.

imported_ats · May 8, 2015

Exophase said:
We can talk about complications that limit the scenarios you can enter power gating in but none of that changes the fact then when the cores are off and have been off for any significant amount of time they're not really using power. Every modern multicore design that I know of allows power gating individual cores (unless you except the Bulldozer line where it's individual modules...) so whatever impact it had on the design is moot.

The reality is most designs can't power gate a core unless the device is in a suspend or sleep state.

You really don't know that.

Lets just say I'm pretty certain. Granted its been a bit, but I've seen the numbers.

Can you point to an example in the mobile SoC world (you know, what we're talking about) where an SoC with more of the same type of cores clocks lower? I can't.

Can you point to an example where it doesn't or doesn't increase the power?

Exophase · May 8, 2015

imported_ats said:
The reality is most designs can't power gate a core unless the device is in a suspend or sleep state.

That hasn't been true for a mobile SoC since Tegra 2 (and that was probably a big reason why it didn't get design wins in phones). If it worked like that there'd be no reason for them to include per-core power gating in the first place as opposed to just one big gate on every core. Power gating actually goes a finer grained than per-core too, for example SIMD units can be power gated.

For a while Android used the hotplug governor to power gate individual cores that have been idle for a while (and bring them back up if there's sufficient demand for a while). There have also been apps that manually gate individual cores. More recently power gating functionality has been moving into the cpuidle manager.

imported_ats said:
Lets just say I'm pretty certain. Granted its been a bit, but I've seen the numbers.

You'll understand if that's not very meaningful to me...

imported_ats said:
Can you point to an example where it doesn't or doesn't increase the power?

There are several examples of mobile SoCs where new versions came out that included more of the same type of cores, made on the same process, and clock speeds stayed the same or decreased. Some examples:

Snapdragon S3: single core Scorpion 1.5GHz -> S3: dual core Scorpion 1.7GHz
Snapdragon S4: dual core Krait 200 1.5GHz -> S3: quad core Krait 200 1.7GHz
Snapdragon S4: dual core Krait 300 1.7GHz -> S3: quad core Krait 300 1.9GHz
Snapdragon 618: 2x 1.8GHz Cortex-A72 + 4x 1.2GHz Cortex-A53 -> Snapdragon 620: 4x 1.8GHz Cortex-A72 + 4x 1.2GHz Cortex-A53

Tegra 2: dual core Cortex-A9 1.2GHz -> Tegra 3: 1.6GHz Cortex-A9 1.7GHz ("up to 1.7GHz in single core mode")

Exynos 4: dual core Cortex-A9 1.5GHz -> quad core Cortex-A9 1.6GHz
Exynos 5: dual core Cortex-A15 1.7GHz -> 4x Cortex-A15 2.0GHz + 4x Cortex-A7 1.3GHz

i.MX6 dual: dual core Cortex-A9 1.2GHz -> i.MX6 quad: quad core Cortex-A9 1.2GHz
(in this case the SoCs are the same other than the core count)

MT6571: dual core Cortex-A7 1.3GHz -> MT6589 quad core Cortex-A7 1.5GHz
MT6588: quad core Cortex-A7 1.7Ghz -> MT6592 octa core Cortex-A7 2GHz
MT6732: quad core Cortex-A53 1.5GHz -> MT6752: octa core Cortex-A53 1.5GHz

Z2480: dual core Saltwell 2GHz -> Z2580: quad core Saltwell 2GHz
Z3480: dual core Silvermont 2.13GHz -> Z3580: quad core Silvermont 2.33GHz

I don't have any power consumption numbers for how one SoC with higher core counts and cores disabled compares to another with lower counts, that are otherwise of similar/same design because I haven't seen anyone test for this. Not a lot of people really do good testing for power consumption to begin with. But suffice it to say that in the mobile SoC world core count has not been a limiter of clock speed in any case I know of.

Maybe things are different in the server world. Maybe the extra cores not being fused off really dp cost thermal headroom, which could be due to different design priorities. Or maybe there are other reasons that make them not as aggressive as possible with throttling. The server world is very different, here when you get an 12 core processor you're not going to be expected to only need to have one or two cores active very often. But on mobile devices this is the case much of the time, and it's critical that power consumption is minimized.

Systems analyst · May 9, 2015

Exophase said:
That hasn't been true for a mobile SoC since Tegra 2 (and that was probably a big reason why it didn't get design wins in phones). If it worked like that there'd be no reason for them to include per-core power gating in the first place as opposed to just one big gate on every core. Power gating actually goes a finer grained than per-core too, for example SIMD units can be power gated.

For a while Android used the hotplug governor to power gate individual cores that have been idle for a while (and bring them back up if there's sufficient demand for a while). There have also been apps that manually gate individual cores. More recently power gating functionality has been moving into the cpuidle manager.

You'll understand if that's not very meaningful to me...

There are several examples of mobile SoCs where new versions came out that included more of the same type of cores, made on the same process, and clock speeds stayed the same or decreased. Some examples:

Snapdragon S3: single core Scorpion 1.5GHz -> S3: dual core Scorpion 1.7GHz
Snapdragon S4: dual core Krait 200 1.5GHz -> S3: quad core Krait 200 1.7GHz
Snapdragon S4: dual core Krait 300 1.7GHz -> S3: quad core Krait 300 1.9GHz
Snapdragon 618: 2x 1.8GHz Cortex-A72 + 4x 1.2GHz Cortex-A53 -> Snapdragon 620: 4x 1.8GHz Cortex-A72 + 4x 1.2GHz Cortex-A53

Tegra 2: dual core Cortex-A9 1.2GHz -> Tegra 3: 1.6GHz Cortex-A9 1.7GHz ("up to 1.7GHz in single core mode")

Exynos 4: dual core Cortex-A9 1.5GHz -> quad core Cortex-A9 1.6GHz
Exynos 5: dual core Cortex-A15 1.7GHz -> 4x Cortex-A15 2.0GHz + 4x Cortex-A7 1.3GHz

i.MX6 dual: dual core Cortex-A9 1.2GHz -> i.MX6 quad: quad core Cortex-A9 1.2GHz
(in this case the SoCs are the same other than the core count)

MT6571: dual core Cortex-A7 1.3GHz -> MT6589 quad core Cortex-A7 1.5GHz
MT6588: quad core Cortex-A7 1.7Ghz -> MT6592 octa core Cortex-A7 2GHz
MT6732: quad core Cortex-A53 1.5GHz -> MT6752: octa core Cortex-A53 1.5GHz

Z2480: dual core Saltwell 2GHz -> Z2580: quad core Saltwell 2GHz
Z3480: dual core Silvermont 2.13GHz -> Z3580: quad core Silvermont 2.33GHz

I don't have any power consumption numbers for how one SoC with higher core counts and cores disabled compares to another with lower counts, that are otherwise of similar/same design because I haven't seen anyone test for this. Not a lot of people really do good testing for power consumption to begin with. But suffice it to say that in the mobile SoC world core count has not been a limiter of clock speed in any case I know of.

Maybe things are different in the server world. Maybe the extra cores not being fused off really dp cost thermal headroom, which could be due to different design priorities. Or maybe there are other reasons that make them not as aggressive as possible with throttling. The server world is very different, here when you get an 12 core processor you're not going to be expected to only need to have one or two cores active very often. But on mobile devices this is the case much of the time, and it's critical that power consumption is minimized.

There is further description and bench-marking of big.little systems on:

http://www.arm.com/products/processors/technologies/biglittleprocessing.php

under the 'resources' tab.

The following pdf:
big_LITTLE_technology_moves_towards_fully_heterogeneous_Global_Task_Scheduling
describes the three different ways of using big.little on page 4.
On pages 8 and 9, they show a comparison of power useage for A15s only, v. A15/A7 big.little.
On page 11 they show a chart for 'Angry Birds' running on a big.little system, where the A7s are sufficient; all four A7s are used, at reduced frequencies, to save power.
These examples clearly show how big.little saves energy, in comparison with the use of larger, more powerful A15 cores.

The pdf:
big.LITTLE Technology: The Future of Mobile
also shows a power comparison, between A15s only, and A7/A15 big.little, figure 8.
Figure 4 shows the Cluster migration and GTS models of big.little. Both of these involve modified kernels, which are available from Linaro, as stated on pages 5 and 12.

Exophase · May 9, 2015

Power savings of using big.LITTLE vs big only doesn't really have anything to do with what we're talking about (the assertion that extra cores eat away at peak clocks in mobile SoCs)

Systems analyst · May 9, 2015

Exophase said:
Power savings of using big.LITTLE vs big only doesn't really have anything to do with what we're talking about (the assertion that extra cores eat away at peak clocks in mobile SoCs)

I think that the whole point of this thread is that extra cores are used (moar coars) to reduce frequencies, and save power; ARM's work shows that this is true; they also show, as ARM pointed out, that consumer devices can make use of multiple cores. This is all in the context of standard ARM cores.

monstercameron · May 11, 2015

seems like qualcomm thinks mediatek has the right idea!
http://www.phonearena.com/news/Snap...ca-core-CPU-being-prepped-by-Qualcomm_id69131

Qwertilot · May 12, 2015

Not convinced by that

4,2,4 would be an awfully odd set up! Might as well throw an extra 2 A53's in to make it twelve

Andrei. · May 13, 2015

imported_ats said:
The reality is most designs can't power gate a core unless the device is in a suspend or sleep state.

That's nonsense. Basically all SoCs beginning with A15/A7 have low-latency fine-grained per-core power gating when they're idling for more than 1ms. Before that they still had it on a more coarse granularity via hot-plugging (Around 100-500ms periods) or when only 1 core was online (Samsung's AFTR powers down the whole CPU complex in running mode - this has been around since the Galaxy S2). And this is full-core power gating implemented by the vendor, not internal architectural power-gating like gating off the FPUs and stuff like that which happen on instruction cycle latencies. How do you assume Apple handles leakage on those gigantic cores without burning through the battery?

People praise Intel's FIVR and their hardware DVFS - but without actually seeing how it behaves, I'm very sceptical of its advantages.

Mediatek MOAR COARS!

Diamond Member

Senior member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Elite Member

Member

Member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Golden Member

Senior member