AT: The Mobile CPU Core-Count Debate: Analyzing The Real World

imported_ats · Sep 5, 2015

Exophase said:
But you're saying they would be better off in all ways with fewer cores, so the scheduler could just keep the other cores powered off.

But it doesn't.

All scheduling is based on analyzing repetitive behaviors. Always has been. Saying that it'd need an oracle to be effective is meaningless without actually data that shows that it's a loss with subpar scheduling.

No, most scheduling isn't based on analyzing repetitive behaviors. Certainly not the schedulers used in linux and esp wrt power.

Okay, so when you say Fmax @ Vmin you mean the maximum frequency that a particular voltage can support and Fmin as the lowest frequency that voltage can support (and in this case Fmin should be the same for all voltages) Usually when I've seen Vmin/Vmax it refers to global limits irrespective of frequency.

If you see Fmin/Fmax or Vmin/Vmax independently then they are usually talking about the absolute limits, things like Fmax@Vmin or Fmax@1V denote sub-sectioned data.

What in the article leads you to think that they are ever running at a higher voltage than the binning of the chip and power management dictates is allowed that frequency?

Because Vmin is a very real thing and it is highly unlikely that the core/process is designed to scale down Vdd down much below .7-.75 if it can even go that low. AKA there is a hard floor to Vmin and you tend to have DFVS frequency scaling well below Fmax@Vmin. This is useful for when a core is processing limited data but much be in a keep alive state, sure it uses less power than Fmax@Vmin, but from a perf/w perspective its worse than Fmax@Vmin. Its done primarily when perf doesn't matter.

Fmin for the big core is at 800MHz and if you look at frequency tables voltage keeps scaling all the way down until this point, which is why it switches to the little core below that, which have Vmin at a substantially lower frequency. Yes there's no point decreasing frequency below what you can scale voltage. What makes you think that's happening here?

What makes you think it isn't happening...

Again. The graph at the end. I don't know why you keep ignoring it. It shows perf/W increases with lower perf over a huge dynamic range.

Which tells us nothing about if b.L is good or bad. Nor if the workload being profiled has any relevance to the workloads being used.

You're making broad statements about costs of context switches and cache flushes but the fact is that this only applies for as much as you actually switch clusters. And a lot of loads hit steady states where they don't switch for a long time, if ever. This is evident in the data.

The data presented is completely insufficient to make an argument one way or another. Though the reason they don't switch is that they've had to heavily bias things because switching is so costly.

Fjodor2001 · Sep 5, 2015

imported_ats said:
b.L like all core hopping has a significant issue with scheduling to even break even. After all, 'its just a simple matter of software'

But do you have any data indicating that the cross CPU task scheduling penalty is greater than the gains of having two different uArch CPU cores perf/watt-optimized for different performance ranges?

imported_ats said:
So yes, it really is possible. After all, we have an actual example of it working in Apple's designs.

The fact that Apple uses a single uArch for their cores does not mean they could not do better with a b.L approach.

b.L is after all a new approach in actual products. Even ARM/Samsung/Qualcomm/Sony/[...] did not use it until quite recently. So with time, we might see even more adopters of b.L.

imported_ats said:
In theory they could combine Atom + Core for low power laptop/tablet markets, but the reality is it would present more problems than it would solve.

Yes, that could work in theory. But I'm not so sure those uArch:es are suited for a b.L approach. I think the uArch:es will have to be designed from the ground up with b.L in mind. Or you have to be lucky and already have two uArches in store that are suitable for b.L.

imported_ats said:
b.L is almost entirely a compromised solution do to the limitations that ARM has to work with. If it provides such an advantage you would see Apple/QCom doing it with their own designs, after all, they also have easy access to a53s, but instead they are going to a single core design that scales.

That is just speculation. We've not seen any data indicating this.

If anything we've seen the difference, as shown by the AT article.

Fjodor2001 · Sep 5, 2015

imported_ats said:
There is no viable phone workload that should in anyways possible be able to use 8 cores that isn't simply horrible software design.

I'm not sure why you say a phone workload that utilizes 8 cores should be horrible software design. I'd say it is just the opposite. The more parallelized you can write the SW the better, since it then is able to make use of more cores when available.

Also, from the AT article:

The fact that Chrome and to a lesser extent Samsung's stock browser were able to consistently load up to 6-8 concurrent processes while loading a page suddenly gives a lot of credence to these 8-core designs that we would have otherwise not thought of being able to fully use their designed CPU configurations.
[...]
What we see in the use-case analysis is that the amount of use-cases where an application is visibly limited due to single-threaded performance seems be very limited.
[...]
On the other hand, scenarios were we'd find 3-4 high load threads seem not to be that particularly hard to find, and actually appear to be an a pretty common occurence.

imported_ats said:
there is no need for more than a single core to handle sub 25 MB/s storage I/O.

This I do agree with. Typically the data to/from eMMC is transferred via DMA, not involving the CPU. The CPU just submits I/O requests to the eMMC hardware and is then notified upon completion via an interrupt, and then a new request can be issued to the eMMC command queue.

imported_ats · Sep 5, 2015

Fjodor2001 said:
But do you have any data indicating that the cross CPU task scheduling penalty is greater than the gains of having two different uArch CPU cores perf/watt-optimized for different performance ranges?

None that is publicaly available.

The fact that Apple uses a single uArch for their cores does not mean they could not do better with a b.L approach.

If they could do better with it, then they would be using it. The people working there are fairly smart. They could easily plot down a couple a53s if it was an advantage. The reality is with proper power optimization and control of the whole stack, you can do better than b.L.

b.L is after all a new approach in actual products. Even ARM/Samsung/Qualcomm/Sony/[...] did not use it until quite recently. So with time, we might see even more adopters of b.L.

Not really that new. It was being looked at for products well over a decade ago. The scheduling and context transition issues are actually pretty significant. ARM uses it because they don't really have another option.

Yes, that could work in theory. But I'm not so sure those uArch:es are suited for a b.L approach. I think the uArch:es will have to be designed from the ground up with b.L in mind. Or you have to be lucky and already have two uArches in store that are suitable for b.L.

They are as suitable for b.L as a53/57.

That is just speculation. We've not seen any data indicating this.

If anything we've seen the difference, as shown by the AT article.

The AT article really says nothing about b.L, I mean it barely says anything about 4c being useful.

imported_ats · Sep 5, 2015

Fjodor2001 said:
I'm not sure why you say a phone workload that utilizes 8 cores should be horrible software design. I'd say it is just the opposite. The more parallelized you can write the SW the better, since it then is able to make use of more cores when available.

There isn't a viable phone workload that needs that many cores. If you are paralyzing just to parallelize then you are doing it wrong. This has been proven time and again. Nor is it power efficient due to increased uncore activity.

Also, from the AT article:...

The data presented doesn't really collaborate that analysis. Having a bunch of thread waiting in spin loops doesn't mean you need multiple hardware contexts. I mean its not hard to generate a bunch of threads loading a webpage, just spin off a new thread for each get request (of which there are probably 10s to 100s for a website like AT), a render thread, a cache thread, etc. This is usually done so that you are non-blocking, but it doesn't mean you need that many nor will even benefit from that many underlying hardware contexts.

This I do agree with. Typically the data to/from eMMC is transferred via DMA, not involving the CPU. The CPU just submits I/O requests to the eMMC hardware and is then notified upon completion via an interrupt, and then a new request can be issued to the eMMC command queue.

Which makes sense. What doesn't make sense is the extreme usage during AT update test which is basically out of whack with any reasonable level of required performance. You can literally run a whole NAS RAID/ZFS box with less cycles and maintain 100+ MB/s of bandwidth.

Exophase · Sep 5, 2015

imported_ats said:
Because Vmin is a very real thing and it is highly unlikely that the core/process is designed to scale down Vdd down much below .7-.75 if it can even go that low. AKA there is a hard floor to Vmin and you tend to have DFVS frequency scaling well below Fmax@Vmin. This is useful for when a core is processing limited data but much be in a keep alive state, sure it uses less power than Fmax@Vmin, but from a perf/w perspective its worse than Fmax@Vmin. Its done primarily when perf doesn't matter.

The data is out there and you're completely ignoring it.

http://www.anandtech.com/show/9330/exynos-7420-deep-dive

A57 @ 800MHz: 625mV
A53 @ 400MHz: 606mV

Those are the Fmin and Vmin being used for the same chip tested in this article.

Everything you're saying about Fmin/Vmin here is completely wrong. The graph at the end of the article that you keep ignoring shows that perf/W continues to improve through the whole dynamic range tested except at the very lowest part!

imported_ats said:
What makes you think it isn't happening...

Nope, it doesn't work this way. Burden is on you to show evidence for your claim. The article already shows the perf/W vs perf curve.

imported_ats said:
The data presented is completely insufficient to make an argument one way or another. Though the reason they don't switch is that they've had to heavily bias things because switching is so costly.

It's funny, you say that the data is insufficient to say things one way or the other yet you're content saying big.LITTLE is completely negative with no data of your own...

The data presented does actually show workloads where the big cores have high residency on only one core and the small cores high residency on multiple cores for long periods of time. How does this not show that there isn't constant cluster migration? If the scheduler is "biased" to do this then again you seem to be accusing the scheduler the wrong thing just to try to justify the core design, yet the SoC has very competitive performance so that's pretty confusing...

Andrei. · Sep 5, 2015

imported_ats said:
What makes you think it isn't happening...

You should explain why you think it's happening as you brought it up. In any case I can say that you're wrong. For this particular SoC the frequencies stop scaling down when voltages stop scaling down. In modern SoCs that is the only point where race-to-sleep is actually valid, for anything else it's the standard power curve which dictates efficiency. This is how every single vendor out there operates no matter the architecture.

Your arguments about context-switching are overblown. Sure there is overhead, and that's why ARM is introducing the improved CCI-500 for example to try to alleviate this. However currently the scheduler is configured for this not to be a problem. The cluster switching hysteresis is 4 orders of magnitude higher than the perf penalty for a reason. Qualcomm actually has the power numbers on this and they use it in their energy-aware scheduler. And it really doesn't matter. It's still a drop in the ocean compared to the actual power used by the cores when they're doing something. And such a problem scenario will realistically simply never happen because nobody will set it up like that because you don't want to switch much faster than what the DVFS system works on, and currently that's still in the 1-20ms range.

Comparing 2 cores at Fmax/2@Vmin vs 1 core at Fmax@Vmin would result in the Fmax@Vmin core burning lower power.

Of course you can run Fmax@Vmin. You do understand that a each Vx there is a range from Fmax to Fmin, right? Have you never seen a Shmoo plot before? And where am I saying DVFS is a farce? It seems you just don't understand semiconductor parametrics.

....

Even the data presented points ot the the number of cores basically going completely unused, sure it running a lot of threads but its idle or powered off 50+% of the time or operating at horrible operating points for power/efficiency (aka 4 cores running at 400mhz, well below Fmax at Vmin).

I think you have a fundamental lack of understanding how these scaling mechanisms work otherwise you wouldn't be making these statements. There isn't a single SoC out there that runs anything less than Fmax at a given Vx. That last sentence makes absolutely no sense at all as 4-core spread at 400MHz is actually the single best efficiency point of the SoC, I even included the efficiency graph to try to avoid people bringing up this discussion.

krumme · Sep 6, 2015

You guys might want to reread this excellent piece
http://www.anandtech.com/show/9330/exynos-7420-deep-dive/6

Where we see this efficient migration between cores the 7420 shows is a new thing vs the 5433.
No wonder people have trouble accepting the bare facts as it is real news and surprising but Andrei last article is a milestone in reviewing imo.

Fjodor2001 · Nov 21, 2015

Samsung's custom ARM Core in Exynos 8 Octa 8890 will use big.LITTLE too:

http://www.extremetech.com/computin...ll-feature-the-companys-first-custom-cpu-core

Now, Samsung has announced the follow-up to its well-regarded FinFET debut the Exynos 8890. Where the Exynos 7420 combined ARMs standard Cortex-A57 and A-53 in an eight-core big.Little configuration, the Exynos 8890 will pair a set of standard Cortex-A53 cores with a brand-new custom architecture, codenamed Mongoose.

So Samsung decided to use a big.LITTLE solution despite having a custom designed ARM CPU core. This indicate those who thought b.L was just a stopgap solution until a custom CPU was ready were wrong.

Search

AT: The Mobile CPU Core-Count Debate: Analyzing The Real World

imported_ats

Senior member

Fjodor2001

Diamond Member

Fjodor2001

Diamond Member

imported_ats

Senior member

imported_ats

Senior member

Exophase

Diamond Member

Andrei.

Senior member

krumme

Diamond Member

Fjodor2001

Diamond Member

TRENDING THREADS