Question Bloomberg: Apple testing SoCs with 16 and 32 high performance cores

mikegg · Dec 10, 2020

The current ‌M1‌ chip has four high-performance processing cores and four power-saving cores. For its next generation chip targeting MacBook Pro and ‌iMac‌ models, Apple is said to be working on designs with as many as 16 power cores and four efficiency cores.

Apple is also reportedly testing a chip design with as many as 32 high-performance cores for higher-end desktop computers planned for later in 2021, as well as a new half-sized Mac Pro planned to launch by 2022.

Bloomberg: Apple Working on Next-Gen Apple Silicon Chips for MacBook Pro, iMacs, and Mac Pro Due to Launch Next Year

Apple is working on a series of new custom Apple silicon processors to power upgraded versions of the MacBook Pro, new iMacs, and a new Mac Pro for...

www.macrumors.com

jeanlain · Dec 13, 2020

Carfax83 said:
See above. HEVC does have ARM64 hand tuned assembly optimizations.

It doesn't mean that the level of optimisation is comparable to the X86 version.
In addition, these optimisations may not have yet been included in handbrake yet. I've read that's the case for some neon optimisation contributed by Apple.

Bam360 · Dec 13, 2020

Carfax83 said:
The M1 has 4x 128 bit NEON, so isn't that similar to 2x 256 bit AVX in throughput?

Yes, but is that a good thing? I mean, similar throughput is not a good thing in this case when Apple is a super wide core with much lower clock speed, it makes sense that in order to be equal or faster than a higher clock more narrow core, not only it needs to have more ALUs or higher ROB window, but also have more throughput in SIMD workloads.

Carfax83 · Dec 13, 2020

jeanlain said:
It doesn't mean that the level of optimisation is comparable to the X86 version.

I already acknowledged that. But it also doesn't mean any pending optimization could bridge that gap successfully.

In addition, these optimisations may not have yet been included in handbrake yet. I've read that's the case for some neon optimisation contributed by Apple.

Why would you think it wouldn't be in Handbrake? x265 version 3.4 was released back in June, so the latest versions of Handbrake should be using it.[/QUOTE]

Carfax83 · Dec 13, 2020

Bam360 said:
Yes, but is that a good thing? I mean, similar throughput is not a good thing in this case when Apple is a super wide core with much lower clock speed, it makes sense that in order to be equal or faster than a higher clock more narrow core, not only it needs to have more ALUs or higher ROB window, but also have more throughput in SIMD workloads.

This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one. There's a good reason why Intel was so dominant in gaming over the years; the very high clock speeds of their CPUs and the low memory latency.

I don't know if encoding corresponds well with high IPC. It may be throughput limited as you say, as SIMD is utilized heavily.

itsmydamnation · Dec 13, 2020

Carfax83 said:
The M1 has 4x 128 bit NEON, so isn't that similar to 2x 256 bit AVX in throughput?

who has 2x 256bit avx throughput ?

jeanlain · Dec 13, 2020

Carfax83 said:
I already acknowledged that. But it also doesn't mean any pending optimization could bridge that gap successfully.

Why would you think it wouldn't be in Handbrake? x265 version 3.4 was released back in June, so the latest versions of Handbrake should be using it.

What's so special about x265 that would make x86 inherently better?
As for handbrake, it was actually about x265 itself, which has not yet included Apple's optimisations in its main branch: Posted: 29 Nov 2020 08:39
I couldn't verify the source, but this poster appears to be contributing to handbrake or x265.

Thunder 57 · Dec 13, 2020

senttoschool said:
Can we have a conversation without you getting all butthurt that Apple is/will/could be faster than AMD for f sakes?

Yes, we get it. You love AMD and you want to own the fastest chips without having to use Apple products. Not being able to in the near future hurts your manlihood. Now let's put that aside and focus on actual CPU discussions.

The M1 is a very impressive chip, especially when it comes to efficiency. You have said more than once though that it is the best CPU out there, and that is not true.

jeanlain · Dec 13, 2020

Carfax83 said:
This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one.

I don't get it. The M1 is very fast at single threaded tasks. That has been shown in many types of workloads. Why would games differ? Do you believe that the IPC of the M1 would somehow be lower for games?

Carfax83 · Dec 13, 2020

itsmydamnation said:
who has 2x 256bit avx throughput ?

Doesn't Zen 2 and 3 have 2x AVX2 units per core, plus 2 FMA?

Carfax83 · Dec 13, 2020

jeanlain said:
What's so special about x265 that would make x86 inherently better?

I didn't say x86 was inherently better. It has nothing to do with ISA, but microarchitecture. It's possible that heavy compute intensive encoding like HEVC is throughput limited, so low clock speed+high IPC won't necessarily help you there. What would help the most would be high clock speed, number of cores, SIMD performance.

As for handbrake, it was actually about x265 itself, which has not yet included Apple's optimisations in its main branch: Posted: 29 Nov 2020 08:39
I couldn't verify the source, but this poster appears to be contributing to handbrake or x265.

But handbrake 1.4 includes these optimizations? So Handbrake is using a custom x265 variant for version 1.4? That's interesting if true.

Carfax83 · Dec 13, 2020

jeanlain said:
I don't get it. The M1 is very fast at single threaded tasks. That has been shown in many types of workloads. Why would games differ? Do you believe that the IPC of the M1 would somehow be lower for games?

Because game code typically tends to be poorly optimized and very dependent on memory performance, which is why it's low in IPC. I've read comments from many programmers and software engineers that it's extremely difficult to get high IPC code from games.....even on this forum:

That's why microarchitectures like Skylake were so successful in gaming workloads, and also why Zen 3 had a massive performance gain over Zen 2. Low memory latency is a big deal in gaming, as is high clock speed.

itsmydamnation · Dec 13, 2020

Carfax83 said:
Doesn't Zen 2 and 3 have 2x AVX2 units per core, plus 2 FMA?

No,

Zen 2 has 4 x256bit pipes
two of those are FMAC/ FMA
two of those are FADD

Zen 3 has 6x 256bit pipes
two of those are FMAC/ FMA
two of those are FADD
two of those are int to fp conversion and Load/store

now both Zen 2 and 3 can issue 2xFMA + 2x FADD a cycle in the right conditions , more achievable is 2xFMA + 1x FADD.

Just like both AMD and Intel , ARM Neon pipes dont have the be symmetric either , so doing pipeline counts is rather useless ( see Zen3 FP uplift over Zen2 without any additional ADD/MUL capacity) , you need to look at issues rate per instruction.

jeanlain · Dec 13, 2020

Carfax83 said:
But handbrake 1.4 includes these optimizations? So Handbrake is using a custom x265 variant for version 1.4? That's interesting if true.

You're right. Handbrake uses these optimisations, which apparently reduces the encoding time by almost half. How much room there is for further improvements, I don't know.

Bitbucket

bitbucket.org

The official x265 doesn't include these optimisations, as x265 contributors don't appear interested.

Carfax83 · Dec 13, 2020

itsmydamnation said:
No,

Zen 2 has 4 x256bit pipes
two of those are FMAC/ FMA
two of those are FADD

Zen 3 has 6x 256bit pipes
two of those are FMAC/ FMA
two of those are FADD
two of those are int to fp conversion and Load/store

now both Zen 2 and 3 can issue 2xFMA + 2x FADD a cycle in the right conditions , more achievable is 2xFMA + 1x FADD.

Just like both AMD and Intel , ARM Neon pipes dont have the be symmetric either , so doing pipeline counts is rather useless ( see Zen3 bit FP uplift over Zen2 without any additional ADD/MUL capacity) , you need to look at issues rate per instruction.

This is all a bit too technical for me to completely understand. I'll ask you two simple questions:

1) How much greater is the SIMD throughput for Zen 3 over Zen 2 and Intel's Comet Lake?

2) How does Zen 3 compare with the M1 in that regard?

According to Andrei F.:

The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

jeanlain · Dec 13, 2020

Carfax83 said:
Because game code typically tends to be poorly optimized and very dependent on memory performance, which is why it's low in IPC.

I suppose this can be evaluated by benchmarking games at very low res so that performance is not limited by the GPU. But this will be hard to achieve on the M1's integrated GPU. And very few games have universal macOS versions anyway. There's only WoW and some Minecraft rip-off, AFAIK.
Even then, comparisons are difficult because x86/ARM specific optimisations may come into play.

itsmydamnation · Dec 13, 2020

Carfax83 said:
This is all a bit too technical for me to completely understand. I'll ask you two simple questions:

1) How much greater is the SIMD throughput for Zen 3 over Zen 2 and Intel's Comet Lake?

2) How does Zen 3 compare with the M1 in that regard?

According to Andrei F.:

1.for Zen3 over Zen2 it workload dependant, 0% to 50%

2. dont know, to hard, code quality, instruction mix etc all matter
i cant find anything that says firestorm supports FMA, and if it does on how many ports etc
so if it does FMA on all ports then in terms of absolute width per cycle executed its a very very slight advantage to Zen2/3 ( 512bit add+mul , vs 512bit mull , 786 bit add, can be 1024bit add )
if it only does FADD or FMUL then its 512bits of add or mull vs 2xFMA ( 512bit add + 512 bit mul) + upto 2x FADD ( 512bit add)
if the workload was only FADD or only FMUL then it is equal 512bit vs 512bit.

But application code and complier optimisation can make a massive difference here and in apple walled garden that can be an advantage if what is offered in that garden is what you need.

Also Andrei F is wrong on a few points in that review for the x86 side ( normally in a negative to x86 way) , was discussed at length on RWT.

Carfax83 · Dec 13, 2020

jeanlain said:
I suppose this can be evaluated by benchmarking games at very low res so that performance is not limited by the GPU. But this will be hard to achieve on the M1's integrated GPU. And very few games have universal macOS versions anyway. There's only WoW and some Minecraft rip-off, AFAIK.
Even then, comparisons are difficult because x86/ARM specific optimisations may come into play.

If you want to see some technical details as to why high IPC doesn't necessarily equal high performance, look at this thread:

(3) Discussion - [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems | AnandTech Forums: Technology, Hardware, Software, and Deals

Carfax83 · Dec 13, 2020

itsmydamnation said:
Also Andrei F is wrong on a few points in that review for the x86 side ( normally in a negative to x86 way) , was discussed at length on RWT.

I lurk over at RWT from time to time and I did see that thread. It was nice to see him get challenged like that, as there are plenty of experts on that forum who don't take kindly to bs.

TurtleCrusher · Dec 14, 2020

DrMrLordX said:
You think software encoding is niche?

For the vast majority of consumers, absolutely.

DrMrLordX · Dec 14, 2020

TurtleCrusher said:
For the vast majority of consumers, absolutely.

It's never been niche around here.

Gideon · Dec 14, 2020

Carfax83 said:
I lurk over at RWT from time to time and I did see that thread. It was nice to see him get challenged like that, as there are plenty of experts on that forum who don't take kindly to bs.

RWT looks to have a lot of really imformed people but I can't stand the forum interface. Does anyone have a link to the thread? IMO it's impossible to find anything there ...

Carfax83 · Dec 14, 2020

Gideon said:
RWT looks to have a lot of really imformed people but I can't stand the forum interface. Does anyone have a link to the thread? IMO it's impossible to find anything there ...

Yeah it has a really old school layout. Reminds me of Aceshardware forum back in the day. Here's the link though. It's a long thread:

Apple M1 discussion on RWT

Gideon · Dec 14, 2020

Carfax83 said:
Yeah it has a really old school layout. Reminds me of Aceshardware forum back in the day. Here's the link though. It's a long thread:

Apple M1 discussion on RWT

Thanks!

Entropyq3 · Dec 14, 2020

DrMrLordX said:
It's never been niche around here.

Lets not, for even a second, pretend that the denizens of these forums are typical consumers.

Bam360 · Dec 14, 2020

Carfax83 said:
This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one. There's a good reason why Intel was so dominant in gaming over the years; the very high clock speeds of their CPUs and the low memory latency.

I don't know if encoding corresponds well with high IPC. It may be throughput limited as you say, as SIMD is utilized heavily.

The reason Intel was so dominant in gaming is not higher clocks, it was because of memory latency almost exclusively. Memory performance is part of the IPC equation, and Intel won against Zen2 even with the same clocks, there were some tests using fixed clocks.
So what I was saying is simply that IPC will be lower in workloads that will use 256-bit to good use, it is not because of lower clocks, it is because the architecture doesn't feature modern 256-bit registers, when a future architecture does then it will have that IPC advantage like it does in other workloads.

Question Bloomberg: Apple testing SoCs with 16 and 32 high performance cores

Golden Member

Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Platinum Member

Diamond Member

Platinum Member

Junior Member

Member