• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Question Bloomberg: Apple testing SoCs with 16 and 32 high performance cores

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jeanlain

Member
Oct 26, 2020
84
50
51
See above. HEVC does have ARM64 hand tuned assembly optimizations.
It doesn't mean that the level of optimisation is comparable to the X86 version.
In addition, these optimisations may not have yet been included in handbrake yet. I've read that's the case for some neon optimisation contributed by Apple.
 

Bam360

Member
Jan 10, 2019
30
58
61
The M1 has 4x 128 bit NEON, so isn't that similar to 2x 256 bit AVX in throughput?
Yes, but is that a good thing? I mean, similar throughput is not a good thing in this case when Apple is a super wide core with much lower clock speed, it makes sense that in order to be equal or faster than a higher clock more narrow core, not only it needs to have more ALUs or higher ROB window, but also have more throughput in SIMD workloads.
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
It doesn't mean that the level of optimisation is comparable to the X86 version.
I already acknowledged that. But it also doesn't mean any pending optimization could bridge that gap successfully.

In addition, these optimisations may not have yet been included in handbrake yet. I've read that's the case for some neon optimisation contributed by Apple.
Why would you think it wouldn't be in Handbrake? x265 version 3.4 was released back in June, so the latest versions of Handbrake should be using it.[/QUOTE]
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
Yes, but is that a good thing? I mean, similar throughput is not a good thing in this case when Apple is a super wide core with much lower clock speed, it makes sense that in order to be equal or faster than a higher clock more narrow core, not only it needs to have more ALUs or higher ROB window, but also have more throughput in SIMD workloads.
This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one. There's a good reason why Intel was so dominant in gaming over the years; the very high clock speeds of their CPUs and the low memory latency.

I don't know if encoding corresponds well with high IPC. It may be throughput limited as you say, as SIMD is utilized heavily.
 

jeanlain

Member
Oct 26, 2020
84
50
51
I already acknowledged that. But it also doesn't mean any pending optimization could bridge that gap successfully.



Why would you think it wouldn't be in Handbrake? x265 version 3.4 was released back in June, so the latest versions of Handbrake should be using it.
What's so special about x265 that would make x86 inherently better?
As for handbrake, it was actually about x265 itself, which has not yet included Apple's optimisations in its main branch: Posted: 29 Nov 2020 08:39
I couldn't verify the source, but this poster appears to be contributing to handbrake or x265.
 
Last edited:

Thunder 57

Golden Member
Aug 19, 2007
1,676
1,724
136
Can we have a conversation without you getting all butthurt that Apple is/will/could be faster than AMD for f sakes?

Yes, we get it. You love AMD and you want to own the fastest chips without having to use Apple products. Not being able to in the near future hurts your manlihood. Now let's put that aside and focus on actual CPU discussions.
The M1 is a very impressive chip, especially when it comes to efficiency. You have said more than once though that it is the best CPU out there, and that is not true.
 

jeanlain

Member
Oct 26, 2020
84
50
51
This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one.
I don't get it. The M1 is very fast at single threaded tasks. That has been shown in many types of workloads. Why would games differ? Do you believe that the IPC of the M1 would somehow be lower for games?
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
What's so special about x265 that would make x86 inherently better?
I didn't say x86 was inherently better. It has nothing to do with ISA, but microarchitecture. It's possible that heavy compute intensive encoding like HEVC is throughput limited, so low clock speed+high IPC won't necessarily help you there. What would help the most would be high clock speed, number of cores, SIMD performance.

As for handbrake, it was actually about x265 itself, which has not yet included Apple's optimisations in its main branch: Posted: 29 Nov 2020 08:39
I couldn't verify the source, but this poster appears to be contributing to handbrake or x265.
But handbrake 1.4 includes these optimizations? So Handbrake is using a custom x265 variant for version 1.4? That's interesting if true.
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
I don't get it. The M1 is very fast at single threaded tasks. That has been shown in many types of workloads. Why would games differ? Do you believe that the IPC of the M1 would somehow be lower for games?
Because game code typically tends to be poorly optimized and very dependent on memory performance, which is why it's low in IPC. I've read comments from many programmers and software engineers that it's extremely difficult to get high IPC code from games.....even on this forum:

That's why microarchitectures like Skylake were so successful in gaming workloads, and also why Zen 3 had a massive performance gain over Zen 2. Low memory latency is a big deal in gaming, as is high clock speed.
 
  • Like
Reactions: Tlh97 and jeanlain

itsmydamnation

Platinum Member
Feb 6, 2011
2,234
1,875
136
Doesn't Zen 2 and 3 have 2x AVX2 units per core, plus 2 FMA?
No,

Zen 2 has 4 x256bit pipes
two of those are FMAC/ FMA
two of those are FADD

Zen 3 has 6x 256bit pipes
two of those are FMAC/ FMA
two of those are FADD
two of those are int to fp conversion and Load/store

now both Zen 2 and 3 can issue 2xFMA + 2x FADD a cycle in the right conditions , more achievable is 2xFMA + 1x FADD.

Just like both AMD and Intel , ARM Neon pipes dont have the be symmetric either , so doing pipeline counts is rather useless ( see Zen3 FP uplift over Zen2 without any additional ADD/MUL capacity) , you need to look at issues rate per instruction.
 
Last edited:

jeanlain

Member
Oct 26, 2020
84
50
51
But handbrake 1.4 includes these optimizations? So Handbrake is using a custom x265 variant for version 1.4? That's interesting if true.
You're right. Handbrake uses these optimisations, which apparently reduces the encoding time by almost half. How much room there is for further improvements, I don't know.
The official x265 doesn't include these optimisations, as x265 contributors don't appear interested.
 
  • Like
Reactions: Tlh97 and Carfax83

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
No,

Zen 2 has 4 x256bit pipes
two of those are FMAC/ FMA
two of those are FADD

Zen 3 has 6x 256bit pipes
two of those are FMAC/ FMA
two of those are FADD
two of those are int to fp conversion and Load/store

now both Zen 2 and 3 can issue 2xFMA + 2x FADD a cycle in the right conditions , more achievable is 2xFMA + 1x FADD.

Just like both AMD and Intel , ARM Neon pipes dont have the be symmetric either , so doing pipeline counts is rather useless ( see Zen3 bit FP uplift over Zen2 without any additional ADD/MUL capacity) , you need to look at issues rate per instruction.
This is all a bit too technical for me to completely understand. I'll ask you two simple questions:

1) How much greater is the SIMD throughput for Zen 3 over Zen 2 and Intel's Comet Lake?

2) How does Zen 3 compare with the M1 in that regard?

According to Andrei F.:

The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
 

jeanlain

Member
Oct 26, 2020
84
50
51
Because game code typically tends to be poorly optimized and very dependent on memory performance, which is why it's low in IPC.
I suppose this can be evaluated by benchmarking games at very low res so that performance is not limited by the GPU. But this will be hard to achieve on the M1's integrated GPU. And very few games have universal macOS versions anyway. There's only WoW and some Minecraft rip-off, AFAIK.
Even then, comparisons are difficult because x86/ARM specific optimisations may come into play.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,234
1,875
136
This is all a bit too technical for me to completely understand. I'll ask you two simple questions:

1) How much greater is the SIMD throughput for Zen 3 over Zen 2 and Intel's Comet Lake?

2) How does Zen 3 compare with the M1 in that regard?

According to Andrei F.:
1.for Zen3 over Zen2 it workload dependant, 0% to 50% :p
2. dont know, to hard, code quality, instruction mix etc all matter
i cant find anything that says firestorm supports FMA, and if it does on how many ports etc
so if it does FMA on all ports then in terms of absolute width per cycle executed its a very very slight advantage to Zen2/3 ( 512bit add+mul , vs 512bit mull , 786 bit add, can be 1024bit add )
if it only does FADD or FMUL then its 512bits of add or mull vs 2xFMA ( 512bit add + 512 bit mul) + upto 2x FADD ( 512bit add)
if the workload was only FADD or only FMUL then it is equal 512bit vs 512bit.

But application code and complier optimisation can make a massive difference here and in apple walled garden that can be an advantage if what is offered in that garden is what you need.

Also Andrei F is wrong on a few points in that review for the x86 side ( normally in a negative to x86 way) , was discussed at length on RWT.
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
I suppose this can be evaluated by benchmarking games at very low res so that performance is not limited by the GPU. But this will be hard to achieve on the M1's integrated GPU. And very few games have universal macOS versions anyway. There's only WoW and some Minecraft rip-off, AFAIK.
Even then, comparisons are difficult because x86/ARM specific optimisations may come into play.
If you want to see some technical details as to why high IPC doesn't necessarily equal high performance, look at this thread:

(3) Discussion - [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems | AnandTech Forums: Technology, Hardware, Software, and Deals
 

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
Also Andrei F is wrong on a few points in that review for the x86 side ( normally in a negative to x86 way) , was discussed at length on RWT.
I lurk over at RWT from time to time and I did see that thread. It was nice to see him get challenged like that, as there are plenty of experts on that forum who don't take kindly to bs. :D
 
  • Like
Reactions: Thunder 57

Gideon

Golden Member
Nov 27, 2007
1,430
2,877
136
I lurk over at RWT from time to time and I did see that thread. It was nice to see him get challenged like that, as there are plenty of experts on that forum who don't take kindly to bs. :D
RWT looks to have a lot of really imformed people but I can't stand the forum interface. Does anyone have a link to the thread? IMO it's impossible to find anything there ...
 

Bam360

Member
Jan 10, 2019
30
58
61
This is kind of making my argument for me. The lower clock speed will put it at a disadvantage in many workloads that don't have high IPC throughput, ie gaming being the most notorious one. There's a good reason why Intel was so dominant in gaming over the years; the very high clock speeds of their CPUs and the low memory latency.

I don't know if encoding corresponds well with high IPC. It may be throughput limited as you say, as SIMD is utilized heavily.
The reason Intel was so dominant in gaming is not higher clocks, it was because of memory latency almost exclusively. Memory performance is part of the IPC equation, and Intel won against Zen2 even with the same clocks, there were some tests using fixed clocks.
So what I was saying is simply that IPC will be lower in workloads that will use 256-bit to good use, it is not because of lower clocks, it is because the architecture doesn't feature modern 256-bit registers, when a future architecture does then it will have that IPC advantage like it does in other workloads.
 
  • Like
Reactions: HurleyBird

ASK THE COMMUNITY