Question x86 and ARM architectures comparison thread.

Page 22 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

poke01

Diamond Member
Mar 8, 2022
4,756
6,093
106
It is because AVX512 was the last bastion of performance superiority for x86 (AMD).

People who complain about SME boosting M4's ST never complain about Zen4 AVX512 boosting Object Detection by even more.
AVX512 is really good.

I wonder when a ARM licensees will add proper SVE2 support in client. I give it 5-10 years.

especially with Intel finally adopting AVX512 across the whole stack, server to mobile in Nova lake.
 

johnsonwax

Senior member
Jun 27, 2024
469
674
96
The problem for me is not Sme But SME being part of ST performance when it's not part of a single Core..
But isn't the underlying problem here the benchmarks being structured around antiquated architectural notions? Yes, single core is useful for evaluating the core of the chip, and somewhat useful for evaluating how categories of software that tends to be single thread dominated will perform. So if you're looking from the hardware side outward, SME doesn't make sense as part of a single thread benchmark, but if you look from the code side and you make a SME call from inside your main loop, it's perfectly single thread. You don't need to worry about coherency, etc. and as such incorporating SME into your code, even unknowingly is quite safe. That's not the general case when you jump to multicore - even though there are libraries that you can hand a task to that will utilize all cores and block until all results are returned, that's not really a thing you encounter in the wild. Consider how GPUs are utilized in the general case - normally you aren't looking for a return so you don't care how parallelized they are, you're sending off a bunch of work to be done and don't care about the result because the output is to the display, not back to the CPU. CUDA of course changes that as does all of the AI uses for GPU/NPU. That's why AVX makes sense as part of ST performance because your parallelized performance is still in a single atomic entity (the vector) that comes back as a complete unit so there's no multithreaded overhead when doing compute using a vector unit, and it's relatively easy to incorporate into software so it gets used more often, and more competently than multithreaded does.

Compute isn't as cleanly organized now as it was when the Pentium D first shipped, but the structure of the benchmarks haven't really contended with that as SME shouldn't be included in multicore either which is often evaluated as drop-off from the linear multiple of single core performance, which SME would also mess up, so it's not appropriate to put there either.
 
  • Like
Reactions: Joe NYC

Doug S

Diamond Member
Feb 8, 2020
3,795
6,725
136
The fact of the matter is that Apple waited on SVE to be matured to fully make the jump to adopting it, and AMX was simply a stopgap solution to address the shortcomings of Arm NEON.

Not sure what they meant with that quote. Apple does not support SVE, and if they meant SME (based on the link in the article I think that's very likely) they didn't "wait on SME to be matured" they waited on ARM to standardize AMX and give it a new name.
 

LightningDust

Member
Sep 3, 2024
82
184
66
The problem for me is not Sme But SME being part of ST performance when it's not part of a single Core..

That's true of a lot of things, though.

LLC is usually shared between cores, for instance. L2 often is. Does that make single-thread benchmarks invalid for, say, a core that shares an L2 cache with another one? After all, its L2 hit behavior will be very different if the second core is loaded vs if it isn't.
 

mikegg

Platinum Member
Jan 30, 2010
2,091
633
136
The problem for me is that it's skewing Geekbench average for a category which it is not part of imo I don't have any issue with it's existence.
So the real problem is that it makes Apple CPUs look as good as they truly are. Got it.
 

mikegg

Platinum Member
Jan 30, 2010
2,091
633
136
🔔🔔🔔

Everything else is just working backwards from that.
People just can't fathom that a fanless phone SoC is faster in most tasks than their overclocked DIY machine with 12 fans inside and liquid cooled.

What are excuses we heard?
  • Only AVX 512 benchmarks matter for CPUs
  • SME CPU instruction support is cheating for Apple
  • Let's only use non-ARM optimized Cinebench R23 and Passmark to making comparisons
  • Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't
  • Apple cores are bigger so it isn't fair (they're same size as AMD cores)
  • Apple uses a more advanced node so it isn't fair (Old Apple SoCs on the same node are still way more efficient)
  • macOS is highly optimized for Apple Silicon (it is but M4 Max running Parallels running Windows still has higher ST and MT than any AMD/Intel laptop)
 
Last edited:

511

Diamond Member
Jul 12, 2024
5,340
4,753
106
you people are taking my comments the wrong way all i said was remove the SME from ST average for GB comparison cause it's not part of ST Core you can add it when it's part of the core that's it Apple has better Int/fp Cores for a while
 

johnsonwax

Senior member
Jun 27, 2024
469
674
96
you people are taking my comments the wrong way all i said was remove the SME from ST average for GB comparison cause it's not part of ST Core you can add it when it's part of the core that's it Apple has better Int/fp Cores for a while
Honest question: I had been coding for some time before FPUs were introduced. In that era, would a benchmark that used the FPU be appropriate for a single core benchmark? Any modern CPU will have multiple ALUs in multiple pipelines. Does that variation impact single thread benchmarking? The 68881 was the FPU for the 68020, but you were still writing single threaded code for this asymmetric dual core system.

What I'm getting at is that I suspect you don't have a coherent concept of 'single thread'. Instead you have a concept of a single core that is a product of implementation (specifically the Pentium) rather than a clear concept. What does it matter if the SME unit is wired in a way to be accessible to all cores rather than integrated into each core. It's not like we treat memory benchmarks differently because it too is accessible to all cores, or even cache as noted above. If you approach it from the perspective of the code, ST vs MT carries very clear costs in terms of maintaining coherency between those separate threads, something you don't need to do with a memcpy despite the fact that its interfacing with a part of the system that is accessible to all cores, despite the fact that the speed of a memcpy will vary depending on how many cores are hitting the memory controller simultaneously. And how could you even write a benchmark that doesn't allocate memory. You simply have to live with it because you can't debate it. But you can debate SME, so you do.

One reason why ST vs MT is differentiated is because an awful lot of code doesn't benefit from multithreading. At the very least, it doesn't scale even close to linearly. And more than that, most code that is written isn't threaded, even if it would benefit from that because threading your code adds substantial complexity to it. Most of the data science code I wrote would have benefited from threading, but I did it maybe 5 times because the maintenance overhead for a model I would run a couple times each year just isn't worth the work when my time was very valuable and buying compute was very cheap. And that is pretty much true for about 90% of the software out there. And while most apps won't get the benefits of SME unless they're calling a framework or API that uses it, that's also true for the FPU. But the bottom line is that the M4 only has two SME units - one for the P cores and one for the E cores, and so if you're running a ST benchmark, you can only be using one SME unit. How is that different from the 68881?
 

DavidC1

Platinum Member
Dec 29, 2023
2,106
3,224
106
Apple core could be 2x the size of Lion Cove, but that doesn't change the fact that it still can go in a phone. Hating Apple is fine, but objective analysis should be separated. They have a fantastic core.

There was some merit saying that they are only competitive on Geekbench, but with the M5 it beats it on Cinebench. No more excuses exist. Back in the day it was that ARM was only for low power and it couldn't compete. Well, they are low power and competes now. All the AMD vs Intel arguments or even the P core vs E core team arguments are like sheep's in a pen getting mad while Apple is outside laughing at the bubble the x86 vendors are stuck in.
Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't
Apple is too consumer focused to get datacenter but it doesn't matter. Because they make money hands over first. It's actually an advantage for them to ignore server, because it allows them to be better at what they do.

But as Intel demonstrated with Core architecture and Nehalem, making a good core is the hard part. It was questionable whether Intel could take Core's benefits and bringing it to servers. But Nehalem's changes uncorked the potential of the CPU, and the gains were massive. Should Apple ever make a server chip, I expect similar gigantic advantages on the server side. Not that they need to do so.
 
Last edited:

Geddagod

Golden Member
Dec 28, 2021
1,654
1,686
136
Apple core could be 2x the size of Lion Cove, but that doesn't change the fact that it still can go in a phone. Hating Apple is fine, but objective analysis should be separated. They have a fantastic core.
Apple cores are really big. Honestly, it seems like Qcomm is catching up to them though in PPA. They are like a gen behind in perf, and matching Apple's last gen perf with a core that is only ~70% the area. Since both Qcomm and Apple do the shared SL2 thing, no one should have any qualms about these sorts of area comparisons.
Power is a bit trickier, since Geekerwan's comparisons are scuffed. From GB6's perf/watt curve from Xiaobai's tech review, they have the same perf/watt at ~1 watt per core as Apple, and is a gen behind higher up on the curve.
 

DavidC1

Platinum Member
Dec 29, 2023
2,106
3,224
106
Apple cores are really big. Honestly, it seems like Qcomm is catching up to them though in PPA. They are like a gen behind in perf, and matching Apple's last gen perf with a core that is only ~70% the area. Since both Qcomm and Apple do the shared SL2 thing, no one should have any qualms about these sorts of area comparisons.
Power is a bit trickier, since Geekerwan's comparisons are scuffed. From GB6's perf/watt curve from Xiaobai's tech review, they have the same perf/watt at ~1 watt per core as Apple, and is a gen behind higher up on the curve.
Die area for Apple matters less than merchant vendors like Qualcomm, because every component in Apple products serves to better the big picture, which is their product. So if they lose few % for a bigger die to get their product more attractive, it's worth it for them, whereas it might not be for merchant vendors. They are still majority revenue share for phones, and not by a small margin either. Samsung which is 2nd is taking scraps in comparison.

Also, it doesn't change that x86 vendors still can't make chips that fit in a Tablet properly, nevermind phones. It's a proper embarassment. Apple should have abandoned Intel in 2015.
 

Joe NYC

Diamond Member
Jun 26, 2021
4,122
5,664
136
People just can't fathom that a fanless phone SoC is faster in most tasks than their overclocked DIY machine with 12 fans inside and liquid cooled.

What are excuses we heard?
  • Only AVX 512 benchmarks matter for CPUs
  • SME CPU instruction support is cheating for Apple
  • Let's only use non-ARM optimized Cinebench R23 and Passmark to making comparisons
  • Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't
  • Apple cores are bigger so it isn't fair (they're same size as AMD cores)
  • Apple uses a more advanced node so it isn't fair (Old Apple SoCs on the same node are still way more efficient)
  • macOS is highly optimized for Apple Silicon (it is but M4 Max running Parallels running Windows still has higher ST and MT than any AMD/Intel laptop)

When I posted in Zen 7 thread that it seemed to me that AMD seems determined to take performance crown from Apple, it was an acknowledgement that Apple has the performance crown in many tasks / benches.

You followed that with "AMD can't because of" a number of exaggerated claims, such as having to make up 720% efficiency gap (that only exists in your head).

The real distance, the real lead of M series is nowhere near that. Trying to break down the components of the real lead does not equal "excuses". It is exploration of:
- contribution from process node
- contribution from instruction sets
- contribution from accelerators
- contribution from the core itself
- contribution from server vs. client orientation of the core
 
  • Like
Reactions: Jan Olšan

DavidC1

Platinum Member
Dec 29, 2023
2,106
3,224
106
let’s see how QC v3 cores handles proper tasks like Blender CPU and Handbrake.
The advantages that ARM has is biggest in the scalar Integer workload. Every workload that veers farther from it, the advantage gets less and less, because it's more of how much can you fit in at the latest process versus the real hard stuff, the smarts required that come from brilliant engineering and managers. So vector/FP workloads are easier to catch up, MT is easier, and GPUs are even easier.

This is why Apple's GPUs are the least advantageous, because at the heart, a faster GPU is mostly about having more parallel execution units. If scalar Integer was like that, then they'd have went to Atom-level narrow core with 512 cores on a enthusiast PC years ago. And CMT 8 core Bulldozer would have been more competitive against Sandy Bridge. A 10% advantage in scalar advantage is probably equal to 40-50% multi-thread and/or vector advantage, or even more. A big dGPU from 8 years ago can brute force beat modern iGPUs in any performance metric, whereas a top of the line 125W CPU from few years ago will hardly beat a 10W E core in some applications.
When I posted in Zen 7 thread that it seemed to me that AMD seems determined to take performance crown from Apple, it was an acknowledgement that Apple has the performance crown in many tasks / benches.
I wish they would, because the x86 world is just sad.

But it seems the expectation by the general public that ARM ISA has insurmountable advantage of x86 is slowly and steadily becoming true, even though if everything was ISO with all teams executing, I don't think the differences are anywhere that close. A Micron presentation said DRAM is king because it has all the manpower and manhours, all the research going into it like a freight train and that's why all so-called "DRAM-killers" die. ARM research, development, and the aspiring new engineers are far more than the x86 world. It's a self-fulfilling prophecy. You think you suck, so you indeed become the suck, not because you actually suck. What the x86 world needed was a truly open system, so a 3rd, or a 4th player can come in an upend imaginary AMD/Intel competition.
 
Last edited:

AMDK11

Senior member
Jul 15, 2019
489
435
136
wRzTwlsaEkFKOx0sVYulGvVzVkEpy-6r5smW8RAZ0riKcajft28hZ6nonSEJYj2Dm7KqR-K0FtEMw_sO8W6t1EiIchtk3MR483wqZ_pn0NXhcCbsMMbLmglrSnnb2HxrmC__uWuvX-AdJnTvHRpHwSetToSviHAF81Qz8vavgDg.webp


wRzTwlsaEkFKOx0sVYulGqSr4-14tkOzw2wksvXvWmBrYLtP3rq6sPJd3_QMqTz2VtJno_tC9_o9jGus3AsgCaxFo6Mi5s-PA52Flrs6l9groDZYzW2yh-Ya8Ed75IUu9h1u2v7S-9W_XvalOWb_ueL0EYeIWhTJai9owj3mo50.webp
 
Last edited:

poke01

Diamond Member
Mar 8, 2022
4,756
6,093
106
.
wish they would, because the x86 world is just sad.
no x86 Intel is bad. x86 AMD is great.

The zen core is small and for the purpose it is made for it excels there. meanwhile Intels P core fails at both client and server.
 

AMDK11

Senior member
Jul 15, 2019
489
435
136
P M5
Decode 10-Wide(10 uops)
Rename 10 uops
423 Integer PRF
447 Vector/FP PRF
ROB 430
PRRT 1038 (Physical Register Reclaim Table)!!!
4xAGU(2xLoad, 1xLoad/Store, 1xStore)
Load Queue 213
Store Queue 76

Zen5
Decode 8-Wide(2x 4-Wide)(16 uops(2x 8 uops))
6144 uops Cache 12-Wide(12 uops)
Multiplex 8 uops
Rename 8 uops
240 Integer PRF
384 Vector/FP PRF
ROB 448
4xAGU(4xLoad/Store)(4xLoad/2xLoad +2xStore)
Load Queue 202
Store Queue 104
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,786
1,085
126
The advantages that ARM has is biggest in the scalar Integer workload. Every workload that veers farther from it, the advantage gets less and less, because it's more of how much can you fit in at the latest process versus the real hard stuff, the smarts required that come from brilliant engineering and managers. So vector/FP workloads are easier to catch up, MT is easier, and GPUs are even easier.

This is why Apple's GPUs are the least advantageous, because at the heart, a faster GPU is mostly about having more parallel execution units. If scalar Integer was like that, then they'd have went to Atom-level narrow core with 512 cores on a enthusiast PC years ago. And CMT 8 core Bulldozer would have been more competitive against Sandy Bridge. A 10% advantage in scalar advantage is probably equal to 40-50% multi-thread and/or vector advantage, or even more. A big dGPU from 8 years ago can brute force beat modern iGPUs in any performance metric, whereas a top of the line 125W CPU from few years ago will hardly beat a 10W E core in some applications.

I wish they would, because the x86 world is just sad.

But it seems the expectation by the general public that ARM ISA has insurmountable advantage of x86 is slowly and steadily becoming true, even though if everything was ISO with all teams executing, I don't think the differences are anywhere that close. A Micron presentation said DRAM is king because it has all the manpower and manhours, all the research going into it like a freight train and that's why all so-called "DRAM-killers" die. ARM research, development, and the aspiring new engineers are far more than the x86 world. It's a self-fulfilling prophecy. You think you suck, so you indeed become the suck, not because you actually suck. What the x86 world needed was a truly open system, so a 3rd, or a 4th player can come in an upend imaginary AMD/Intel competition.

Did you see the latest Level 2 Jeff? If arm has all these advantages it wouldn't preform that bad. Or sad. Strong memory order and efficient cache snooping is something arm is never going to have. So you have to either rewrite all code or gimp it down. This general public is skeptical.

Sidenote: Bulldozer actually aged well and closed the gap to sandy bridge. Too bad it lasted for 10 years without improving.
 

poke01

Diamond Member
Mar 8, 2022
4,756
6,093
106
Did you see the latest Level 2 Jeff? If arm has all these advantages it wouldn't preform that bad. Or sad. Strong memory order and efficient cache snooping is something arm is never going to have. So you have to either rewrite all code or gimp it down. This general public is skeptical.

Sidenote: Bulldozer actually aged well and closed the gap to sandy bridge. Too bad it lasted for 10 years without improving.
That arm chip tested by Jeff doesn’t represent mainstream arm cores made by ARM, Qualcomm or Apple.

It’s like comparing some Zhaoxin x86 chip to AMDs Zen5 and saying x86 as a whole is bad because the Zhaozin performed poorly.
 

Schmide

Diamond Member
Mar 7, 2002
5,786
1,085
126
That arm chip tested by Jeff doesn’t represent mainstream arm cores made by ARM, Qualcomm or Apple.

It’s like comparing some Zhaoxin x86 chip to AMDs Zen5 and saying x86 as a whole is bad because the Zhaozin performed poorly.
It's not just that chip. We've been fighting to get a GPU working on arm for I would say 7 years. We barely have it working and it still sucks.

Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?
 

poke01

Diamond Member
Mar 8, 2022
4,756
6,093
106
Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?
well I would’ve said Nvidia N1X but Nvidia isn’t capable enough to ship on time.

So yeah you are right, other than Apple there is no competent GPU on ARM.

( some will say Qualcomm has good GPUs and to that I say uh no. Qualcomm doesn’t have good GPUs for compute or AAA gaming)
 
  • Like
Reactions: perry mason

johnsonwax

Senior member
Jun 27, 2024
469
674
96
It's not just that chip. We've been fighting to get a GPU working on arm for I would say 7 years. We barely have it working and it still sucks.

Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?
I mean, technically Nvidia Grace, but not in the way you mean. There's no reason why Nvidia can't make that product.