Question x86 and ARM architectures comparison thread.

poke01 · Nov 16, 2025

mikegg said:
It is because AVX512 was the last bastion of performance superiority for x86 (AMD).

People who complain about SME boosting M4's ST never complain about Zen4 AVX512 boosting Object Detection by even more.

AVX512 is really good.

I wonder when a ARM licensees will add proper SVE2 support in client. I give it 5-10 years.

especially with Intel finally adopting AVX512 across the whole stack, server to mobile in Nova lake.

johnsonwax · Nov 16, 2025

511 said:
The problem for me is not Sme But SME being part of ST performance when it's not part of a single Core..

But isn't the underlying problem here the benchmarks being structured around antiquated architectural notions? Yes, single core is useful for evaluating the core of the chip, and somewhat useful for evaluating how categories of software that tends to be single thread dominated will perform. So if you're looking from the hardware side outward, SME doesn't make sense as part of a single thread benchmark, but if you look from the code side and you make a SME call from inside your main loop, it's perfectly single thread. You don't need to worry about coherency, etc. and as such incorporating SME into your code, even unknowingly is quite safe. That's not the general case when you jump to multicore - even though there are libraries that you can hand a task to that will utilize all cores and block until all results are returned, that's not really a thing you encounter in the wild. Consider how GPUs are utilized in the general case - normally you aren't looking for a return so you don't care how parallelized they are, you're sending off a bunch of work to be done and don't care about the result because the output is to the display, not back to the CPU. CUDA of course changes that as does all of the AI uses for GPU/NPU. That's why AVX makes sense as part of ST performance because your parallelized performance is still in a single atomic entity (the vector) that comes back as a complete unit so there's no multithreaded overhead when doing compute using a vector unit, and it's relatively easy to incorporate into software so it gets used more often, and more competently than multithreaded does.

Compute isn't as cleanly organized now as it was when the Pentium D first shipped, but the structure of the benchmarks haven't really contended with that as SME shouldn't be included in multicore either which is often evaluated as drop-off from the linear multiple of single core performance, which SME would also mess up, so it's not appropriate to put there either.

Doug S · Nov 16, 2025

mikegg said:
The fact of the matter is that Apple waited on SVE to be matured to fully make the jump to adopting it, and AMX was simply a stopgap solution to address the shortcomings of Arm NEON.

Not sure what they meant with that quote. Apple does not support SVE, and if they meant SME (based on the link in the article I think that's very likely) they didn't "wait on SME to be matured" they waited on ARM to standardize AMX and give it a new name.

LightningDust · Nov 16, 2025

511 said:
The problem for me is not Sme But SME being part of ST performance when it's not part of a single Core..

That's true of a lot of things, though.

LLC is usually shared between cores, for instance. L2 often is. Does that make single-thread benchmarks invalid for, say, a core that shares an L2 cache with another one? After all, its L2 hit behavior will be very different if the second core is loaded vs if it isn't.

mikegg · Nov 16, 2025

511 said:
The problem for me is that it's skewing Geekbench average for a category which it is not part of imo I don't have any issue with it's existence.

So the real problem is that it makes Apple CPUs look as good as they truly are. Got it.

okoroezenwa · Nov 16, 2025

mikegg said:
So the real problem is that it makes Apple CPUs look as good as they truly are. Got it.

🔔🔔🔔

Everything else is just working backwards from that.

mikegg · Nov 16, 2025

okoroezenwa said:
🔔🔔🔔

Everything else is just working backwards from that.

People just can't fathom that a fanless phone SoC is faster in most tasks than their overclocked DIY machine with 12 fans inside and liquid cooled.

What are excuses we heard?

Only AVX 512 benchmarks matter for CPUs
SME CPU instruction support is cheating for Apple
Let's only use non-ARM optimized Cinebench R23 and Passmark to making comparisons
Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't
Apple cores are bigger so it isn't fair (they're same size as AMD cores)
Apple uses a more advanced node so it isn't fair (Old Apple SoCs on the same node are still way more efficient)
macOS is highly optimized for Apple Silicon (it is but M4 Max running Parallels running Windows still has higher ST and MT than any AMD/Intel laptop)

511 · Nov 17, 2025

you people are taking my comments the wrong way all i said was remove the SME from ST average for GB comparison cause it's not part of ST Core you can add it when it's part of the core that's it Apple has better Int/fp Cores for a while

511 · Nov 17, 2025

mikegg said:
Apple cores are bigger so it isn't fair (they're same size as AMD cores)

This is not true N4P Cores are not equal to N3E Ones Glymer is better core though

johnsonwax · Nov 17, 2025

511 said:
you people are taking my comments the wrong way all i said was remove the SME from ST average for GB comparison cause it's not part of ST Core you can add it when it's part of the core that's it Apple has better Int/fp Cores for a while

Honest question: I had been coding for some time before FPUs were introduced. In that era, would a benchmark that used the FPU be appropriate for a single core benchmark? Any modern CPU will have multiple ALUs in multiple pipelines. Does that variation impact single thread benchmarking? The 68881 was the FPU for the 68020, but you were still writing single threaded code for this asymmetric dual core system.

What I'm getting at is that I suspect you don't have a coherent concept of 'single thread'. Instead you have a concept of a single core that is a product of implementation (specifically the Pentium) rather than a clear concept. What does it matter if the SME unit is wired in a way to be accessible to all cores rather than integrated into each core. It's not like we treat memory benchmarks differently because it too is accessible to all cores, or even cache as noted above. If you approach it from the perspective of the code, ST vs MT carries very clear costs in terms of maintaining coherency between those separate threads, something you don't need to do with a memcpy despite the fact that its interfacing with a part of the system that is accessible to all cores, despite the fact that the speed of a memcpy will vary depending on how many cores are hitting the memory controller simultaneously. And how could you even write a benchmark that doesn't allocate memory. You simply have to live with it because you can't debate it. But you can debate SME, so you do.

One reason why ST vs MT is differentiated is because an awful lot of code doesn't benefit from multithreading. At the very least, it doesn't scale even close to linearly. And more than that, most code that is written isn't threaded, even if it would benefit from that because threading your code adds substantial complexity to it. Most of the data science code I wrote would have benefited from threading, but I did it maybe 5 times because the maintenance overhead for a model I would run a couple times each year just isn't worth the work when my time was very valuable and buying compute was very cheap. And that is pretty much true for about 90% of the software out there. And while most apps won't get the benefits of SME unless they're calling a framework or API that uses it, that's also true for the FPU. But the bottom line is that the M4 only has two SME units - one for the P cores and one for the E cores, and so if you're running a ST benchmark, you can only be using one SME unit. How is that different from the 68881?

DavidC1 · Nov 17, 2025

Apple core could be 2x the size of Lion Cove, but that doesn't change the fact that it still can go in a phone. Hating Apple is fine, but objective analysis should be separated. They have a fantastic core.

There was some merit saying that they are only competitive on Geekbench, but with the M5 it beats it on Cinebench. No more excuses exist. Back in the day it was that ARM was only for low power and it couldn't compete. Well, they are low power and competes now. All the AMD vs Intel arguments or even the P core vs E core team arguments are like sheep's in a pen getting mad while Apple is outside laughing at the bubble the x86 vendors are stuck in.

mikegg said:
Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't

Apple is too consumer focused to get datacenter but it doesn't matter. Because they make money hands over first. It's actually an advantage for them to ignore server, because it allows them to be better at what they do.

But as Intel demonstrated with Core architecture and Nehalem, making a good core is the hard part. It was questionable whether Intel could take Core's benefits and bringing it to servers. But Nehalem's changes uncorked the potential of the CPU, and the gains were massive. Should Apple ever make a server chip, I expect similar gigantic advantages on the server side. Not that they need to do so.

Geddagod · Nov 17, 2025

DavidC1 said:
Apple core could be 2x the size of Lion Cove, but that doesn't change the fact that it still can go in a phone. Hating Apple is fine, but objective analysis should be separated. They have a fantastic core.

Apple cores are really big. Honestly, it seems like Qcomm is catching up to them though in PPA. They are like a gen behind in perf, and matching Apple's last gen perf with a core that is only ~70% the area. Since both Qcomm and Apple do the shared SL2 thing, no one should have any qualms about these sorts of area comparisons.
Power is a bit trickier, since Geekerwan's comparisons are scuffed. From GB6's perf/watt curve from Xiaobai's tech review, they have the same perf/watt at ~1 watt per core as Apple, and is a gen behind higher up on the curve.

DavidC1 · Nov 17, 2025

Geddagod said:
Apple cores are really big. Honestly, it seems like Qcomm is catching up to them though in PPA. They are like a gen behind in perf, and matching Apple's last gen perf with a core that is only ~70% the area. Since both Qcomm and Apple do the shared SL2 thing, no one should have any qualms about these sorts of area comparisons.
Power is a bit trickier, since Geekerwan's comparisons are scuffed. From GB6's perf/watt curve from Xiaobai's tech review, they have the same perf/watt at ~1 watt per core as Apple, and is a gen behind higher up on the curve.

Die area for Apple matters less than merchant vendors like Qualcomm, because every component in Apple products serves to better the big picture, which is their product. So if they lose few % for a bigger die to get their product more attractive, it's worth it for them, whereas it might not be for merchant vendors. They are still majority revenue share for phones, and not by a small margin either. Samsung which is 2nd is taking scraps in comparison.

Also, it doesn't change that x86 vendors still can't make chips that fit in a Tablet properly, nevermind phones. It's a proper embarassment. Apple should have abandoned Intel in 2015.

poke01 · Nov 17, 2025

Geddagod said:
and is a gen behind higher up on the curve.

let’s see how QC v3 cores handles proper tasks like Blender CPU and Handbrake.

The cpu package power will be interesting to see.

Joe NYC · Nov 17, 2025

mikegg said:
People just can't fathom that a fanless phone SoC is faster in most tasks than their overclocked DIY machine with 12 fans inside and liquid cooled.

What are excuses we heard?

Only AVX 512 benchmarks matter for CPUs

SME CPU instruction support is cheating for Apple

Let's only use non-ARM optimized Cinebench R23 and Passmark to making comparisons

Only MT performance matters and AMD wins there because they have Epyc and Apple doesn't

Apple cores are bigger so it isn't fair (they're same size as AMD cores)

Apple uses a more advanced node so it isn't fair (Old Apple SoCs on the same node are still way more efficient)

macOS is highly optimized for Apple Silicon (it is but M4 Max running Parallels running Windows still has higher ST and MT than any AMD/Intel laptop)

When I posted in Zen 7 thread that it seemed to me that AMD seems determined to take performance crown from Apple, it was an acknowledgement that Apple has the performance crown in many tasks / benches.

You followed that with "AMD can't because of" a number of exaggerated claims, such as having to make up 720% efficiency gap (that only exists in your head).

The real distance, the real lead of M series is nowhere near that. Trying to break down the components of the real lead does not equal "excuses". It is exploration of:
- contribution from process node
- contribution from instruction sets
- contribution from accelerators
- contribution from the core itself
- contribution from server vs. client orientation of the core

DavidC1 · Nov 17, 2025

poke01 said:
let’s see how QC v3 cores handles proper tasks like Blender CPU and Handbrake.

The advantages that ARM has is biggest in the scalar Integer workload. Every workload that veers farther from it, the advantage gets less and less, because it's more of how much can you fit in at the latest process versus the real hard stuff, the smarts required that come from brilliant engineering and managers. So vector/FP workloads are easier to catch up, MT is easier, and GPUs are even easier.

This is why Apple's GPUs are the least advantageous, because at the heart, a faster GPU is mostly about having more parallel execution units. If scalar Integer was like that, then they'd have went to Atom-level narrow core with 512 cores on a enthusiast PC years ago. And CMT 8 core Bulldozer would have been more competitive against Sandy Bridge. A 10% advantage in scalar advantage is probably equal to 40-50% multi-thread and/or vector advantage, or even more. A big dGPU from 8 years ago can brute force beat modern iGPUs in any performance metric, whereas a top of the line 125W CPU from few years ago will hardly beat a 10W E core in some applications.

Joe NYC said:
When I posted in Zen 7 thread that it seemed to me that AMD seems determined to take performance crown from Apple, it was an acknowledgement that Apple has the performance crown in many tasks / benches.

I wish they would, because the x86 world is just sad.

But it seems the expectation by the general public that ARM ISA has insurmountable advantage of x86 is slowly and steadily becoming true, even though if everything was ISO with all teams executing, I don't think the differences are anywhere that close. A Micron presentation said DRAM is king because it has all the manpower and manhours, all the research going into it like a freight train and that's why all so-called "DRAM-killers" die. ARM research, development, and the aspiring new engineers are far more than the x86 world. It's a self-fulfilling prophecy. You think you suck, so you indeed become the suck, not because you actually suck. What the x86 world needed was a truly open system, so a 3rd, or a 4th player can come in an upend imaginary AMD/Intel competition.

AMDK11 · Nov 17, 2025

wRzTwlsaEkFKOx0sVYulGvVzVkEpy-6r5smW8RAZ0riKcajft28hZ6nonSEJYj2Dm7KqR-K0FtEMw_sO8W6t1EiIchtk3MR483wqZ_pn0NXhcCbsMMbLmglrSnnb2HxrmC__uWuvX-AdJnTvHRpHwSetToSviHAF81Qz8vavgDg.webp

Loading…

en.namu.wiki

wRzTwlsaEkFKOx0sVYulGqSr4-14tkOzw2wksvXvWmBrYLtP3rq6sPJd3_QMqTz2VtJno_tC9_o9jGus3AsgCaxFo6Mi5s-PA52Flrs6l9groDZYzW2yh-Ya8Ed75IUu9h1u2v7S-9W_XvalOWb_ueL0EYeIWhTJai9owj3mo50.webp

adroc_thurston · Nov 17, 2025

DavidC1 said:
because at the heart, a faster GPU is mostly about having more parallel execution units

Well that's just 2/10 bait

poke01 · Nov 17, 2025

.

DavidC1 said:
wish they would, because the x86 world is just sad.

no x86 Intel is bad. x86 AMD is great.

The zen core is small and for the purpose it is made for it excels there. meanwhile Intels P core fails at both client and server.

AMDK11 · Nov 17, 2025

P M5
Decode 10-Wide(10 uops)
Rename 10 uops
423 Integer PRF
447 Vector/FP PRF
ROB 430
PRRT 1038 (Physical Register Reclaim Table)!!!
4xAGU(2xLoad, 1xLoad/Store, 1xStore)
Load Queue 213
Store Queue 76

Zen5
Decode 8-Wide(2x 4-Wide)(16 uops(2x 8 uops))
6144 uops Cache 12-Wide(12 uops)
Multiplex 8 uops
Rename 8 uops
240 Integer PRF
384 Vector/FP PRF
ROB 448
4xAGU(4xLoad/Store)(4xLoad/2xLoad +2xStore)
Load Queue 202
Store Queue 104

Schmide · Nov 17, 2025

DavidC1 said:
The advantages that ARM has is biggest in the scalar Integer workload. Every workload that veers farther from it, the advantage gets less and less, because it's more of how much can you fit in at the latest process versus the real hard stuff, the smarts required that come from brilliant engineering and managers. So vector/FP workloads are easier to catch up, MT is easier, and GPUs are even easier.

This is why Apple's GPUs are the least advantageous, because at the heart, a faster GPU is mostly about having more parallel execution units. If scalar Integer was like that, then they'd have went to Atom-level narrow core with 512 cores on a enthusiast PC years ago. And CMT 8 core Bulldozer would have been more competitive against Sandy Bridge. A 10% advantage in scalar advantage is probably equal to 40-50% multi-thread and/or vector advantage, or even more. A big dGPU from 8 years ago can brute force beat modern iGPUs in any performance metric, whereas a top of the line 125W CPU from few years ago will hardly beat a 10W E core in some applications.

I wish they would, because the x86 world is just sad.

But it seems the expectation by the general public that ARM ISA has insurmountable advantage of x86 is slowly and steadily becoming true, even though if everything was ISO with all teams executing, I don't think the differences are anywhere that close. A Micron presentation said DRAM is king because it has all the manpower and manhours, all the research going into it like a freight train and that's why all so-called "DRAM-killers" die. ARM research, development, and the aspiring new engineers are far more than the x86 world. It's a self-fulfilling prophecy. You think you suck, so you indeed become the suck, not because you actually suck. What the x86 world needed was a truly open system, so a 3rd, or a 4th player can come in an upend imaginary AMD/Intel competition.

Did you see the latest Level 2 Jeff? If arm has all these advantages it wouldn't preform that bad. Or sad. Strong memory order and efficient cache snooping is something arm is never going to have. So you have to either rewrite all code or gimp it down. This general public is skeptical.

Sidenote: Bulldozer actually aged well and closed the gap to sandy bridge. Too bad it lasted for 10 years without improving.

poke01 · Nov 17, 2025

Schmide said:
Did you see the latest Level 2 Jeff? If arm has all these advantages it wouldn't preform that bad. Or sad. Strong memory order and efficient cache snooping is something arm is never going to have. So you have to either rewrite all code or gimp it down. This general public is skeptical.

Sidenote: Bulldozer actually aged well and closed the gap to sandy bridge. Too bad it lasted for 10 years without improving.

That arm chip tested by Jeff doesn’t represent mainstream arm cores made by ARM, Qualcomm or Apple.

It’s like comparing some Zhaoxin x86 chip to AMDs Zen5 and saying x86 as a whole is bad because the Zhaozin performed poorly.

Schmide · Nov 17, 2025

poke01 said:
That arm chip tested by Jeff doesn’t represent mainstream arm cores made by ARM, Qualcomm or Apple.

It’s like comparing some Zhaoxin x86 chip to AMDs Zen5 and saying x86 as a whole is bad because the Zhaozin performed poorly.

It's not just that chip. We've been fighting to get a GPU working on arm for I would say 7 years. We barely have it working and it still sucks.

Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?

poke01 · Nov 17, 2025

Schmide said:
Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?

well I would’ve said Nvidia N1X but Nvidia isn’t capable enough to ship on time.

So yeah you are right, other than Apple there is no competent GPU on ARM.

( some will say Qualcomm has good GPUs and to that I say uh no. Qualcomm doesn’t have good GPUs for compute or AAA gaming)

johnsonwax · Nov 17, 2025

Schmide said:
It's not just that chip. We've been fighting to get a GPU working on arm for I would say 7 years. We barely have it working and it still sucks.

Show me an arm platform that runs any GPU well other than apple and an imagination derivatives?

I mean, technically Nvidia Grace, but not in the way you mean. There's no reason why Nvidia can't make that product.

Question x86 and ARM architectures comparison thread.

Diamond Member

Senior member

Diamond Member

Member

Platinum Member

Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Platinum Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member