Question x86 and ARM architectures comparison thread.

poke01 · Aug 11, 2025

Jan Olšan said:
Is that so?

Metal is quite good especially Metal 3.0/4.0 and less complex than Vulkan. It’s just disappointing it’s stuck to Apple platforms.

Geddagod · Aug 13, 2025

Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?

Jan Olšan · Aug 13, 2025

Note that these numbers don't nearly tell the whole story.
Usually the pipes or units are differentiated. For example all can do 1 SIMD integer add per cycle, but only some can do integer multiplication. floating point ops quite often aren't available on all pipes. Shuffle ops generally have lower than maximum throughput Good CPUs can do them every cycle, but may only be two pipes capable of them.

Sometimes it gets complicated like when Intel AVX-512 client cores had "half speed AVX-512". It was way more complex than that, since the integer ops were actually full-speed, just floating-point ones were not, and the unit layout was quite complex, IIRC with three 256bit floating-point units (that were 512 bit for integer!). But on the server version of the core, it was not as simple as the units being extended to 512bit for floating point. No, it worked differently - two of the 256bit pipes coupled into one 512bit FMA pipe and the third pipe received extra dedicated fully-512bit FMA unit (that was tacked on as an additional block on the side of the original floorplan) to reach 2x 512bit FMA pipes.

Basically, ideally you want an instruction table that tells you how many ops of each type can be executed per cycle (throughput) as well as latency (the delay before the result is available, due to pipelining - can also be important!).
You can have cases like core having seemingly beefy SIMD units, but then you find that shuffle ops have poor throughput with multi-cycle latency, for example. Some code would not mind, but some would.

DavidC1 · Aug 13, 2025

Geddagod said:
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?

From a user perspective ARM's way is superior because you don't need recompiling of any kind, and thankfully that's what was done with Skymont, hence the impressive 70% gain. I hope the E core team sticks with 256-bit doubled pumped for AVX10.2. By adding more units you benefit every application from 1990's to now, whereas recompiling results in substantial diminishing returns regarding adoption. Even original AVX, nevermind AVX2 and AVX-512 is universally adopted.

LNC doesn't really have a quad FPU similar in fashion to Skymont and ARM chips either. Skymont is 4x 128-bit FMA, while LNC is 2x 256-bit FADD + 2x 256-bit FMA. A true 4x 256 would perform much better. Hence why Skymont gets 30% FP gain on top of 30% Integer gain, while on Lion Cove we remain a miserable roughly 10% for both Int and FP.

You can see from Turin tests that most of the AVX-512 gain is from the ISA not the move from 256 to 512 bits.

AVX-512 Performance With 256-bit vs. 512-bit Data Path For AMD EPYC 9005 CPUs Review - Phoronix

www.phoronix.com

When taking the geometric mean of all the raw AVX-512 performance benchmarks, AVX-512 in the default FP512 configuration yielded 1.45x the performance compared to disabling AVX-512 outright. Having the 512-bit data path allowed for 1.12x the performance compared to running the EPYC 9755 processor in the 256-bit data path mode, similar to how AVX-512 operates with Zen 4.

adroc_thurston · Aug 13, 2025

DavidC1 said:
By adding more units you benefit every application from 1990's to now

Did you actually try talking to SIMD people?

Quintessa · Aug 14, 2025

Geddagod said:
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?

Not quite right on a couple of points.

Snapdragon X Elite w/ Oryon: 4×128-bit NEON/SVE-like FP pipes per core. The x925 is in that family, not 6×128.
Cortex-X4 / M4: also 4×128-bit FP units per core.
Lunar Lake (LNC): 4×256-bit AVX2/AVX-VNNI FP pipes. No AVX-512.
Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).

If you literally mean physical FPU width, none of these ARM cores hit 256-bit yet; they just duplicate 128-bit lanes. Zen 5 and Sapphire Rapids-class Xeons are the only ones here with 512-bit ISA exposure.

Thunder 57 · Aug 14, 2025

adroc_thurston said:
Did you actually try talking to SIMD people?

OMG. What this time? You really need to shut up.

Jan Olšan · Aug 14, 2025

Quintessa said:
Not quite right on a couple of points.

Snapdragon X Elite w/ Oryon: 4×128-bit NEON/SVE-like FP pipes per core. The x925 is in that family, not 6×128.

Cortex-X4 / M4: also 4×128-bit FP units per core.

Lunar Lake (LNC): 4×256-bit AVX2/AVX-VNNI FP pipes. No AVX-512.

Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).

If you literally mean physical FPU width, none of these ARM cores hit 256-bit yet; they just duplicate 128-bit lanes. Zen 5 and Sapphire Rapids-class Xeons are the only ones here with 512-bit ISA exposure.

You should recheck, X925 was definitely advertised as 6x128 at launch, other info (Zen 5) also looks wrong. Oryon doesn't support SVE.

igor_kavinski · Aug 14, 2025

Thunder 57 said:
OMG. What this time? You really need to shut up.

He will suggest chemically induced meditation

MS_AT · Aug 14, 2025

Quintessa said:
The x925 is in that family, not 6×128

that is wrong. It's oversized SIMD unit is one of its selling points. https://developer.arm.com/documentation/109842/500?lang=en hopefully that's sufficient as a proof, unless you doubt ARM itself.

Quintessa said:
Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).

that is half wrong. Zen4 layout applies only to Strix Point and Krackan families. Everything else is using 512b execution units. Proofs for that are so widely available that I leave it for the reader to confirm.

DavidC1 said:
Skymont, hence the impressive 70% gain.

DavidC1 said:
Skymont is 4x 128-bit FMA

Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC FP testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.

511 · Aug 14, 2025

MS_AT said:
Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.

what about SpecFP?

MS_AT · Aug 14, 2025

511 said:
what about SpecFP?

Let me quote myself

MS_AT said:
In C&C SPEC testing that does not use afaik anything above SSE, Skymont is loosing.

I have not specified FP, but I thought FP was given from the context. I will correct the original message for clarity.

511 · Aug 14, 2025

FP and SIMD are two different things 😅

MS_AT · Aug 14, 2025

511 said:
FP and SIMD are two different things 😅

True. But what about it? SIMD units on x64 are executing: INT SIMD, FP Scalar and FP SIMD. We have left out INTfrom the discussion. But if FP Scalar would show improvements, then FP SIMD at 128b would show improvements too and vice versa. Latencies and throughput for math operations are the same regardless of register width (unless you are doing the "double pumping scheme" which does not apply for <=128b ).

DavidC1 · Aug 14, 2025

MS_AT said:
Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC FP testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.

You know Int gains are 30% and FP is 60-70% right? Greater FP gains to that degree doesn't happen without the extra units.

I'd like to know which C&C article you are talking about.

mikegg · Aug 15, 2025

AMD EPYC 9965 "Turin Dense" Delivers Better Performance/Power Efficiency vs. AmpereOne 192-Core ARM CPU Review - Phoronix

www.phoronix.com

Ampere MSRP $5.5K vs $15K for the 192 core EPYC. With 1.6x worse performance at 1.2x better energy consumption. Looks like a reasonable option.

In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.

So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).

AmpereOne was severely delayed. It's a N5 part that really should have been competing against Zen4 only. They have a 256 core, 3nm, 12 memory channel shipping next year that is likely to better challenge Turin Dense and Sierra Forest in terms of raw performance.

adroc_thurston · Aug 15, 2025

mikegg said:
Ampere MSRP $5.5K vs $15K for the 192 core EPYC.

are you seriously going off list prices for server CPUs?

511 · Aug 15, 2025

Server providers buys at special discounted price

mikegg · Aug 15, 2025

adroc_thurston said:
are you seriously going off list prices for server CPUs?

Yes. Do you have private prices for Ampere and Epyc?

igor_kavinski · Aug 15, 2025

mikegg said:
So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).

Another initial challenge with AmpereOne either due to the CPU itself or the Supermicro server is never being able to idle under 100 Watt power draw for the CPU.

And lose those savings over the life of the server in idle power costs.

mikegg · Aug 15, 2025

igor_kavinski said:
And lose those savings over the life of the server in idle power costs.

Won't matter since server CPUs should always be working.

511 · Aug 15, 2025

igor_kavinski said:
And lose those savings over the life of the server in idle power costs.

if server cpus are sitting idle they are loosing money

MS_AT · Aug 15, 2025

DavidC1 said:
Greater FP gains to that degree doesn't happen without the extra units.

You misread my comment. I have not claimed extra units are a waste. I have have asked that can you observe from existing benchmarks that homogenous pipe setup (all pipes capable of multiplication) give you any advantage over heterogenous pipe setup (only some pipes capable of doing some ops) in existing code bases, as that was your claim if I understood correctly.

I guess the main drive to have all pipes capable of basic fp arithmetic was Skymont performance with AVX2 as this is what the baseline the compiler will optimise for. Lion Cove, having native 256b units did not need this.

DavidC1 said:
I'd like to know which C&C article you are talking about.

Both Skymont articles, https://chipsandcheese.com/p/skymont-in-desktop-form-atom-unleashed and https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-sky

mvprod123 · Aug 15, 2025

511 · Aug 15, 2025

Lol the contamination zones in LNL hampered the single core test

Question x86 and ARM architectures comparison thread.

Diamond Member

Golden Member

Senior member

Platinum Member

Diamond Member

Member

Diamond Member

Senior member

Lifer

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Platinum Member

Diamond Member

Senior member

Senior member

Diamond Member