Question x86 and ARM architectures comparison thread.

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Geddagod

Golden Member
Dec 28, 2021
1,491
1,586
106
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
 

Jan Olšan

Senior member
Jan 12, 2017
558
1,105
136
Note that these numbers don't nearly tell the whole story.
Usually the pipes or units are differentiated. For example all can do 1 SIMD integer add per cycle, but only some can do integer multiplication. floating point ops quite often aren't available on all pipes. Shuffle ops generally have lower than maximum throughput Good CPUs can do them every cycle, but may only be two pipes capable of them.

Sometimes it gets complicated like when Intel AVX-512 client cores had "half speed AVX-512". It was way more complex than that, since the integer ops were actually full-speed, just floating-point ones were not, and the unit layout was quite complex, IIRC with three 256bit floating-point units (that were 512 bit for integer!). But on the server version of the core, it was not as simple as the units being extended to 512bit for floating point. No, it worked differently - two of the 256bit pipes coupled into one 512bit FMA pipe and the third pipe received extra dedicated fully-512bit FMA unit (that was tacked on as an additional block on the side of the original floorplan) to reach 2x 512bit FMA pipes.

Basically, ideally you want an instruction table that tells you how many ops of each type can be executed per cycle (throughput) as well as latency (the delay before the result is available, due to pipelining - can also be important!).
You can have cases like core having seemingly beefy SIMD units, but then you find that shuffle ops have poor throughput with multi-cycle latency, for example. Some code would not mind, but some would.
 

DavidC1

Golden Member
Dec 29, 2023
1,733
2,811
96
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
From a user perspective ARM's way is superior because you don't need recompiling of any kind, and thankfully that's what was done with Skymont, hence the impressive 70% gain. I hope the E core team sticks with 256-bit doubled pumped for AVX10.2. By adding more units you benefit every application from 1990's to now, whereas recompiling results in substantial diminishing returns regarding adoption. Even original AVX, nevermind AVX2 and AVX-512 is universally adopted.

LNC doesn't really have a quad FPU similar in fashion to Skymont and ARM chips either. Skymont is 4x 128-bit FMA, while LNC is 2x 256-bit FADD + 2x 256-bit FMA. A true 4x 256 would perform much better. Hence why Skymont gets 30% FP gain on top of 30% Integer gain, while on Lion Cove we remain a miserable roughly 10% for both Int and FP.

You can see from Turin tests that most of the AVX-512 gain is from the ISA not the move from 256 to 512 bits.
When taking the geometric mean of all the raw AVX-512 performance benchmarks, AVX-512 in the default FP512 configuration yielded 1.45x the performance compared to disabling AVX-512 outright. Having the 512-bit data path allowed for 1.12x the performance compared to running the EPYC 9755 processor in the 256-bit data path mode, similar to how AVX-512 operates with Zen 4.
 
Last edited:

Quintessa

Member
Jun 23, 2025
57
34
46
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
Not quite right on a couple of points.
  • Snapdragon X Elite w/ Oryon: 4×128-bit NEON/SVE-like FP pipes per core. The x925 is in that family, not 6×128.
  • Cortex-X4 / M4: also 4×128-bit FP units per core.
  • Lunar Lake (LNC): 4×256-bit AVX2/AVX-VNNI FP pipes. No AVX-512.
  • Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).
If you literally mean physical FPU width, none of these ARM cores hit 256-bit yet; they just duplicate 128-bit lanes. Zen 5 and Sapphire Rapids-class Xeons are the only ones here with 512-bit ISA exposure.
 
  • Like
Reactions: Tlh97