Question x86 and ARM architectures comparison thread.

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Geddagod

Golden Member
Dec 28, 2021
1,491
1,579
106
Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
 

Jan Olšan

Senior member
Jan 12, 2017
558
1,101
136
Note that these numbers don't nearly tell the whole story.
Usually the pipes or units are differentiated. For example all can do 1 SIMD integer add per cycle, but only some can do integer multiplication. floating point ops quite often aren't available on all pipes. Shuffle ops generally have lower than maximum throughput Good CPUs can do them every cycle, but may only be two pipes capable of them.

Sometimes it gets complicated like when Intel AVX-512 client cores had "half speed AVX-512". It was way more complex than that, since the integer ops were actually full-speed, just floating-point ones were not, and the unit layout was quite complex, IIRC with three 256bit floating-point units (that were 512 bit for integer!). But on the server version of the core, it was not as simple as the units being extended to 512bit for floating point. No, it worked differently - two of the 256bit pipes coupled into one 512bit FMA pipe and the third pipe received extra dedicated fully-512bit FMA unit (that was tacked on as an additional block on the side of the original floorplan) to reach 2x 512bit FMA pipes.

Basically, ideally you want an instruction table that tells you how many ops of each type can be executed per cycle (throughput) as well as latency (the delay before the result is available, due to pipelining - can also be important!).
You can have cases like core having seemingly beefy SIMD units, but then you find that shuffle ops have poor throughput with multi-cycle latency, for example. Some code would not mind, but some would.