- Mar 8, 2022
- 3,979
- 5,296
- 106
Metal is quite good especially Metal 3.0/4.0 and less complex than Vulkan. It’s just disappointing it’s stuck to Apple platforms.Is that so?
Metal is quite good especially Metal 3.0/4.0 and less complex than Vulkan. It’s just disappointing it’s stuck to Apple platforms.Is that so?
From a user perspective ARM's way is superior because you don't need recompiling of any kind, and thankfully that's what was done with Skymont, hence the impressive 70% gain. I hope the E core team sticks with 256-bit doubled pumped for AVX10.2. By adding more units you benefit every application from 1990's to now, whereas recompiling results in substantial diminishing returns regarding adoption. Even original AVX, nevermind AVX2 and AVX-512 is universally adopted.Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
When taking the geometric mean of all the raw AVX-512 performance benchmarks, AVX-512 in the default FP512 configuration yielded 1.45x the performance compared to disabling AVX-512 outright. Having the 512-bit data path allowed for 1.12x the performance compared to running the EPYC 9755 processor in the 256-bit data path mode, similar to how AVX-512 operates with Zen 4.
Did you actually try talking to SIMD people?By adding more units you benefit every application from 1990's to now
Not quite right on a couple of points.Correct me if I'm wrong, but the x925 has a 6 x 128 bit FPU, Oryon and M4 have a 4 x 128 bit FPU, LNC has a 4 x 256 bit FPU, Zen 5 a 4 x 512 bit FPU?
Did you actually try talking to SIMD people?
You should recheck, X925 was definitely advertised as 6x128 at launch, other info (Zen 5) also looks wrong. Oryon doesn't support SVE.Not quite right on a couple of points.
If you literally mean physical FPU width, none of these ARM cores hit 256-bit yet; they just duplicate 128-bit lanes. Zen 5 and Sapphire Rapids-class Xeons are the only ones here with 512-bit ISA exposure.
- Snapdragon X Elite w/ Oryon: 4×128-bit NEON/SVE-like FP pipes per core. The x925 is in that family, not 6×128.
- Cortex-X4 / M4: also 4×128-bit FP units per core.
- Lunar Lake (LNC): 4×256-bit AVX2/AVX-VNNI FP pipes. No AVX-512.
- Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).
He will suggest chemically induced meditationOMG. What this time? You really need to shut up.
that is wrong. It's oversized SIMD unit is one of its selling points. https://developer.arm.com/documentation/109842/500?lang=en hopefully that's sufficient as a proof, unless you doubt ARM itself.The x925 is in that family, not 6×128
that is half wrong. Zen4 layout applies only to Strix Point and Krackan families. Everything else is using 512b execution units. Proofs for that are so widely available that I leave it for the reader to confirm.Zen 5: 4×256-bit SIMD pipes per core, but supports AVX-512 by fusing 2×256 per instruction (so effectively 2×512 at half rate).
Skymont, hence the impressive 70% gain.
Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC FP testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.Skymont is 4x 128-bit FMA
what about SpecFP?Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.
Let me quote myselfwhat about SpecFP?
I have not specified FP, but I thought FP was given from the context. I will correct the original message for clarity.In C&C SPEC testing that does not use afaik anything above SSE, Skymont is loosing.
True. But what about it? SIMD units on x64 are executing: INT SIMD, FP Scalar and FP SIMD. We have left out INTfrom the discussion. But if FP Scalar would show improvements, then FP SIMD at 128b would show improvements too and vice versa. Latencies and throughput for math operations are the same regardless of register width (unless you are doing the "double pumping scheme" which does not apply for <=128b ).FP and SIMD are two different things 😅
You know Int gains are 30% and FP is 60-70% right? Greater FP gains to that degree doesn't happen without the extra units.Are you able to prove that with exisiting code? I mean that the 4 FMA units give Skymont any advantage over Zen4 or Zen5 when running 128b code. In C&C SPEC FP testing that does not use afaik anything above SSE, Skymont is loosing. I mean I would expect that if symmetrical FPU pipes are so big benefit then it would show somehow, as improving FP performance over Gracemont is not note worth achievement seeing how much the core grew (60% ROB, 30% FP reg file, lower FP ops latencies) and how bad Gracemont was in that regard in the first place. I mean sure, if I write dedicated code for Skymont I can give it the advantage, but I wonder if there is code out there where you can pinpoint this to be an advantage.
are you seriously going off list prices for server CPUs?Ampere MSRP $5.5K vs $15K for the 192 core EPYC.
Yes. Do you have private prices for Ampere and Epyc?are you seriously going off list prices for server CPUs?
So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).
And lose those savings over the life of the server in idle power costs.Another initial challenge with AmpereOne either due to the CPU itself or the Supermicro server is never being able to idle under 100 Watt power draw for the CPU.
Won't matter since server CPUs should always be working.And lose those savings over the life of the server in idle power costs.
if server cpus are sitting idle they are loosing moneyAnd lose those savings over the life of the server in idle power costs.
You misread my comment. I have not claimed extra units are a waste. I have have asked that can you observe from existing benchmarks that homogenous pipe setup (all pipes capable of multiplication) give you any advantage over heterogenous pipe setup (only some pipes capable of doing some ops) in existing code bases, as that was your claim if I understood correctly.Greater FP gains to that degree doesn't happen without the extra units.
Both Skymont articles, https://chipsandcheese.com/p/skymont-in-desktop-form-atom-unleashed and https://chipsandcheese.com/p/skymont-intels-e-cores-reach-for-the-skyI'd like to know which C&C article you are talking about.