Discussion Zen 5 Architecture & Technical discussion

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

gdansk

Platinum Member
Feb 8, 2011
2,833
4,210
136
The latency has increased for SIMD instructions from 1 to 2 cycles. Because of this SSE instructions seems to suffer. So any workload using this instructions might see a slight regression from Zen4.View attachment 104756
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.
 

naukkis

Senior member
Jun 5, 2002
869
733
136
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

X86-64 basic fp instruction set is SSE2, x87 could be used from x64 but ain't recommended and also not normally used at all. AVX/AVX2 has some support but as it's not supported on all cpu's even sold today support is quite minimally. AVX512 ain't supported pretty much on anything. AMD probably didn't know SIMD workload distribution when they started Zen5 design - Intel did back up AVX512 then pretty strongly. But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD. Intel goes to opposite direction - their E-core straight doubled 128 fp resources and Lion cove increased 256 bit fp units. Zen5 seems to face quite tough competition from Intel.
 

JustViewing

Senior member
Aug 17, 2022
216
381
106
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.
I guess it will impact, as most executable are generic ones which are compiled for lowest common denominator. Win64 baseline is SSE2. So most generic application may not have significant improvement. It could change in Zen 6.
 

MS_AT

Member
Jul 15, 2024
199
456
96
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.
That depends. SSE is the default [for scalars too] processing mode for floating point values for x64 architecture. The thing 1 cycle latency instructions got worse, and those are usually shuffles and you don't need to shuffle scalar values within the register. Add and multiply were already 3 cycles each and those were not affected. Actually SIMD int might be affected as vector int add was probably one cycle.
X86-64 basic fp instruction set is SSE2, x87 could be used from x64 but ain't recommended and also not normally used at all. AVX/AVX2 has some support but as it's not supported on all cpu's even sold today support is quite minimally. AVX512 ain't supported pretty much on anything. AMD probably didn't know SIMD workload distribution when they started Zen5 design - Intel did back up AVX512 then pretty strongly. But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD. Intel goes to opposite direction - their E-core straight doubled 128 fp resources and Lion cove increased 256 bit fp units. Zen5 seems to face quite tough competition from Intel.
It wasn't hard for Skymont to double 128b execution units, they had so few of them before;) It would be much more impressive if Lion Cove doubled number of 256 pipes, but they are probably facing the same limitations AMD is. Lion Cove will match Zen5 with AVX2 capabilities as it is actually playing catch-up to Zen4.
 

Mahboi

Senior member
Apr 4, 2024
976
1,761
96
The 40% IPC improvement in SpecInt (an early leak) is consistent with my tests showing 30-35% improvement in raw scalar integer that isn't memory-bound.
We get essentially 10% general improvement in INT, if not less.
How the heck can Zen 5 be somehow entirely memory-bound on scalar????
 

MS_AT

Member
Jul 15, 2024
199
456
96

We get essentially 10% general improvement in INT, if not less.
How the heck can Zen 5 be somehow entirely memory-bound on scalar????
Latency, what good are all those execution resources if you are waiting either for data or code to run. Games must be notorious for this, seeing how many of them are helped by x3d cache. Since Zen5 and Zen4 share the same connection characteristics from Core to L3 and from CCD to IOD afaik you won't see much improvement between Zen4 and Zen5 when that happens.
I guess using synthetic benchmarks running completely from L1 cache, you would see noticeable improvements in int scalar execution between Zen5 vs Zen4. Therefore the uncore changes rumored for Zen6 might be more meaningful than IPC gain of the core, if current potential is not fully tapped. But to know that, we would need someone to hook a profiler and see where the problem lies. Maybe C&C will do that.
 

soresu

Diamond Member
Dec 19, 2014
3,188
2,463
136
But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.
 

gdansk

Platinum Member
Feb 8, 2011
2,833
4,210
136
AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.
No, FPU can only do 4 operations per cycle (of any size). And up to 2 loads (of any length). The stores can be split, however. 1 x 512 bit or 2 x 128/256 bit.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,331
4,005
75
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.
SSE replacing x87 basically started in 2000 with the Pentium 4. But with SSE came the option of SIMD, doing 2-4 tasks at a time. So any floating-point code requiring performance between 2000 and the early 2010s should have used SSE with SIMD. When AVX came along in the early 2010s it was a drop-in upgrade for most SIMD code.

By this logic, most performant floating-point code updated in the past decade should be using AVX and thus should not be affected. Also realize that many applications don't use floating-point at all.

Of course there are always exceptions. A program I worked on used SSE and 80-bit x87 in a weird way that wouldn't translate well to AVX because there weren't enough x87 registers. Fortunately it's obsolete now, but it would have required a good deal of work to use AVX.
 

inquiss

Member
Oct 13, 2010
179
261
136
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.
Yes there is extra latency there now
 

soresu

Diamond Member
Dec 19, 2014
3,188
2,463
136
Also 3DNow! instruction set the year before SSE.

Edit: Oh interesting, 3DNow! actually started offering FP32 add/subtract/multiply operations before Intel had them with SSE.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,948
4,474
136
SSE replacing x87 basically started in 2000 with the Pentium 4. But with SSE came the option of SIMD, doing 2-4 tasks at a time. So any floating-point code requiring performance between 2000 and the early 2010s should have used SSE with SIMD. When AVX came along in the early 2010s it was a drop-in upgrade for most SIMD code.

By this logic, most performant floating-point code updated in the past decade should be using AVX and thus should not be affected. Also realize that many applications don't use floating-point at all.

Of course there are always exceptions. A program I worked on used SSE and 80-bit x87 in a weird way that wouldn't translate well to AVX because there weren't enough x87 registers. Fortunately it's obsolete now, but it would have required a good deal of work to use AVX.

I think you meant SSE2 with the P4. SSE was on the P3 before that. When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.

Also 3DNow! instruction set the year before SSE.

Edit: Oh interesting, 3DNow! actually started offering FP32 add/subtract/multiply operations before Intel had them with SSE.

3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.
 

MS_AT

Member
Jul 15, 2024
199
456
96
No, FPU can only do 4 operations per cycle (of any size). And up to 2 loads (of any length). The stores can be split, however. 1 x 512 bit or 2 x 128/256 bit.
There are further limits by the operation type [add, mul, complex permute, simple permute] which might be or might not be important
Yes there is extra latency there now
only for subset of instructions the most basic one already had more than 1 cycle latency so are not affected
I think you meant SSE2 with the P4. SSE was on the P3 before that. When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.
If not for this stupid segmentation policy, AVX512 could be more popular now, I mean Intel could fit somewhat limited implementation into Tiger Lake, and I remember part of the hardware for this limited implementation was already present on Skylake non X but fused off. But I would need to dig for source, since I might be remembering wrongly
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,331
4,005
75
I think you meant SSE2 with the P4. SSE was on the P3 before that.
I know there was SSE on P3, but I don't think they really pushed moving off the x87 FPU for non-SIMD work until the P4.
When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.
Oh, yeah, I forgot about the broken Celery. (Celerons.)
 

soresu

Diamond Member
Dec 19, 2014
3,188
2,463
136
3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.
Story of AMD's life really.

Exact same thing happened with SSE5 and AVX.
 
  • Like
Reactions: Thibsie

Mahboi

Senior member
Apr 4, 2024
976
1,761
96
Latency, what good are all those execution resources if you are waiting either for data or code to run. Games must be notorious for this, seeing how many of them are helped by x3d cache. Since Zen5 and Zen4 share the same connection characteristics from Core to L3 and from CCD to IOD afaik you won't see much improvement between Zen4 and Zen5 when that happens.
I guess using synthetic benchmarks running completely from L1 cache, you would see noticeable improvements in int scalar execution between Zen5 vs Zen4. Therefore the uncore changes rumored for Zen6 might be more meaningful than IPC gain of the core, if current potential is not fully tapped. But to know that, we would need someone to hook a profiler and see where the problem lies. Maybe C&C will do that.
Fascinating...
One thing I read somewhere is that Apple's performance success comes also from a fairly fat L2 rather than the more server-typical L1/2/3 AMD uses.
Now granted they also do it in GPUs while Nvidia stays with only L1/L2, maybe it's just a kink AMD will keep. But could it be that with Zen 6, we start seeing the latency bottleneck cured with a smaller or non-existent L3 on client, while a really fat L2 replacing it? Server apps very clearly gain a lot from Zen 5, the problem seems to be more that what we have here is a fully primed server chip that isn't really any kind of improvement in client.
 
  • Like
Reactions: Vattila

naukkis

Senior member
Jun 5, 2002
869
733
136
3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.

3dNow! was basically meant only for 3d-gaming. It fp calculations weren't IEEE-compatible, so for most programs they lack accuracy so being useless for general usage.
 
  • Like
Reactions: Thunder 57

naukkis

Senior member
Jun 5, 2002
869
733
136
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

Not the point. AMD does have only 2-load ports to fp register where everybody else has more. Even Intel E-cores will have 3 load ports, so probably being able to achieve better IPC for some scalar & 128 bit workloads. AMD has biggest FP-unit of all in their cpu's - and is nearly in situation that it will have worst IPC for desktop and mobile workloads.
 
  • Like
Reactions: Vattila

gdansk

Platinum Member
Feb 8, 2011
2,833
4,210
136
AMD has biggest FP-unit of all in their cpu's
Biggest how? Area I doubt it's even half the size of GC's. Transistor count? Doubt that it'll approach any N3 core.
I think people keep overestimating the area needed to do the changes they did. Plus in mobile the FPU isn't wider, it's still 256-bit.
And yet it has the 1 cycle penalty (i.e. it's not a consequence of being 512 bit but of some other design hazard).
 

naukkis

Senior member
Jun 5, 2002
869
733
136
Biggest how? Area I doubt it's even half the size of GC's. Transistor count? Doubt that it'll approach any N3 core.
I think people keep overestimating the area needed to do the changes they did. Plus in mobile the FPU isn't wider, it's still 256-bit.
And yet it has the 1 cycle penalty (i.e. it's not a consequence of being 512 bit but of some other design hazard).

Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.
 

soresu

Diamond Member
Dec 19, 2014
3,188
2,463
136
Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.
And yet a rather obvious low hanging fruit for future cores to improve upon.