Discussion Zen 5 Architecture & Technical discussion

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

JustViewing

Senior member
Aug 17, 2022
216
382
106
@Abwx Yeah it's hard to know what's going on. There could really be a bug here when using the 2 groups of decoders in single thread. As far as I know no one has demonstrated the decoding of more than 4 instructions per cycle. But this might be due to microbenchmarks being unable to demonstrate that.
It could be that you have to disable SMT to get both duel decoders and duel OP cache to work in single thread. Could be a limitation in current Zen5 design. Or it is disabled for due to some bug.
 

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,971
136
It could be that you have to disable SMT to get both duel decoders and duel OP cache to work in single thread. Could be a limitation in current Zen5 design. Or it is disabled for due to some bug.
According to tests done disabling SMT leads to strange perf behaviors, which is why a bug is quite possible.
 

JustViewing

Senior member
Aug 17, 2022
216
382
106
How much cycles before the decoder is put to use.?.
Instructions are flowing permanently from the instruction cache, so how much instructions can be picked from the uop cache before new instructions are required from the cache.?
Even if the op cache can provide some instructions you ll still have to decode what will follow next, it s not like a full set of an app instructions can be provided by the uop cache.
Most critical part of the code like inner loop will be fully served by OP cache. In Zen4 I think each item in op cache is 8 Bytes. So 6000+ entries is equal to 48KBytes. 6000 instructions is a lot for inner loop.
 
  • Like
Reactions: yuri69

naukkis

Senior member
Jun 5, 2002
871
737
136
How much cycles before the decoder is put to use.?.
Instructions are flowing permanently from the instruction cache, so how much instructions can be picked from the uop cache before new instructions are required from the cache.?
Even if the uop cache can provide some instructions you ll still have to decode what will follow next, it s not like a full set of an app instructions can be provided by the uop cache.

Programs are made from loops of same instructions. If that wasn't the case all cpu caches would be useless. For instructions that uop cache is first level, L0 cache. Instructions are only needed to decoded from L1-cache when L0 cache misses. And program loops can be fetch from L0 completely and when that happens quite a some time, something like a million instructions come from L0 without any instruction decoded todays cpu's will power gate whole decoder down to save power. And that will happen continuously when executing optimized programs.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,964
136
Most critical part of the code like inner loop will be fully served by OP cache. In Zen4 I think each item in op cache is 8 Bytes. So 6000+ entries is equal to 48KBytes. 6000 instructions is a lot for inner loop.

The µop cache has fixed-lenght lines, with 6 ops per line, so 1024 lines. There are alignment restrictions, iirc every x86 instruction that gets put on a line must begin on the same 64B cache line. So worst case is 7 µops worth of instructions per cache line, because that means one full line and one line with just one µop. But yeah, ~<6000 ops is good enough for small enough inner loops, but that's it.

Looking at the C&C numbers, it really looks like to me that Zen5 is extremely front-end bound. The backend can handle a lot more ops than what the frontend can, on average, sling at it. I hope they can beef that up on future designs.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
871
737
136
But yeah, ~<6000 ops is good enough for small enough inner loops, but that's it.

If I remember old things right you can expect most of loops to be less than 32 instructions. Todays cpus will cache many layers of recursive loops which makes practical to implement things like power gating decoders totally, as for many cases looping code is cached totally in mop but data is sitting far away, maybe in dram. Power gating decoders will take some time as bringing them back to life so predictor has to know looping to continue pretty long time before it will gate them down. BTW, as decoders consume up to 50% of cpu execution power that should be see in cpu power figures - 1T workloads will bring power down whereas 2T usually won't. Seems like only Intel cpus are powering decoding down - AMD cpus keep them on even with 1T workloads.

For typical example for tight inner loop is memory moving or table clearing - those should be just couple of instructions but take a massive amount of instructions if operated memory footage is large. And blast from the past - when loop buffer was firstly used in MC68010 it was exactly as small as it could be, two instructions - but even that achieved quite good hit ratios.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136
Skymont is closer to 50% the area of Lion Cove.



He included L3 as the cache for Lion Cove which is not how cores are typically compared.

Here's a core size comparison chart. With the Lion Cove die shots we have, the L2 area is unclear to me, so I am leaving that out for now. I am only including 1/4 of the Skymont L2 because it is shared between all the cores in the 4-core cluster. If you want to include the full L2 size, that would obviously make Skymont + L2 a little bigger.

Additionally, the Zen 5c core is about 0.64 times the size of Zen 5, but since they both have the same amount of L2 cache, the difference becomes smaller at 0.75 when L2 cache is included.

For process comparison, N3B has ~1.6x the logic density of N4P and, I believe, ~1.3x the chip density. So, if we assume a direct port would achieve chip density level of improvements for Zen 5, then Zen 5 would be at least 30% smaller than Lion Cove (L2 included with both). Zen 5c would be around 7% smaller than Skymont or 12% bigger if including L2 and 1/4 L2 for Zen5c and Skymont respectively.

Core type​
Process​
Area (mm^2)​
Relative area to Lion Cove + L2​
(LNL) Lion Cove + L2​
TSMC N3B​
4.25​
1​
(LNL) Skymont + (1/4) L2​
TSMC N3B​
1.93​
0.454​
Mobile Zen 5 + L2​
TSMC N4P​
4.15​
0.975​
Mobile Zen 5c + L2​
TSMC N4P​
3.09​
0.727​
(LNL) Lion Cove​
TSMC N3B​
3.23?​
0.76?​
(LNL) Skymont​
TSMC N3B​
1.5​
0.353​
Mobile Zen 5​
TSMC N4P​
3.09​
0.727​
Mobile Zen 5c​
TSMC N4P​
1.99​
0.468​


Edit: I added a guess for Lion Cove core size without L2. I left the questions marks though as I am less confident in it versus the other size estimates.
 
Last edited:

CouncilorIrissa

Senior member
Jul 28, 2023
520
1,991
96
Here's a core size comparison chart. With the Lion Cove die shots we have, the L2 area is unclear to me, so I am leaving that out for now. I am only including 1/4 of the Skymont L2 because it is shared between all the cores in the 4-core cluster. If you want to include the full L2 size, that would obviously make Skymont + L2 a little bigger.

Additionally, the Zen 5c core is about 0.64 times the size of Zen 5, but since they both have the same amount of L2 cache, the difference becomes smaller at 0.75 when L2 cache is included.

For process comparison, N3B has ~1.6x the logic density of N4P and, I believe, ~1.3x the chip density. So, if we assume a direct port would achieve chip density level of improvements for Zen 5, then Zen 5 would be at least 30% smaller than Lion Cove (L2 included with both). Zen 5c would be around 7% smaller than Skymont or 12% bigger if including L2 and 1/4 L2 for Zen5c and Skymont respectively.

Core type​
Process​
Area (mm^2)​
Relative area to Lion Cove + L2​
Lion Cove + L2​
TSMC N3B​
4.25​
1​
Skymont + (1/4) L2​
TSMC N3B​
1.93​
0.454​
Zen 5 + L2​
TSMC N4P​
4.15​
0.975​
Zen 5c + L2​
TSMC N4P​
3.09​
0.727​
Lion Cove​
TSMC N3B​
3.23?​
0.76?​
Skymont​
TSMC N3B​
1.5​
0.353​
Zen 5​
TSMC N4P​
3.09​
0.727​
Zen 5c​
TSMC N4P​
1.99​
0.468​


Edit: I added a guess for Lion Cove core size without L2. I left the questions marks though as I am less confident in it versus the other size estimates.
What's the source on Zen 5 core area? I thought it was 3.46 mm^2 without L2?
 

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136
What's the source on Zen 5 core area? I thought it was 3.46 mm^2 without L2?

It is my own estimate from the high res STX die shot. The 3.46 mm^2 estimate was for Zen 5 on Granite Ridge with the full 512-bit paths. It was also based on a low res photo from a distance and should have a high error margin to it.

Edit: Here is the source of the GNR core area estimate. I've added that the Zen 5 in the table is the mobile core.

Edit 2: Based on this AMD provided image, I estimate Zen 5 core area in GNR to be 3.5 mm^2, so it's in good agreement with the previous estimate. I will say, that even with the AMD provided image, it's not as good as the STX die shot and the edit/highlighting done on it will make the estimate less accurate, but ~3.5 mm^2 should be pretty close. That means that adding the extra width for the 512-bit paths to Zen 5 for GNR increased the core size (without L2) by ~13%. With L2 I get Zen 5 core in GNR as ~4.56 mm^2 or roughly 10% larger than Zen 5 in STX with L2.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136
It is my own estimate from the high res STX die shot. The 3.46 mm^2 estimate was for Zen 5 on Granite Ridge with the full 512-bit paths. It was also based on a low res photo from a distance and should have a high error margin to it.

Edit: Here is the source of the GNR core area estimate. I've added that the Zen 5 in the table is the mobile core.

Edit 2: Based on this AMD provided image, I estimate Zen 5 core area in GNR to be 3.5 mm^2, so it's in good agreement with the previous estimate. I will say, that even with the AMD provided image, it's not as good as the STX die shot and the edit/highlighting done on it will make the estimate less accurate, but ~3.5 mm^2 should be pretty close. That means that adding the extra width for the 512-bit paths to Zen 5 for GNR increased the core size (without L2) by ~13%. With L2 I get Zen 5 core in GNR as ~4.56 mm^2 or roughly 10% larger than Zen 5 in STX with L2.

I should also add that mobile Zen 5 and desktop have a different target Fmax, so that may come into play as well in the die area differences.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
So after actually counting the ducks, Skymont went from much smaller than Zen 5C to roughly equal.
Only an estimate. It is still my guess N3B Skymont would be smaller than N3E Zen 5C for two reasons: N3E is a bit less dense than N3B and this Zen 5C has the full width AVX-512 unit unlike Strix Point Zen 5C.

But I was hoping to see if anyone had measurements.
 
Last edited:
  • Like
Reactions: coercitiv

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136
Only an estimate. It still my guess N3B Skymont would be smaller than N3E Zen 5C for two reasons: N3E is a bit less dense than N3B and this Zen 5C has the full width AVX-512 unit unlike Strix Point Zen 5C.

But I was hoping to see if anyone had measurements.

This is STX Zen 5c. . .

So after actually counting the ducks, Skymont went from much smaller than Zen 5C to roughly equal.

If you mean on an equivalent process, then yes.
 

MS_AT

Senior member
Jul 15, 2024
202
475
96
According to tests done disabling SMT leads to strange perf behaviors, which is why a bug is quite possible.
Could you point to specific tests you find the performance to be strange? From what I recall from the TPU tests, Zen4 CPU was showing the same behaviour generally but less pronounced. Most games were benefitting from SMT off and tasks like cinebench or code compilation were worse, which I think looks fine.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,302
136
Programs are made from loops of same instructions. If that wasn't the case all cpu caches would be useless. For instructions that uop cache is first level, L0 cache. Instructions are only needed to decoded from L1-cache when L0 cache misses. And program loops can be fetch from L0 completely and when that happens quite a some time, something like a million instructions come from L0 without any instruction decoded todays cpu's will power gate whole decoder down to save power. And that will happen continuously when executing optimized programs.
That must be in the most favourable cases because Cheep and Cheese measured the op cache hit rate of Zen 4 at 70-80% for some games, so that makes 20-30% of the instructions flow still relying on the decoder and hence on the instructions cache, that s hardly compatible with one million instructions extracted from the op cache in a row, or even much less than one million.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
It could be that you have to disable SMT to get both duel decoders and duel OP cache to work in single thread. Could be a limitation in current Zen5 design. Or it is disabled for due to some bug.
From David's test as well as from Cheese, it seems the front end is still 4 wide in ST mode (HX 370)
This is in contrast to what Mike Clark said.

George Cozma: You know, for a single thread of it, let’s say you’re running a workload that only uses one thread on a given core. Can a single thread take advantage of all of the front-end resources and can it take advantage of both decode clusters and the entirety of the dual ported OP cache?

Mike Clark: The answer is yes, and it’s a great question to ask because I explain SMT to a lot of people, they come in with the notion that we don’t [and] they aren’t able to use all these resources when we’re in single threaded mode, but our design philosophy is that barring a few, very rare microarchitectural exceptions, everything that matters is available in one thread mode. If we imagine we are removing [SMT] it’s not like we’d go shrink anything. There’s nothing to shrink. This is what we need for good, strong single threaded performance. And we’ve already built that.
The dual ported op cache seems to be working as mentioned. But the dual decode cluster seems to be not working as stated.
So it would seem that there is some bug or security vulnerability that would inhibit its full potential.

From the test, Z5 is quite front end bound.

Then they tossed out the loop buffer, doubled inst fetch, change a whole bunch of things. Some worked well some didn't.

I guess Mike was excited from the architecture point of view I guess, maybe the implementation was not that great.

It is my own estimate from the high res STX die shot. The 3.46 mm^2 estimate was for Zen 5 on Granite Ridge with the full 512-bit paths. It was also based on a low res photo from a distance and should have a high error margin to it.

Edit: Here is the source of the GNR core area estimate. I've added that the Zen 5 in the table is the mobile core.

Edit 2: Based on this AMD provided image, I estimate Zen 5 core area in GNR to be 3.5 mm^2, so it's in good agreement with the previous estimate. I will say, that even with the AMD provided image, it's not as good as the STX die shot and the edit/highlighting done on it will make the estimate less accurate, but ~3.5 mm^2 should be pretty close. That means that adding the extra width for the 512-bit paths to Zen 5 for GNR increased the core size (without L2) by ~13%. With L2 I get Zen 5 core in GNR as ~4.56 mm^2 or roughly 10% larger than Zen 5 in STX with L2.
The Z5 die area would also include the TSVs and power delivery for the V Cache which got extended into the L2 area, so I guess it has to be taken out too.
 

MS_AT

Senior member
Jul 15, 2024
202
475
96
That must be in the most favourable cases because Cheep and Cheese measured the op cache hit rate of Zen 4 at 70-80% for some games, so that makes 20-30% of the instructions flow still relying on the decoder and hence on the instructions cache, that s hardly compatible with one million instructions extracted from the op cache in a row, or even much less than one million.
Games are branchy, well optimized compute loop, think HPC, will have much smaller branch count for thousand instructions than any game. So it's not out of the realm of possibility you could fit whole computation kernel [not only hot loops] in the uop cache.
From David's test as well as from Cheese, it seems the front end is still 4 wide in ST mode (HX 370)
This is in contrast to what Mike Clark said.
Until Cheese or David will confirm they tried to measure this with SMT=Off in the BIOS then there is a slight chance that the message conveyed on the slides, that decoders are statically partitioned, might turn out true. So far when they used statically partitioned it meant that when SMT is active each thread gets half of the resource but when it's off, then the single active hw thread can use full core resources. [according to Zen4 Software Optimization Guide]. This is also what Clark could mean indirectly.

So in other words the possibilities are: Clark and slides are wrong: there is not way a single thread can use 2 decoders at all. Another possibility is: slides are right and Clark was partially right: in SMT=Off mode the remaining HW thread can use all core resources including the 2 decoders. The last option is that the trigger for activing second decode cluster is not exactly known, therefore the tests cannot account for it. I would guess, similar to Goldmont, a branch could be the trigger but it's just a guess.

It's a pity AMD releases Software Optimization Guides for new cores only few months after the release date... [at least to general public]. Actually AMD is probably the worst when it comes to software guides and compiler support. IIRC X925 got the resources posted during the announcement, Intel is also usually doing that ahead of time, not to mention compiler support, where AMD is supplying patches with misinformation [GCC] or not supplying them at all [CLANG] ahead of launch. And they want to be a software company...;)
 

StefanR5R

Elite Member
Dec 10, 2016
5,889
8,757
136
So in other words the possibilities are: Clark and slides are wrong: there is not way a single thread can use 2 decoders at all. Another possibility is: slides are right and Clark was partially right: in SMT=Off mode the remaining HW thread can use all core resources including the 2 decoders. The last option is that the trigger for activing second decode cluster is not exactly known, therefore the tests cannot account for it.
About the first possibility: Strix Point and Granite Ridge, as released, are one thing. (Alright, two things.) There are more Zen 5 products in the making. I know, this is not the speculation thread, but recall that last what we heard about Turin's launch date was no more precise than "2H 2024".

About the second possibility: It would be rather brutal to have to disable SMT entirely through firmware in order to make the dual decoder useful to a single thread. Obviously, AMD should make this work dynamically (if they haven't already unbeknownst to everyone outside).
 
  • Like
Reactions: Vattila

KompuKare

Golden Member
Jul 28, 2009
1,163
1,426
136
Here's a core size comparison chart. With the Lion Cove die shots we have, the L2 area is unclear to me, so I am leaving that out for now. I am only including 1/4 of the Skymont L2 because it is shared between all the cores in the 4-core cluster. If you want to include the full L2 size, that would obviously make Skymont + L2 a little bigger.

Additionally, the Zen 5c core is about 0.64 times the size of Zen 5, but since they both have the same amount of L2 cache, the difference becomes smaller at 0.75 when L2 cache is included.

For process comparison, N3B has ~1.6x the logic density of N4P and, I believe, ~1.3x the chip density. So, if we assume a direct port would achieve chip density level of improvements for Zen 5, then Zen 5 would be at least 30% smaller than Lion Cove (L2 included with both). Zen 5c would be around 7% smaller than Skymont or 12% bigger if including L2 and 1/4 L2 for Zen5c and Skymont respectively.

Core type​
Process​
Area (mm^2)​
Relative area to Lion Cove + L2​
Lion Cove + L2​
TSMC N3B​
4.25​
1​
Skymont + (1/4) L2​
TSMC N3B​
1.93​
0.454​
Mobile Zen 5 + L2​
TSMC N4P​
4.15​
0.975​
Mobile Zen 5c + L2​
TSMC N4P​
3.09​
0.727​
Lion Cove​
TSMC N3B​
3.23?​
0.76?​
Skymont​
TSMC N3B​
1.5​
0.353​
Mobile Zen 5​
TSMC N4P​
3.09​
0.727​
Mobile Zen 5c​
TSMC N4P​
1.99​
0.468​


Edit: I added a guess for Lion Cove core size without L2. I left the questions marks though as I am less confident in it versus the other size estimates.
Great work and something nobody AFAIK attempted on the Skymont thread.

Edit: once all the figures are on Skymont is still a huge leap in efficiency and PPA for Intel but not quite class leading. The irony is that in servers both Skymont and Zen 4c/5c are both going for the one market where ARM does really well.
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
202
475
96
About the first possibility: Strix Point and Granite Ridge, as released, are one thing. (Alright, two things.) There are more Zen 5 products in the making. I know, this is not the speculation thread, but recall that last what we heard about Turin's launch date was no more precise than "2H 2024".
Was this more a comment about the software optimization guides availability than the decoder?

Generally, I would expect them to release the documents coinciding with the release of the first product regardless if server or client, at the latest. The reason being that if you release client first then developers already can get to work and be ready for the server parts. If you release server parts first the developers are able to start immediately to make most of the platform. This makes sense especially for Zen when the same core is targeting both desktop and server parts.

While nobody will bother with recompiling/rewriting games for new CPUs, some server specific software probably will be updated, so by releasing late they only do themselves a harm. [I guess those documents are already released behind NDA but it's not like releasing them publicly on release day would hurt their "security by obscurity" thing...]
 
  • Like
Reactions: Thibsie

MS_AT

Senior member
Jul 15, 2024
202
475
96
I had the decoder capability in mind in #219.
In other words, you meant that Turin could show different decoder behavior compared to Strix Point and Granite Ridge? Otherwise I miss the connection and would be grateful for explaining it.