Discussion Zen 4 Core Specifications Discussion

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
1661935926947.png
1661935956023.png



1664491299461.png

Some tidbits
  • A 15 layer Telescoping Metal stack has been co-optimized to deliver both high frequency and high density routing capability
This bodes well for density going forward, since they managed to increase frequency greatly without adding additional metal layers. Probably RDNA3 will hit in the same range for density ~90MTr/mm2 and probably blazing frequency if thermal hotspots can be taken care of.
They did add a lot more transistor to support AVX512/increasing ROB/L2/uop cache/BTB.

I bet the second GMI burnt a lot of space albeit probably a necessary forward looking block.

Zen5 will be a reset and optimize the core again a la Zen 3

Will be updated if more specs will show up. This time I doubt AMD will be more open
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I think Retired Engineer was the source for most of these?
Him, Chips and Cheese Gigabyte Leak article, public released material from AMD for Zen4 and one twitter Guy xinoassasin post fragments of the manual and delete them asap. Didn't want to repost what he deleted.
Unverified MTr values from Skyjuice.
Zen1/2/3 from manuals
 

naad

Member
May 31, 2022
63
176
66
The 2nd GMI link looks like a power compromise while they're still using organic substrates, can't be power efficient pumping that much bandwidth from one Serdes to another, much more efficient running two at half the frequency, maybe put the 2nd link to double duty for cache coherency and telemetry as well?
 
  • Like
Reactions: maddie

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,069
136
the front end is interesting , if i remeber right zen can pull 32bytes of instructions a cycle but only hit 18-19 bytes of decode a cycle. I wonder if they have improvements there , allowing more decoded instructions over a period of time or it just come straight from big uop cache.

also i love how L2 is sub 1% , Intel fans have been throwing around numbers like 10% IPC improvement over ADL because of L2 cache improvements. i wonder if the last 48 hours was any kind of wake up call for them
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
The 2nd GMI link looks like a power compromise while they're still using organic substrates, can't be power efficient pumping that much bandwidth from one Serdes to another, much more efficient running two at half the frequency, maybe put the 2nd link to double duty for cache coherency and telemetry as well?
It is. I think we discussed a patent (I think it was) before about using two links for maximum bandwidth and power gating one for better power efficiency and latency.

The size of the SerDes/GMI likely is due to physical necessity, can't shrinkt the physical interface. With that doubled and everything else shrinking that's bound to look really wasteful.

What's interesting is how much less space the TSVs for V-Cache appear to take up. There AMD and TSMC appear to have managed to significantly shrink the bonding interface, on Zen 3 that was quite wasteful just to enable some late gen products (likely balanced out by enabling the development on the whole X3D tech to begin with, with Zen 4 and onward profiting of all the experience garnered).

also i love how L2 is sub 1%
The granularity of that bar seems to be rounded to whole percentages. So 1% L2 Cache, 1% Execution Engine, 3% Branch Prediction, 3% Load/Store, 5% Front End.

Overall Zen 4 is a mixed bag. It's the first Zen gen where the rise in transistors used is disproportionate to the performance improvement (remember Zen 3 excelled at increasing performance more than it increased the amount of transistors, guess that's where Zen 5 will shine again). Though that balance may favor Zen 4 some more when also considering the higher frequency it allows. Zen 4c may significantly cut back on the seeming transistor waste. Will be very interesting to see what will be the trade offs.
 
Last edited:
  • Like
Reactions: Elfear

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
I love how L2 is sub 1% , Intel fans have been throwing around numbers like 10% IPC improvement over ADL because of L2 cache improvements. i wonder if the last 48 hours was any kind of wake up call for them
Intel Memory Subsystem is not as optimized as AMD so any improvement Intel can make on Raptor Lake it will translate to improved IPC performance as shown by Chips and Cheese
 

Abwx

Lifer
Apr 2, 2011
10,847
3,296
136
Overall Zen 4 is a mixed bag. It's the first Zen gen where the rise in transistors used is disproportionate to the performance improvement (remember Zen 3 excelled at increasing performance more than it increased the amount of transistors, guess that's where Zen 5 will shine again). Though that balance may favor Zen 4 some more when also considering the higher frequency it allows. Zen 4c may significantly cut back on the seeming transistor waste. Will be very interesting to see what will be the trade offs.

If a transistor doesnt switch fast enough you put two in serial in an arrangement called a cascode, this allow to short the input/output capacitance to ground, gain in frequency is about 20-30% without any increasement in power comsumption at same frequency.

Of course it wont be done everywhere since not all the circuitry switch at max frequency but that still inflate hugely the transistor count, FI a logic inverter need only a pair of transistor, but increasing speed will require twice the amount, and there are inverters everywhere since they are the basic block of all other gates/logic functions.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
Well this came entirely out of the blue but I guess that answers that question:

View attachment 66908
Sounds like the desktop/server Zen 4 IOD did take over parts of the async IMC/IF available in the mobile APUs since Renoir, very nice.

If a transistor doesnt switch fast enough you put two in serial in an arrangement called a cascode, this allow to short the input/output capacitance to ground, gain in frequency is about 20-30% without any increasement in power comsumption at same frequency.

Of course it wont be done everywhere since not all the circuitry switch at max frequency but that still inflate hugely the transistor count, FI a logic inverter need only a pair of transistor, but increasing speed will require twice the amount, and there are inverters everywhere since they are the basic block of all other gates/logic functions.
Thanks, so AMD likely went just by area not worrying about the transistor count for these Zen 4 cores. Will be the more interesting how (half size) Zen 4c turns out, especially frequency wise.
 

arcsign

Junior Member
Jul 26, 2009
8
26
91
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?
I have not zseen any mention of the IGP cost in transistors. Or am I mussing somrething ? Thats supposed to be a pretty powerful IGP.
 

gdansk

Golden Member
Feb 8, 2011
1,973
2,346
136
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?
But even stock Zen 3 has all the plumbing for L3 v-cache? it just wastes some space afaik
 
Last edited:

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
I have not zseen any mention of the IGP cost in transistors. Or am I mussing somrething ? Thats supposed to be a pretty powerful IGP.
The IGP sits on the IOD - up until now only the core and CCD have been discussed.
I think it is rather small on 6nm with only 2CU.
Is there any information on how many IFOP links the client IOD has? 2 with 2 ports each?
 

arcsign

Junior Member
Jul 26, 2009
8
26
91
Was basing the thought on Angstromomics talking about a 58% increase in transistor count per CCD, so that ignores the IGP stuff (think that’s on io die as mentioned by someone above?).

By plumbing, I should clarify that I mean tighter integration for stuff that one might conceivably put in a stacked die… I’m not super knowledgeable about any of this, so I don’t know what exactly that might entail, but thinking something like additional L2, some kind of extra bus between different cores bits and memory. I dunno really, just seems weird to have such a large increase in transistor count relative to what they added performance wise, especially if they mentioned trying to conserve area/trans count in order to help with energy and clocks and stuff?
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
Looks like doubled the IF links per CCD. If it uses the same IOD as Genoa then Bergamo should have 3 IF links. View attachment 66994
Such an uneven arrangement would be quite interesting. I have no clue if this is possible.
Up until now I would think that an 8c CCX of Bergamo will only have one IF link. But there will be two CCX on a CCD. So they will use 16 out of 24 ports.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,568
146
Was basing the thought on Angstromomics talking about a 58% increase in transistor count per CCD, so that ignores the IGP stuff (think that’s on io die as mentioned by someone above?).

By plumbing, I should clarify that I mean tighter integration for stuff that one might conceivably put in a stacked die… I’m not super knowledgeable about any of this, so I don’t know what exactly that might entail, but thinking something like additional L2, some kind of extra bus between different cores bits and memory. I dunno really, just seems weird to have such a large increase in transistor count relative to what they added performance wise, especially if they mentioned trying to conserve area/trans count in order to help with energy and clocks and stuff?
Sorry, I wasn't telling you to wait for no reason. I know already where some of the die area went because of the testing done by a developer of a certain application, and I'm saying you'll hear about it soon. Probably. Depends on when AMD let them release their findings.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
This thing suggest to me that they are using a lot of automatic Place and Route in contrast to Intel with a fairly stable floor plan for a quite a while.
It seems like Mike Clark's subtle comment on AMD using automation and AI shows their telltale signs here.
No way you can change so much of the floor plan every generation if done by hand without spending a ton of time and man hours.

With all these super computers behind, I am wondering if the engineers doing the physical implementation can shortcut a lot of the process and instead spend more time simulating the RTL design.
Instead of waiting six months to a year to find out if your design is performant enough, or to address late game competitive targets.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,461
136
The FPU redesign really surprised me too.

Basically everything I know about physical design is screaming at me that the longest latency path in the FPU would be shorter if the execution units would be split on both sides of the reg file, like they are on Zen 3. There has to be some constraint I don't understand to make this design make sense.

Looking at that photo, I would expect some of the less commonly used FP/SIMD instructions to get a few extra cycles of latency on Zen4, simply because they are further away from their inputs.