Discussion Zen 4 Core Specifications Discussion

DisEnchantment · Aug 31, 2022

Some tidbits

A 15 layer Telescoping Metal stack has been co-optimized to deliver both high frequency and high density routing capability

This bodes well for density going forward, since they managed to increase frequency greatly ~~without adding additional metal layers~~. Probably RDNA3 will hit in the same range for density ~90MTr/mm2 and probably blazing frequency if thermal hotspots can be taken care of.
They did add a lot more transistor to support AVX512/increasing ROB/L2/uop cache/BTB.

I bet the second GMI burnt a lot of space albeit probably a necessary forward looking block.

Zen5 will be a reset and optimize the core again a la Zen 3

Will be updated if more specs will show up. This time I doubt AMD will be more open

moinmoin · Aug 31, 2022

I think Retired Engineer was the source for most of these?

https://twitter.com/x/status/1564413952108335105

Also noteworthy:

https://twitter.com/x/status/1564579004044103680

Would be nice if all the gaps would be filled officially.

DisEnchantment · Aug 31, 2022

moinmoin said:
I think Retired Engineer was the source for most of these?

Him, Chips and Cheese Gigabyte Leak article, public released material from AMD for Zen4 and one twitter Guy xinoassasin post fragments of the manual and delete them asap. Didn't want to repost what he deleted.
Unverified MTr values from Skyjuice.
Zen1/2/3 from manuals

naad · Aug 31, 2022

The 2nd GMI link looks like a power compromise while they're still using organic substrates, can't be power efficient pumping that much bandwidth from one Serdes to another, much more efficient running two at half the frequency, maybe put the 2nd link to double duty for cache coherency and telemetry as well?

itsmydamnation · Aug 31, 2022

the front end is interesting , if i remeber right zen can pull 32bytes of instructions a cycle but only hit 18-19 bytes of decode a cycle. I wonder if they have improvements there , allowing more decoded instructions over a period of time or it just come straight from big uop cache.

also i love how L2 is sub 1% , Intel fans have been throwing around numbers like 10% IPC improvement over ADL because of L2 cache improvements. i wonder if the last 48 hours was any kind of wake up call for them

moinmoin · Aug 31, 2022

naad said:
The 2nd GMI link looks like a power compromise while they're still using organic substrates, can't be power efficient pumping that much bandwidth from one Serdes to another, much more efficient running two at half the frequency, maybe put the 2nd link to double duty for cache coherency and telemetry as well?

It is. I think we discussed a patent (I think it was) before about using two links for maximum bandwidth and power gating one for better power efficiency and latency.

The size of the SerDes/GMI likely is due to physical necessity, can't shrinkt the physical interface. With that doubled and everything else shrinking that's bound to look really wasteful.

What's interesting is how much less space the TSVs for V-Cache appear to take up. There AMD and TSMC appear to have managed to significantly shrink the bonding interface, on Zen 3 that was quite wasteful just to enable some late gen products (likely balanced out by enabling the development on the whole X3D tech to begin with, with Zen 4 and onward profiting of all the experience garnered).

itsmydamnation said:
also i love how L2 is sub 1%

The granularity of that bar seems to be rounded to whole percentages. So 1% L2 Cache, 1% Execution Engine, 3% Branch Prediction, 3% Load/Store, 5% Front End.

Overall Zen 4 is a mixed bag. It's the first Zen gen where the rise in transistors used is disproportionate to the performance improvement (remember Zen 3 excelled at increasing performance more than it increased the amount of transistors, guess that's where Zen 5 will shine again). Though that balance may favor Zen 4 some more when also considering the higher frequency it allows. Zen 4c may significantly cut back on the seeming transistor waste. Will be very interesting to see what will be the trade offs.

nicalandia · Aug 31, 2022

itsmydamnation said:
I love how L2 is sub 1% , Intel fans have been throwing around numbers like 10% IPC improvement over ADL because of L2 cache improvements. i wonder if the last 48 hours was any kind of wake up call for them

Intel Memory Subsystem is not as optimized as AMD so any improvement Intel can make on Raptor Lake it will translate to improved IPC performance as shown by Chips and Cheese

uzzi38 · Aug 31, 2022

Well this came entirely out of the blue but I guess that answers that question:

Abwx · Aug 31, 2022

moinmoin said:
Overall Zen 4 is a mixed bag. It's the first Zen gen where the rise in transistors used is disproportionate to the performance improvement (remember Zen 3 excelled at increasing performance more than it increased the amount of transistors, guess that's where Zen 5 will shine again). Though that balance may favor Zen 4 some more when also considering the higher frequency it allows. Zen 4c may significantly cut back on the seeming transistor waste. Will be very interesting to see what will be the trade offs.

If a transistor doesnt switch fast enough you put two in serial in an arrangement called a cascode, this allow to short the input/output capacitance to ground, gain in frequency is about 20-30% without any increasement in power comsumption at same frequency.

Of course it wont be done everywhere since not all the circuitry switch at max frequency but that still inflate hugely the transistor count, FI a logic inverter need only a pair of transistor, but increasing speed will require twice the amount, and there are inverters everywhere since they are the basic block of all other gates/logic functions.

moinmoin · Aug 31, 2022

uzzi38 said:
Well this came entirely out of the blue but I guess that answers that question:

View attachment 66908

Sounds like the desktop/server Zen 4 IOD did take over parts of the async IMC/IF available in the mobile APUs since Renoir, very nice.

Abwx said:
If a transistor doesnt switch fast enough you put two in serial in an arrangement called a cascode, this allow to short the input/output capacitance to ground, gain in frequency is about 20-30% without any increasement in power comsumption at same frequency.

Of course it wont be done everywhere since not all the circuitry switch at max frequency but that still inflate hugely the transistor count, FI a logic inverter need only a pair of transistor, but increasing speed will require twice the amount, and there are inverters everywhere since they are the basic block of all other gates/logic functions.

Thanks, so AMD likely went just by area not worrying about the transistor count for these Zen 4 cores. Will be the more interesting how (half size) Zen 4c turns out, especially frequency wise.

arcsign · Aug 31, 2022

Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?

Markfw · Aug 31, 2022

arcsign said:
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?

I have not zseen any mention of the IGP cost in transistors. Or am I mussing somrething ? Thats supposed to be a pretty powerful IGP.

gdansk · Aug 31, 2022

arcsign said:
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?

But even stock Zen 3 has all the plumbing for L3 v-cache? it just wastes some space afaik

BorisTheBlade82 · Sep 1, 2022

Markfw said:
I have not zseen any mention of the IGP cost in transistors. Or am I mussing somrething ? Thats supposed to be a pretty powerful IGP.

The IGP sits on the IOD - up until now only the core and CCD have been discussed.
I think it is rather small on 6nm with only 2CU.
Is there any information on how many IFOP links the client IOD has? 2 with 2 ports each?

uzzi38 · Sep 1, 2022

arcsign said:
Complete speculation on my part… but is there a chance some of those extra 2 billion transistors are plumbing for v-cache? Maybe something more than just L3?

Wait for more in-depth analysis on the Zen 4 core. Some interesting things will come to light.

uzzi38 · Sep 1, 2022

BorisTheBlade82 said:
The IGP sits on the IOD - up until now only the core and CCD have been discussed.
I think it is rather small on 6nm with only 2CU.

The display and media engines are going to be the main bits that eat up die area of the iGPU, a single WGP is like a couple of mm^2 iirc - practically negligible.

arcsign · Sep 1, 2022

Was basing the thought on Angstromomics talking about a 58% increase in transistor count per CCD, so that ignores the IGP stuff (think that’s on io die as mentioned by someone above?).

By plumbing, I should clarify that I mean tighter integration for stuff that one might conceivably put in a stacked die… I’m not super knowledgeable about any of this, so I don’t know what exactly that might entail, but thinking something like additional L2, some kind of extra bus between different cores bits and memory. I dunno really, just seems weird to have such a large increase in transistor count relative to what they added performance wise, especially if they mentioned trying to conserve area/trans count in order to help with energy and clocks and stuff?

thigobr · Sep 1, 2022

I just hope the links between CCD/IOD are fast enough to handle all that DDR5 bandwidth.

scineram · Sep 2, 2022

Looks like doubled the IF links per CCD. If it uses the same IOD as Genoa then Bergamo should have 3 IF links.

naad · Sep 2, 2022

scineram said:
Looks like doubled the IF links per CCD. If it uses the same IOD as Genoa then Bergamo should have 3 IF links.

Yup, 2 smaller ones instead of one big one, wonder how many links the I/O die has?

arcsign · Sep 2, 2022

Any idea how they identify TSVs on the die images?

BorisTheBlade82 · Sep 3, 2022

scineram said:
Looks like doubled the IF links per CCD. If it uses the same IOD as Genoa then Bergamo should have 3 IF links. View attachment 66994

Such an uneven arrangement would be quite interesting. I have no clue if this is possible.
Up until now I would think that an 8c CCX of Bergamo will only have one IF link. But there will be two CCX on a CCD. So they will use 16 out of 24 ports.

uzzi38 · Sep 3, 2022

arcsign said:
Was basing the thought on Angstromomics talking about a 58% increase in transistor count per CCD, so that ignores the IGP stuff (think that’s on io die as mentioned by someone above?).

By plumbing, I should clarify that I mean tighter integration for stuff that one might conceivably put in a stacked die… I’m not super knowledgeable about any of this, so I don’t know what exactly that might entail, but thinking something like additional L2, some kind of extra bus between different cores bits and memory. I dunno really, just seems weird to have such a large increase in transistor count relative to what they added performance wise, especially if they mentioned trying to conserve area/trans count in order to help with energy and clocks and stuff?

Sorry, I wasn't telling you to wait for no reason. I know already where some of the die area went because of the testing done by a developer of a certain application, and I'm saying you'll hear about it soon. Probably. Depends on when AMD let them release their findings.

DisEnchantment · Sep 3, 2022

https://twitter.com/x/status/1565846947918905347

This thing suggest to me that they are using a lot of automatic Place and Route in contrast to Intel with a fairly stable floor plan for a quite a while.
It seems like Mike Clark's subtle comment on AMD using automation and AI shows their telltale signs here.
No way you can change so much of the floor plan every generation if done by hand without spending a ton of time and man hours.

With all these super computers behind, I am wondering if the engineers doing the physical implementation can shortcut a lot of the process and instead spend more time simulating the RTL design.
Instead of waiting six months to a year to find out if your design is performant enough, or to address late game competitive targets.

Tuna-Fish · Sep 3, 2022

The FPU redesign really surprised me too.

Basically everything I know about physical design is screaming at me that the longest latency path in the FPU would be shorter if the execution units would be split on both sides of the reg file, like they are on Zen 3. There has to be some constraint I don't understand to make this design make sense.

Looking at that photo, I would expect some of the less commonly used FP/SIMD instructions to get a few extra cycles of latency on Zen4, simply because they are further away from their inputs.

Discussion Zen 4 Core Specifications Discussion

Golden Member

Diamond Member

Golden Member

Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Diamond Member

Junior Member

Moderator Emeritus, Elite Member

Diamond Member

Senior member

Platinum Member

Platinum Member

Junior Member

Senior member

Senior member

Member

Junior Member

Senior member

Platinum Member

Golden Member

Golden Member