Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

jpiniero · Oct 26, 2021

Ajay said:
There is certainly the potential to feed more bandwidth into the core with DDR5 and AVX-512 execution units would definitely eat that up. It will be interesting to see how DDR5 will affect various applications and games, in light of the longer latency.

JEDEC vs JEDEC, the latency increase isn't that much. The people complaining about the DDR5 latency are comparing JEDEC to XMP (which I doubt it used in servers).

Thibsie · Oct 26, 2021

Rumours pointed to Zen5 being (with many other things) a hybrid core, one Zen4 core and one Zen5 core with shared registers.
I dunno anything about the possibility of this but I'm very curious though.

Tuna-Fish · Oct 26, 2021

Thibsie said:
Rumours pointed to Zen5 being (with many other things) a hybrid core, one Zen4 core and one Zen5 core with shared registers.
I dunno anything about the possibility of this but I'm very curious though.

This makes literally zero sense.

DisEnchantment · Oct 26, 2021

Regardless of whether Zen4 is just a redesigned core or not, with L3 being same and a confirmed die size of 72.226mm2 on N5P, Zen4 core will pack anything between 25-40% more transistors
[If we assume horrible scaling, e.g. 70% of the Zen3 core will scale a measly 1.1x, while the 30% scale at 1.5x (vs TSMC's advertised 1.8x)]
All the new interconnect logic is in the cIOD, CXL, GenZ ,MPDMA, NVDIM/SCM etc., sot they dont contribute to the CCD die area.
Zen4 is far from being an optical shrink for sure.

Zen3 core over Zen2 core is just ~9% increase within similar power envelope.
Looking at the chart below, N7 -->N5P [23%perf/-49%power reduction].
If AMD keeps same clocks, the efficiency gain is enormous.

maddie · Oct 26, 2021

DisEnchantment said:
Regardless of whether Zen4 is just a redesigned core or not, with L3 being same and a confirmed die size of 72.226mm2 on N5P, Zen4 core will pack anything between 25-40% more transistors
[If we assume horrible scaling, e.g. 70% of the Zen3 core will scale a measly 1.1x, while the 30% scale at 1.5x (vs TSMC's advertised 1.8x)]
All the new interconnect logic is in the cIOD, CXL, GenZ ,MPDMA, NVDIM/SCM etc., sot they dont contribute to the CCD die area.
Zen4 is far from being an optical shrink for sure.

Zen3 core over Zen2 core is just ~9% increase within similar power envelope.
Looking at the chart below, N7 -->N5P [23%perf/-49%power reduction].
If AMD keeps same clocks, the efficiency gain is enormous.

View attachment 51936

Minor correction. I think it's really a 40% power reduction in total if following the specs.

Gideon · Oct 26, 2021

DisEnchantment said:
Regardless of whether Zen4 is just a redesigned core or not, with L3 being same and a confirmed die size of 72.226mm2 on N5P,

I must have missed something. It's confirmed to be N5P not vanilla N5?

Saylick · Oct 26, 2021

Gideon said:
I must have missed something. It's confirmed to be N5P not vanilla N5?

AMD haven't come out to say what version of N5 they will use, but the rumors have said that they will be using an enhanced version or something along those lines. It could be N5P or something more specific for AMD.

AMD best-buds, TSMC, designed an 'enhanced' 5nm node for its future Ryzen chips

And potentially for its RDNA 3 graphics cards too.

www.pcgamer.com

https://twitter.com/x/status/1249925996209266688

https://twitter.com/x/status/1249925998256115712

Saylick · Oct 26, 2021

Thibsie said:
Rumours pointed to Zen5 being (with many other things) a hybrid core, one Zen4 core and one Zen5 core with shared registers.
I dunno anything about the possibility of this but I'm very curious though.

Tuna-Fish said:
This makes literally zero sense.

Make of the rumor what you will: https://videocardz.com/newz/amd-patents-a-task-transition-method-between-big-and-little-processors

This is all based on patents AMD submitted.

Thibsie · Oct 26, 2021

That was it, thanks

NostaSeronx · Oct 26, 2021

Panino Manino said:
Now I know who is the right person to ask about K9 (it wasn't Jim!).

K9 was Mitch Alsup -> 65nm 5 GHz Opteron
K10 was Charles R. Moore -> 45nm Bulldozer
//Specifically, the two closest to release versions. With Alsup's K9 taping out then being canned, and Moore's K10 being shown on roadmaps then canned.

Greyhound = 10h, since at that point they stopped publicizing Kx names.

2005/2007 being two keypoints of K9's development guesstimated by AMD in 2003:

-> The Sunnyvale, Calif.-based company is "working like crazy" on the K9, an underlying architecture, or blueprint, for a new generation of chips, said Fred Weber, chief technology officer of AMD's computational products group, during an interview at the Microprocessor Forum here Wednesday.
-> Chips based on the K9 architecture will likely be released--at least in sample quantities--by the second half of 2005, Weber said.
-> "We will have a multicore product," Weber said.
Which coincides with the above by AMD's FW.

K9's Trace Cache location:

jamescox · Oct 27, 2021

BorisTheBlade82 said:
@moinmoin
I think we can all agree that going back monolith is not THE solution. Chiplets have clear benefits and are the way to go. Now there are taxes because of the Interconnect. The IOD needs that much power because it needs to drive all those bits via the interconnect.
With the current Interconnect via organic package you need around 15pJ/bit of energy. With something like EMIB or Info-LSI you only need 1-2pJ/bit. So this way of packaging is clearly a way to go. And the competitor we dare not to name will clearly use such a solution on order to scale its newly announced SoC to 2x and 4x.

I believe the original ISSCC paper on the zeppelin die from 2018 said 11 pJ/bit for IFIS and 2 pJ/bit for IFOP. This doesn’t seem to jive with your numbers; where are they from? The IFOP can be highly optimized since they run on package with a maximum distance of 1 to 2 cm. I would expect the power to have increased, but the speed has also increased significantly so it is unclear where the current power per bit will be. If the original IFOP was 2 pJ/bit, I would expect a silicon bridge to be significantly lower than that.

Connecting the cpu die with silicon bridges is problematic. They can’t do long runs so the die have to be placed directly adjacent. This might work for 4 or 6 dies but would be difficult for 8 or more. The current packages route the serdes links for the outer chips under the inner chips. To use embedded silicon bridges, it seems like they would need to daisy chain them. That isn’t necessarily a bad solution. It would just be an extra hop across a silicon bridge, but you would need to route across an entire die. It seems like it would be better to stack die in that case. Extreme core count processors are generally lower clock anyway.

I have wondered if they would make a modular IO die such that multiple smaller IO die could be used for Epyc. That might allow some other options like mounting the cpu die close the IO die with silicon bridges and distributing the IO die with serdes connections. Later they could move to a stacked solution with embedded silicon interconnect between IO die. It seems more like the standard Zen4 Epyc might be very conservative such that it is very similar to current Epyc processors. We might get a less conservative (more stacking) version of Zen 4 later leading to a zen 5 stacked version.

BorisTheBlade82 · Oct 27, 2021

@jamescox
As it happens some months ago I made a mockup of how EPYC Genoa could look like with silicon die Interconnects. I know that this is more wishful thinking than anything else.

To your question about the numbers. These are really hard to come by and I just realized that I remembered them totally wrong. From what I gathered some time ago was next to your numbers: 1-2 pj/bit for IFOP, 0,1-0,2 for EMIB/CoWoS etc. and as a really rough estimate around 0,05-0,1 for 5mm on-die.
The point still is that advanced packaging saves around 10x interconnect consumption and diminishes the advantage of a monolith by a huge amount.

A Look at Intel Lakefield: A 3D-Stacked Single-ISA Heterogeneous Penta-Core SoC

A look at Lakefield, Intel's new mobile-class heterogeneous penta-core SoC built using two dies 3D-stacked face-to-face using the company Foveros packaging technology.

fuse.wikichip.org

Figure 5. Data movement is overtaking computation as the most dominant...

Download scientific diagram | Data movement is overtaking computation as the most dominant cost of a system both in terms of dollars and in terms of energy consumption. Consequently, we should be more explicit about reasoning about data movement. This diagram shows the cost of operations or data...

www.researchgate.net

As to your suggestion with modular IODs. That sounds quite interesting as well. This is a topic where a lot of developments can be imagined - especially with the lack of expert knowledge I have 😉

Ajay · Oct 27, 2021

DisEnchantment said:
Regardless of whether Zen4 is just a redesigned core or not, with L3 being same and a confirmed die size of 72.226mm2 on N5P, Zen4 core will pack anything between 25-40% more transistors

That's a lot more xtors than needed just for the AVX-512 registers, pipelines, etc. Now I'm really curious what's going on. I think those who said this will be like the Zen1 to Zen2 improvement may be correct. There's the usual suspects like op cache, retire buffer, TLBs, etc. But, how about a larger L2? I wish the Zen3 Wikichip page had as much detail as he had for Zen1.

DisEnchantment said:
Zen4 is far from being an optical shrink for sure.

Did someone think otherwise?

DisEnchantment said:
If AMD keeps same clocks, the efficiency gain is enormous.

Heck, AMD could pip the top clocks by 5% for improved ST and still beat Zen3 on power usage by a good margin (for Raphael at least).

Eh, I'm getting over excited based on a short interview with Mike Clark. Mike deserves a gold star for that interview (and Ian too). Lisa Su must love this guy.

moinmoin · Oct 27, 2021

BorisTheBlade82 said:
As it happens some months ago I made a mockup of how EPYC Genoa could look like with silicon die Interconnects. I know that this is more wishful thinking than anything else.

The problem with this approach is that in the current CCDs the IFOP links are in the center of the die. Will be interesting how links at the edges will behave latency wise with distance being more different between near and far cores. Also if AMD were planning to just place the CCDs along the sides of an IOD they could have chosen to create a far more rectangular aspect ratio for the package to facilitate this. So I expect them to choose some different approaches we may not be thinking of yet.

jamescox · Oct 27, 2021

moinmoin said:
The problem with this approach is that in the current CCDs the IFOP links are in the center of the die. Will be interesting how links at the edges will behave latency wise with distance being more different between near and far cores. Also if AMD were planning to just place the CCDs along the sides of an IOD they could have chosen to create a far more rectangular aspect ratio for the package to facilitate this. So I expect them to choose some different approaches we may not be thinking of yet.

The IFOP die area was in the middle, between the 2 CCX on Zen 2 but it moved to the edge of the die on Zen 3.

moinmoin · Oct 27, 2021

jamescox said:
The IFOP die area was in the middle, between the 2 CCX on Zen 2 but it moved to the edge of the die on Zen 3.

Duh, you're right indeed (they moved it there to make room for the 3D V-Cache).

Though it's on a long edge not a short one (where it couldn't link directly to the L3$ with the current layout).

jamescox · Oct 27, 2021

Ajay said:
That's a lot more xtors than needed just for the AVX-512 registers, pipelines, etc. Now I'm really curious what's going on. I think those who said this will be like the Zen1 to Zen2 improvement may be correct. There's the usual suspects like op cache, retire buffer, TLBs, etc. But, how about a larger L2? I wish the Zen3 Wikichip page had as much detail as he had for Zen1.

Did someone think otherwise?

Heck, AMD could pip the top clocks by 5% for improved ST and still beat Zen3 on power usage by a good margin (for Raphael at least).

Eh, I'm getting over excited based on a short interview with Mike Clark. Mike deserves a gold star for that interview (and Ian too). Lisa Su must love this guy.

I don’t know how reliable the rumors are about L2 cache size increases. The larger vector units might take quite a lot of die area, but I have wondered if they might move to a large, shared L2 similar to Apple designs. A lot of applications really like large, fast L2 cache. That would also allow disabling cores for maximum single core performance. Large L3 cache could be stacked so spending more die area on fast L2 could be a good way to go.

leoneazzurro · Oct 27, 2021

Ajay said:
That's a lot more xtors than needed just for the AVX-512 registers, pipelines, etc. Now I'm really curious what's going on. I think those who said this will be like the Zen1 to Zen2 improvement may be correct. There's the usual suspects like op cache, retire buffer, TLBs, etc. But, how about a larger L2? I wish the Zen3 Wikichip page had as much detail as he had for Zen1.

I think the Gigabyte leaks on AM5 mainboards already revealed that Zen4 will have 1Mbyte of L2 cache.

Details on the Gigabyte Leak

Recently, a ransomware group leaked data from Gigabyte in an attempt to extort payment. That’s been well covered by other outlets (please everyone, secure your networks), so here we’re …

chipsandcheese.com

moinmoin · Oct 27, 2021

If one looks at the changes in the cores from Zen to Zen 2 and the ones from Zen 2 to 3 one can notice that the latter makes mostly architectural changes while the former does mostly size changes (wider, larger, more, needing more die area). I'm expecting Zen 4 to follow the pattern of the former.

The rhythm seems to be:
- Ground up re-design, same node optimization. (~Zen, Zen 3)
- Same design optimized and extended to make good use of the additional area afforded by new smaller node. (Zen 2, Zen 4?)

That'd make Mike Clark's excitement about Zen 5 understandable as well considering that's the next ground up re-design in the queue, the first with AMD being the healthy company it is nowadays.

Btw.

Mike Clark said:
So every three years, we're pretty much redesigning it all.

New Zen gen only every 18 months confirmed. @DrMrLordX vindicated

(The interview is actually a little fuzzy on that since later on they talk about another three years later being Zen 8, not 7. But that's by Ian and Clark just seems to play along without really confirming or denying it.)

jamescox · Oct 27, 2021

BorisTheBlade82 said:
@jamescox
As it happens some months ago I made a mockup of how EPYC Genoa could look like with silicon die Interconnects. I know that this is more wishful thinking than anything else.

To your question about the numbers. These are really hard to come by and I just realized that I remembered them totally wrong. From what I gathered some time ago was next to your numbers: 1-2 pj/bit for IFOP, 0,1-0,2 for EMIB/CoWoS etc. and as a really rough estimate around 0,05-0,1 for 5mm on-die.
The point still is that advanced packaging saves around 10x interconnect consumption and diminishes the advantage of a monolith by a huge amount.

A Look at Intel Lakefield: A 3D-Stacked Single-ISA Heterogeneous Penta-Core SoC

A look at Lakefield, Intel's new mobile-class heterogeneous penta-core SoC built using two dies 3D-stacked face-to-face using the company Foveros packaging technology.

fuse.wikichip.org

Figure 5. Data movement is overtaking computation as the most dominant...

Download scientific diagram | Data movement is overtaking computation as the most dominant cost of a system both in terms of dollars and in terms of energy consumption. Consequently, we should be more explicit about reasoning about data movement. This diagram shows the cost of operations or data...

www.researchgate.net

As to your suggestion with modular IODs. That sounds quite interesting as well. This is a topic where a lot of developments can be imagined - especially with the lack of expert knowledge I have 😉

The mock-up looks like it would fit better rotated 90 degrees. I am still expecting serdes in the Genoa implementation. I suspect there will be a higher end device that comes a bit later that makes more use of stacking; that might be a 128 core variant. If they do that, it could be essentially a test run for future Zen 5 Epyc. Might only be low volume, very high price, HPC though. There are so many possibilities with stacking that it is very difficult to predict. It sounds like Intel will have an HPC cpu with HBM eventually, so they likely need to use some embedded silicon bridges or interposers with HBM to compete with that. I don’t know if massive, stacked L3 will be sufficient.

Ajay · Oct 27, 2021

leoneazzurro said:
I think the Gigabyte leaks on AM5 mainboards already revealed that Zen4 will have 1Mbyte of L2 cache.

Details on the Gigabyte Leak

Recently, a ransomware group leaked data from Gigabyte in an attempt to extort payment. That’s been well covered by other outlets (please everyone, secure your networks), so here we’re …

chipsandcheese.com

Ah, good memory. I forgot about that. I was thinking at least double that - though slower, the hit rate would be very high in many workloads because it is inclusive. And it would still be backed by the even larger L3$ victim cache.

I have developed too many other interests to follow CPU and process developments in detail anymore. Still enough interest though to hang around here and annoy people 😈

BorisTheBlade82 · Oct 28, 2021

@jamescox
I am with you. As I said the mockup is more wishful thinking as coincidentally the geometrics would allow it. But yes, AMD will stick to IFOP with Genoa. The trouble is this: I do not think that it is technically possible to use IFOP on one SKU and CoWoS etc. on another SKU with the same CCD. So I guess IFOP will stay with us for another full product stack. So it might very well be that Apple will be first in this area as well.

DisEnchantment · Oct 28, 2021

BorisTheBlade82 said:
But yes, AMD will stick to IFOP with Genoa. The trouble is this: I do not think that it is technically possible to use IFOP on one SKU and CoWoS etc. on another SKU with the same CCD. So I guess IFOP will stay with us for another full product stack.

Hmmm ... I dont think that is the route AMD will take with Genoa.

Zen4 CCD from the Gigabyte leak likely has two SDP/IF links.
On top of that to support 96 or even 128 cores would mean they need to support up to 512 SerDes links.
Way too much power wasted and looking at the routing for Rome above already is very complicated.
On Rome they had to route the links underneath the CCD.

And in ISSCC 2021, Sam Naffziger already alluded to interposers/higher density interconnects (highlighing by me). This was before Lisa announced 3D V-Cache.

In fact from this slide we knew the second item already is coming to Zen3. (Cache while not exactly memory is backed by SRAM which is memory)

From TSMC's offical data, CoWoS-L with LSI/Si bridges is proven and it reaches 3x reticle size which can cover all chiplets for a hypothetical 16 CCD EPYC.

Anyway, I think AMD will most likely go with some sort of interposer, probably CoWoS-R if not CoWoS-L if there is really no need for super high density interconnects. i.e. if 4um contact pitch is enough (i.e. CoWoS-R) instead of the high density CoWoS-L (<1um pitch)
If not, they will burn power linking those 96/128 cores, it is not sustainable.
You can read yourself the paper by Naffziger

https://d3smihljt9218e.cloudfront.net/lecture/13766/slideshow/a8919637b2ff693a934db77ff29044fd.pdf

BorisTheBlade82 · Oct 28, 2021

@DisEnchantment
Thanks for the enlightenment. Although I saw all those infos before I was not able to put the puzzle together.

xilli_fiberbit · Oct 28, 2021

https://twitter.com/x/status/1453707924338089990

Turin has a max cTDP of 600W

Should we have ZEN5 thread ?

https://twitter.com/x/status/1453747256046219269

ZEN5 EPYC should also have two configurations. 192C 384T 256C 512T

Interesting. This particular "group" also has a max VID of 1.8V. Zen 5 has some very interesting traits it seems

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Lifer

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Lifer

Senior member

Golden Member

Senior member

Member