Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 72 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

tomatosummit

Member
Mar 21, 2019
184
177
116
~multi interposer stuff
A lot of the interposer talk comes down to "too big to be affordable" and that amd paper adds in that it doesn't do distance that well.
I think the logical direction for multi interposers is EMIB style implementation, it would be a similar level of complexity for a fraction of the silicon requirements.
Also with zen4 they're going to need package and serdes improvements for ddr5 and pcie5 so perhaps there's the ability to just increase the if3 bandwidth using ifop.
My bet would be on organic package being here until active interposers can be put into more mainstream solutions unless tsmc's emib is closer than I think.
 
  • Like
Reactions: Tlh97 and Vattila

jamescox

Senior member
Nov 11, 2009
637
1,103
136
A lot of the interposer talk comes down to "too big to be affordable" and that amd paper adds in that it doesn't do distance that well.
I think the logical direction for multi interposers is EMIB style implementation, it would be a similar level of complexity for a fraction of the silicon requirements.
Also with zen4 they're going to need package and serdes improvements for ddr5 and pcie5 so perhaps there's the ability to just increase the if3 bandwidth using ifop.
My bet would be on organic package being here until active interposers can be put into more mainstream solutions unless tsmc's emib is closer than I think.
The paper indicates that very long runs using local silicon interconnect (LSI; TSMC version of Intel EMIB) or large interposers does not work, at least not with passive interposers. To use LSI for interconnect for cpu die, it would likely need to be implemented as a daisy chain, although such tech could probably support multiple channels such that each die could have a separate path. There would be added latency due to buffering and needing to route across each cpu die. It would need to be one very short bridge between IO die and each compute die. You would have up to 3 cpu die with 3 LSI die embedded in the package as the interconnect. That would be 3 hops at a relatively low clock to get to the last die in the chain. IT would be a wide interface though, so it may not be a latency issue. While that is plausible technology, it probably doesn't compete well with just using IFOP. The latency for IFOP should be going down as they up the clock speed. A pci-express level IFOP link will be at ridiculous clock speeds. LSI would be lower power, but more expensive, more likely to have defects, etc. The IFOP has some advantages. They have more freedom in where the chips can be placed; they can be spread out for better thermal density and such.

I am currently thinking that the Epyc IO die might be split into 4 chips even if it is not stacked. Due to the lack of scaling, it would be large, even if made on 6 nm. The large transistors required for driving external interfaces are unlikely to have defects though. Stacking may not make sense. While it has a lot of stuff that does not scale, there isn't that much that does. Making it all on a more advanced process should help with the higher clock speeds required to handle DDR5 and pci-express 5. They could make one IO die chip for the whole product line with it split into 4 chips. Ryzen could get 1 of them, Threadripper 2 or 4 (possibly using salvage IO die with some parts disabled), and Epyc always 4. They could even use it as a chipset like the way they use the current Ryzen IO die as a chipset, just with no cpu or memory attached. They could use IFOP style links or LSI to connect the 4 separate chips together since they would be mounted right next to each other with the cpu die still connected by IFOP. LSI between the IO die chips would avoid the latency of a hop over a serialized infinity fabric link, although it is unclear which is actually lower latency. It would be more die area efficient than having wide IFOP links. They could put one LSI die along each adjacent edge and then perhaps one square LSI die right in the middle to allow the separate chips to be fully connected. It seems like it would need to be a square die to allow it to be implemented as one identical die; each die would be rotated 90 degrees from the adjacent die.
 
  • Like
Reactions: Tlh97 and Vattila

Joe NYC

Golden Member
Jun 26, 2021
1,970
2,349
106
I wonder if there is such a thing as a combination of routing through substrate and interposer, where a chip can sit partially on top of one and another.

The problem with "reach" of interposer in Rome / Milan class is only if you try to fit everything on one interposer. But what if there are 2 smaller interposer strictly providing a fast bridge from CCD to IO chip, one on each side of the I/O chip? And the rest of the routing from the IO (PCI, DDR) from I/O chip and power into the CCDs would still go through the substrate...

And in theory the entire CCD die or IOD would NOT have to be on top of the interposer, only the edge, which would overlap on one side IOD and on the other side CCD, replacing serial Infinity Fabirc with wide parallel.

Perhaps, AMD does not have to make wholesale changes to the package configuration with Zen 4, but moving to Zen5, I think the bandwidth limitations and power consumption of Infinity Fabric may need a more drastic change.

I/O die could accommodate 1 or 2 DRAM stacks which could work as L4 or main memory

Here is how I envision it

1624749495733.png

Alternatively, some of the CCD/SRAM stacks being replaced with DRAM stacks, to offer different configurations of HBM and cores.

Say 4 CCDs x 8 cores + 4 DDR stacks x 16GB DRAM = 32 cores + 64 GB DRAM
 

moinmoin

Diamond Member
Jun 1, 2017
4,956
7,676
136
I wonder if there is such a thing as a combination of routing through substrate and interposer, where a chip can sit partially on top of one and another.

The problem with "reach" of interposer in Rome / Milan class is only if you try to fit everything on one interposer. But what if there are 2 smaller interposer strictly providing a fast bridge from CCD to IO chip, one on each side of the I/O chip? And the rest of the routing from the IO (PCI, DDR) from I/O chip and power into the CCDs would still go through the substrate...

And in theory the entire CCD die or IOD would NOT have to be on top of the interposer, only the edge, which would overlap on one side IOD and on the other side CCD, replacing serial Infinity Fabirc with wide parallel.

Perhaps, AMD does not have to make wholesale changes to the package configuration with Zen 4, but moving to Zen5, I think the bandwidth limitations and power consumption of Infinity Fabric may need a more drastic change.

I/O die could accommodate 1 or 2 DRAM stacks which could work as L4 or main memory

Here is how I envision it

View attachment 46293

Alternatively, some of the CCD/SRAM stacks being replaced with DRAM stacks, to offer different configurations of HBM and cores.

Say 4 CCDs x 8 cores + 4 DDR stacks x 16GB DRAM = 32 cores + 64 GB DRAM
That's essentially Intel's EMIB. I'm not sure if it's patented or whether TSMC and/or AMD could replicate something close to it without running foul of an Intel IP.
 
  • Like
Reactions: Tlh97 and Joe NYC

Vattila

Senior member
Oct 22, 2004
799
1,351
136
I'm not sure if it's patented or whether TSMC and/or AMD could replicate something close to it without running foul of an Intel IP.

TSMC has CoWoS-L with LSI:

"CoWoS®-L, as one of the chip-last packages in CoWoS® platform, combining the merits of CoWoS®-S and InFO technologies to provide the most flexible integration using interposer with LSI (Local Silicon Interconnect) chip for die-to-die interconnect and RDL layers for power and signal delivery. The offering starts from 1.5X-reticle interposer size with 1x SoC + 4x HBM cubes and will move forward to expand the envelope to larger sizes for integrating more chips."

CoWoS® - Taiwan Semiconductor Manufacturing Company Limited (tsmc.com)

CoWoS-L_01.png
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
Alternatively, some of the CCD/SRAM stacks being replaced with DRAM stacks, to offer different configurations of HBM and cores.
Currently not possible to stack DRAM chips on top of logic using SoIC. They would need InFO for that, but that means losing all the key benefits of SoIC that AMD just talked about. Interconnect density, thermal characteristics etc.
 

Joe NYC

Golden Member
Jun 26, 2021
1,970
2,349
106
Currently not possible to stack DRAM chips on top of logic using SoIC. They would need InFO for that, but that means losing all the key benefits of SoIC that AMD just talked about. Interconnect density, thermal characteristics etc.

It might not be readily available on the market, but I am thinking that a more radical change to the architecture will more likely happen in Zen 5 time frame, and by then, there may be SoIC stacking of DRAM.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I wonder if there is such a thing as a combination of routing through substrate and interposer, where a chip can sit partially on top of one and another.

The problem with "reach" of interposer in Rome / Milan class is only if you try to fit everything on one interposer. But what if there are 2 smaller interposer strictly providing a fast bridge from CCD to IO chip, one on each side of the I/O chip? And the rest of the routing from the IO (PCI, DDR) from I/O chip and power into the CCDs would still go through the substrate...

And in theory the entire CCD die or IOD would NOT have to be on top of the interposer, only the edge, which would overlap on one side IOD and on the other side CCD, replacing serial Infinity Fabirc with wide parallel.

Perhaps, AMD does not have to make wholesale changes to the package configuration with Zen 4, but moving to Zen5, I think the bandwidth limitations and power consumption of Infinity Fabric may need a more drastic change.

I/O die could accommodate 1 or 2 DRAM stacks which could work as L4 or main memory

Here is how I envision it

View attachment 46293

Alternatively, some of the CCD/SRAM stacks being replaced with DRAM stacks, to offer different configurations of HBM and cores.

Say 4 CCDs x 8 cores + 4 DDR stacks x 16GB DRAM = 32 cores + 64 GB DRAM
This is basically what we were discussing. Connecting the CCDs using LSI is problematic due the distances involved. They could connect some chips in that manner with the CCD directly next to the IO die, but you can’t really do something that looks like current Epyc packages. LSI would require placing the IO die and CCD very close and probably daisy chaining the CCD with separate LSI die to achieve the CCD count. You aren’t going to get 12 CCD adjacent to a single IO die very easily.

Intel claims to be using HBM2e in HPC processors in late 2022, so I would expect AMD to have an HBM device also. I don’t know how competitive they will be if they just have massive SRAM caches. It allowed them to use a narrower memory interface on their GPUs though. It is unclear how they would add HBM. If the CCD are still connected by serialized IFOP links, then the HBM may be on top of the IO die or next to it using LSI. It I might be interesting if they have split the IO die into 4 separate, but identical chips. HBM2 is about 92 square mm and a 6 nm, “single quadrant”, IO die might be about that size, so perhaps they just place one HBM stack on top of each chip. They could place 4 of these die stacks for an Epyc processor. With HBM2 die stacks being larger than CCDs, I don’t think they would want to place HBM2 die stacks next to the IO die. It would probably take a lot of space and limit routing under the stacks to the CCDs. Placing the HBM on top of the IO die(s) makes quite a bit of sense. I was wondering if they were going to use some number of smaller interposers, but that is seeming less likely.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
It might not be readily available on the market, but I am thinking that a more radical change to the architecture will more likely happen in Zen 5 time frame, and by then, there may be SoIC stacking of DRAM.
i don’t remember all of the different stacking technologies TSMC has available, but it seems intel may have HBM HPC processors by the end of 2022, so it seems like AMD needs HBM with Zen 4 unless massive SRAM caches make up for it. I guess they can stack it on top of the the IO die, next to it on an interposer, or using LSI. I assume all of those use the same micro-solder bumps that they have been using for years. HBM is only a 1024 but interface, so it doesn’t need the massive level of connectivity allowed by TSMC SoIC. SoIC is more than an order of magnitude higher interconnect density than what is required for HBM.
 
  • Like
Reactions: Tlh97 and Vattila

Gideon

Golden Member
Nov 27, 2007
1,646
3,712
136
I totally missed that Anandtech retested Milan with a newer platform. The I/O power penalty is completely gone and it's a straight upgrade from rome now:


per-thread_575px.png


The most interesting comparisons today were pitting the 24- and 16-core Milan parts against Intel’s newest 28-core Xeon 6330 based on the new Ice Lake SP microarchitecture. The AMD parts are also in the same price range to Intel’s chip, at $2010 and $1565 versus $1894. The 16-core chip actually mostly matches the performance of the 28-competitor in many workloads while still showcasing a large per-thread performance advantage, while the 24-core part, being 6% more expensive, more notably showcases both large total +26% throughput and large +47% per-thread performance leadership. Database workloads are admittedly still AMD’s weakness here, but in every other scenario, it’s clear which is the better value proposition.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
It might not be readily available on the market, but I am thinking that a more radical change to the architecture will more likely happen in Zen 5 time frame, and by then, there may be SoIC stacking of DRAM.
SoIC needs co-design, means DRAM (or whatever) and logic designed using same process technologies for bumpless hybrid bonding within FE. Unless TSMC starts making DRAM on N7 it not possible. And TSMC has no DRAM/eDRAM IP on N7/5, all SRAM.
See this illustration from TSMC

1625053180873.png

However it is possible to stack DRAM on top like in InFO, but there is an issue with this on highly clocked HPC devices. This is fine for mobile processors and low clocked ARM processors.
SoIC caters to a different paradigm, massive interconnect density, low pJ/bit, improved thermal characteristics.
SoIC + CoWoS as shown above is the approach proposed by TSMC for 3D integration with DRAM for HPC devices, also described in this post below

TSMC has CoWoS-L with LSI:

"CoWoS®-L, as one of the chip-last packages in CoWoS® platform, combining the merits of CoWoS®-S and InFO technologies to provide the most flexible integration using interposer with LSI (Local Silicon Interconnect) chip for die-to-die interconnect and RDL layers for power and signal delivery. The offering starts from 1.5X-reticle interposer size with 1x SoC + 4x HBM cubes and will move forward to expand the envelope to larger sizes for integrating more chips."

CoWoS® - Taiwan Semiconductor Manufacturing Company Limited (tsmc.com)

CoWoS-L_01.png
To stack logic SoIC is currently the best. (Although they still need to solve power routing to the top most die, there are patents how this is being solved. Probably one of many reasons why they started with cache first)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
TSMC has CoWoS-L with LSI:

"CoWoS®-L, as one of the chip-last packages in CoWoS® platform, combining the merits of CoWoS®-S and InFO technologies to provide the most flexible integration using interposer with LSI (Local Silicon Interconnect) chip for die-to-die interconnect and RDL layers for power and signal delivery. The offering starts from 1.5X-reticle interposer size with 1x SoC + 4x HBM cubes and will move forward to expand the envelope to larger sizes for integrating more chips."

CoWoS® - Taiwan Semiconductor Manufacturing Company Limited (tsmc.com)

CoWoS-L_01.png
This looks very interesting but complicated. Will the typical OSAT go out of business?
Good read here.

The tools are there, the risks and complexities are understood, but as has been the case expect AMD to take calculated and measured risks while trying to push innovation
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
SoIC needs co-design, means DRAM (or whatever) and logic designed using same process technologies for bumpless hybrid bonding within FE. Unless TSMC starts making DRAM on N7 it not possible. And TSMC has no DRAM/eDRAM IP on N7/5, all SRAM.
See this illustration from TSMC

View attachment 46505

However it is possible to stack DRAM on top like in InFO, but there is an issue with this on highly clocked HPC devices. This is fine for mobile processors and low clocked ARM processors.
SoIC caters to a different paradigm, massive interconnect density, low pJ/bit, improved thermal characteristics.
SoIC + CoWoS as shown above is the approach proposed by TSMC for 3D integration with DRAM for HPC devices, also described in this post below


To stack logic SoIC is currently the best. (Although they still need to solve power routing to the top most die, there are patents how this is being solved. Probably one of many reasons why they started with cache first)

I guess the “CoWoS-L with LSI” looks like a good candidate for Epyc with HBM. The IO die (or dies) combined with 4 stacks of HBM would almost certainly be less than the reticle size, so it would be cost effective. I don’t know what they would use if they were going to stack HBM on top of the IO die. I guess I need to read more about TSMC stacking tech, but I haven’t had time. If they split the Epyc IO die into 4 separate, but identical die, then each would probably be close to the size of an HBM2 die, which is about 92 mm2. Current Epyc IO die is 416 mm2 (I have also seen 435). The Ryzen IO die is 125; a split die might be closer to the Ryzen IO die. It would shrink a bit when going to 6 nm, but the pci-express 5 links and possibly an extra cpu die connection would take some more space. It might be close in size to an HBM2 die though, so about 736 mm2 if 2.5D. They may need the space for all of the IO pins, so the 2.5D stacking with the HBM next to the IO die probably makes the most sense. I don’t know how they will fit all of that on the package since the IO die with HBM would be quite a bit larger.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Ok Bergamo, if Zen4 128 CPU cores physically fit in SP5 CPU socket(limited to 128C) there is no problem.

There is a possibility that the 128-core version is a different configuration coming later that uses two layers of cpu die. With relatively low clocks, stacking 2 layers of cpu die might be doable from a thermal perspective, so it would only be 8 CCD die stacks rather than 16.

The 96 core version is still a bit weird. With the 4 cpu die variant (up to 32 cores) probably being the most common for Rome and Milan, I was wondering if they were going to make some kind of interposer based or other stacked solution with 4 cpu die stacked with the IO die somehow. That would allow room to mount HBM based gpu die or up to 8 more CCD on either side of the package for HPC. That hits the issue that big interposers and other stacking tech are more expensive than serialized IFOP connections. It would also possibly be asymmetric, although some Epyc solutions are already asymmetric, like the 48-core version has some quadrants with 2 CCD and some quadrants with 1.

It also might be that SP5 is just a significantly larger package, but going to 128-cores with 16 8-core CCD seems unlikely. It would need 4 links per quadrant rather than the current 2 unless they are shared between separate CCD somehow. I guess they could daisy chain CCD. Making a 16-core CCD later on would also be a possible solution. It would probably be 2 CCX on one die like Zen 2, if that is an option. I thought that the IFOP connection moved from the middle of the die, between 2 CCX on Rome to the side of the die on Milan. So perhaps they add a second CCX and the IFOP die area is in the middle again.
 

LightningZ71

Golden Member
Mar 10, 2017
1,628
1,898
136
Just a guess, but, perhaps they are willing to risk yields on larger 16 core CCDs for the 128 core EPYC because they aren't an early adopter of N5. Yields are reported to be quite good on N5, and I suspect that the big CCDs are going to only be in premium EPYC products that pull very high ASPs. It would make the expense of a second CCD design and package changes worth it.
 

randomhero

Member
Apr 28, 2020
181
247
86
Just a guess, but, perhaps they are willing to risk yields on larger 16 core CCDs for the 128 core EPYC because they aren't an early adopter of N5. Yields are reported to be quite good on N5, and I suspect that the big CCDs are going to only be in premium EPYC products that pull very high ASPs. It would make the expense of a second CCD design and package changes worth it.
What if two CCDs share IFOP link and vcache?
Plausible?
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
What if two CCDs share IFOP link and vcache?
Plausible?
It seems more likely that they would make a different variant with 2 CCX on one die. As far as I know, the two CCD on a Zen 2 die share a single IFOP link. It would look the same as a Zen 2 die except with 8 cores in each CCX. The IFOP link is in the center, between the 2 CCX on Zen 2 die but it is along one edge on Zen 3. Sharing v-cache is also unlikely. To use the same v-cache die, it would need to mount over each of the existing CCX L3 caches. They would probably need 2 separate cache die and 3 silicon spacers.
 
  • Like
Reactions: Joe NYC

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Just a guess, but, perhaps they are willing to risk yields on larger 16 core CCDs for the 128 core EPYC because they aren't an early adopter of N5. Yields are reported to be quite good on N5, and I suspect that the big CCDs are going to only be in premium EPYC products that pull very high ASPs. It would make the expense of a second CCD design and package changes worth it.
Mask sets are very expensive but AMD has a lot more money now. They have quite a few different variants of CPUs, APUs, GPUs, and IO chips in the pipeline. The initial Zen 1 was a single chip to do everything; very cheap to tape out. They can specialize a bit more now. I don’t know how likely they are to make a different variant vs. using some form of stacking.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
As in a second CCD being daisy chained through the first? I would imagine that it's unlikely due to the inconsistent latency in core communications that would result.
I don’t think they would use daisy chaining with IFOP connections. If they were going to use LSI, then perhaps. They can’t do LSI for long runs, so all of the CCD would need to be adjacent to the IO die which doesn’t work well for more than 4 or maybe 6 die. I guess they could use essentially a switch chip that would add very little latency to expand the number of connections and use massive L3 to make up for it. That seems a bit unlikely though.
 

LightningZ71

Golden Member
Mar 10, 2017
1,628
1,898
136
It seems more likely that they would make a different variant with 2 CCX on one die. As far as I know, the two CCD on a Zen 2 die share a single IFOP link. It would look the same as a Zen 2 die except with 8 cores in each CCX. The IFOP link is in the center, between the 2 CCX on Zen 2 die but it is along one edge on Zen 3. Sharing v-cache is also unlikely. To use the same v-cache die, it would need to mount over each of the existing CCX L3 caches. They would probably need 2 separate cache die and 3 silicon spacers.

Or...

They make one larger CCD that has two x eight core CCX with their L3 caches aligned such that they have a common long, central axis. The VIAs can be placed in the middle like with Zen3, and a single VCache die can be constructed to align with that axis. That would allow a single cache die to stack on the CCD and connect to both CCX units.

I actually suggest that they could design these high density CCDs with half the L3 per CCX, at 16MB. Then, a four stack of VCache can be placed on it with four layers of 32MB of cache over each ccx, giving 144MB of cache for each eight core CCD. That's still plenty.
 
  • Like
Reactions: Joe NYC and Tlh97

randomhero

Member
Apr 28, 2020
181
247
86
Or...

They make one larger CCD that has two x eight core CCX with their L3 caches aligned such that they have a common long, central axis. The VIAs can be placed in the middle like with Zen3, and a single VCache die can be constructed to align with that axis. That would allow a single cache die to stack on the CCD and connect to both CCX units.

I actually suggest that they could design these high density CCDs with half the L3 per CCX, at 16MB. Then, a four stack of VCache can be placed on it with four layers of 32MB of cache over each ccx, giving 144MB of cache for each eight core CCD. That's still plenty.
How about LSI bridge(s) doing the same with multiple dies?

Edit: I am of opinion that AMD pushes for advanced packaging because it saves production time of silicon, so I am throwing things at the wall and trying to guess what it sticks.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
How about LSI bridge(s) doing the same with multiple dies?

Edit: I am of opinion that AMD pushes for advanced packaging because it saves production time of silicon, so I am throwing things at the wall and trying to guess what it sticks.
LSI is significantly lower connectivity compared to the stacking they are using for the stacked L3 cache. LSI is probably an order of magnitude lower for number of connections.
 
  • Like
Reactions: Joe NYC