Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 101 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
It is not a straight answer, there are pros and cons of increasing cache
For applications that have huge datasets it is a big plus but not for all.
Therefore V Cache is the best solution, stacking additional dies only on the specific SKUs.

Good writeup here on why increased cache at the cost of latency is not good for most use cases

L3 design is also a tradeoff, you can make L3 faster but the cost is density and power.
There are so many dials and levers at play.

This seems to blow past one of the key features of using die stacking. You can (and AMD has specifically stated they are) use a variant of a given process that is FAR more ideal for L3 cache when you use die stacking for L3. When you put a large L3 on a die with the CPU core, you have to balance the process between achieving the desired density in the L3 and the desired performance of the CPU core transistors. If AMD decides to, for example, swing the process for the CPU die over to one that favors the core + L2 over the L3 density, and reduces the L3 space on the CPU die to half or eve nothing, they can achieve far higher transistor density and better performance for that part, while reaping the benefits of a more L3 cache friendly process on the stacked cache die. You can hide the latency hit for the L3 by expanding the L2, which we know that AMD is doing with Zen4.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
AMD went with the CCD strategy in the beginning to scale out # of cores where applicable, to use commodity dice between desktop; workstation; and server, and to improve yields. If AMD can get good yields on16c parts and below on N5 then they could just go monolithic on their entire desktop and laptop lineup. The I/O die would be exclusive to EPYC and Threadripper of that generation. Not saying that's what they'll do, since it would violate the pattern from Zen -> Zen3 and force them to produce multiple monolithic dice for desktop/laptop (more masks = more time, more money). They would need an 8c monolithic and a 16c monolithic at least, and then fuse off cores for 12c and 6c parts.
Indeed. Just for clarification I wrote that to make clear that it's all a give and take balancing act (whereas the discussion at that point was pushing for some specialized expensive niche with no guaranteed audience). The monolith approach is the one Apple chose because it can (it has both the financial muscle to push through these huge dies on an expensive node and the target audience guaranteed to be willing to pay the resulting prices). AMD is not in the same position, and the approach with essentially flexible modules of all kinds (be it plain MCM, chiplets and all the possible more advanced packaging tech, but also all the different IPs on the dies) also gives AMD much more room to maneuver so it wouldn't be wise for AMD to move away from that.
 

yuri69

Senior member
Jul 16, 2013
387
617
136
The 3D stacking future is cool but gosh one must think about the logistics involved.

Even the upcoming V-Cache Zen 3 requires:
* 2+ CCDs manufactured at TSMC's 7nm line
* 1 IOD manufactured at GF
* organic substrate to hold the IOD and CCDs
* V-Cache die manufactured at TSMC's 7nm line
* 3D integration of V-Cache at CCDs using a special TSMC fab

AMD's manufacturing lead time suffers a lot, doesn't it?
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
The 3D stacking future is cool but gosh one must think about the logistics involved.

Even the upcoming V-Cache Zen 3 requires:
* 2+ CCDs manufactured at TSMC's 7nm line
* 1 IOD manufactured at GF
* organic substrate to hold the IOD and CCDs
* V-Cache die manufactured at TSMC's 7nm line
* 3D integration of V-Cache at CCDs using a special TSMC fab

AMD's manufacturing lead time suffers a lot, doesn't it?
Only your last 2 points matter.
Points 1-3 is the existing reality.
All but points 1 & 4 can be done in parallel.
V-cache takes less time to fab relative to CCD.

Overall for a generational jump, not bad at all.
 

Mopetar

Diamond Member
Jan 31, 2011
7,835
5,982
136
Was your numbering off? The V-cache could be manufactured in parallel, it's just the extra time from the last step of bonding the v-cache to a Zen chiplet that adds extra time.

From some analysis of Apple parts the move from 7nm to 5nm didn't result in an SRAM shrink, though whether that was Apple simply opting not to try as opposed to an impossibility is unknown. It is known that SRAM scaling isn't nearly as good as logic so we may have reached a wall of sorts.

Given this, there's no reason that AMD couldn't fab chiplets on 5N while using 6N or some other process for the v-cache. Since they aren't splitting wafers the throughput for manufacturing doesn't decrease as significantly as if everything had to use the same node or by any amount at all assuming that there's no bottleneck on either v-cache manufacturing or assembly.

I guess it matters for current Zen 3D products since everything is done using 7 nm, but this seems more like a pipe cleaner project to ensure that the kinks are worked out for future products.
 
  • Like
Reactions: Tlh97 and Joe NYC

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
Was your numbering off? The V-cache could be manufactured in parallel, it's just the extra time from the last step of bonding the v-cache to a Zen chiplet that adds extra time.

From some analysis of Apple parts the move from 7nm to 5nm didn't result in an SRAM shrink, though whether that was Apple simply opting not to try as opposed to an impossibility is unknown. It is known that SRAM scaling isn't nearly as good as logic so we may have reached a wall of sorts.

Given this, there's no reason that AMD couldn't fab chiplets on 5N while using 6N or some other process for the v-cache. Since they aren't splitting wafers the throughput for manufacturing doesn't decrease as significantly as if everything had to use the same node or by any amount at all assuming that there's no bottleneck on either v-cache manufacturing or assembly.

I guess it matters for current Zen 3D products since everything is done using 7 nm, but this seems more like a pipe cleaner project to ensure that the kinks are worked out for future products.
I assumed the V-cache is using the same lines as the CCD when fabbing. Can you batch fab both simultaneously?

Future wise, no problem once they have the ability to assemble different nodes together.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
the approach with essentially flexible modules of all kinds (be it plain MCM, chiplets and all the possible more advanced packaging tech, but also all the different IPs on the dies) also gives AMD much more room to maneuver so it wouldn't be wise for AMD to move away from that.

Exactly. That being said, AMD still has to go back and produce monolithic variants of their latest IP for the mobile sector (and those monolithic chips usually lag behind production of the CCDs that go into EPYC and, at least through Zen3, Ryzen). If they pushed the monolithic dice forward they could just move Ryzen to monolithic. Which might be worth it if they're planning on breaking the 8c barrier in mobile anyway.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
If they pushed the monolithic dice forward they could just move Ryzen to monolithic. Which might be worth it if they're planning on breaking the 8c barrier in mobile anyway.
I think with AM5 we'll see what was already more and more happening from Zen to Zen gen during AM4: Those monolithic APUs are replacing the lower core amount chips in the consumer product range. I expect Rembrandt to cover all chips up to 8 cores and Raphael to initially start above that.

Regarding moving forward monoliths, that's an advantage more modularized packages have: Shorter time to market. I'm confident that creating CCDs takes less time than mobile APUs even when pulling a Cezanne and just replacing the cores. So you can't really move forward the latter, a new core needs to appear somewhere for testing first and the former is much more suited to that both time and financials wise.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
Regarding moving forward monoliths, that's an advantage more modularized packages have: Shorter time to market. I'm confident that creating CCDs takes less time than mobile APUs even when pulling a Cezanne and just replacing the cores. So you can't really move forward the latter, a new core needs to appear somewhere for testing first and the former is much more suited to that both time and financials wise.

True, though if yields are good enough then AMD may be able to sell all the CCDs in Genoa and Bergamo. If not then yes they will have CCDs left over for Ryzen products, and time-to-market would be swiftest.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
True, though if yields are good enough then AMD may be able to sell all the CCDs in Genoa and Bergamo. If not then yes they will have CCDs left over for Ryzen products, and time-to-market would be swiftest.
Well, if all CCDs are sold to data centers and nothing is left for the consumer market then you don't really move monoliths forward but rather delay the introduction of new cores to the consumer market. I don't think that's what AMD would want. :p
 
  • Like
Reactions: Tlh97 and Joe NYC

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
Well, if all CCDs are sold to data centers and nothing is left for the consumer market then you don't really move monoliths forward but rather delay the introduction of new cores to the consumer market. I don't think that's what AMD would want. :p

Arguably, Raphael/Zen 4 desktop is already delayed.
 
  • Like
Reactions: lobz

Kepler_L2

Senior member
Sep 6, 2020
331
1,162
106
Exactly. That being said, AMD still has to go back and produce monolithic variants of their latest IP for the mobile sector (and those monolithic chips usually lag behind production of the CCDs that go into EPYC and, at least through Zen3, Ryzen). If they pushed the monolithic dice forward they could just move Ryzen to monolithic. Which might be worth it if they're planning on breaking the 8c barrier in mobile anyway.
Au contraire, the future is MCM and 3D stacked everything. AMD will not have any monolithic products after 2023.
 
  • Like
Reactions: Tlh97 and Joe NYC

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
AMD will not have any monolithic products after 2023.

So how are they going to keep interconnect power usage down on their mobile products? That's one of the main reasons why they aren't already using the CCD strategy on those. And even if you do start stacking dice, there would still be some advantage to them producing similar APUs in the future to the ones they do now.
 

Joe NYC

Golden Member
Jun 26, 2021
1,935
2,272
106
The 3D stacking future is cool but gosh one must think about the logistics involved.

Even the upcoming V-Cache Zen 3 requires:
* 2+ CCDs manufactured at TSMC's 7nm line
* 1 IOD manufactured at GF
* organic substrate to hold the IOD and CCDs
* V-Cache die manufactured at TSMC's 7nm line
* 3D integration of V-Cache at CCDs using a special TSMC fab

AMD's manufacturing lead time suffers a lot, doesn't it?

It is only the last step that is added, which adds some days to the cycle.

V-Cache SRAM is manufactured in parallel
 
Last edited:

Joe NYC

Golden Member
Jun 26, 2021
1,935
2,272
106
You know what else would "solve" the "SerDes problem"? Going monolith...

A lot of success AMD has had is due to modularity, high yields. less complexity of chiplets, that allowed AMD to address more markets than it would be possible with monolithic dies.

The correct answer, IMO, is o fix the interconnect and go for even more modularity. To be able to deploy even wider portfolio of product cost efficiently.

EMIB is simple and straight forward approach that Intel is taking in SPR, which may have some issues and overhead, according to some comments from Ian of AnandTech and Charlie of SemiAccurate.

There are rumors that AMD is looking at stacked active silicon bridges for RDNA3, which would be close to Nirvana. Who knows if and when AMD can get there

The point with IOD being essentially unchanged is not one of nodes, but the comparison of the uncore with e.g. known improvements in the APU dies since. Already a couple times in the past on these boards we talked about the heavy toll power consumption through an ever more featured uncore has. The biggest power eater in the uncore is by some distance the IMC depending on memory frequency, which is why the APUs introduced dynamic memory frequency scaling depending on CPU load and demanded latency resp. bandwidth. Other areas which with the current IOD can't be completely power gated get some dynamic regulation as well to essentially offer both a power saving state and a high performance state. @BorisTheBlade82 already pointed to the "narrow mode" mentioned to only use links required for the overall bandwidth. All this further increases complexity in an MCM setup though so it will be very interesting to see the design decisions AMD did in this balancing act. Considering with Zen 4 AMD also introduces new platforms in SP5 and AM5 the MCM hierarchy and layout is no longer bound to be compatible with SP3 and AM4 so they can and will be changed to adapt to all the modern requirements.

Yeah, I think that was something I was thinking as well. The new IOD is brand new, AMD had some years to work on it, so many of the it components were fine tuned, optimized.

I have a slight hope that maybe AMD managed to do something about SerDes, high cost of which is a roadblock to further modularity.
 
  • Like
Reactions: Tlh97

BorisTheBlade82

Senior member
May 1, 2020
663
1,014
106
500% 🤣🤣🤣🤣
I do not want to open this up again. So please, just take a look at the numbers:

 

BorisTheBlade82

Senior member
May 1, 2020
663
1,014
106
@moinmoin
I think we can all agree that going back monolith is not THE solution. Chiplets have clear benefits and are the way to go. Now there are taxes because of the Interconnect. The IOD needs that much power because it needs to drive all those bits via the interconnect.
With the current Interconnect via organic package you need around 15pJ/bit of energy. With something like EMIB or Info-LSI you only need 1-2pJ/bit. So this way of packaging is clearly a way to go. And the competitor we dare not to name will clearly use such a solution on order to scale its newly announced SoC to 2x and 4x.
 

Timorous

Golden Member
Oct 27, 2008
1,608
2,753
136
I do not want to open this up again. So please, just take a look at the numbers:


Hitman928 very clearly explained to you in that very thread why your numbers are bogus.
 
  • Like
Reactions: Tlh97 and Mopetar

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
@moinmoin
I think we can all agree that going back monolith is not THE solution. Chiplets have clear benefits and are the way to go. Now there are taxes because of the Interconnect. The IOD needs that much power because it needs to drive all those bits via the interconnect.
With the current Interconnect via organic package you need around 15pJ/bit of energy. With something like EMIB or Info-LSI you only need 1-2pJ/bit. So this way of packaging is clearly a way to go. And the competitor we dare not to name will clearly use such a solution on order to scale its newly announced SoC to 2x and 4x.
AMD is not using InFO.
They use SoIC for FEOL (e.g. V Cache) and traditionally CoWoS(now known as CoWoS-S, previoulsy plainly known as Si Interposer) for BEOL (e.g. MI100).
But they need to replace the interconnect via the organic package indeed.

For next gen (not necessarily Zen4) chip level integration LSI/CoWoS-L would be the most likely candidate in my opinion based on AMD's patents.
Has the biggest reach to cover a lot of chiplets, and optimal cost (vs CoWoS-S)
AMD was key partner for TSMC's SoIC and I think same with CoWoS-L.
Way too many patents indicate CoWoS-L usage

This patent all the way from 2016 talks about multiple Si bridges to connect chiplets.

1634805787234.png

[0024] As described in more detail below, the circuit board 15 may include one or more embedded chiplets 50 and 55 to provide chip-to-chip interconnects and also electrical pathways through the circuit board 15. The circuit structures of the chiplets 50 and 55 may be constructed using one more design rules for higher density circuit structures while the circuit structures of the remainder of the circuit board 15 may be constructed using one or more design rules for lower density circuit structures. The high density design rules are used to create in the chiplets 50 and 55 larger numbers of electrical pathways than would ordinarily be possible using a lower density design rule for the remainder of the circuit board 15. The chiplets 50 and 55 may be used for a variety of purposes. For example, the chiplet 50 may be used to provide large numbers of electrical pathways between the semiconductor chip 20 and 25 as well as electrical pathways to and from the semiconductor chips 20 and 25, through the circuit board 15 and out to the I/O's 45 if desired. The chiplet 55 may be used to provide large numbers of electrical pathways between the semiconductor chip 25 and the semiconductor chip 30 as well as electrical pathways to and from the semiconductor chips 25 and 30 through the circuit board 15 and out to the I/O's 45 if desired. It should be understood that the chiplets 50 and 55 may number other than two, be of various footprints and be spatially arranged in a huge variety of ways on the circuit board 15 depending upon the electronic requirements of the circuit board 15, the number of semiconductor chips mounted thereon and other design considerations.

This patent basically describes CoWoS-L
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,431
7,849
136
AMD is not using InFO.
They use SoIC for FEOL (e.g. V Cache) and traditionally CoWoS(now known as CoWoS-S, previoulsy plainly known as Si Interposer) for BEOL (e.g. MI100).
But they need to replace the interconnect via the organic package indeed.

For next gen (not necessarily Zen4) chip level integration LSI/CoWoS-L would be the most likely candidate in my opinion based on AMD's patents.
Has the biggest reach to cover a lot of chiplets, and optimal cost (vs CoWoS-S)
AMD was key partner for TSMC's SoIC and I think same with CoWoS-L.
Way too many patents indicate CoWoS-L usage

This patent all the way from 2016 talks about multiple Si bridges to connect chiplets.

View attachment 51693

[0024] As described in more detail below, the circuit board 15 may include one or more embedded chiplets 50 and 55 to provide chip-to-chip interconnects and also electrical pathways through the circuit board 15. The circuit structures of the chiplets 50 and 55 may be constructed using one more design rules for higher density circuit structures while the circuit structures of the remainder of the circuit board 15 may be constructed using one or more design rules for lower density circuit structures. The high density design rules are used to create in the chiplets 50 and 55 larger numbers of electrical pathways than would ordinarily be possible using a lower density design rule for the remainder of the circuit board 15. The chiplets 50 and 55 may be used for a variety of purposes. For example, the chiplet 50 may be used to provide large numbers of electrical pathways between the semiconductor chip 20 and 25 as well as electrical pathways to and from the semiconductor chips 20 and 25, through the circuit board 15 and out to the I/O's 45 if desired. The chiplet 55 may be used to provide large numbers of electrical pathways between the semiconductor chip 25 and the semiconductor chip 30 as well as electrical pathways to and from the semiconductor chips 25 and 30 through the circuit board 15 and out to the I/O's 45 if desired. It should be understood that the chiplets 50 and 55 may number other than two, be of various footprints and be spatially arranged in a huge variety of ways on the circuit board 15 depending upon the electronic requirements of the circuit board 15, the number of semiconductor chips mounted thereon and other design considerations.

This patent basically describes CoWoS-L
Curious why the patent show the SI under most of the CCD. I would think a narrow SI from the CCD to the IOD would provide all that's needed for higher speed (frequency) IF interconnects. SI is only needed for signalling. Anyway, this would explain how Genoa will increase IF speeds at the same or lower power. I suppose it could be easier to manufacture, or the patent is just providing more coverage for legal reasons.