Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 62 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
I'd think if HBM is coming to the CPU that it's to a standalone interposer for something disproportionately larger than anything else. It should be in GB rather than MB at that point. Anything less becomes an overly complicated buffer, not worth the cost. No matter how much bandwidth is there in the chip with HBM attached, it's going to be only a subset of CPU bandwidth. Therefore you can afford to create rows of HBM. This suggests HBM wouldn't come anytime soon in a next step for consumer chip but rather be squarely aimed at EPYC customers where the rectangle shape is necessitated.

On the move to 6nm I think you're going to see a slight increase in core count per chip. They could move to 10, 12, 14, or 16 per die with scaling they've already been working on. They've created architecture that can be scaled as needed.. I know everyone likes iterations in powers of two. But you are no longer bound to filling out groups to the power of 2 on AMDs architecture. They can turn cores on and off at will. And with Infinity Fabric architecture, it's simplified to accomplish these incremental increases into the future.

The layout of the mockups would probably look a bit different IMHO, knowing the two factors above. I could be wrong. But it seems to be evolving in that direction.

Not on 6nm you aren’t.

Also, AMD will likely stick with 8-core CCDs for the immediate future. Why? Because the cores actually grow in size every generation. AMD has margin targets they want to hit, and adding more cores while simultaneously growing said cores will make the chips more expensive to produce, decreasing margins.

They will likely try to fit additional CCDs on the chip in the future, however.

AMD is excelling by building simple, scalable, flexible designs. Contrast that to Intel, who still hasn’t rolled out chiplets. Compare the number of Ryzen SKUs to Intel SKUs.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
So word on the street is that TSMC is pushing everyone to 6nm due to the not insignificant savings in die space/production output. 7nm and 6nm were "mostly" compatible. I bring this up because I suspect that Warhol is essentially Zen 3 on a smaller chip + vcache. AMD is prototyping vache stacking on Zen 3, but the final product will ship on 6nm...and yes, it will be on AM4.

Why do you say "mostly" compatible? 6nm is an optical shrink of 7nm, so it should be 100% compatible with all 7nm designs.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
I'd love for more cores to be added to the CCD's but if you're talking about 6nm, that's not Zen4 which is is 5nm, right? So that would be a still unknown Zen3 refresh, but we all assume that to be the one with more cache added, so.... would there be two Zen3 refreshes, one with cache and one with more cores? Or would the one with more cache also be 6nm (seems unlikely since they're already talking about measurements being similar to Zen3)?
Correct, Zen 4, and likely, Zen 5 will be 5nm.
Why do you say "mostly" compatible? 6nm is an optical shrink of 7nm, so it should be 100% compatible with all 7nm designs.
6nm is not 100% compatible with 7nm according to what I have read. It is “easy” to port a design from 7nm to 6nm, but they aren’t swappable without modification. If they were, AMD would have done it because the increased use of EUV and slightly smaller design means they can churn out chips faster.

That is why I think Warhol is still a thing. Warhol will probably be Zen 3 + vcache with possibly slightly faster clocks. AMD will pull another 20% gain out of thin air (thanks mostly to the vcache), and head off Alder Lake at the same time. All without jumping to 5nm.

I do hope AMD uses the extra time to build up a large supply of 5nm chips. I know warehousing is expensive, but launching Ryzen and RDNA with ample supply will help AMD take that next step.
 

Makaveli

Diamond Member
Feb 8, 2002
4,715
1,049
136
I do hope AMD uses the extra time to build up a large supply of 5nm chips. I know warehousing is expensive, but launching Ryzen and RDNA with ample supply will help AMD take that next step.

Woudn't apple factor into this since they have been using up most of TSMC 5nm. Apple will have to move forward to free up 5nm for AMD's use?
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
I'm very skeptical you can stack L1. The length of wires between dies is pretty significant even with thinned dies, the RC delay will kill your latency on the scale of an L1. No way you don't add at least one cycle of latency. If you're willing to add a cycle of latency, you can go bigger in the main die.

Stacking might benefit L2, but I don't believe it would be at all viable for L1.


Going back to this idea, there was an anonymous article written in 2007 (giving an idea of how long such techniques require to go from research to practice) about 3D stacking technology, TSVs etc. that mentioned an experiment Intel did in the P4 days. They found some cycle latency benefit in face to face stacking (i.e. something that could be done for only two layers but gets you minimal wire distance) where the L1D and ALUs faced each other, and the FP registers and FP units faced each other.

That's likely to be a more promising route than hoping to use this technology to make L1 bigger. It could be an even bigger win for registers given how many physical registers a modern core requires.

In the case of Intel's test vehicle CPU they actually had to reduce the operating frequency to avoid hot spots that were created. But thanks to dropping pipeline stages necessary to handle worst case routing from the edge of an ALU to the edge of L1 etc. the overall power draw was reduced and performance increased. Such a technique may not have quite so dramatic a potential effect on modern CPUs, as they have much shorter pipelines than the P4.

https://www.realworldtech.com/3d-integration/ (see pages 7 & 8 for the discussion of Intel's P4 experiment)
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
6nm is not 100% compatible with 7nm according to what I have read. It is “easy” to port a design from 7nm to 6nm, but they aren’t swappable without modification. If they were, AMD would have done it because the increased use of EUV and slightly smaller design means they can churn out chips faster.

Did some checking, N6 is an optical shrink of N7+ - i.e. the N7 that has a few EUV layers. Thus a mask set built for DUV only N7 like AMD is using cannot be shrunk in that way, and would require more rework.

I'm not sure how well the DUV layers can be optically shrunk at these sizes either, because they have to introduce massive amounts of aberration into the mask to get the desired pattern using 193nm sources. So it is probably not as simple as a traditional optical shrink used to be, though once they use EUV through the full stack (maybe with N3?) that problem will go away for future optical shrinks.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
Woudn't apple factor into this since they have been using up most of TSMC 5nm. Apple will have to move forward to free up 5nm for AMD's use?

TSMC has to increase capacity, which they are doing of course. Apple isn't "moving forward", they will still be on N5 (N5P) for the next iPhone so their use of N5 will increase as they will stop selling the 2019 iPhones that use N7P (maybe they keep one around, but their sales mix will be overwhelmingly N5/N5P)

They won't move forward and start reducing their usage of N5 until next year when the phones/Macs using N3 arrive.
 
  • Like
Reactions: Tlh97 and scineram

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
Woudn't apple factor into this since they have been using up most of TSMC 5nm. Apple will have to move forward to free up 5nm for AMD's use?
TSMC has to increase capacity, which they are doing of course. Apple isn't "moving forward", they will still be on N5 (N5P) for the next iPhone so their use of N5 will increase as they will stop selling the 2019 iPhones that use N7P (maybe they keep one around, but their sales mix will be overwhelmingly N5/N5P)

They won't move forward and start reducing their usage of N5 until next year when the phones/Macs using N3 arrive.

Apple is socking with N5 (or maybe N5P) this year, but next year they are reportedly moving to N3. AMD would be the biggest N5/N5P customer at that point. I strongly suspect the supply situation will improve at that point.
 
  • Like
Reactions: spursindonesia

MadRat

Lifer
Oct 14, 1999
11,909
229
106
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
5 nm's shrinkage is far more than that. Genoa is max 12 dies with 8 cores each.
He did a straight % calculation.

In reality, assuming for a second that the nm ratings were accurate, which they are not, especially for different circuitry, you would still need to do an area calc.
6nm is 73% of 7nm
5nm is 51% of 7nm

I know you know this.
 

MadRat

Lifer
Oct 14, 1999
11,909
229
106
Yep, I did not square the reduction.

But also figure that die shrink bumps the potential of the IB at a relatively same increase per core. So as the die shrinks they need leaps in IB data rates to compensate.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
Yep, I did not square the reduction.

But also figure that die shrink bumps the potential of the IB at a relatively same increase per core. So as the die shrinks they need leaps in IB data rates to compensate.

Well there's no need to do math, TSMC tells us exactly how much of an increase you get going from N7 to N5, and from N5 to N3. I don't recall the N7 to N5 numbers off the top of my head, but N5 to N3 is 70% more transistors per area for logic, 20% for cache. I think it was a bit better logic scaling from N7 to N5, and while I don't think they released the cache scaling as a separate number for that we know it is around 30% basing it on A12 -> A14.
 

andermans

Member
Sep 11, 2020
151
153
76
Well there's no need to do math, TSMC tells us exactly how much of an increase you get going from N7 to N5, and from N5 to N3. I don't recall the N7 to N5 numbers off the top of my head, but N5 to N3 is 70% more transistors per area for logic, 20% for cache. I think it was a bit better logic scaling from N7 to N5, and while I don't think they released the cache scaling as a separate number for that we know it is around 30% basing it on A12 -> A14.

AFAICT 1.84x claimed by TSMC for logic. Wikichip estimates 1.3x for cache (https://fuse.wikichip.org/news/3398/tsmc-details-5-nm/ , which says their own estimate for logic was 1.87x)
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
It would seem to me that the only changes to an N6 shrink of Zen3 would be to expand the L2 or increase some buffers. Once you start making big changes to logic, it complicates everything greatly. The Vcache already makes a big difference in many loads. Expanding the L2 would make them more comparable to their competition in that regard.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
It would seem to me that the only changes to an N6 shrink of Zen3 would be to expand the L2 or increase some buffers. Once you start making big changes to logic, it complicates everything greatly. The Vcache already makes a big difference in many loads. Expanding the L2 would make them more comparable to their competition in that regard.

I would expect a straight shrink if it ends up being on N6 and no other changes. If only due to Epyc.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.
I don't agree with your reasoning at all, but that's no reason to downvote your comment.

Come on guys @SK10H and @scineram ... he's arguing in a perfectly civilized manner.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.

This one got too long again.

I am still thinking 8 cores for the CCX on Zen 4. They are probably going to increase the L2 cache and probably massive increase in floating point units. Floating point units, especially wide vector units supporting a lot of different data types, take up a lot of die area. I didn’t really expect an increase in L3 on die. 32 is probably still plenty for the base chiplet. The data paths need to be significantly wider to support the increased floating point width. All of that is going to eat into the due area savings when going to 5 nm, so it may be similar size to 7 nm with no increase in core count.

If we don’t get some kind of stacked IO die / interposer / whatever, then I guess we may just get pci-express 5 speeds for infinity fabric connections between die with a lot of stacked cache.

For a die stacked solution, I have been thinking that they might have a base die with the IO for just 1/4 of an Epyc processor, so 2 or 3 memory channels and 2 x16 pci-express 5. One fourth of an Epyc processor is basically a desktop Ryzen processor. It might have one chiplet on top that actually contains logic for IO; basically anything that would benefit being made on 5 or 7 nm. The base die would contain a lot of the IO stuff that does not scale and would be made on an older TSMC process. It then could have 3 spaces for cpu chiplets, perhaps with some models containing some gpu chiplets rather than cpu chiplets. Perhaps the gpu chiplet is 2x the size of the cpu chiplet or it just uses 2 gpu chiplets to one cpu chiplet for some models. I don’t know if HBM makes sense here with all of the SRAM cache available.

To support 4 such devices for Epyc (2 for threadripper, 1 for Ryzen), they would need to either have pci-express style infinity fabric links or some TSMC 2.5D solution. They would need at least 3 links for fully connected, but possibly 4 for routing reasons. Original Zen 1 die had 4 with only 3 used. It would kind of look like Naples again except it would be 4 separate interposers rather than just 4 separate die. With such massive stacked caches, the bandwidth required for the infinity fabric network may not be that high, comparatively speaking, so staying with serial infinity fabric links might still make sense.

Another possibility is to use TSMC’s local silicon interconnect. This is a (probably) passive die embedded in the package partially under the other chips or interposers. They actually wouldn’t even need TSVs in the LSI die if they don’t need any off package routing; it is just a chip mounted upside down in that case. This is similar to intel EMIB. This would allow possibly significantly lower power and higher bandwidth without the cost of a giant interposer.

They might also just have multiple interposer sizes, with all Epyc processors requiring a 1.5 to 2x reticule sized interposer. They might be able to make a smaller one for 32 or maybe 64 cores with only the 96 core requiring more than 1x reticule size. A lot of Epyc processors sold are 32 core or less, but they all have the same IO. It seems like it gets complicated and expensive to pull that off with large or different sized interposers. The modularity of using a single type of smaller interposer seems like it makes sense though.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.
5 nm's shrinkage is far more than that. Genoa is max 12 dies with 8 cores each.

As others have said, N5 is actually quite a bit more dense. N6 is supposed to be 18% more dense. Node density wasn’t what I was pointing out at all, however. AMD will need to make new chips bigger, which means we won’t see die area improvements.

Zen 4, for example, has a GPU included.
 

scineram

Senior member
Nov 1, 2020
361
283
106
This one got too long again.

I am still thinking 8 cores for the CCX on Zen 4. They are probably going to increase the L2 cache and probably massive increase in floating point units. Floating point units, especially wide vector units supporting a lot of different data types, take up a lot of die area. I didn’t really expect an increase in L3 on die. 32 is probably still plenty for the base chiplet. The data paths need to be significantly wider to support the increased floating point width. All of that is going to eat into the due area savings when going to 5 nm, so it may be similar size to 7 nm with no increase in core count.

If we don’t get some kind of stacked IO die / interposer / whatever, then I guess we may just get pci-express 5 speeds for infinity fabric connections between die with a lot of stacked cache.

For a die stacked solution, I have been thinking that they might have a base die with the IO for just 1/4 of an Epyc processor, so 2 or 3 memory channels and 2 x16 pci-express 5. One fourth of an Epyc processor is basically a desktop Ryzen processor. It might have one chiplet on top that actually contains logic for IO; basically anything that would benefit being made on 5 or 7 nm. The base die would contain a lot of the IO stuff that does not scale and would be made on an older TSMC process. It then could have 3 spaces for cpu chiplets, perhaps with some models containing some gpu chiplets rather than cpu chiplets. Perhaps the gpu chiplet is 2x the size of the cpu chiplet or it just uses 2 gpu chiplets to one cpu chiplet for some models. I don’t know if HBM makes sense here with all of the SRAM cache available.

To support 4 such devices for Epyc (2 for threadripper, 1 for Ryzen), they would need to either have pci-express style infinity fabric links or some TSMC 2.5D solution. They would need at least 3 links for fully connected, but possibly 4 for routing reasons. Original Zen 1 die had 4 with only 3 used. It would kind of look like Naples again except it would be 4 separate interposers rather than just 4 separate die. With such massive stacked caches, the bandwidth required for the infinity fabric network may not be that high, comparatively speaking, so staying with serial infinity fabric links might still make sense.

Another possibility is to use TSMC’s local silicon interconnect. This is a (probably) passive die embedded in the package partially under the other chips or interposers. They actually wouldn’t even need TSVs in the LSI die if they don’t need any off package routing; it is just a chip mounted upside down in that case. This is similar to intel EMIB. This would allow possibly significantly lower power and higher bandwidth without the cost of a giant interposer.

They might also just have multiple interposer sizes, with all Epyc processors requiring a 1.5 to 2x reticule sized interposer. They might be able to make a smaller one for 32 or maybe 64 cores with only the 96 core requiring more than 1x reticule size. A lot of Epyc processors sold are 32 core or less, but they all have the same IO. It seems like it gets complicated and expensive to pull that off with large or different sized interposers. The modularity of using a single type of smaller interposer seems like it makes sense though.
They need to drastically increase L1 as well to get that IPC uplift.
Also my impression is the GPU is integrated into the IOD. That is what the old image GN shared showed as well.