Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 71 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
644
1,105
136
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.

This one got too long again.

I am still thinking 8 cores for the CCX on Zen 4. They are probably going to increase the L2 cache and probably massive increase in floating point units. Floating point units, especially wide vector units supporting a lot of different data types, take up a lot of die area. I didn’t really expect an increase in L3 on die. 32 is probably still plenty for the base chiplet. The data paths need to be significantly wider to support the increased floating point width. All of that is going to eat into the due area savings when going to 5 nm, so it may be similar size to 7 nm with no increase in core count.

If we don’t get some kind of stacked IO die / interposer / whatever, then I guess we may just get pci-express 5 speeds for infinity fabric connections between die with a lot of stacked cache.

For a die stacked solution, I have been thinking that they might have a base die with the IO for just 1/4 of an Epyc processor, so 2 or 3 memory channels and 2 x16 pci-express 5. One fourth of an Epyc processor is basically a desktop Ryzen processor. It might have one chiplet on top that actually contains logic for IO; basically anything that would benefit being made on 5 or 7 nm. The base die would contain a lot of the IO stuff that does not scale and would be made on an older TSMC process. It then could have 3 spaces for cpu chiplets, perhaps with some models containing some gpu chiplets rather than cpu chiplets. Perhaps the gpu chiplet is 2x the size of the cpu chiplet or it just uses 2 gpu chiplets to one cpu chiplet for some models. I don’t know if HBM makes sense here with all of the SRAM cache available.

To support 4 such devices for Epyc (2 for threadripper, 1 for Ryzen), they would need to either have pci-express style infinity fabric links or some TSMC 2.5D solution. They would need at least 3 links for fully connected, but possibly 4 for routing reasons. Original Zen 1 die had 4 with only 3 used. It would kind of look like Naples again except it would be 4 separate interposers rather than just 4 separate die. With such massive stacked caches, the bandwidth required for the infinity fabric network may not be that high, comparatively speaking, so staying with serial infinity fabric links might still make sense.

Another possibility is to use TSMC’s local silicon interconnect. This is a (probably) passive die embedded in the package partially under the other chips or interposers. They actually wouldn’t even need TSVs in the LSI die if they don’t need any off package routing; it is just a chip mounted upside down in that case. This is similar to intel EMIB. This would allow possibly significantly lower power and higher bandwidth without the cost of a giant interposer.

They might also just have multiple interposer sizes, with all Epyc processors requiring a 1.5 to 2x reticule sized interposer. They might be able to make a smaller one for 32 or maybe 64 cores with only the 96 core requiring more than 1x reticule size. A lot of Epyc processors sold are 32 core or less, but they all have the same IO. It seems like it gets complicated and expensive to pull that off with large or different sized interposers. The modularity of using a single type of smaller interposer seems like it makes sense though.
 

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
eek2121-

6nm is 85% of 7nm
5nm is 71% of 7nm

AMD built 8 cores per CCD long enough that they can afford to move to 12-16 cores by either step. More cores is fewer CCD per package. If AMD truly intends to scale above 64 cores per package then IMHO they need to make this move. Sticking to 8 cores will mean a shrink of the die without improved IFOP/IFIS. First generation Infinity Fabric was meant to scale. The 64 core is already sharing 42.6 GB/s between cores and 31 GB/s between packages. Throw in hyperthreading and you're garaunteed to be bandwidth starving any cache architecture.
5 nm's shrinkage is far more than that. Genoa is max 12 dies with 8 cores each.

As others have said, N5 is actually quite a bit more dense. N6 is supposed to be 18% more dense. Node density wasn’t what I was pointing out at all, however. AMD will need to make new chips bigger, which means we won’t see die area improvements.

Zen 4, for example, has a GPU included.
 

scineram

Senior member
Nov 1, 2020
361
283
106
This one got too long again.

I am still thinking 8 cores for the CCX on Zen 4. They are probably going to increase the L2 cache and probably massive increase in floating point units. Floating point units, especially wide vector units supporting a lot of different data types, take up a lot of die area. I didn’t really expect an increase in L3 on die. 32 is probably still plenty for the base chiplet. The data paths need to be significantly wider to support the increased floating point width. All of that is going to eat into the due area savings when going to 5 nm, so it may be similar size to 7 nm with no increase in core count.

If we don’t get some kind of stacked IO die / interposer / whatever, then I guess we may just get pci-express 5 speeds for infinity fabric connections between die with a lot of stacked cache.

For a die stacked solution, I have been thinking that they might have a base die with the IO for just 1/4 of an Epyc processor, so 2 or 3 memory channels and 2 x16 pci-express 5. One fourth of an Epyc processor is basically a desktop Ryzen processor. It might have one chiplet on top that actually contains logic for IO; basically anything that would benefit being made on 5 or 7 nm. The base die would contain a lot of the IO stuff that does not scale and would be made on an older TSMC process. It then could have 3 spaces for cpu chiplets, perhaps with some models containing some gpu chiplets rather than cpu chiplets. Perhaps the gpu chiplet is 2x the size of the cpu chiplet or it just uses 2 gpu chiplets to one cpu chiplet for some models. I don’t know if HBM makes sense here with all of the SRAM cache available.

To support 4 such devices for Epyc (2 for threadripper, 1 for Ryzen), they would need to either have pci-express style infinity fabric links or some TSMC 2.5D solution. They would need at least 3 links for fully connected, but possibly 4 for routing reasons. Original Zen 1 die had 4 with only 3 used. It would kind of look like Naples again except it would be 4 separate interposers rather than just 4 separate die. With such massive stacked caches, the bandwidth required for the infinity fabric network may not be that high, comparatively speaking, so staying with serial infinity fabric links might still make sense.

Another possibility is to use TSMC’s local silicon interconnect. This is a (probably) passive die embedded in the package partially under the other chips or interposers. They actually wouldn’t even need TSVs in the LSI die if they don’t need any off package routing; it is just a chip mounted upside down in that case. This is similar to intel EMIB. This would allow possibly significantly lower power and higher bandwidth without the cost of a giant interposer.

They might also just have multiple interposer sizes, with all Epyc processors requiring a 1.5 to 2x reticule sized interposer. They might be able to make a smaller one for 32 or maybe 64 cores with only the 96 core requiring more than 1x reticule size. A lot of Epyc processors sold are 32 core or less, but they all have the same IO. It seems like it gets complicated and expensive to pull that off with large or different sized interposers. The modularity of using a single type of smaller interposer seems like it makes sense though.
They need to drastically increase L1 as well to get that IPC uplift.
Also my impression is the GPU is integrated into the IOD. That is what the old image GN shared showed as well.
 

Kepler_L2

Senior member
Sep 6, 2020
537
2,198
136
It could be just that the Epyc/MI products are coming first and there isn't room to do the client products until later.
MI200 is coming this year, MI300 should be after RDNA3.

EPYC does release early for the big boys (google, facebook, tencent, etc.) but general availability is usually after the desktop release. Lisa Su confirmed recently that Genoa will launch 2022, so imo we should expect Raphael Q3 and Genoa Q4.
 

exquisitechar

Senior member
Apr 18, 2017
684
942
136
MI200 is coming this year, MI300 should be after RDNA3.

EPYC does release early for the big boys (google, facebook, tencent, etc.) but general availability is usually after the desktop release. Lisa Su confirmed recently that Genoa will launch 2022, so imo we should expect Raphael Q3 and Genoa Q4.
It has been rumored that Genoa will launch before Raphael, I think.
I think he's wrong but we'll see. This would mean over 3 years of development for Zen 4 and 2.5 years for RDNA3, much longer than usual for AMD.
I believe it's true that RDNA3 hasn't been taped out yet, at least. Unfortunately, the rest might be too.
 

CakeMonster

Golden Member
Nov 22, 2012
1,502
659
136

This guy seems to think Zen 4 (desktop?) will be Q4 2022.

My bet would also be Q4 '22 since we have confirmation of a Zen3 with more cache. Gives them time to get everything right and build up some stock + account for unforeseen problems.

I'm not making shit up for Twitter views though, I'm just a clueless hardware fan who goes with what makes most sense given my very limited knowledge.
 

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
I think he's wrong but we'll see. This would mean over 3 years of development for Zen 4 and 2.5 years for RDNA3, much longer than usual for AMD.
TSMC 5nm on 5nm die stacking won't be available until Q3 2022. The consensus is that this is the future, so expected. In fact, this is the earliest timeframe possible if Zen4 has die stacking as standard and not an add on.

Production timeframes are not only design limited but also if the product can be made.
 
  • Like
Reactions: Tlh97 and Vattila

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
It is theoretically possible that there are Zen4 versions of EPYC without v-cache.
I would say this is almost guaranteed.
Lambda services, load balancers, proxy servers, REST/gRPC API gateways etc scale well with cores. Not sure if these folks hosting such services wanna pay premium for cache heavy SKUs which won't bring any gain vs regular EPYC 7002/3 type cache SKUs.
I would even suggest that Altra even cut the cache for their 128 core chip which is squarely aimed at nginx type loads.
The weaker but more cores and with lesser cache did well in such tests on Phoronix.
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,784
4,744
136
TSMC 5nm on 5nm die stacking won't be available until Q3 2022. The consensus is that this is the future, so expected. In fact, this is the earliest timeframe possible if Zen4 has die stacking as standard and not an add on.

Production timeframes are not only design limited but also if the product can be made.


If they are waiting a full two years before the initial 5nm ramp to make 5nm die stacking available, that would 100% be because of customer scheduling. If Apple was going to use it they would have had it available much earlier, so you can probably use it to tell when those AMD products will ship (though it is possible they have other customers wanting to use it we don't know about)
 

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
If they are waiting a full two years before the initial 5nm ramp to make 5nm die stacking available, that would 100% be because of customer scheduling. If Apple was going to use it they would have had it available much earlier, so you can probably use it to tell when those AMD products will ship (though it is possible they have other customers wanting to use it we don't know about)
I actually don't understand your points. What was the time lag for 7nm? AMD designed the CCDs for it knowing that it would only be available for use in Q4 2021. It takes time to R&D new techniques. Why would you think that they could do it earlier but held back? someone could use that argument for almost everything. Why 3d stacking only now? Why chiplets only recently? Why many others.
 

DrMrLordX

Lifer
Apr 27, 2000
22,065
11,693
136
I would say this is almost guaranteed.
Lambda services, load balancers, proxy servers, REST/gRPC API gateways etc scale well with cores. Not sure if these folks hosting such services wanna pay premium for cache heavy SKUs which won't bring any gain vs regular EPYC 7002/3 type cache SKUs.
I would even suggest that Altra even cut the cache for their 128 core chip which is squarely aimed at nginx type loads.
The weaker but more cores and with lesser cache did well in such tests on Phoronix.

Bear in mind that, in the case of v-cache, it isn't a matter of sacrificing area that could be used for cores in favor of cache (as was the case with the Altra). It's more as @moinmoin indicated - waiting for validation. AMD can get you Genoa today without v-cache, or you can wait a year and get it with v-cache if your workload would actually benefit from the extra L3.

Not all workloads benefit from L3.

Genoa DOES offer an increase in core count vs. Milan, so it isn't necessarily a choice between Genoa w/out v-cache vs. Genoa with v-cache. It's a matter of choosing between Milan-X and Genoa-not-X.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Bear in mind that, in the case of v-cache, it isn't a matter of sacrificing area that could be used for cores in favor of cache (as was the case with the Altra). It's more as @moinmoin indicated - waiting for validation. AMD can get you Genoa today without v-cache, or you can wait a year and get it with v-cache if your workload would actually benefit from the extra L3.

Not all workloads benefit from L3.

Genoa DOES offer an increase in core count vs. Milan, so it isn't necessarily a choice between Genoa w/out v-cache vs. Genoa with v-cache. It's a matter of choosing between Milan-X and Genoa-not-X.
Mmmm ... I am not sure I understood the relation between Milan-X and Genoa, but what I am trying to say is that there are lots of customers who would be interested in a plain Genoa without the V Cache, especially if it comes at a lower cost. The whole point with V Cache is to have another tool (like IF for scaling cores) to scale the end product, be it cache, core count and so on.

Therefore I believe AMD would definitely offer such a high core count Genoa SKU w/o V Cache, because it is suitable for many common loads.
We have a whole bunch of services on Azure/AKS that does nothing but authenticate and process request/response to/from the worker nodes within our DMZ and doing nothing but the very bare minimum of operations and processing at most a dozen bytes of data. Changing the instance type does nothing for us, changing vCPU count makes a difference. For such a service I would select an SKU for my Azure Subscription that allows me the highest vCPU count possible which is what we did.