Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

AMD could keep the AM5 core count static while increasing core counts on Threadripper and Epyc. Because of Intel's big.LITTLE, I could see AMD wanting to go more than 8 cores on mobile but 16 cores on desktop will be good for awhile.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
AMD could keep the AM5 core count static while increasing core counts on Threadripper and Epyc. Because of Intel's big.LITTLE, I could see AMD wanting to go more than 8 cores on mobile but 16 cores on desktop will be good for awhile.
I personally think this will be the way they will go, keeping AMx for low to high end APU's and TRx for enthusiast/creator core counts.

Having said that, they will need a chunky GPU in AMx to justify that - 32 CU at least, preferably with HBM2/3.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
AMD could have a hybrid chip with a smaller 5nm die (or 2) and a larger 7nm die. It's curious that their previous roadmaps said nothing about desktop 5nm...
 

inf64

Diamond Member
Mar 11, 2011
3,685
3,957
136
I expect that Zen4 is very similar project to Zen2. It will widen the FP and L/S by 2x, add (most likely) 2x the cores per chiplet. I hope that AMD will go Zen3 route with regards to CCX and share a huge pool of (64MB?) L3 cache among 16 Zen4 cores. Similarly to Zen2, I think they will aim at around 15ish% IPC jump versus Zen3 - this would leave Zen5 with very optimistic (but obviously achievable) target of ~21% IPC improvement coming from Zen4, if they were to keep the 40% increases between their "tocks" (EX->Zen1; Zen1->Zen3 ; Zen3->Zen5?).
There are some rumors of further chiplet design evolution and some possible massive (L4?) caches, new memory controller + DDR5 support, shrinking of the IOD etc. Zen4 definitely looks like the next big core count increase and a major platform update.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
AMD could have a hybrid chip with a smaller 5nm die (or 2) and a larger 7nm die. It's curious that their previous roadmaps said nothing about desktop 5nm...
AMD's patent for their hybrid chip is quite interesting. They rely on illegal instruction exception to wake up the big cores and transfer the register state to them and avoid depending on the scheduler like current hybrid designs. Sounds like a bad idea from security perspective on server chips though. So will be client only probably.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.
I don't see why the package would need to be larger. I also doubt that they will scale the core counts that much for desktop parts. They might be able to double it again, but I don't think it is necessary for desktop parts. Not sure if you are being serious. It will not have enough memory channels to support huge core counts, although with enough cache that may not be an issue. Genoa will probably use chip stacking of some kind, so there may not even be a discrete IO die for Genoa. It could be an active interposer with cpu die stacked on top or some other form of chip stacking. TSMC has many different types of chip stacking available now or coming soon (posted several times somewhere here):


For Epyc, they may have a variant that actually stacks more than one cpu die on top of another or stacks cache with cpu die. TSMC has chip stacking tech that does not use micro solder balls which has better thermal characteristics than stacking tech with micro solder balls. They could easily make Epyc processors optimized for frequency with 1 or a small number of layers and other processors optimized for core count with mutiple layers of CPU die.

I don't know if they would have different stacked and non-stacked variants. Non-stacked variants would be cheaper with better thermal characteristics for low core count devices. If they use chip stacking for desktop parts, then the footprint may actually be significantly smaller than the current footprint for 2 cpu die and IO die, so the socket wouldn't need to be larger. They could also use LSI tech, which uses a smaller piece of silicon embedded in the package (with multiple die overlapping it) rather than a full interposer. In that case, the chips would be smaller and closer together. The smallest form factor (and most expensive), would be a full active interposer; circuitry to drive external interfaces, which require larger transistors, would be in the active interposer. It may still have some IO die stacked on top. You could perhaps have a memory controller chiplet with cache at 7 nm while the physical interface is in the active interposer.

If Genoa actually does use stacked cpu die then I would expect a significant core count increase would be possible. The IFOP (on package, 32-bit wide serdes at 4x IO die clock to roughly match 256-bit internal IO die pathways) would probably be replaced with vertical connections that could be ~ 1024 bits wide. A single HBM stack uses a 1024-bit interface. That would reduce latency and increase bandwidth significantly. It would also reduce power consumption significantly vs. what would be needed to achieve similar speeds with pcie-5 clocked serdes links.

Speculating about what route they will go when chip stacking is used is very difficult. They could have a wide range of Epyc products with many different types of chiplets, possibly HBM or even full HBM gpus integrated into the package. The area needed for the cpus could be quite small with different forms of chip stacking, so there could possibly be room for an HBM gpu on either side. Package power consumption would be very large and it may limit the size of gpu that could be used.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I expect that Zen4 is very similar project to Zen2. It will widen the FP and L/S by 2x, add (most likely) 2x the cores per chiplet. I hope that AMD will go Zen3 route with regards to CCX and share a huge pool of (64MB?) L3 cache among 16 Zen4 cores. Similarly to Zen2, I think they will aim at around 15ish% IPC jump versus Zen3 - this would leave Zen5 with very optimistic (but obviously achievable) target of ~21% IPC improvement coming from Zen4, if they were to keep the 40% increases between their "tocks" (EX->Zen1; Zen1->Zen3 ; Zen3->Zen5?).
There are some rumors of further chiplet design evolution and some possible massive (L4?) caches, new memory controller + DDR5 support, shrinking of the IOD etc. Zen4 definitely looks like the next big core count increase and a major platform update.
It will get difficult to scale the cache size larger without increasing latency. Some form of L4 may be more likely. I don't think they are going to jump to a 16 core CCX right after going to an 8 core. It may be possible that they would make a 16-core chiplet with 2 CCX on one die. I expect Zen 4 to be very similar to Zen3. Zen 3 is a new architecture, so I don't think we will see huge changes to most of the functionality. Using stacked chips allows for much higher bandwidth, so I wouldn't be surprised to see internal pathways widened significantly and much increased FP performance. Stacked chips can easily use 1024 bit links; a single HBM stack is 1024-bits, so I am wondering if internal paths will actually go up to 1024 bits to match.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
There is a list of changes to be upstreamed to enable LLVM to emit SPIRV IR code.

The bigger question is how to integrate these changes from Intel if another IHV is doing the work in parallel.
I checked the diffs not that big to me(granted, we work with several codebases with more than 30000 kloc each). Even AMD's downstream ROCm LLVM fork has 35K+ diff from upstream and they are constantly issuing PR almost 5 times a day to get all in.
Consuming it with OpenCL runtime is the easier part.

These days almost everything uses the LLVM infrastructure.
ROCm also does. For AMD's part they also are making a lot of new proposals to elf/dwarf and new tooling for debugging heterogenous systems.

Update:
Looks like Intel has been upstreaming these LLVM changes indeed. Kudos to Intel.
You can check the Meeting Notes, last one from two weeks ago.
Well, things went faster than I imagined.
In the Q3 call of today, Hans Mosesman specifically asked on the SW stack for HPC, and Lisa and Victor Peng(Xilinx CEO) said Xilinx has a great SW stack for compute(for FPGAs) and they will bring it together at some point in time with ROCm.
And guess what the comittee for SYCL is actually Intel, Xilinx, Codeplay and Argonne. One of the major contributors to LLVM for SYCL besides Intel is Xilinx(Keyrell ).
So in a quick turn of events it seems like it will basically be AMD contributing to SYCL for FPGA via Xilinx.
 

randomhero

Member
Apr 28, 2020
180
247
86
Here is some speculation. Due to recent AMD's graphics reveal and infinity cache, and that zen arch was inspiration for it, I think that zen4 could have L4 cache.
So CCD would have two octacore CCXes, each having its own 32MB L3 cache, both connected to CCD wide 64 MB L4 cache.
This way you get to reuse zen3 topology, but also save on latency. Also, cache, or to be more precise, SRAM is getting "cheaper" on smaller nodes.
I am not EE, so this could be nonsense, but thought it could sprout some discussion.
 
  • Like
Reactions: Vattila

Kryohi

Junior Member
Nov 12, 2019
16
17
81
Here is some speculation. Due to recent AMD's graphics reveal and infinity cache, and that zen arch was inspiration for it, I think that zen4 could have L4 cache.
So CCD would have two octacore CCXes, each having its own 32MB L3 cache, both connected to CCD wide 64 MB L4 cache.
This way you get to reuse zen3 topology, but also save on latency. Also, cache, or to be more precise, SRAM is getting "cheaper" on smaller nodes.
I am not EE, so this could be nonsense, but thought it could sprout some discussion.
Navi's 128MB cache is only ~86mm2, so a cache chiplet that size would be feasible, even more if shrunk to the 6N node.
I wonder if the performance improvements in consumer workloads would make it worth though.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Here is some speculation. Due to recent AMD's graphics reveal and infinity cache, and that zen arch was inspiration for it, I think that zen4 could have L4 cache.
So CCD would have two octacore CCXes, each having its own 32MB L3 cache, both connected to CCD wide 64 MB L4 cache.
This way you get to reuse zen3 topology, but also save on latency. Also, cache, or to be more precise, SRAM is getting "cheaper" on smaller nodes.
I am not EE, so this could be nonsense, but thought it could sprout some discussion.
Well, the infinity cache on Navi2 uses 6T memory, vs 8T memory used in CPUs. So it seems 'cheaper' on an area basis, but it's not.
 

randomhero

Member
Apr 28, 2020
180
247
86
Well, the infinity cache on Navi2 uses 6T memory, vs 8T memory used in CPUs. So it seems 'cheaper' on an area basis, but it's not.
But it should be "cheaper" on 5nm,shouldn't it?
Also, after seeing that zen3 is bandwidth starved in quite a few MT workloads, I am even more sure that cache amount will go up despite DDR5 being introduced with zen4.
I don't know, is 6T SRAM performant for L4 in CPU?
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
But it should be "cheaper" on 5nm,shouldn't it?
Also, after seeing that zen3 is bandwidth starved in quite a few MT workloads, I am even more sure that cache amount will go up despite DDR5 being introduced with zen4.
I don't know, is 6T SRAM performant for L4 in CPU?
Not necessarily cheaper on 5nm, since the wafers will cost more. I'd be shocked if Zen4 doesn't use a larger L3 cache to keep pushing performance forward along with contributions by DDR5.
I think an added L4 wouldn't be worth its cost in xtors, aside from some specific loads. E-DRAM offers higher density (IIRC). Given that Intel customers didn't buy many e-dram enable CPUs, I think the poor perf/$ rules out it's use on the desktop. Adding more/faster DRAM channels for server CPUs seems to make more sense because the cost off all that ECC buffered DRAM dwarfs the cost of adding more channels. This is just my opinion. I don't have any maths to back up my claim - just looking at historical trends.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Navi's 128MB cache is only ~86mm2, so a cache chiplet that size would be feasible, even more if shrunk to the 6N node.
I wonder if the performance improvements in consumer workloads would make it worth though.

At this point, I am wondering if Milan will actually have a new IO die with infinity cache. It would certainly obsolete most intel competition. There is a good chance that infinity cache is a very modular infinity fabric interface so there is some possibility that they can reuse most of the design in other products. I don't know if they can use that on a global foundries process without making the die very large. Perhaps not an issue though. Current IO die is 435 square mm.

The slides showing the infinity cache do not show a nice discrete square that could be put on a separate chip, so I assume that it is tied in with the memory controllers / infinity fabric. The infinity cache slide appears to show the infinity cache sitting between the memory controllers along both the long edges and the gpu in the middle. With the GPUs supporting full CPU-like virtual memory and possibly cache coherency, is it plausible that almost the same unified memory controller and infinity fabric interface could be used in the gpu and the IO die? It is obviously different physical interfaces and such, but the UMC and infinity fabric should be independent of memory type. The Epyc memory system is actually 512-bit with 8x 64-bit channels. Big Navi is 8x 32-bit channels for 256-bit width, but higher transfer rate with GDDR6. I saw some Navi slides showing 4x 64-bit memory controllers, so they may group them into 64-bit channels internally. They need to upgrade the design to support DDR5, so they may need to go wider or faster internally to compensate anyway. I don't know if it is plausible that they could use the same design, but with how big AMD has been on design reuse, modularity, etc., I have to wonder. Calling it "Infinity Cache" seems to indicate that it may be used for more than just this one GPU also.

If it isn't used in Milan, then perhaps it will show up in Genoa. It is very difficult to speculate anything about Zen 4 since they will probably be using chip stacking. With chip stacking, they could do something like make a 16 core chiplet (2x 8 core CCX likely) with no L3 and then stack an L3 cache chip on top. That allows you to use an optimal process for the cache and cpu logic. They may just make a 16 core chiplet with L3 but have L4 infinity cache in an active interposer. I don't think they will go to a 16 core CCX. It is just difficult to scale the access latency across a monolithic cache with that many cores, slices, and size. Intel's mesh network attempts to do that, but it does not achieve that low of latency and seems to take a lot of power to send data long distances at high speed.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
RDNA3 should be in chiplet, so AMD can produce one infinity cache chip and use it in EPYC and RDNA3, the infinity cache could be produced in 12nm/12nm+ at GF ?
This doesn't look like something that would end up on a separate chiplet by itself. If they use it I would expect it to be integrated into the memory controller or infinity fabric, whether that is a single IO die or some kind of stacked thing. They could do all kinds of things with chip stacking though, so who knows.

The images in the GPU slides show the memory controllers down both long edges of the gpu die with the infinity cache on both sides. I am assuming that it integrates into the infinity fabric or just kind of sits in front of the memory controllers. Perhaps it has slices that handle caching for each memory channel separately. If they use an active interposer for Genoa then it could be in the interposer. If they use chiplets stacked on a passive interposer, then it could be in the chiplet(s) containing the unified memory controllers (UMC) and/or infinity fabric.

The stacked solution with an active interposer on a larger process tech could use an interposer with only the PHY (physical layer) interfaces since these require larger transistors anyway to drive external interfaces. They do not scale with process shrinks much. Eight DDR4/5 64-bit channels and 8 pci-express 4/5 x16 is a lot of physical interfaces. The actual infinity fabric cross bar switch, UMC, and cache could then be done as smaller 5 or 7 nm chiplets stacked on top or under cpu chiplets even. It could also be in an embedded piece of silicon under some of the chips. TSMC has a lot of different stacking technologies. Some of them allow for HBM to be done with a smaller embedded piece of silicon that is under the HBM die, but only under a small portion of the GPU or other die. This is much cheaper than a giant interposer that must fit everything. Some of AMD’s X3D slides have shown what look like 4 cpu chiplets stacked on an interposer with GPUs on either side. I am thinking that they will have a 16-core die with some products stacking multiple CPU chiplets on top of each other for very high core count devices.

RDNA3 may be chiplet based, so it may have some similar memory / IO designs to EPYC, it just needs GDDR6 or HBM PHY rather than DDR4/5. Some AMD infinity architecure slides have shown CDNA gpus with a large number of infinity fabric links. In the one slide I have seen, it looked like 8 gpus connected in a ring (perhaps using 2 links) and then something like 6 other links to the other gpus; it wasn’t fully connected. The EPYC IO die does technically have 512-bit wide memory interface. GDDR6 is something like 4x speed of DDR4, but actually half the width in 6000 series gpus (256-bit), so to use the same or similar design it would need to be 2x wider internally in the gpu or 2x the clock compared to EPYC IO die made to use DDR4. For DDR5, the existing 2x width may be sufficient. I guess we could see divergence of chiplets (cpu chiplets, RDNA chiplets, CDNA chiplets, and maybe even FPGA chiplets) but convergence of the memory / IO system.

I don’t know if they can easily mix devices from GF in TSMC chip stacked devices. A lot of that will probably need to be done completely in house at TSMC. Perhaps the desktop cpus will continue to use a GF IO die and no chip stacking. It isn’t really required for desktop cpus. They don’t need huge core counts or huge amounts of IO. I kind of hope that we see infinity cache with Milan. It could increase bandwidth and reduce power consumption significantly due to much lower number of memory accesses.
 
  • Like
Reactions: lightmanek

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I don’t know if they can easily mix devices from GF in TSMC chip stacked devices. A lot of that will probably need to be done completely in house at TSMC. Perhaps the desktop cpus will continue to use a GF IO die and no chip stacking. It isn’t really required for desktop cpus. They don’t need huge core counts or huge amounts of IO. I kind of hope that we see infinity cache with Milan. It could increase bandwidth and reduce power consumption significantly due to much lower number of memory accesses.
Milan based EPYC "Trento" for Frontier will get a special IO die with stacked memory. It is not far away (3Qs or even earlier), it is going to be the first application of X3D.
Charlie reported on it some time ago about the silicon coming back from the fab/assembly (Paywall)

We should expect new sockets for server and desktop to have some adjustments for area and height to accomodate these stacked chips from Zen4 onwards.
I think FP6 laptop parts would still be 8 core and monolithic. But desktop parts might see some of the stacked silicon according to Rick Bergman.

“It's certainly for our highest-end server type parts and so on -- we're looking at that -- but we're also looking at it for PC products as well, for the reasons that I cited.”

That said, I still wonder how they will achieve this, because the parts using interposer will let go of the SerDes and route directly via the microbumps. Whereas the parts without the interposer will need to have the SerDes blocks.
Not to mention they will need an RDL to lay out the bumps for the ones which will be stacked.
Which basically means they will need to have different IODs and CCDs for stacked and non stacked parts.
Also this time I hope they have different CCDs for server and desktop. Those server dies have way too much extra stuffs which do not see any use in desktop space.
 

dr1337

Senior member
May 25, 2020
309
503
106
Also this time I hope they have different CCDs for server and desktop. Those server dies have way too much extra stuffs which do not see any use in desktop space.
Eh, doesn't seem all that likely to me as their margins come from using the same chiplet on multiple product ranges. Also idk what makes you think the CCD has too much extra stuffs on it. Frankly I love ryzen for the extra IO and broad feature support.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Well, the infinity cache on Navi2 uses 6T memory, vs 8T memory used in CPUs. So it seems 'cheaper' on an area basis, but it's not.

It's a bit late, but L3 SRAM likely still uses 6T. Intel transitioned to 8T in L1 and L2 caches during Nehalem. There's zero indication they moved to 8T SRAM for the L3 caches.

The same I suspect is true for AMD.
 
  • Like
Reactions: randomhero

randomhero

Member
Apr 28, 2020
180
247
86
It's a bit late, but L3 SRAM likely still uses 6T. Intel transitioned to 8T in L1 and L2 caches during Nehalem. There's zero indication they moved to 8T SRAM for the L3 caches.

The same I suspect is true for AMD.
Hm....So....Ok, bear with me.
Zen4 will probably be evolution of Zen3 on 5nm node.We could then expect that AMD would reuse Zen3 topology. They also have experience with Zen2 dual CCX.3 Bandwidth between IOD and CCD has to go up, but they need to preserve low latency, both IOD to CCD and CCX to CCX, hence L4. I believe it will be 64MB, basically double the L3s in CCD with 2 CCX.
I expect that clocks will be lower, but that will be more than compensated with better IPC and additional bandwidth from caches and DDR5.So basically same single thread pperformance but much better multithread performance.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Eh, doesn't seem all that likely to me as their margins come from using the same chiplet on multiple product ranges. Also idk what makes you think the CCD has too much extra stuffs on it. Frankly I love ryzen for the extra IO and broad feature support.
Zen4 server chips (IOD+CCD) will have a lot of things which will not see any use in desktop (HEDT maybe), GenZ/CCIX/CXL support, CCP (Crypto Co Processor), IFIS (IF Intersocket), SEV stuffs...
From die area and efficiency perspective it is redundant and rather detrimental I would say. Of course it is a tradeoff vs reusability but for power efficency every little pJ saved counts.
They did not reuse CCDs for Mobile for this reason. But the more they can strip off irrelavant blocks the better it is for space and energy efficiency.
They always scale down their server cores, instead of a purpose built Mobile core. This hurts the efficiency and therefore market perception.
They will have N5P on their Zen4 products, might as well extract maximum energy efficiency of the node with a purpose built SoC for Mobile/DT.
 
Last edited: