Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 194 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
A TR is 4x the number of chiplets as Ryzen, active or not. The whole platform is just expensive to produce.
They could stand to have something in between the full Epyc SP5 (768-bit memory interface, etc) and Ryzen AM5 (128-bit memory, etc). With Genoa, Epyc is now 6x the desktop part rather than just 4x. This isn’t just for HEDT; there are also workstation and lower end server parts that don’t need a full size SP5 socket.

I have thought that the best way to go would be what Apple did with the M1 Ultra. Make a large APU (16-core) and put two of them together for 2x everything. Either that or just make a desktop sized IO die that can connect to a second IO die for 256-bit memory and up to 4 cpu chiplets. That would allow for very cheap HEDT, workstation, and low end server implementations without using an expensive Genoa IO die and package. Genoa tends more towards HPC than generic server.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
How many are willing to drop $700 on a motherboard and another 4 grand on a CPU for DIY? Not many.

I think a lot of Threadripper products wound up in prosumer systems/workstations which is one of the reasons why AMD has focused more on Threadripper Pro moving forward.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
R9-7950X with 24/48 cores/threads...

How? 1 X Genoa 8core CCD + 1 X Bergamo 16 core CCD? 3 X Genoa 8 core CCD? 2 X Bergamo 16 Core CCD with 12 cores active each?

With a switch to N6 for the IOD, there is physically space enough on an AM5 package for 3 X CCD + 1 X IOD...
 
  • Like
Reactions: nagus

CakeMonster

Golden Member
Nov 22, 2012
1,384
482
136
When a Twitter person says 'take with a grain of salt'.................

If Z5 is expected at max 18 months after Z4, which I had an impression of, meaning a shorter interval than Z3->Z4, then 24 cores is quite surprising now. I suspect that 16c/32t would be just fine with the same 32t as Intel's RL.

My money is still on 16c/32t but I'd be happy to be wrong. Unless being wrong means Z5 is delayed....
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
R9-7950X with 24/48 cores/threads...

How? 1 X Genoa 8core CCD + 1 X Bergamo 16 core CCD? 3 X Genoa 8 core CCD? 2 X Bergamo 16 Core CCD with 12 cores active each?

With a switch to N6 for the IOD, there is physically space enough on an AM5 package for 3 X CCD + 1 X IOD...
V cache also upends the equation for determining die size which is largely dominated by on die $ area.

If they opted for smaller L3$ on die and use V cache for higher performance SKUs, then even without 6nm IOD it should be possible to manage 3, or perhaps even 4 CCDs on AM5.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
That's why I think that more than 2 CCDs is unlikely. It would mean making a beefier IO die that can connect at least that many when most chips won't connect to more than a single chiplet. DDR5 bandwidth probably alleviates this to some degree, but there's a point where it hits the wall.

At that point just make a mobile Threadripper for the desktop replacement crowd and sell them 32+ cores instead of stopping at 24. The market for a 16 core laptop is already pretty small and they'd probably pay for even more. Anyone who just wants a gaming laptop is probably fine with 8 cores and will be for quite a while.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Hmm, with stacked cache, you could certainly have a small N5 Zen4 ccd if you displace a chunk of the L3 off die. That could give tons of space for three CCDs. In addition, potentially larger L3 caches combined with as much DDR bandwidth as 32 core Zen 3 threadrippers had available would make memory throughout limitations greatly reduced.

going much beyond that though, heat dissipation seems like a huge obstacle.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
That's why I think that more than 2 CCDs is unlikely. It would mean making a beefier IO die that can connect at least that many when most chips won't connect to more than a single chiplet. DDR5 bandwidth probably alleviates this to some degree, but there's a point where it hits the wall.

At that point just make a mobile Threadripper for the desktop replacement crowd and sell them 32+ cores instead of stopping at 24. The market for a 16 core laptop is already pretty small and they'd probably pay for even more. Anyone who just wants a gaming laptop is probably fine with 8 cores and will be for quite a while.
The lack of consumer threadripper thing why it's a good idea for a 24core desktop cpu. It would also line up with dragon falls being "the most cpu cores in laptop" if it has the potential for 24.
There's already a bunch of wasted IO on the io die and the IF links are pretty small all things considered, adding a third one of them probably comes to less cost than producing a whole new consumer focussed hedt platform with it's own new IO die and product lines.(ignoring the rumours of the small epyc socket ofc).
The halo effect is important and 24 zen4 cores would very likely trounce an 8+16 raptor lake cpu instead of leaving anything up in the air like it is now.

Of course I do wish there'd be a third memory channel and a handful more pcie lanes to go along with it on the highest end motherboards but that's just dreaming.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
The lack of consumer threadripper thing why it's a good idea for a 24core desktop cpu. It would also line up with dragon falls being "the most cpu cores in laptop" if it has the potential for 24.
There's already a bunch of wasted IO on the io die and the IF links are pretty small all things considered, adding a third one of them probably comes to less cost than producing a whole new consumer focussed hedt platform with it's own new IO die and product lines.(ignoring the rumours of the small epyc socket ofc).
The halo effect is important and 24 zen4 cores would very likely trounce an 8+16 raptor lake cpu instead of leaving anything up in the air like it is now.

Of course I do wish there'd be a third memory channel and a handful more pcie lanes to go along with it on the highest end motherboards but that's just dreaming.
When rumors where talking AMD using more advanced processes for the IO die, I was wondering if the IO die would be made modular so that the desktop version would use 1, Epyc would use 4, and a hypothetical intermediate would use 2 modular IO die. The Epyc IO die has 3 IFOP, 3 64-bit memory (really 6 32-bit) channels, and 2 x16 per quadrant. With the smaller processes, smaller die make more sense, but this gets difficult to connect 4 of them together with EFB or serdes and route everything else. It may also be wasteful since the desktop version presumably has a small gpu that would be unneeded in Epyc. The desktop part may not need all of the memory channels or all of the IFOP that Epyc has in each quadrant; frequently only 1 of 2 or 1 of 3 if the desktop part actually has 3. This would give a lot of options for salvage parts though; bad gpu could go to Epyc or Threadripper. Bad IFOP links or a bad memory channel could go to Ryzen parts. Everything seems to say that the Epyc IO die is monolithic though. It might still be a possibility for an intermediate part though. It is much simpler to connect just 2 chips. The gpu rumors look like they are built this way, with blocks of 2 tightly coupled gpu chiplets, each one with an HBM stack. Large GPUs are built out of 2 or 4 of these units, if the rumors are correct. It would make some sense that Epyc is monolithic since it can absorb the cost of a larger, lower yield IO die. Hopefully we get an intermediate socket with at least 2x AM5 specs somehow.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
When rumors where talking AMD using more advanced processes for the IO die, I was wondering if the IO die would be made modular so that the desktop version would use 1, Epyc would use 4, and a hypothetical intermediate would use 2 modular IO die. The Epyc IO die has 3 IFOP, 3 64-bit memory (really 6 32-bit) channels, and 2 x16 per quadrant. With the smaller processes, smaller die make more sense, but this gets difficult to connect 4 of them together with EFB or serdes and route everything else. It may also be wasteful since the desktop version presumably has a small gpu that would be unneeded in Epyc. The desktop part may not need all of the memory channels or all of the IFOP that Epyc has in each quadrant; frequently only 1 of 2 or 1 of 3 if the desktop part actually has 3. This would give a lot of options for salvage parts though; bad gpu could go to Epyc or Threadripper. Bad IFOP links or a bad memory channel could go to Ryzen parts. Everything seems to say that the Epyc IO die is monolithic though. It might still be a possibility for an intermediate part though. It is much simpler to connect just 2 chips. The gpu rumors look like they are built this way, with blocks of 2 tightly coupled gpu chiplets, each one with an HBM stack. Large GPUs are built out of 2 or 4 of these units, if the rumors are correct. It would make some sense that Epyc is monolithic since it can absorb the cost of a larger, lower yield IO die. Hopefully we get an intermediate socket with at least 2x AM5 specs somehow.
I think the more important limit is power for data transfer, especially today, for the epyc IO die. There's so much going on in there that without incredibly cheap and low power silicon bridges I don't think it'd be worth it. Remember they're also trailing nodes and while they're big compared to ccds, a 300-400mm^2 die is nothing in comparison to the big gpgpus and fpgas that are a common occurance these days.
But cutting an epyc sized IO die into quarters would never really work, there'd be so much silicon assigned to the extra interconnect between IO chiplets that it probably wouldn't be worth it just for desktop off cuts. You can see in m1ultra, spr and mi250 how much space these high bandwidth silicon bridge interconnects take up. There's a good cost saving on reusing all the design blocks and ccds already, penny pinching further than that would probably start to have an effect on the server cpu performance in regards to increasing the power draw of the already sizable IO die's power draw if it was chiplet based.
The server cpu design times are quite long, zen5 stuff is probably open to more exotic designs. Zen4 is still very similar to zen3 with ccxs and io.

You bring up an interesting point with the dual desktop situation. I think the capability for a two IO die might have been on the boards at one point. The x570 chipset is already a repurposed IO die. The logic to connect the two IO dies coherantly might already be in place but never implemented. Just use one full 16 lane block of the mmio/pcie for more bandwidth leaving 32 outwardly, the same way epyc 2 socket systems work but probably possible on the same substrate. In the end it all boils down to data routing if the protocals are implemented.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I think the more important limit is power for data transfer, especially today, for the epyc IO die. There's so much going on in there that without incredibly cheap and low power silicon bridges I don't think it'd be worth it. Remember they're also trailing nodes and while they're big compared to ccds, a 300-400mm^2 die is nothing in comparison to the big gpgpus and fpgas that are a common occurance these days.
But cutting an epyc sized IO die into quarters would never really work, there'd be so much silicon assigned to the extra interconnect between IO chiplets that it probably wouldn't be worth it just for desktop off cuts. You can see in m1ultra, spr and mi250 how much space these high bandwidth silicon bridge interconnects take up. There's a good cost saving on reusing all the design blocks and ccds already, penny pinching further than that would probably start to have an effect on the server cpu performance in regards to increasing the power draw of the already sizable IO die's power draw if it was chiplet based.
The server cpu design times are quite long, zen5 stuff is probably open to more exotic designs. Zen4 is still very similar to zen3 with ccxs and io.

You bring up an interesting point with the dual desktop situation. I think the capability for a two IO die might have been on the boards at one point. The x570 chipset is already a repurposed IO die. The logic to connect the two IO dies coherantly might already be in place but never implemented. Just use one full 16 lane block of the mmio/pcie for more bandwidth leaving 32 outwardly, the same way epyc 2 socket systems work but probably possible on the same substrate. In the end it all boils down to data routing if the protocals are implemented.
Yeah, I mentioned that it would get a bit ugly for 4 die with bridges and Epyc can absorb the cost of a large IO die anyway. They could still use an intermediate sized solution and, in that case, using two desktop type die seems to make a lot of sense, even if they are connected by serdes infinity fabric rather than silicon bridges. The gpu rumors seem to show two tightly coupled die plus HBM in one unit. That would seem to be a silicon bridge, so it may not be that expensive.

I don’t think the bridges take that much silicon or power either. A passive bridge chip would be very small. They should be lower power than infinity fabric links based on high speed serdes. I would guess only slightly higher than on die connections. It would be the same type of physical interface used to connect HBM2e. This could be used to make a 4 (or 6 chiplet) device for Threadripper, workstation, and low end servers. They could also use the same tech to connect two APUs for a high end laptop chip rather than making two different die. Apple’s M1 Ultra might not be representative since AMD may not require as much interconnect bandwidth. AMD seems more likely to depend on large caches on both die. Also, we are talking about a 2023 product here, so it may be kind of next gen compared to Apple’s current M1 Ultra.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
Raphael being produced in small quantities in May.?! BS, unless Zen4 desktop isn't going to be in retail till ~ November. Manufacturing to retail is still longer than usual because of various supply chain and logistical problems. That, or $1500 7950Xs from scalpers because of limited supply.

We have no idea what 'small quantities' mean. Also, going to disagree regarding your manufacturing to retail comment. Supply chain issues have been largely cleaned up. They can also be avoided entirely by shipping via air vs. ground. I had some custom parts overnighted to me from an Asian country. No issues at all and the costs were reasonable. I'm not saying there still aren't issues in certain areas, however, it is much better than it was before, where I tried to order parts and they took months to get here.
 

desrever

Member
Nov 6, 2021
108
262
106
Wouldn't a ddr4 3600 16c system have same bandwidth /core as a DDR5 5400 24c? And zen4 has double L2 cache.
There would be larger demand for bandwidth if they want to improve IPC as well as clock speed. If we think IPC+Clocks are going up 25%, without considering the caches, there would need to be 25% more bandwidth to feed the core.

24 cores of Zen 3 at Zen 3 clocks probably will be fine with just moving to DDR5 but hard to say about Zen 4. Considering in EPYC 96 Zen 4 cores are getting 12 memory controllers compared to 64 cores having 8 in Zen 3, there is probably more demand for bandwidth.
 
  • Like
Reactions: Tlh97 and maddie

biostud

Lifer
Feb 27, 2003
18,193
4,674
136
There would be larger demand for bandwidth if they want to improve IPC as well as clock speed. If we think IPC+Clocks are going up 25%, without considering the caches, there would need to be 25% more bandwidth to feed the core.

24 cores of Zen 3 at Zen 3 clocks probably will be fine with just moving to DDR5 but hard to say about Zen 4. Considering in EPYC 96 Zen 4 cores are getting 12 memory controllers compared to 64 cores having 8 in Zen 3, there is probably more demand for bandwidth.
But the software running on a server is bound to have different limitations, compared to a desktop. I don’t know if something like video editing is specifically bandwidth hungry, or if any other traditional desktop software is known for being very memory intensive.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,078
136
There would be larger demand for bandwidth if they want to improve IPC as well as clock speed. If we think IPC+Clocks are going up 25%, without considering the caches, there would need to be 25% more bandwidth to feed the core.

24 cores of Zen 3 at Zen 3 clocks probably will be fine with just moving to DDR5 but hard to say about Zen 4. Considering in EPYC 96 Zen 4 cores are getting 12 memory controllers compared to 64 cores having 8 in Zen 3, there is probably more demand for bandwidth.
Wrong, memory is slow, higher IPC generally needs less bandwidth per core the fewer times you hit DDR the higher IPC you will have. So bigger /better caching, prefetch, prediction, decode etc.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,078
136
I guess developing faster memory is pointless, with all the IPC we gained since the pentiums days we shouldn't need memory faster than a hard drive at this point.
our memory isnt faster , access times in NS haven't changed in like 20 years. Throughput has , throughput is different to IPC. for example if Zen 4 has 4x 512 bit vector units , L/S and cache R/RW bandwidth to sustain that then yes 24 cores with DDR5 would be more memory limited then Zen3 with DDR4. But in that case at peak twice the throughput of Zen4 would have the same IPC Zen3.

The other big advantage of DDR5 is concurrency.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
I wonder if that is an all core turbo. Also, if 8 cores can do 5.2 GHz, perhaps higher core count SKUs can turbo higher a la 5900X vs 5800X.

 

exquisitechar

Senior member
Apr 18, 2017
655
862
136
I wonder if that is an all core turbo. Also, if 8 cores can do 5.2 GHz, perhaps higher core count SKUs can turbo higher a la 5900X vs 5800X.

Probably single core. 5.4 GHz for the 16 core one, perhaps? Many 5950Xs can already hit 5.15 GHz, so I don't see why not. :)