Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Timmah! · Jun 1, 2021

Just read elsewhere thar Ryzen 4 might rise the core count up to 24? If yes, and kept it at the current 16-core price, consider me interested. I was thinking about waiting for next Intel HEDT/nextgen Threadripper, but at 24 cores i suppose i would be content even with "mainstream" platform. Sounds almost too good to be true, though, especially since Intel wont come even with proper 16-core themselves.

IntelUser2000 · Jun 1, 2021

Kepler_L2 said:
Rembrandt will be 2x faster without V-Cache

Doesn't matter at this point. Since Alderlake is sticking to the 96EU configuration.

AMD showed their card and it's a formidable one.

Asterox · Jun 1, 2021

IntelUser2000 said:
Doesn't matter at this point. Since Alderlake is sticking to the 96EU configuration.

AMD showed their card and it's a formidable one.

Intel would very much like to put more EU-s, but it cant do that.More EU-s that would mean even bigger 10nm Alder Lake die, and bigger die is simple more expensive product.

CakeMonster · Jun 1, 2021

Timmah! said:
Just read elsewhere thar Ryzen 4 might rise the core count up to 24? If yes, and kept it at the current 16-core price, consider me interested. I was thinking about waiting for next Intel HEDT/nextgen Threadripper, but at 24 cores i suppose i would be content even with "mainstream" platform. Sounds almost too good to be true, though, especially since Intel wont come even with proper 16-core themselves.

That's the kind of thing I'm looking for. Not for HEDT uses, but to have a core 'buffer'. I wonder if the *900x version would be 18 or 20 cores, that would probably be the sweet spot price wise.

Hail The Brain Slug · Jun 1, 2021

CakeMonster said:
That's the kind of thing I'm looking for. Not for HEDT uses, but to have a core 'buffer'. I wonder if the *900x version would be 18 or 20 cores, that would probably be the sweet spot price wise.

Not really sure how you'd divide 20 by 3. It would be 18.

It seems like they still have to disable 2 cores per CCD, which multiplied across 3 CCD's would mean there would only be an 18 and 24 core part on the 3 chiplet SKU's.

Unless they designed around that on Zen 4 for per-core disableability.

A/// · Jun 1, 2021

I'd imagine it would have to work on the 8 core spectrum unless they had enough where they could fuse off perfectly good cores.

CakeMonster · Jun 1, 2021

Justinus said:
Not really sure how you'd divide 20 by 3. It would be 18.

It seems like they still have to disable 2 cores per CCD, which multiplied across 3 CCD's would mean there would only be an 18 and 24 core part on the 3 chiplet SKU's.

Unless they designed around that on Zen 4 for per-core disableability.

Oh, I was thinking 12 core CCD's. 3x8 sounds doable although definitely more complex.

A/// · Jun 1, 2021

Earlier ideas a year or two ago when their patent was discovered suggested layered cores and caches. That would be a remarkable engineering feat.

Hail The Brain Slug · Jun 1, 2021

A/// said:
Earlier ideas a year or two ago when their patent was discovered suggested layered cores and caches. That would be a remarkable engineering feat.

Maybe after an interation or two layered cores, but I assume cache is the first market-ready step because of the relative simplicity of fabbing, stacking, and cooling an extra layer of cache.

CakeMonster said:
Oh, I was thinking 12 core CCD's. 3x8 sounds doable although definitely more complex.

The latest rumors I saw were 3x8 core CCD's. I haven't seen any credible rumors of another increase in CCD size, but I'd welcome it. I was wrong about 6 cores per CCD in zen 2/3, so I could be wrong again.

lightmanek · Jun 1, 2021

Just wanted to point out we recently had news regarding new Zen 3 CCD stepping showing on the market. AMD claimed it doesn't bring any performance or clock improvements, but one has to wonder if this stepping fixed / improved manufacturability for stacking cache on top of it.
I think it's too much of a coincidence and this stepping will be used in future, cache heavy Ryzens

Hail The Brain Slug · Jun 1, 2021

lightmanek said:
Just wanted to point out we recently had news regarding new Zen 3 CCD stepping showing on the market. AMD claimed it doesn't bring any performance or clock improvements, but one has to wonder if this stepping fixed / improved manufacturability for stacking cache on top of it.
I think it's too much of a coincidence and this stepping will be used in future, cache heavy Ryzens

One source cites that AMD claims they had to make the CCD dies thinner to accommodate the stacked cache under the AM4 heatspreader. This could explain the new stepping.

jamescox · Jun 1, 2021

Justinus said:
One source cites that AMD claims they had to make the CCD dies thinner to accommodate the stacked cache under the AM4 heatspreader. This could explain the new stepping.

There had to be some elements present in Zen 3 from the start, but it may not have been fully functional before the new stepping. I don't think changes would need to be made just for the thickness of the die. That just depends on polishing the wafer down to the proper thickness to expose the through silicon vias (TSVs). It is possible that the original die did not have the TSVs in place. That would add a bit of processing that would be unnecessary in the earlier revisions that were not going to be used as stacked packages. It may have had the circutry already present for connecting multiple layers.

This is still subject to binning. I don't know if they can test the TSVs before attempting bonding with the cache chip. If they can, they might be able to reject a certain number of them to be sold as non-stacked parts. Really, if it isn't perfect, then they should just sell it as a non-stacked part to reduce wasted cache die. Also, If the cache bonding fails in some manner, the die might still be usable as a regular processor without the extra cache.

jamescox · Jun 1, 2021

Justinus said:
Maybe after an interation or two layered cores, but I assume cache is the first market-ready step because of the relative simplicity of fabbing, stacking, and cooling an extra layer of cache.

The latest rumors I saw were 3x8 core CCD's. I haven't seen any credible rumors of another increase in CCD size, but I'd welcome it. I was wrong about 6 cores per CCD in zen 2/3, so I could be wrong again.

I doubt they would increase the CCD beyond 8 cores at this point. The rumors have been saying that they are going to increase the number of cpu die connections and memory channels to 3 for each quadrant while the IO stays at 2 x16 with an upgrade to pci-express 5 for Epyc though. That is a lot of extra interconnect on the Epyc package (12 cpu connections, 12 memory channels and 8 x16 pci-express). The ryzen IO die is basically 1/4 the Epyc IO die, so if Epyc goes up to 3 DDR5 and 3 cpu interfaces per quadrant, then it wouldn't be unreasonable for the Ryzen IO die to have the same. The Epyc IO die for Genoa might be a stacked device though. I had expected them to add a large L4 "infinity cache" on to the IO die to provide a somewhat monolithic last level cache before going out to memory. I guess they may still do that, but it seems less necessary with massive caches on the cpu die itself. If it is an active interposer, then they could put all of the physical layer portions of the interfaces in the interposer (these require larger transistors that do not scale well) and add chips produced on a more dense process stacked on top. The IO die will still have a lot of logic and SRAM for buffers independent of an unlikely L4 cache.

I suspect we will see some big.LITTLE before we get an increase beyond 8 big core CCX.

uzzi38 · Jun 1, 2021

Justinus said:
One source cites that AMD claims they had to make the CCD dies thinner to accommodate the stacked cache under the AM4 heatspreader. This could explain the new stepping.

The TSVs are visible in the original Zen 3 die shots from months ago. No, even the B0 stepping could theoretically be used here, but I reckon they might've had some common hardware defect with the TSVs or some of the related logic that they fixed with the B2 stepping in order to get the 3D stacking to work.

MadRat · Jun 2, 2021

Does anyone else feel like pin counts are getting ridiculously high? I miss the days where you couldn't hardly screw up a CPU install because they were big enough to be easy to align. Now everything has shrunk to such tiny levels it's getting almost pointless to still use sockets. Great that consumers can still build their own systems. The vast majority of purchases get used as they were built and remain mostly static throughout the life of the system.

Apple is doing it right by specifications that optimize the base build. They are selling static designs for top profits. But by doing so they've eliminated bottlenecks for certain design choices. I'm not a fan of their 8GB and 16GB M1 designs. But I do admit an admiration they've kept wire traces extremely manageable by integration of memory into the main board. It reminded me of the first system I owned, a TANDY/RADIO SHACK with 2MB integrated on the main board. They had aggressive timing on that 2MB. Adding SDRAM really only meant slower memory timings overall. Tandy could control specs on that memory on the main board but not the additional consumer-added memory.

Maybe AMD should spec 8-16 memory modules directly on the main board, too. They can always add more memory controllers for DDR4/5 memory expansion. But 8-16GB of memory at crazy aggressive timing only bolsters their products. Consumers can order main boards and a CPU without needing extra memory to get the system up and running. Aim that integrated memory at video processing, like Apple does to tout the M1. Not enough for many games these days, but absolute monsters in cute benchmarks like that. And strategies like this bolster integrated GPUs.

AMD should probably push for less pins on consumer CPUs to keep the space very tight. Keep the ceramic package small by default. Put two sockets on premium consumer boards rather than constantly scaling the package. Maybe multiple consumer CPUs attach to the packaging of commercial ceramic packages. Add in some kind of ribbon cable to connect the CPU to extra goodies. With 32 twisted pair in flat ribbon cable they already push 1.5 amps per line and transmit data up to several hundred MT/s. So let's connect memory expansion using a ribbon cable off one side. Connect your graphics card on another side. You can still run your main bus architecture through the socket. But the shielded ribbon cable is a simple interface that could really add some straight up powerful links to other devices plugged into your main board, or to bridge communication between processors. And by keeping it robust it's consumer friendly. I can't imagine 1,200 pins in a CPU will ever be considered friendly.

I could imagine the day a main board comes with a soldered in CPU, with built-in GPU, that has an expansion slot for a second CPU. A ribbon interface would be connected to at least two sides of the integrated CPU. And the main board would have 8GB to 32GB of relatively low latency memory. And the board would have additional memory slots that run off a less aggressive controller due to the longer traces. One ribbon interface would align with the empty socket, but would be optional as a second CPU could operate off socket connections independent of the ribbon interface. (This is to offer a low latency connection between CPUs, and would be optional not mandatory.) If the second CPU socket is empty then use a longer standard ribbon to reach memory expansion. (These are additional paths between CPU and memory expansion, not the sole connect.). If a second CPU is installed then you can connect CPUs with standard short cable and memory expansion with a second standard short ribbon. The second ribbon connector off the integrated CPU would face an auxiliary channel for main board bus devices like the wireless NIC, audio, modem-fax, etc. Or maybe it faces the 16x PCIe slot. Ribbon connectors offer a premium boost to a consumer-grade product and offers you a chance to sell more accessories. But of course if you want 1,200 pin processors, you can sell them, too.

A/// · Jun 2, 2021

Justinus said:
Maybe after an interation or two layered cores, but I assume cache is the first market-ready step because of the relative simplicity of fabbing, stacking, and cooling an extra layer of cache.

Makes a lot of sense now that I think about it; baby steps.

jamescox · Jun 2, 2021

MadRat said:
Does anyone else feel like pin counts are getting ridiculously high? I miss the days where you couldn't hardly screw up a CPU install because they were big enough to be easy to align. Now everything has shrunk to such tiny levels it's getting almost pointless to still use sockets. Great that consumers can still build their own systems. The vast majority of purchases get used as they were built and remain mostly static throughout the life of the system.

Apple is doing it right by specifications that optimize the base build. They are selling static designs for top profits. But by doing so they've eliminated bottlenecks for certain design choices. I'm not a fan of their 8GB and 16GB M1 designs. But I do admit an admiration they've kept wire traces extremely manageable by integration of memory into the main board. It reminded me of the first system I owned, a TANDY/RADIO SHACK with 2MB integrated on the main board. They had aggressive timing on that 2MB. Adding SDRAM really only meant slower memory timings overall. Tandy could control specs on that memory on the main board but not the additional consumer-added memory.

Maybe AMD should spec 8-16 memory modules directly on the main board, too. They can always add more memory controllers for DDR4/5 memory expansion. But 8-16GB of memory at crazy aggressive timing only bolsters their products. Consumers can order main boards and a CPU without needing extra memory to get the system up and running. Aim that integrated memory at video processing, like Apple does to tout the M1. Not enough for many games these days, but absolute monsters in cute benchmarks like that. And strategies like this bolster integrated GPUs.

AMD should probably push for less pins on consumer CPUs to keep the space very tight. Keep the ceramic package small by default. Put two sockets on premium consumer boards rather than constantly scaling the package. Maybe multiple consumer CPUs attach to the packaging of commercial ceramic packages. Add in some kind of ribbon cable to connect the CPU to extra goodies. With 32 twisted pair in flat ribbon cable they already push 1.5 amps per line and transmit data up to several hundred MT/s. So let's connect memory expansion using a ribbon cable off one side. Connect your graphics card on another side. You can still run your main bus architecture through the socket. But the shielded ribbon cable is a simple interface that could really add some straight up powerful links to other devices plugged into your main board, or to bridge communication between processors. And by keeping it robust it's consumer friendly. I can't imagine 1,200 pins in a CPU will ever be considered friendly.

I could imagine the day a main board comes with a soldered in CPU, with built-in GPU, that has an expansion slot for a second CPU. A ribbon interface would be connected to at least two sides of the integrated CPU. And the main board would have 8GB to 32GB of relatively low latency memory. And the board would have additional memory slots that run off a less aggressive controller due to the longer traces. One ribbon interface would align with the empty socket, but would be optional as a second CPU could operate off socket connections independent of the ribbon interface. (This is to offer a low latency connection between CPUs, and would be optional not mandatory.) If the second CPU socket is empty then use a longer standard ribbon to reach memory expansion. (These are additional paths between CPU and memory expansion, not the sole connect.). If a second CPU is installed then you can connect CPUs with standard short cable and memory expansion with a second standard short ribbon. The second ribbon connector off the integrated CPU would face an auxiliary channel for main board bus devices like the wireless NIC, audio, modem-fax, etc. Or maybe it faces the 16x PCIe slot. Ribbon connectors offer a premium boost to a consumer-grade product and offers you a chance to sell more accessories. But of course if you want 1,200 pin processors, you can sell them, too.

So you want things to be built like apple does where if your SSD goes bad, you have to replace the whole system board because it is soldered on? Apple does this because pick and place machines are cheaper than having workers install components and put screws in. It is more reliable in some respects, but it has its down sides also. You can get soldered on memory and such in laptops and an xbox or playstation. The whole point of a PC is that you can mix and match components.

They may eventually integrate enough memory (probably HBM or other stacked DRAM) that you wouldn't really need to add any more DRAM. Just plug in an optane or flash drive. Most off chip IO seems to be moving toward using pci-express physical layer signalling which is low pin count for the bandwidth. IT would be higher latency to use that for memory, but with massive caches, it may make sense eventually.

beginner99 · Jun 2, 2021

DisEnchantment said:
Which makes me wonder why they keep throwing so much cache at it.

I suspect they simply use it as a test vehicle. Much less validation needed than for server products. If there is some unforseen fatal flaw (unlikley but possible) it will show in these consumer Ryzen products before hurting and server costumers. Because server is where this cache will be a tremendous advanatge. If I read and remember correctly a 64-core epyc could have up to 2GB of L3 cache. A tremendous advantage for server loads.

misuspita · Jun 2, 2021

I like upgradability thank you very much. Had a Ryzen 1600 which I could upgrade to the last 5xxx CPU because motherboard supported it. Started with 8Gb 3000Mhz of ram cause it was when Ram prices were to the roof, got 8 more when it returned to normal, sold, got 32 now when I got the 4650G. Got a 256GB NVMe initially, when prices stabilised got a TB one and a 512GB for boot. I wanted a smaller system so bought a ITX board together with the 4650G, but if it wasn't for my need to become mobile, i would have kept the old system intact, and with a 5600 would have been absolutely fine.

If all would have been soldered, none of that would have been possible.

Timorous · Jun 2, 2021

jamescox said:
I suspect we will see some big.LITTLE before we get an increase beyond 8 big core CCX.

On the big.LITTLE front what if that is just stacked dies. Little on the bottom with low power cores and big on the top with the usual cores.

This would give great flexibility because you could have non stacked versions in markets where big little makes no sense but in laptops you could go stacked with various configs.

eek2121 · Jun 2, 2021

Timorous said:
On the big.LITTLE front what if that is just stacked dies. Little on the bottom with low power cores and big on the top with the usual cores.

This would give great flexibility because you could have non stacked versions in markets where big little makes no sense but in laptops you could go stacked with various configs.

Not unless we have some innovations around cooling. Stacking cache is easy. Cores? Not so much. Even little cores need to be cooled, and pushing that heat up through the big cores will cause problems there as well.

I think AMD is currently taking the best approach.

EDIT: I would love to see AMD take this to an obscene level with a halo Threadripper product.

carrotmania · Jun 2, 2021

Thinking of Zen4 WRT the stacked cache, and the expectation / rumour of more cores, up to 12C per CCD. The expectation, and therefore the refutation, being that 6C either side of the cache would be a weird layout.

But what if this now means that there is no cache at all on the compute silicon and rather than 4C/cache/4C, could it be 4C/4C/4C with two blocks (side by side, not stacked higher) of 64MB on top. 12C with 128MB of cache per CCD. That would be a nice step up over Zen3+ in and of itself.

The silicon would be roughly the same physical size and you'd still only need two chiplets, rather than an odd number of 3 (if they didn't go 6C/cache/6C). Possible?

LightningZ71 · Jun 2, 2021

That would require additional work in thermal management. The demonstrated setup showed the L3 stack being directly over the L3 section of the CCD, and it's likely there for more than distance reasons. Getting heat pumped out of the CPU core hotspots is a challenge now. Doing it with a cache die in the way? Likely much harder.

I suspect that, in the next iteration, the L3 cache die will be below the CCD, which will allow better heat dissipation. Why wasn't it done this time? In my opinion, it's because the tech wasn't ready for mass production when Zen3 went final.

carrotmania · Jun 2, 2021

Well, OK, cache below, but I was more thinking of the core layout... is 2x 4C/4C/4C with cache only stacked more likely than 2x 6C/cache/6C, or 3x 4C/cache/4C?

LightningZ71 · Jun 2, 2021

A lot depends on how the actual cores change. With the talk of the parents for a virtual opcode cache, it's possible that the one die cache may change it's nature going forward.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Golden Member

Elite Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Platinum Member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Member

Platinum Member

Member

Platinum Member