Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 70 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
I wouldn't be shocked if it wasn't some sort of hybrid between a jaguar derived core and the construction core shared FPU layout.
Nah, they already technically can implement what they did with Zen2 on PS5.

The current rumored specs for Big.Little appear more or less like this in my opinion:
Small Zen4 cores with 128-bit SIMD and big Zen5 cores with 512-bit SIMD.

Zen4 4-track on 3nm => lower leakage, same frequency capability (smaller FPU requires less current)
Zen5 5-track on 3nm => higher leakage, higher current capability (to feed larger FPU), thus higher frequency support at low/mid SIMD capability.

8 Zen5 cores(Big core CCX), 4 Zen4 cores(Small core CCX) => similar strategy as Apple.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
It should also be noted that the ARM world is moving in a direction that does all that (and more!) when it comes to giving schedulers the option to assign threads to any available core regardless of which SIMD instructions are in use. Because SVE2.

Note, that despite SVE2 is binary compatible between different vector-sizes, you still cannot mix cores with different vector-sizes in the same SMP system.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Lol, damn. Jim sounds pissed that everyone else was doubting him on the rumor and when others corroborated it, the tech press didn't even give him credit.
I always expected 128 cores, so the 96 core thing was odd. They might need more than 96 cores to stay competitive with ARM solutions. Intel seems like they aren’t going to even be close. With the massive L3 caches, Milan-x and Genoa might be serious competition in HPC workloads. I wonder if Intel had no clue about the stacked L3. AMD seems to have kept that from leaking.

With the 96 core rumors and the insistence that the 128-core chip isn’t Genoa by some leakers, then what is special about it? Does it have heterogeneous cores? Perhaps a different stacking solution? It would be very power limited. I have still been leaning towards a solution with multiple interposers. They could split up the IO die into 4 chips and perhaps stack up to 3 CCD on top for 96 cores. This allows devices with 1, 2, or 4 die stacks for 24, 48, or 96 cores. ThreadRipper with 2 stacks would be limited to 48 cores unless they make a version with 4 stacks and half of the memory controller disabled. If this is the route that they are going, the. Genoa may look more like the intel stacked solution in the video than the AMD mock-up. I doubt that it will look anything like the mock-ups we have seen so far.

Perhaps the 128 core capable version is a full interposer solution with a massive IO die / active interposer. This could be the high end HPC product with extra space for more cores and HBM memory. It would reduce package power consumption to have it all in one giant interposer, which would be required for 128 cores, but it would be very expensive. It may come out later than the initial Genoa, which might be why we are hearing about it this late. It would likely have a different core name from Genoa since it would be a different IO die / interposer. The interposer would need to be massive though if it was all one piece. Perhaps it is still distributed into multiple chips except with extra space for more cores and HBM memory.
 
  • Like
Reactions: Tlh97

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Note, that despite SVE2 is binary compatible between different vector-sizes, you still cannot mix cores with different vector-sizes in the same SMP system.

Yeah you've explained that in other ARM-based threads. It still should be possible to have very small cores area-wise in heterogeneous core arrangements featuring "big" cores with relatively wide vector sizes. It will be interesting to see how AMD handles heterogeneous core arrangements in Zen5 or whenever they actually get around to it.
 

MadRat

Lifer
Oct 14, 1999
11,909
229
106
I'd think with the goals of AMD that the big cores and little cores will be chosen less by physical location and more by opportunity. By making the core virtualized at the hardware level, the software should be ignorant of individual physical cores. This will probably help with load balancing and thermal distribution.
 

CakeMonster

Golden Member
Nov 22, 2012
1,384
482
136
Reading all of this I kind of feel like securing myself a high core non-bigLittle CPU just in case it won't work all that well initially. Seems like there's no hurry with regards to AMD though, since Z4 and a possible Z3 update will be a regular core setup.

So if the scheduler turns out funky I'd guess that a regular high core N5 or N4 Zen* would hold me over for a good while awaiting that it gets sorted out.
 

andermans

Member
Sep 11, 2020
151
153
76
Google Announces AMD Milan-based Cloud Instances - Out with SMT vCPUs?

In light of this development, does anyone want to reconsider SMT4 speculation of future AMD micro-architecture? I have a feeling that, with the numbers of cores increasing exponentially and security issues related to SMT, AMD will start de-emphasizing SMT going forward.

I think one could see this coming from miles away. With all the isolation and security issues and the lack of SMT on ARM often being seen as a significant advantage, going less SMT for cloud instead of more would make sense. If you combine that with higher levels of SMT bringing to some extent diminishing returns SMT4 was always very unlikely.

Hard to say if SMT will be completely de-emphasized though. There are usecases which have less issues with it so it might be kept for that reason (remember a lot of the value is shared R&D costs for a general solution instead of a perfectly specialized one), but maybe that will all be replaced by small core designs over the next 5 years depending on how that works out.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
I think one could see this coming from miles away. With all the isolation and security issues and the lack of SMT on ARM often being seen as a significant advantage, going less SMT for cloud instead of more would make sense. If you combine that with higher levels of SMT bringing to some extent diminishing returns SMT4 was always very unlikely.

IIRC Amazon gives you the SMT thread "for free" now (or you could say they charge by core now). So SMT security issues is not a big deal.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
I think one could see this coming from miles away. With all the isolation and security issues and the lack of SMT on ARM often being seen as a significant advantage, going less SMT for cloud instead of more would make sense. If you combine that with higher levels of SMT bringing to some extent diminishing returns SMT4 was always very unlikely.

Hard to say if SMT will be completely de-emphasized though. There are usecases which have less issues with it so it might be kept for that reason (remember a lot of the value is shared R&D costs for a general solution instead of a perfectly specialized one), but maybe that will all be replaced by small core designs over the next 5 years depending on how that works out.


SMT is a bad idea for cloud because no one wants someone else's unknown software running on another thread on the same core their stuff is, given all the security issues.

For on premise computing it still makes sense as you aren't worried about those type of exploits - if you have code trying to exploit that running on your internal servers you are already screwed!
 

tomatosummit

Member
Mar 21, 2019
184
177
116
Q2 would be consistent with 18 month cadence. Q4 2022 seemed . . . wrong.
I want to be wrong but I was previously thinking am5 is q2 to go with rembrant desktop for oems. Raphael could unfortunately still be up in the air.
The leak not mentioning the chipset is why I think it's more of an oem thing and amd will be able to ride their zen3 3dcache cpus in the retail market for a while.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
I want to be wrong but I was previously thinking am5 is q2 to go with rembrant desktop for oems. Raphael could unfortunately still be up in the air.
The leak not mentioning the chipset is why I think it's more of an oem thing and amd will be able to ride their zen3 3dcache cpus in the retail market for a while.

Rembrandt and Barcelo are laptop chips. I doubt will will see it on the desktop since Zen 4 reportedly includes a GPU.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
Rembrandt and Barcelo are laptop chips. I doubt will will see it on the desktop since Zen 4 reportedly includes a GPU.
Rembrandt's iGPU too wide to just be designed around the laptop market alone. Based off my 6700XT testing, I see absolutely no reason for AMD to go with 12CUs for mobile alone, as in the same 25W power budget for the GPU CUs alone, you could instead go with 8CUs (-33%) clocked 300MHz higher (+15%). At lower iGPU only power (AKA in more CPU heavy games or lower power budgets), this gap would measurably shrink.

Especially not as the iGPU will likely be enough to beat Intel ones out to 2023 or perhaps even later if MTL and LNL do actually come with Gen12 as previously rumoured.

Large APUs on the desktop are pretty much going to be necessary going forwards, as taping out small dGPUs becomes less and less cost cost effective going onwards from here.

Without doing something awfully strange like backporting new IP to 12LP+ or something that is.
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
On the question of silicon interposer/bridges for the "Zen 4" generation, AMD's paper by Sam Naffziger et al goes into detail about their considerations of using silicon interposer for the "Zen 3" generation, explaining why they ended up using IFOP (Infinity Fabric On Package) in the organic package substrate instead. It was primarily down to high cost and lack of reach. Relying on IFOP was quite a feat, though, and comes with its own drawbacks. In particular, there is not much room in the package for links so they had to be very creative, including co-design in the chiplets themselves (compromises/innovation in power delivery). Obviously, IFOP has higher latency. Energy per bit for IFOP is over an order of magnitude greater than links on silicon. And, 14% of the CCD die size is devoted to IFOP due to the large interconnect bump pitch. Surprisingly, IFOP isn't a bandwidth limiter though — it has ample bandwidth for the "Zen 3" generation, at least.

"3) Packaging Technology Decisions: AMD was among the first companies to commercially introduce silicon interposer technologies starting with the AMD Radeon™ R9 “Fury”
GPUs with high-bandwidth memory (HBM) in 2015 [16]. A natural question for our chiplet-based products is why we chose to use package substrate routing rather than the higher density interconnects enabled by silicon interposers. There are several factors that drove the decision to not use silicon interposers for our chiplet-based processors. The first is the communication requirements of our chiplets. With eight CCDs and eight memory channels, on average each chiplet’s IFOP only needs to handle approximately one DDR4 channel’s worth of bandwidth. Using DDR4-2933 as an example, a single channel would correspond to ~23.5 GB/s of peak bandwidth. Even accounting for some load imbalance across the CCDs, a single CCD’s IFOP would still be expected to observe no more than a few tens of GB/s of traffic, and in fact each link can support approximately 55GB/s of effective bandwidth. Point-to-point links in the package substrate routing layers are more than sufficient to handle this modest level of bandwidth. In contrast, a single HBM stack can deliver hundreds of GB/s of memory bandwidth, which far exceeds the capabilities of the organic package substrate, and this is why HBM-enabled GPU products need a higher-bandwidth solution such as silicon interposers [2][16][17]. The second factor against silicon interposers for our chiplet-based processors is the reach of the interposer-based interconnects. While interposers can provide great signal density for very high bandwidths, the lengths of the signals are limited and as such constrain the connections to edge-to-edge links. The reach of interposer-based interconnects can in principle be extended using wider metal routes and greater spacing between routes, but this would decrease the effective bandwidth per interface because fewer total routes could be supported for a fixed width of routing tracks. This argument also applies to silicon bridge technologies [12]. The next subsection describes the challenges of providing sufficient IFOP bandwidth across the package substrate. Figure 10 illustrates a hypothetical interposer-based processor design. The edge connectivity constraint would limit the architecture to only four CCDs, which would render the product concept to be far less compelling. Even if interconnect reach was not a limiting factor, the IOD and the eight CCDs would require so much area that the underlying interposer would greatly exceed the reticle limit (while a passive interposer does not contain any transistors, the metal layers are still lithographically created and therefore must stay within the same reticle field constraints). Figure 10 shows the placement where an additional CCD would have to be, which is both outside the boundary of a maximum-sized interposer and too far for the unbuffered interposer routes to reach while supporting required bandwidths. Recent advancements in silicon interposer manufacturing have enabled reticle stitching to create very large interposers [11], but such an approach would have been cost prohibitive for this market segment. Last, the silicon interposer itself adds more cost to the overall solution. A CCD with the twice the core count could have been used, but that would have resulted in lower yield and decreased configurability. For all these reasons, routing IFOP directly across the package substrate was chosen for this product family. The total area consumed by multiple chiplets is typically greater than a monolithic chip with equivalent functionality. While this could theoretically cause a corresponding increase in the overall package size, the size of the SP3 processor package used by AMD EPYC™ processors is primarily determined by the large number of package pins required to support the eight DDR memory channels, 128 lanes of PCIe plus other miscellaneous I/O, and all the power and ground connections."

Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families: Industrial Product (computer.org)

I think the reasoning behind their decision strongly hints that silicon interposer/bridges will be coming as soon as reach and cost allow. However, there is a chance there might be low-risk and low-cost reasons to extend the current scheme, if they find it is sufficient for their performance targets. If they can elongate the package a little more, and there is room underneath the CCDs to route another IFOP link (or two), maybe they can fit another 4 (or 8) CCDs on the package, in a manner compatible with the ugly mock-ups shared by leakers. But the routing is already pretty cramped, as can be seen in this figure from the paper:

1624110730771.png

PS. Could we perhaps see 96-core "Genoa" using IFOP, and a 128-core follow-up using silicon interposer/bridges? Could they do that in the same socket?
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
Rembrandt's iGPU too wide to just be designed around the laptop market alone. Based off my 6700XT testing, I see absolutely no reason for AMD to go with 12CUs for mobile alone, as in the same 25W power budget for the GPU CUs alone, you could instead go with 8CUs (-33%) clocked 300MHz higher (+15%). At lower iGPU only power (AKA in more CPU heavy games or lower power budgets), this gap would measurably shrink.

It's still more power efficient to go wide and lower clocked.

Large APUs on the desktop are pretty much going to be necessary going forwards, as taping out small dGPUs becomes less and less cost cost effective going onwards from here.

OEMs don't work that way. They like the upsell to dGPU angle.

I would even argue the exact opposite is likely to happen. The desktop is destined to get the very low end IGP and nothing more.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
You are forgetting the market that the NUC, uSFF, and Mac Mini live in. There's a need for highly integrated but modestly upgradeable products on the desktop and entertainment center. AMD already does a lot of work in that market with the XBox and PS solutions that they provide. There is a need there for something that can push HQ 1080p and upscaled 4K to TVs and monitors that is tiny and mass producible.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
You are forgetting the market that the NUC, uSFF, and Mac Mini live in. There's a need for highly integrated but modestly upgradeable products on the desktop and entertainment center. AMD already does a lot of work in that market with the XBox and PS solutions that they provide. There is a need there for something that can push HQ 1080p and upscaled 4K to TVs and monitors that is tiny and mass producible.

And the very low end IGP would be more than capable of doing that. It's gaming where it would be lackluster. That's where the upsell comes in.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
It's still more power efficient to go wide and lower clocked.

I literally just gave you what the best case scenario performance uplift would be. It's not large at all.

OEMs don't work that way. They like the upsell to dGPU angle.

So? What does that matter if there's no dGPU to upsell with?

I don't think you understood my point at all. To put it differently, shipping Rembrandt would be cheaper than shipping Navi24 even on it's own for the desktop whilst netting you most of the performance of the latter (and a CPU bundled in too). It's so much cheaper for the end consumer because you don't have to worry about PCB and other BOM costs of the dGPU itself. With the cost to manufacture lower end dGPUs increasing (R&D costs, tapeout costs, opportunity cost of not shipping ever increasing in size higher end products), APUs are swiftly becoming much more viable.
 
  • Like
Reactions: Tlh97

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
I don't think you understood my point at all. To put it differently, shipping Rembrandt would be cheaper than shipping Navi24 even on it's own for the desktop whilst netting you most of the performance of the latter (and a CPU bundled in too). It's so much cheaper for the end consumer because you don't have to worry about PCB and other BOM costs of the dGPU itself. With the cost to manufacture lower end dGPUs increasing (R&D costs, tapeout costs, opportunity cost of not shipping ever increasing in size higher end products), APUs are swiftly becoming much more viable.

If Navi 24 type products no longer make sense to design, that's fine with OEMs. They will just stick with 23 type products and better. Or buy nVidia.

OEMs really love having dGPUs on products marketed for gaming. Even if the dGPU is bad.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
If Navi 24 type products no longer make sense to design, that's fine with OEMs. They will just stick with 23 type products and better. Or buy nVidia.

OEMs really love having dGPUs on products marketed for gaming. Even if the dGPU is bad.

As if that matters. OEMs are more than happy to throw a 3400G in a box and slap a gaming sticker on it - hell there's instances of some doing the same with the 3000G.
 
  • Like
Reactions: Tlh97

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
QUOTE="jpiniero, post: 40527363, member: 281730"]
And the very low end IGP would be more than capable of doing that. It's gaming where it would be lackluster. That's where the upsell comes in.
[/QUOTE]

But a tiny, low end iGPU doesn't do it for the whole market! Yes, for office machines, video players, and modest content creation, even the existing Vega iGPUs are fine. However, for anything north of 720p upscaled setups, they don't cut it for decent games at decent quality settings. The issue is that that end of the market doesn't want a dGPU. They want the smallest possible box that will do the job, preferably as quietly as possible. This is where a DDR5 fed iGPU could shine. The DDR5 spec currently tops out around the same memory bandwidth as an RX560 in a "dual channel" setup at top specs. RDNA2 is more efficient with memory bandwidth than the Polaris. Those RDNA "CU" equivalents will likely be more performant than the 560 as well. It is an easy stretch to believe that there won't be any real need for a dGPU in those systems for 1080p or 4K upscale situations unless the user wants to be competitive in an FPS, or they just want the highest possible quality. I know several people that do just fine gaming with 1050s and 560s, which I believe that these iGPUs will be comparable to.
 
  • Like
Reactions: Tlh97