Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 100 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
IMHO that might be overly confident as TGL-8C is not that bad in comparison to Cézanne. While on the other hand Vermeer is really bad, efficiency wise wrt ST/light load.
You need to look at that DTR "laptop" market with unlocked CPUs. They are not about efficiency per se but desktop like performance (including desktop like power usage really) in a laptop form factor, never mind wattage. Power usage at idle won't be of concern in these products.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
I doubt it needs to reconcile that. Even with the overhead a Raphael-H will already be more efficient than the direct competition. And the direct competition so far has been DTR chips of the likes of 10980HK, 11980HK and likely 12980HK and 13980HK unless Intel changes the model names.
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.
Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.

Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.
Considering the IOD barely changed at all between Zen 2 and 3 I do expect big changes for Zen 4. When AMD talked about going organic and MCM instead interposer they showed how cramped the floor plan for all the necessary links already is, so around doubling that for DDR5, PCIe5 and SerDes links does seem impossible. We may actually see a more significant change to the current IOD to CCD hierarchy.
 

uzzi38

Platinum Member
Oct 16, 2019
2,556
5,531
146
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.
Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.
Oh, Cheese forgot to include it I think, but there were a couple of tables on peak I/O die power draw iirc.

Genoa was at essentially the same level as Milan, +- 10W.

Not sure if I have a copy of those tables somewhere, but they were like 125W and 135W or something thereabouts.

Something is almost certainly different. The I/O is drastically improved - more PCIe lanes, Gen 5, DDR5, many more IFOPs and even still IOD power is essentially static.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,731
3,063
136
@eek2121
Should I maybe point that I have never ever mentioned M1Pro/Max specifically. I am talking about the Firestorm core of Apple Silicon in general. Because therefore I have hard numbers (AKA facts). Apple is 500% more efficient while 5nm brings smaller 150% - end of story, not?
No because i can make my Zen3 in CB23 ST hold ~3.0ghz while consuming 0.5 watts ( on the core). I lose 33% performance and at 12x reduced power. that's setting TDP to 9 watts, i have like 300 background processes including sql , web frame works etc so in this situation i have more non Cb23 load on my device then not. If i really cared to game your metric i would do a clean install turn off AV , indexing etc.
 

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
@BorisTheBlade82
Since that appears to be an aspect of power consumption you seem to gloss over so far: Imo it's very important to separate the efficiency of cores from the idle power consumption floor of the uncore. Zen cores for instance already are plenty efficient both at idle and at the power efficiency inflection point. But the uncore is setting the floor. That's also why performance doesn't decrease linearly when reducing cTDP, at some point you starve the cores against the uncore floor. On the other hand this also means in MT efficiency you can recover a high-ish uncore floor by having a lot of efficient cores.
 

uzzi38

Platinum Member
Oct 16, 2019
2,556
5,531
146
No because i can make my Zen3 in CB23 ST hold ~3.0ghz while consuming 0.5 watts ( on the core). I lose 33% performance and at 12x reduced power. that's setting TDP to 9 watts, i have like 300 background processes including sql , web frame works etc so in this situation i have more non Cb23 load on my device then not. If i really cared to game your metric i would do a clean install turn off AV , indexing etc.
Could you try running it with Process Lasso holding the workload to a single core in particular? I would but I'm away from my PC for a bit.

I noticed when I last tested this that CPPC2 and the whole switching between preferred cores thing actually did lead to much lower ST power recorded on average, so you would need to lock it to a single thread to get an accurate measurement.

(FYI when I did this on my 3800X about a year ago I got 3.6GHz sustained at 3.2W iirc).
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,731
3,063
136
So just finished my 9 wat TDP ~0.5watt core CB single core run , scored 996 compared to about 1300 in normal operation. so 76% of the performance for 1/12 of the power consumption on the core. lets see what 8 watt tdp looks like :p

Could you try running it with Process Lasso holding the workload to a single core in particular? I would but I'm away from my PC for a bit.

I noticed when I last tested this that CPPC2 and the whole switching between preferred cores thing actually did lead to much lower ST power recorded on average, so you would need to lock it to a single thread to get an accurate measurement.

(FYI when I did this on my 3800X about a year ago I got 3.6GHz sustained at 3.2W iirc).
i am already doing that but just with core affinity, otherwise it moves around to fast to see anything :)

edit: interesting behaviour going from 9 to 8 watts in ryzen controller. The uncore power has dropped massively and my core is consuming ~ 2.7-3.0watts @ 3.6ghz. There is some black magic behind the SMU .....lol

edit2: it looks like its actively parking more core to

edit3: so 8watt score was just under 1200 , just tried 5 watt ..... appears to be the "limit" 0.25watt on the core , 400mhz clock.... lol
there are 7 of 16 threads at 100% load , the other cores are consuming ~ 0.1 watt

edit4: going from 5 to 6 watts see the core jump to 2.2 to 3ghz , core power between 0.5 and 0.8 watts (bouncing quite a bit) and now i need to work, will play more latter :p
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.
Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.

Thankfully, the supported AVX-512 instruction will be using a 64B data paths between cache and execution ports. It wasn't clear to me that that was going to be the case (would have neutered performance otherwise). So, AMD is serious about high throughput AVX-512 compute.

For AM5, there is a processor configuration with 28 PCIe lanes, so 4 extra for another 4.0 M.2 slot. (or 5.0, eventually perhaps).
Speed freaks can put them in RAID 0 and get their coveted PCIe 5.0 data rates :p

Thanks for the link, interesting times :)
 
  • Like
Reactions: lightmanek

eek2121

Platinum Member
Aug 2, 2005
2,883
3,860
136
@eek2121
Should I maybe point that I have never ever mentioned M1Pro/Max specifically. I am talking about the Firestorm core of Apple Silicon in general. Because therefore I have hard numbers (AKA facts). Apple is 500% more efficient while 5nm brings smaller 150% - end of story, not?

@moinmoin
IMHO that might be overly confident as TGL-8C is not that bad in comparison to Cézanne. While on the other hand Vermeer is really bad, efficiency wise wrt ST/light load.

This isn't an Apple thread. That number "500% more efficient" is wrong. The CPU has a larger power budget than Cezanne. A 5950x core beats the fire storm core in ST performance in GB5, for example, while consuming under 20W for most tasks (including that GB5 bench). Also, Apple's chips are optimized for power, while AMD's chips are optimized for performance and expandability. You can't buy a Mac with 256 gb of RAM. You can't buy a Mac with 12TB of storage. You can't upgrade an M1 based Mac, period. Apple has a slight efficiency lead, most or all is down to a more efficient process (which Zen 4 will launch on), and not having to support DDR4/PCIE4. In addition, you are comparing an old architecture from AMD with a cutting edge architecture from Apple. If you want to continue this discussion about the M1, please take it to the appropriate thread. I am not going to respond further in this thread about it. We aren't interested in talking about Apple in every thread. This thread is about Zen 4.
 

eek2121

Platinum Member
Aug 2, 2005
2,883
3,860
136
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.
Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.
The IO die is rumored to be on N6. Also, IF speeds are rumored to be dynamic. AMD was rumored to be playing around with dynamic DDR5 speeds at some point. The current IO die design is also almost 4 years old. Even with Cezanne, AMD has made it clear there is much room for improvement.

Rembrandt on the mobile side will apparently feature several improvements to power (though Cezanne honestly is pretty efficient), those improvements to Rembrandt will carry over to Zen 4 more than likely.

I'm looking forward to Zen 4. I strongly suspect that there will be massive improvements over Zen 3 in terms of perf/watt.
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
@leoneazzurro
IMHO ST is the best way to analyse architecture differences. Because for MT it is quite obvious that more cores and threads typically wins (other factors being the same). And what do you mean with "Link these data, the conditions in which these data are taken"?
 

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
The SERDES is what also keeps me thinking. So maybe Zen4 could finally be the time they at least on mobile could use something like Info-LSI (TSMCs Version of EMIB, but better). This would improve Interconnect power efficiency by around 1000% and would make chiplets in the mobile space viable. But the leaks suggested that they seem to stay on the organic package Interconnect at least on the Desktop. So for me it is rather improbable that it was technically feasible with one and the same die. But I keep my fingers crossed.

See comments from @uzzi38 . It looks like they've done something. More throughput, similar power on Genoa vs. Milan. Maybe those changes will also appear in desktop products along with DTR products like Raphael-H.

If you mean that AMD could stick a large GPU into a SoC and feed that with high bandwidth memory, then clearly yes. If you mean them to reach the same power efficiency, then definitely no. The margin is much too big for them to catch up within one generation.

We're talking about competition that can already hit 60W on battery. Most DTRs downclock on battery anyway, so being only competitive when receiving wall power wouldn't be entirely bad.

Also why bring Apple PR in this thread, there are lots of such threads.

Forget the PR for a moment and consider what market changes AMD must react to. In a "worst case" scenario, their main x86 competition will flame out on their 10nm process (10ESF) and not release anything compelling in significant volume due to wafer supply problems, or will at least be out of the game until said competition moves to their 20A process. And by "worst case" I mean "possible, though perhaps not probable; we'll see". In that scenario, AMD would need to shoulder the burden of the entire x86 world looking to them for chips on cutting-edge processes. Who will be competing with them from outside the x86 world?

Apple isn't using lagging PowerPC hardware or also-ran Intel stuff anymore. AMD would be foolish not to steer their products accordingly.

But with Zen 4 AMD won't have a direct answer to any M1 variant, and also doesn't need one unlike Intel which wants to win back Apple as a customer.

I will concede that, today, AMD doesn't really have DTR products that would compete directly against the most expensive (and heavy!) Mac Pro units announced recently. There are some boutique laptop manufacturers that are in the same segment using what is essentially a mashup of desktop and laptop hardware, but AMD doesn't address that segment. In the future, they may wish to change that. And if they do, then whabam, they are in competition. You must concede that a 16c Raphael-H would be a significant departure from their current mobile offerings. We are getting into DTR territory with hardware like that.

There's also the possibility that stuff like the M1X will filter into other products eventually that will compete more directly with things AMD makes.

As a general aside, please let's not look at speculation like this as an excuse to steer the thread to a much-hyped quasi-competitor. Instead, the underlying question I had initially (which I still consider to be somewhat unanswered) is: why Raphael-H? The technical hurdles of that aside, it doesn't really look like anything Intel is doing would necessitate that change. Alder Lake-P will reach (at most) 6+8 (20t), and the MT performance of that chip will probably be worse than a hypothetical monolithic Zen4 8c/16t -H or -HS SoC. Raptor Lake will do little more than add -e cores which . . . well let's just say that the jury's still out on whether or not those will really improve application performance much in any meaningful way. Meteor Lake will face Zen5 and may well be rare as hen's teeth thanks to shortages of wafers (yes, including N3 wafers taken from TSMC).

16c Raphael-H really looks like AMD is going for a DTR chip. If they can somehow fit this beast into a 45W power envelope and put it in the premium laptop segment, then more power to them. I'm skeptical of that being the case.
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
Zen4 might have dual SDP/CCD links, with current SerDes that is going to be a problem.
Imagine this on 128 Core.
That would be 4x the interconnect energy consumption vs the 64 core part. Or 3x for the 96 core part.
Add to that additional 4 channels of DDR5.
There has to be something changed here.
In the article you quoted there is also the following passage: " ‘Narrow mode’ may be a way to save even more power by disabling one of the IF links when high bandwidth isn’t required." So I guess this is how they try to tackle the consumption problem - being more dynamic depending on bandwidth demand.
 

Joe NYC

Golden Member
Jun 26, 2021
1,872
2,153
106
Considering the IOD barely changed at all between Zen 2 and 3 I do expect big changes for Zen 4. When AMD talked about going organic and MCM instead interposer they showed how cramped the floor plan for all the necessary links already is, so around doubling that for DDR5, PCIe5 and SerDes links does seem impossible. We may actually see a more significant change to the current IOD to CCD hierarchy.

It's a brand new I/O die on 6nm. But the real bottleneck, and albatross are the SerDes links.

Which, as you said, first double to double the bandwidth to each CCD to match the DDR5, than go up from there by another 1.5x or 2x between Genoa and Bergano.

And even it all of it is done, it would still prevent the CCDs from accessing each others (huge) L3 any faster than a DDR-5 channel.

It seems to me that simple EMIB would solve 90% of the problems SerDes causes...
 
  • Like
Reactions: BorisTheBlade82

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
No because i can make my Zen3 in CB23 ST hold ~3.0ghz while consuming 0.5 watts ( on the core). I lose 33% performance and at 12x reduced power. that's setting TDP to 9 watts, i have like 300 background processes including sql , web frame works etc so in this situation i have more non Cb23 load on my device then not. If i really cared to game your metric i would do a clean install turn off AV , indexing etc.
But that's just wizardry. Hard facts please!! 🤣
 

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
It seems to me that simple EMIB would solve 90% of the problems SerDes causes...
You know what else would "solve" the "SerDes problem"? Going monolith...

The point with IOD being essentially unchanged is not one of nodes, but the comparison of the uncore with e.g. known improvements in the APU dies since. Already a couple times in the past on these boards we talked about the heavy toll power consumption through an ever more featured uncore has. The biggest power eater in the uncore is by some distance the IMC depending on memory frequency, which is why the APUs introduced dynamic memory frequency scaling depending on CPU load and demanded latency resp. bandwidth. Other areas which with the current IOD can't be completely power gated get some dynamic regulation as well to essentially offer both a power saving state and a high performance state. @BorisTheBlade82 already pointed to the "narrow mode" mentioned to only use links required for the overall bandwidth. All this further increases complexity in an MCM setup though so it will be very interesting to see the design decisions AMD did in this balancing act. Considering with Zen 4 AMD also introduces new platforms in SP5 and AM5 the MCM hierarchy and layout is no longer bound to be compatible with SP3 and AM4 so they can and will be changed to adapt to all the modern requirements.
 
  • Like
Reactions: Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
If we go a step back, Ryzen 5000/Zen3 is a tradeoff across so many things

Zen 3 MTr/mm is around ~51, MI100 is around ~66, 30% higher. Zen 3 had to trade off 30% density to achieve the high clocks. (RDNA2 as well had similar MTr/mm2 like Zen3, because GPU team learnt from CPU team according to Suzanne Plummer :grinning: )
Why the high clocks, in my opinion:
Because the Core +L2 (around ~204MTr) is not that wide and much smaller than Intel's for example (Sunny Cove is ~283MTr) and Firestorm (~502MTr).
(But they needed to improve the efficiency by making it small to run at such high clocks, so it is kind of a vicious cycle)
Because original design of Zen (1,2,3 at least) is to make the die small for cost, defects, yield etc because AMD cannot charge whatever they want.

When Zen2 was introduced, they needed to add the GAMECACHE, because they are getting hammered by Intel in a key workload in the Windows World, Gaming, but in my opinion was an improvisation and not what was envisioned during the Architectural work 4+ years ago.
What is good though, is that there is not going to be an increase of L3 in Genoa
Increasing L3 size can cause regression in IPC if the increase comes with more cycles and of course there is power involved. V Cache comes with "minimal cost of latency" as per AMD, this means it will cause a minor regression in some workloads. But hitrate is massively increased for workloads like gaming. Thankfully the V Cache can be power gated.
In the end Zen3 Core + L2 + L3 turned out to be big, to address the gaming load. There are other benefits as well in the HPC space, but the effect is profoundly highlighted in the Windows world

Operating range at the very extreme of the Shmoo plot is not exactly going to make the chip efficient

For Zen4
I did mention before that I would prefer AMD don't scale up the frequencies again, otherwise again this same cycle would take effect, but in the PR some days ago Hallock alluded to increasing clocks again soooo :expressionless:

On N5P, there is a lot more room to maneuver if they dont go for the absolute frequency.
The process inherently offers a lot more speed (20% over N7 at same power) with HD cells they could make small adjustments to hit clock targets, assuming their frequency targets are not so high
This can allow a to minimize the tradeoff of density for speed, means they can pack more transistors per mm2. This means more logic.
Also means they are not operating at the very extreme of the Shmoo plot and can greatly control the efficiency.

If AMD only take minor speed improvements, say 5%, they can put all gain into efficiency plus cram more transistors because there to no need to go for absolute tradeoff for frequency.
Putting more logic, in the end, can increase "IPC" because you can have more logic blocks, register file, ROB, etc., improving the perf/watt

1634721415965.png
As per TSMC ~4.1GHz is the best range to run the CPU, and probably around 4.3GHz for N5P which AMD will use
So there is a lot of opportunities made available by the process, but it is very interesting indeed what choices AMD will make this time again.

What is known at this point is the die size, 72mm2, at this size, keeping L3 same, the Core+L2 for Zen4 is going to be quite small, slightly higher MTr than Sunny Cove at best.

When you think about this, Sony in PS5 SoC still want to remove blocks from the Core, smh.
 

leoneazzurro

Senior member
Jul 26, 2016
901
1,426
136
I agree with the most you said, sadly the clocks are going high in Zen4 because the competition is raising them without regard for the power consumption... Also I think the cache was something planned, after all there is a lot of emphasis by all CPU makers on large caches for various reasons, not only for gaming.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
Also I think the cache was something planned, after all there is a lot of emphasis by all CPU makers on large caches for various reasons, not only for gaming.
It is not a straight answer, there are pros and cons of increasing cache
For applications that have huge datasets it is a big plus but not for all.
Therefore V Cache is the best solution, stacking additional dies only on the specific SKUs.

Good writeup here on why increased cache at the cost of latency is not good for most use cases

L3 design is also a tradeoff, you can make L3 faster but the cost is density and power.
There are so many dials and levers at play.
 

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
You know what else would "solve" the "SerDes problem"? Going monolith...

AMD went with the CCD strategy in the beginning to scale out # of cores where applicable, to use commodity dice between desktop; workstation; and server, and to improve yields. If AMD can get good yields on16c parts and below on N5 then they could just go monolithic on their entire desktop and laptop lineup. The I/O die would be exclusive to EPYC and Threadripper of that generation. Not saying that's what they'll do, since it would violate the pattern from Zen -> Zen3 and force them to produce multiple monolithic dice for desktop/laptop (more masks = more time, more money). They would need an 8c monolithic and a 16c monolithic at least, and then fuse off cores for 12c and 6c parts.