Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 192 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
If they believe as I do that almost all Apple customers (whether for iPhone or Mac) first make the decision "iPhone or Android?" and "Mac or PC?", then PC OEMs should not be worried about competing against Apple because by the time someone looks at a PC laptop they've already decided against (or never considered) Apple. With x86 Macs it was possible those who wanted to run Windows could buy a Mac and ignore macOS. That's no longer an option, if you buy a Mac you are using a different OS so have to first make the decision to switch operating systems before you look at hardware.

A PC OEM's competition is other PC OEMs, not Apple. It is clearly FAR easier for a PC OEM to grow at the expense of other PC OEMs versus growing at the expense of Apple - and similarly easier to minimize losses in a declining market like the current one by offering better products than other PC OEMs versus offering better products than Apple.

If Intel or AMD offered solutions like Apple's with higher bandwidth DRAM it would be available to all PC OEMs, so they would gain no advantage relative to each other. They compete by offering better value for the money in terms of price for a given configuration, or better value in terms of build quality / support (the latter is what matters for repeat business)

PCs with memory bandwidth comparable to Apple's would make little difference as far as competing with Apple, except for the very small segment of people who know they are doing something that's highly dependent on memory bandwidth and it is SO important to them that something performs well they choose a Mac despite the hassle of having to switch operating systems.

If they wanted to compete with Apple hardware wise they need to look at the power consumption of a Mac versus that of a PC doing the exact same thing. That's why Apple is gaining customers, not because it has hundreds of GB/sec of memory bandwidth. I doubt more than a few percent of Mac or PC buyers could tell you within an order of magnitude the memory bandwidth of what they're buying.
It doesn’t matter that a lot of customers don’t know how much memory bandwidth their laptop has. This is the same type of thing as the “gaming performance crown” which is often taken by a ridiculously expensive product. People see the leader at the top and then often buy a lower end product, even if it isn’t actually the best performance per dollar in their price range. Apple has gotten a huge amount of positive press with their switch to their own ARM processors. There are a lot of professionals saying how great it is for this or that. Massive battery life improvements over PC, etc, etc. It is a massive marketing advantage, even if the product that is actually in a lot of customers price range is not particularly cost effective. Admittedly, a lot of sales are likely people already in the Mac ecosystem, but I suspect a lot of people who aren’t are going to look at the new macs. I also don’t think it is that big of a deal to switch OS. A large number of people don’t really use any software that isn’t available on both systems, except gaming and some niche things. I don’t have that much visibility into that though since I am not a typical consumer. I use Mac and Linux. I haven’t touched windows since win 98, so I can’t really argue about the software side of things.

Regardless of the competitive situation with Apple, there are a lot of rumors talking about really powerful AMD APU’s, so it seems likely that there will be some mobile APUs with much better bandwidth that just the standard DDR memory. Given AMD’s current tech, this seems to be more likely to be a large cache part rather than high bandwidth. This isn’t like Zen 1 where it was initially one chip across almost their entire product lineup. They have been making more specialized chips to some extent and monolithic die have been the best for mobile, so it may be a monolithic die or perhaps something connected by EFB. A single stack of HBM “cache” would change the competitive landscape a bit. It may not be very expensive with EFB rather than large interposers. They may still have a modular solution if the infinity cache base die is a thing. The SoIC stacking may change things as far as monolithic being the lowest power for mobile. The base die can be on an process that is optimized for it. The top die can also be optimized for whatever structures it contains. This may vary between a cpu die and a gpu die.
 

Abwx

Lifer
Apr 2, 2011
10,934
3,423
136
If AMD stated that Zen 4 will dominate in gaming then it means that there s some healthy IPC improvement for INT based code, but not sure that it will be the main culprit.

Out of curiosity i checked the ST perf in 7 ZIP, wich is representaive of INT IPC, for current CPUs, the 5950X is at 6850 MB/s and the 12900K at 6571 MB/s, so Intel s latest is not that strong in this register contrary to FP based code.

The 5800X3D is no better than the 5950X ,wich mean that ADL advantage vs 5950X for games lie in the DDR5 better bandwith.


 

Doug S

Platinum Member
Feb 8, 2020
2,243
3,460
136
It doesn’t matter that a lot of customers don’t know how much memory bandwidth their laptop has. This is the same type of thing as the “gaming performance crown” which is often taken by a ridiculously expensive product. People see the leader at the top and then often buy a lower end product, even if it isn’t actually the best performance per dollar in their price range. Apple has gotten a huge amount of positive press with their switch to their own ARM processors. There are a lot of professionals saying how great it is for this or that. Massive battery life improvements over PC, etc, etc. It is a massive marketing advantage, even if the product that is actually in a lot of customers price range is not particularly cost effective. Admittedly, a lot of sales are likely people already in the Mac ecosystem, but I suspect a lot of people who aren’t are going to look at the new macs. I also don’t think it is that big of a deal to switch OS. A large number of people don’t really use any software that isn’t available on both systems, except gaming and some niche things. I don’t have that much visibility into that though since I am not a typical consumer. I use Mac and Linux. I haven’t touched windows since win 98, so I can’t really argue about the software side of things.

Regardless of the competitive situation with Apple, there are a lot of rumors talking about really powerful AMD APU’s, so it seems likely that there will be some mobile APUs with much better bandwidth that just the standard DDR memory. Given AMD’s current tech, this seems to be more likely to be a large cache part rather than high bandwidth. This isn’t like Zen 1 where it was initially one chip across almost their entire product lineup. They have been making more specialized chips to some extent and monolithic die have been the best for mobile, so it may be a monolithic die or perhaps something connected by EFB. A single stack of HBM “cache” would change the competitive landscape a bit. It may not be very expensive with EFB rather than large interposers. They may still have a modular solution if the infinity cache base die is a thing. The SoIC stacking may change things as far as monolithic being the lowest power for mobile. The base die can be on an process that is optimized for it. The top die can also be optimized for whatever structures it contains. This may vary between a cpu die and a gpu die.


But the thing is, it is (or at least sure as hell should be) easy to beat Apple because they ONLY have power efficient stuff. Apple isn't going to be able to beat a DTR laptop with 3x the peak power draw of a Macbook Pro, so there is the benchmark win. More reasonable configurations might not be able to beat them on power efficiency but they can at least play in the same ballpark. People won't notice that you can't get that top performance and reasonable power efficiency at the same time, because almost everyone who REALLY cares about one of those things doesn't care a whole lot about the other.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Out of curiosity i checked the ST perf in 7 ZIP, wich is representaive of INT IPC, for current CPUs, the 5950X is at 6850 MB/s and the 12900K at 6571 MB/s, so Intel s latest is not that strong in this register contrary to FP based code.

The 5800X3D is no better than the 5950X ,wich mean that ADL advantage vs 5950X for games lie in the DDR5 better bandwith.

7Zip and compression stuff in general is bad for ST comparisons as they are very sensitive to memory subsystem and various interactions with bandwidth/latency provide crazy scaling with improving memory latency or hitting some special L1 / L2 / L3 cache size threshold.
Case in point: in same 18.3 7zr as linked by You, Zen3 stuff scores ~6800 MIPS, 12900K scores 6600 MIPS, but my tuned memory 12900K scores 8600 MIPS @ fixed clock of 5Ghz.

It was the same with Skylake, i had some very good scores in Winrar on 10900K by having tuned memory and it was esp sensitive to secondary/tertiary memory timings.

Alder Lake in my opinion is integer processing monster driven by 5 ALUs and massive OoO machinery. Unfortunately it is fueled by mediocre caches and abysmal memory subsystem that can't feed it properly once workload is spilling out of L2 caches.
I think the best showcase for raw integer IPC is web benchmarks like Speedometer 2.0, Octane etc, where in my testing Alder Lake enjoys up to 30% advantage clock for clock versus Zen3 when both are running tuned memory.
 
Last edited:

eek2121

Platinum Member
Aug 2, 2005
2,929
4,000
136
If they believe as I do that almost all Apple customers (whether for iPhone or Mac) first make the decision "iPhone or Android?" and "Mac or PC?", then PC OEMs should not be worried about competing against Apple because by the time someone looks at a PC laptop they've already decided against (or never considered) Apple. With x86 Macs it was possible those who wanted to run Windows could buy a Mac and ignore macOS. That's no longer an option, if you buy a Mac you are using a different OS so have to first make the decision to switch operating systems before you look at hardware.

A PC OEM's competition is other PC OEMs, not Apple. It is clearly FAR easier for a PC OEM to grow at the expense of other PC OEMs versus growing at the expense of Apple - and similarly easier to minimize losses in a declining market like the current one by offering better products than other PC OEMs versus offering better products than Apple.

If Intel or AMD offered solutions like Apple's with higher bandwidth DRAM it would be available to all PC OEMs, so they would gain no advantage relative to each other. They compete by offering better value for the money in terms of price for a given configuration, or better value in terms of build quality / support (the latter is what matters for repeat business)

PCs with memory bandwidth comparable to Apple's would make little difference as far as competing with Apple, except for the very small segment of people who know they are doing something that's highly dependent on memory bandwidth and it is SO important to them that something performs well they choose a Mac despite the hassle of having to switch operating systems.

If they wanted to compete with Apple hardware wise they need to look at the power consumption of a Mac versus that of a PC doing the exact same thing. That's why Apple is gaining customers, not because it has hundreds of GB/sec of memory bandwidth. I doubt more than a few percent of Mac or PC buyers could tell you within an order of magnitude the memory bandwidth of what they're buying.

I have to disagree. Though the Mac is definitely a niche product in many areas, there are markets where Apple dominates PC. One of those is software development. Windows/Linux have made inroads in these areas, however Apple continues to provide the best software/hardware for the job. Apple’s only real weak point right now is the lack of an enterprise server offering.

In order for PC OEMs to compete, they need AMD and Intel to provide more competitive offerings. It is getting to the point where some AI vendors are looking at porting popular packages to Apple silicon. Some software can be run far cheaper on a Mac Studio than elsewhere.

I even know of a company working on custom racks for the Mac Studio.

Regarding my earlier comment about memory bandwidth, it isn’t just about the GPU, though that is important. As many are finding out with the 5800X3D, there are a variety of workloads that benefit from insanely fast memory, and the only reason more workloads don’t benefit is because they need to be optimized to take advantage.
 

gdansk

Platinum Member
Feb 8, 2011
2,078
2,559
136
7Zip and compression stuff in general is bad for ST comparisons as they are very sensitive to memory subsystem and various interactions with bandwidth/latency provide crazy scaling with improving memory latency or hitting some special L1 / L2 / L3 cache size threshold.
Case in point: in same 18.3 7zr as linked by You, Zen3 stuff scores ~6800 MIPS, 12900K scores 6600 MIPS, but my tuned memory 12900K scores 8600 MIPS @ fixed clock of 5Ghz.

It was the same with Skylake, i had some very good scores in Winrar on 10900K by having tuned memory and it was esp sensitive to secondary/tertiary memory timings.

Alder Lake in my opinion is integer processing monster driven by 5 ALUs and massive OoO machinery. Unfortunately it is fueled by mediocre caches and abysmal memory subsystem that can't feed it properly once workload is spilling out of L2 caches.
I think the best showcase for raw integer IPC is web benchmarks like Speedometer 2.0, Octane etc, where in my testing Alder Lake enjoys up to 30% advantage clock for clock versus Zen3 when both are running tuned memory.
Web benches aren't great as default type for numbers is a double. I'm not sure if that's most of the work but it definitely gives the FPUs something to chew.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Web benches aren't great as default type for numbers is a double. I'm not sure if that's most of the work but it definitely gives the FPUs something to chew.

Yeah, but current browser engines are ridiculous JS optimization machines combined with very optimized VMs to execute said code: basically compilers on steroids that apply ton of optimizations. I would not be surprised if there is very little FPU code left to run once they are done

And all that multitier optimization and compilation work results in very classic integer "ALU" and "AGU" type of load, where ADL with it's deep bench of OoO and 5 ALU ports shines.
 

Abwx

Lifer
Apr 2, 2011
10,934
3,423
136
7Zip and compression stuff in general is bad for ST comparisons as they are very sensitive to memory subsystem and various interactions with bandwidth/latency provide crazy scaling with improving memory latency or hitting some special L1 / L2 / L3 cache size threshold.
Case in point: in same 18.3 7zr as linked by You, Zen3 stuff scores ~6800 MIPS, 12900K scores 6600 MIPS, but my tuned memory 12900K scores 8600 MIPS @ fixed clock of 5Ghz.

It was the same with Skylake, i had some very good scores in Winrar on 10900K by having tuned memory and it was esp sensitive to secondary/tertiary memory timings.

Alder Lake in my opinion is integer processing monster driven by 5 ALUs and massive OoO machinery. Unfortunately it is fueled by mediocre caches and abysmal memory subsystem that can't feed it properly once workload is spilling out of L2 caches.
I think the best showcase for raw integer IPC is web benchmarks like Speedometer 2.0, Octane etc, where in my testing Alder Lake enjoys up to 30% advantage clock for clock versus Zen3 when both are running tuned memory.

The statement is from AT, they used 7 ZIP as metric for INT based computations, beside we are talking of ST perf as a way to extract IPC without saturating the mem bandwith and being too much limited by R/W latencies.

Tuned memory is certainly a great factor for improvement but anything out of spec cant be considered as being guaranted from the manufacturer to be errors free.

 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Tuned memory is certainly a great factor for improvement but anything out of spec cant be considered as being guaranted from the manufacturer to be errors free.

The "errors free" part is irrelevant to this discussion, what is relevant is that by dropping clocks and tuning memory i was able to improve score by 30%. So the test might be testing ALUs only when not bottlenecked by memory. Since we already know 96MB of L3 is not helping much, it must be something about memory, right? I expect the same scaling to happen for Zen3.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
7Zip and compression stuff in general is bad for ST comparisons as they are very sensitive to memory subsystem and various interactions with bandwidth/latency provide crazy scaling with improving memory latency or hitting some special L1 / L2 / L3 cache size threshold.
Case in point: in same 18.3 7zr as linked by You, Zen3 stuff scores ~6800 MIPS, 12900K scores 6600 MIPS, but my tuned memory 12900K scores 8600 MIPS @ fixed clock of 5Ghz.

It was the same with Skylake, i had some very good scores in Winrar on 10900K by having tuned memory and it was esp sensitive to secondary/tertiary memory timings.

Alder Lake in my opinion is integer processing monster driven by 5 ALUs and massive OoO machinery. Unfortunately it is fueled by mediocre caches and abysmal memory subsystem that can't feed it properly once workload is spilling out of L2 caches.
I think the best showcase for raw integer IPC is web benchmarks like Speedometer 2.0, Octane etc, where in my testing Alder Lake enjoys up to 30% advantage clock for clock versus Zen3 when both are running tuned memory.
I doubt that has much of anything to do with 5 ALU and OoO configuration. This comes up in many threads; why not add more ALU units? More cylinders in your engine is better, right? There are lots of tradeoffs. Cpu designers have cycle accurate register transfer level simulators to try out different things, but one configuration isn’t going to be the best for all applications. In my experience, even compute heavy task do not reach very high actual IPC. I have done some testing with intel processor counter monitor which can access the CPUs internal counters. Actual IPC is frequently 1 or less even for supposedly compute heavy applications. IPC is a massively overused term that doesn’t mean it imply what most people seem to think it does. It is just instructions per clock and it varies massively between applications and even between different segments of the same application. You also cannot separate it from the memory system. If Alder lake is bottlenecked by the L3 or other memory system components, then it is what it is. Most modern applications are memory bound in some manner because the cpu at 4 or 5 GHz, super scalar, OoO can execute ridiculous numbers of instructions if it has the data; it usually has to wait.

I would suspect the performance differences you see are almost entirely due to L2 cache size and other cache effects. Intel has always been good at making caches. Modern CPUs are generally more memory by area than anything else due to current limitations. There are a lot of applications that respond very well to larger L2 caches. That is why some of the old core 2 duo and core 2 quad processors remained usable for such a long time. They had 4 to 8 MB of L2 cache. If you have a resident set size that mostly fits in the L2, then it can perform extremely well for some applications. Balancing caches for a wide variety of applications is difficult though. Most of the applications you are talking about likely have very small resident set sizes for some functions that fit in the large L2 caches of Alder Lake. AMD will have larger L2 and supposedly significantly improved L1 for Zen 4. Some very early rumors for Zen 5 are saying massive L2 caches. Perhaps L3 is all stacked with Zen 5.
 

deasd

Senior member
Dec 31, 2013
515
736
136
  • Like
Reactions: Tlh97 and RnR_au

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I doubt that has much of anything to do with 5 ALU and OoO configuration. This comes up in many threads; why not add more ALU units? More cylinders in your engine is better, right? There are lots of tradeoffs. Cpu designers have cycle accurate register transfer level simulators to try out different things, but one configuration isn’t going to be the best for all applications. In my experience, even compute heavy task do not reach very high actual IPC.

I think i had good laughs at certain mister in this forum, who used to shout about 6 ALUs that AMD/Intel need to add to match Apple who is already 6-wide ( at way lesser clock and much tighter memory subsystem and different ABI )
So lets not pretend to be smarter than actual chip designers.

But i think some facts are true for all chips. Even if average IPC is 1, there were sections of code, where suddenly a lot of ops became ready ( maybe some blocking dependancy arrived from DRAM ) and then 5 ALUs and wider machine can chew those ready ops faster. Maybe the end difference in average IPC will be 0.97 versus 0.98, but wider chip will still come out on top.
So armed with this information we find that in web browser benchmarks - esp Speedometer 2.0 Apple and Alder Lake are especially strong. And i think most of that power comes from having massive OoO cores backed by 5-6 ALUs and other resources. How else You'd explain 250 vs 325 score for 5Ghz Z3 vs ADL ?

I would suspect the performance differences you see are almost entirely due to L2 cache size and other cache effects. Intel has always been good at making caches. Modern CPUs are generally more memory by area than anything else due to current limitations. There are a lot of applications that respond very well to larger L2 caches

I did a test on Zen3 as well, and with tuned memory @4.4ghz it is scoring 7950 MIPS ( vs 6800 as in article @4.9ghz i believe ), so it is another confirmation than 7Zr compression algorithm is scaling too well with memory to be proper measurement of ALU process. Who knows where Zen3 or ADL peak ?
 

Det0x

Golden Member
Sep 11, 2014
1,028
2,953
136
Out of curiosity i checked the ST perf in 7 ZIP, wich is representaive of INT IPC, for current CPUs, the 5950X is at 6850 MB/s and the 12900K at 6571 MB/s, so Intel s latest is not that strong in this register contrary to FP based code.

The 5800X3D is no better than the 5950X ,wich mean that ADL advantage vs 5950X for games lie in the DDR5 better bandwith.


7Zip and compression stuff in general is bad for ST comparisons as they are very sensitive to memory subsystem and various interactions with bandwidth/latency provide crazy scaling with improving memory latency or hitting some special L1 / L2 / L3 cache size threshold.
Case in point: in same 18.3 7zr as linked by You, Zen3 stuff scores ~6800 MIPS, 12900K scores 6600 MIPS, but my tuned memory 12900K scores 8600 MIPS @ fixed clock of 5Ghz.

It was the same with Skylake, i had some very good scores in Winrar on 10900K by having tuned memory and it was esp sensitive to secondary/tertiary memory timings.

Alder Lake in my opinion is integer processing monster driven by 5 ALUs and massive OoO machinery. Unfortunately it is fueled by mediocre caches and abysmal memory subsystem that can't feed it properly once workload is spilling out of L2 caches.
I think the best showcase for raw integer IPC is web benchmarks like Speedometer 2.0, Octane etc, where in my testing Alder Lake enjoys up to 30% advantage clock for clock versus Zen3 when both are running tuned memory.
How are you guys getting your MIPS numbers, running benchmark on only 1 thread/core or dividing results on numbers of cores/threads ?
Here are numbers for my 5950x and 5800x3d.

5950x @ 4800mhz = 253k MIPS / 32 threads = 7.9k MIPS per thread.

1651824255186.png

5800x3d @ 4450mhz = 117k MIPS / 16 threads = 7.3k MIPS per threads
1651824427791.png

If i ran benchmark on only 1 core / 1 thread i'm pretty sure i would get much higher numbers..

*edit'*
I had a other version, will rerun 1core/1ST numbers on my 5800x3d with v18.3 when i get home from work later today :)
 
Last edited:
  • Like
Reactions: lightmanek

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
How are you guys getting your MIPS numbers, running benchmark on only 1 thread/core or dividing results on numbers of cores/threads ?

The key is to use 18.3 version as in article, as new one has better scores :) It shows results in MIPS too. The results come from selecting 1 CPU thread and in "total rating" box.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Searching 18.3 on the benchmark results page doesn't return any results. Nor does it on the previous page.

We are discussing the results from reviews of X3D and 12900K


18.3 version in these benchmarks.
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
It looks a bit jarring and kiddish with those bright colors. At least, he should tone down the color palette. He does want people to take him seriously, right?
It might be a strategy. Don't attract the wrong people.

I've often found that the sites with useful and detailed info and discussions have less visual appeal. The flashiest ones are trivial with little info and geared to those easily swayed by colorful graphics and animations and one measure I've found useful over the years to predict a fall in quality of writing is to see "flashiness" increase in a website.
 

DrMrLordX

Lifer
Apr 27, 2000
21,610
10,804
136
It looks a bit jarring and kiddish with those bright colors.

Wellllll


Also this subject is only peripherally related to Zen4, and I'll ding myself for not being able to veer back onto the subject matter within the context of this subconversation! We're all awful people! Or something. Um, hmm. Yeah just waiting for September now folks.
 

Det0x

Golden Member
Sep 11, 2014
1,028
2,953
136
The statement is from AT, they used 7 ZIP as metric for INT based computations, beside we are talking of ST perf as a way to extract IPC without saturating the mem bandwith and being too much limited by R/W latencies.

Tuned memory is certainly a great factor for improvement but anything out of spec cant be considered as being guaranted from the manufacturer to be errors free.

I did a test on Zen3 as well, and with tuned memory @4.4ghz it is scoring 7950 MIPS ( vs 6800 as in article @4.9ghz i believe ), so it is another confirmation than 7Zr compression algorithm is scaling too well with memory to be proper measurement of ALU process. Who knows where Zen3 or ADL peak ?
9079MIPS with a 5800x3d @ 4560mhz and memory 1900:3800
1651849758560.png

8937MIPS with a 5800x3d @ 4560mhz and memory 1800:3600
1651848073511.png

8834MIPS with a 5800x3d @ 4560mhz and memory 1600:3200
1651847022241.png

*edit*
Cleaned up post and screenshots
 

Attachments

  • 1651845849097.png
    1651845849097.png
    352.7 KB · Views: 22
Last edited: