Question Zen 6 Speculation Thread

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
So I feel that it does not make sense for AMD to keep chasing high clocking designs unless the engineers feel that it's the optimal way to extract performance at this point in time. The only reason I could think of that could make them wary of switching to lower freq, higher IPC design if it's indeed more performant is severe area penalty maybe? So severe that it would not work on server?
Here's the thing, if AMD drops their frequency to ARM levels, that's not going to suddenly give them a free double digit IPC gain. Increasing IPC is hard, and as things stand now, ARM vendors have a large IPC advantage over AMD.
 

poke01

Platinum Member
Mar 8, 2022
2,908
3,804
106
That’s because current AMD designs rely on high frequencies. I want to see AMD design a M1 clone for Zen. Big L2 cache and low frequency and high IPC, it would sell so much. It would make the Steam Deck so good. I wonder if such a design is possible with x86. Cause you have to remember ARM/Apple cores are also much wider than x86.
Here's the thing, if AMD drops their frequency to ARM levels, that's not going to suddenly give them a free double digit IPC gain. Increasing IPC is hard, and as things stand now, ARM vendors have a large IPC advantage over AMD.
 

Doug S

Platinum Member
Feb 8, 2020
2,976
5,091
136
That's for another thread but I suspect they'll be better off with a focus on clock rate for a bit - at least it worked for Apple.

Apple is coming from much lower clock rates, so it is much easier for them to meaningfully gain frequency without blowing up power too much. Further, it appears that the M3->M4 clock gain may be largely due to FinFlex allowing them to use HP cells for the P cores, which AMD is already using so they won't get any clock benefit there (their potential benefit would be in the opposite direction, using HD cells for the 'c' cores to save a bit of power)
 

gdansk

Diamond Member
Feb 8, 2011
3,640
5,714
136
Apple is coming from much lower clock rates, so it is much easier for them to meaningfully gain frequency without blowing up power too much. Further, it appears that the M3->M4 clock gain may be largely due to FinFlex allowing them to use HP cells for the P cores, which AMD is already using so they won't get any clock benefit there (their potential benefit would be in the opposite direction, using HD cells for the 'c' cores to save a bit of power)
That was about ARM.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,753
6,627
136
Further, it appears that the M3->M4 clock gain may be largely due to FinFlex allowing them to use HP cells for the P cores, which AMD is already using so they won't get any clock benefit there (their potential benefit would be in the opposite direction, using HD cells for the 'c' cores to save a bit of power)
FinFlex is on N3E+, AMD don't have a product using it yet
Every AMD Zen product with TSMC uses HD cells, with device and backend optimizations mostly. You can check Hot Chips and ISSCC papers for Z2, Z3 and Z4 for that.

I am as convinced with the no clock increase with Z6 as I am with the 32% IPC Z5.

If for instance 6 GHz is right at the tip of the shmoo plot before the gains goes flat, it would be a failure to not leverage that.
TSMC material for N3E indicated much better device performance, lesser leakage, and improved switching capacitance than N4P but folks are not going to leverage any of that?

RDNA clocks and Zen clocks will go up with the next node is my position, other designers are getting lot of clock uplifts going to N3B why not AMD with N3E or N3P which is even better.
AMD physical implementation teams seems good enough.
 

poke01

Platinum Member
Mar 8, 2022
2,908
3,804
106
AMD will likely use N3P for Zen 6 and FinFlex will be taken advantage off. Apple also never used HP cells till N3E so here’s hoping AMD does the same.
RDNA clocks and Zen clocks will go up with the next node is my position, other designers are getting lot of clock uplifts going to N3B why not AMD with N3E or N3P which is even better.
AMD physical implementation teams seems good enough.
 

DavidC1

Golden Member
Dec 29, 2023
1,319
2,150
96
Here's the thing, if AMD drops their frequency to ARM levels, that's not going to suddenly give them a free double digit IPC gain. Increasing IPC is hard, and as things stand now, ARM vendors have a large IPC advantage over AMD.
AMD is 14 best case and 18 worst case. If they go down to 11 like X925, yes they will get double digit gains. They are infected with the subtler version of Bulldozer/Netburst ideology of chasing high clocks.... annnd Intel, their main competitor.

Gracemont has 14 stages, and it shows lower clocks benefit performance elsewhere such as caches, where the 3.9GHz L1 on GMT is lower absolute latency than 5.2GHz 12900K and 4.7GHz 3950X.
gmt_latency.png


Also those that doubt uop cache being basically Trace Cache lineage of trying to have high pipeline stages without the performance costs should look at why the E cores are at 14, Nehalem at 16*, and Sandy Bridge onwards 14-18(15-19 with GLC).

14 is base, then uop cache needs 2 extra cycles. When it's a miss, then you need 2 extra cycles, bringing it to 18. On a uop cache hit, it brings it back to 14. You'll notice Conroe/Merom is also 14 stages.

Intel claimed 85% hit rate with 1.5K uop cache on SNB, and ARM claimed the same recently. It's realistically 60% as C&C tests show. At 60% hit rate, Sandy Bridge's uop cache allows 19 stage clocks with Nehalem-level 16 stage performance. Hence, why Nehalem was "ehh" in many single threaded applications and the successor much better.

I think 14 on pre-Nehalem and also the E cores are too high. They need to go back down to 12 or less.

*Nehalem introduced Turbo.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
5,151
8,251
136
Now answering the question of why the ARM and x86 vendors have arrived at such different conclusions regarding this trade-off requires someone who actually knows what he's talking about, I have no idea personally. Any guesses as to what might be going on here?
Different markets different targets.

ARM cores to this day are mobile first. That's slowly changing. Though the destination seems somewhat unclear. The mobile first approach should make them perfect for DC use as mass cores chips, but somehow this is very slow to speed up, with only AWS' in-house Graviton being in line with what could be expected.

Intel has been increasing the (P) core size a lot to enable high frequencies (to the point they stopped reporting the amount of transistors their chips use since as a result density was nothing to talk home about anymore). This makes it a desktop first design. In DC they opted to double down on the area use, adding a lot of accelerators on top of that. Intel somehow completely lost sight of energy and area efficiency, looks like they only use E-cores to not stay stuck there.

AMD since its CPU revival has been server first, mass cores chips for DC first. The interesting situation here has been that this demands for energy and area efficient cores, while the core also being used on desktop required the ability to run high frequencies on top of that. This balance they managed really well so far, having unrivalled mass core chips in DC and being at least competitive in both desktop and mobile.
 

Nothingness

Diamond Member
Jul 3, 2013
3,165
2,199
136
And Nothingness is on record saying that another 10% would be a good result which is in line with the roadmap.
Heh, I might regret having written that 😀

It's what I hope to see based on my past experience of a good redesign (which Zen5 is) and subsequent iterations.

That's also based on my belief (so not fully sustained by evidence) that everyone is slowly converging to similar performance levels. So unless the starting point is really poor, I don't expect huge increases from one generation to the next (on average). Again, this is a belief 😉
 

Nothingness

Diamond Member
Jul 3, 2013
3,165
2,199
136
Did they? Except for Apple, which only sells consumer hardware, aren't all other arm vendors at similar PPC* and area levels compared to Zen (C) cores?

*At least on average, considering both int and fp ppc
I’m not sure it makes sense to average INT and FP scores in general. Design choices are very different if you want to target high FP performance.

That being said, if we look at PPC of specint 2017, Arm PPC is significantly above AMD, but along with significantly lower clocks.

There’s a nice sheet on David Huang blog that summarizes that for many chips.
 

MS_AT

Senior member
Jul 15, 2024
406
898
96
I’m not sure it makes sense to average INT and FP scores in general. Design choices are very different if you want to target high FP performance.

That being said, if we look at PPC of specint 2017, Arm PPC is significantly above AMD, but along with significantly lower clocks.

There’s a nice sheet on David Huang blog that summarizes that for many chips.
Informative, slightly off topic but I wasn't aware of this:
Regarding the performance of macOS: Due to differences in operating environments (especially macOS libc/malloc), the performance of various processors including x86_64/ARM64 running 523.xalancbmk under macOS has significant advantages over the default configuration of Linux/glibc. , other sub-items are mutually victorious. In the end, the total score of macOS will be about 3%-4% ahead of Linux.
for some reason I thought that SPEC will have their own libc/glibc equivalent just to provide equal playing field, but might be this would be a nightmare to maintain for relatively little influence;)
 

Nothingness

Diamond Member
Jul 3, 2013
3,165
2,199
136
Informative, slightly off topic but I wasn't aware of this:

for some reason I thought that SPEC will have their own libc/glibc equivalent just to provide equal playing field, but might be this would be a nightmare to maintain for relatively little influence;)
Even though SPEC tries to abstract as much as possible score results from platform specific things, there's not much you can do about some things. The effect of malloc implementation is notorious and that's why many official results use jemalloc instead of the system default libc allocation implementation.
 

Doug S

Platinum Member
Feb 8, 2020
2,976
5,091
136
Even though SPEC tries to abstract as much as possible score results from platform specific things, there's not much you can do about some things. The effect of malloc implementation is notorious and that's why many official results use jemalloc instead of the system default libc allocation implementation.

SPEC doesn't try to abstract platform specific things, that's the whole point. It has always been clear that it is a SYSTEM benchmark, not a CPU benchmark.

That's true for stuff like Geekbench, Cinebench, and everything else too, even though people want to pretend it isn't. They'll cherry pick results using some really fast RAM timings and infer that represents the performance of CPU x, even if almost no systems will use such expensive or aggressively timed DRAM.

The use of special malloc libraries in SPEC results is particularly annoying to me. Either the system malloc implementation is slow, if so fix the damn thing, or the replacement is fragile/limited and is really suitable only for SPEC runs in which case its use should be banned IMHO. Long ago I was involved with some software that supported 1000+ users on an entry level workstation. Its memory was maxed out, so reducing the size of each process was really important, and I went to some extraordinary lengths to make that happen. Not replacing malloc, but eliminating its use entirely. I replaced a few libc functions because linking them caused a ton of other stuff to be linked in and ballooned the data size of the process. That meant replacing printf/sprintf with a cut down version that could only do the things that were needed by this particular software. I can't help but wondering if these SPEC special mallocs are similar to my printf.
 

Nothingness

Diamond Member
Jul 3, 2013
3,165
2,199
136
SPEC doesn't try to abstract platform specific things, that's the whole point. It has always been clear that it is a SYSTEM benchmark, not a CPU benchmark.
I meant SPEC can't come with its own libraries and even less rely on external libraries beyond the standard C/C++/FORTRAN (and OpenMP IIRC), as much as it can't use CPU specific intrinsics or assembly language inlines/routines.

I see it as a CPU/compiler benchmark with some sensitivity to memory characteristics.

The use of special malloc libraries in SPEC results is particularly annoying to me. Either the system malloc implementation is slow, if so fix the damn thing, or the replacement is fragile/limited and is really suitable only for SPEC runs in which case its use should be banned IMHO. Long ago I was involved with some software that supported 1000+ users on an entry level workstation. Its memory was maxed out, so reducing the size of each process was really important, and I went to some extraordinary lengths to make that happen. Not replacing malloc, but eliminating its use entirely. I replaced a few libc functions because linking them caused a ton of other stuff to be linked in and ballooned the data size of the process. That meant replacing printf/sprintf with a cut down version that could only do the things that were needed by this particular software. I can't help but wondering if these SPEC special mallocs are similar to my printf.
jemalloc is better at memory fragmentation management and better too for multithreading. It seems to be used by default on FreeBSD and Firefox. All this from the jemalloc page, I never studied it thoroughly.

Having worked on embedded systems, I feel the pain you had with your app. printf is particularly nasty given the range of features it uses (FP printing is quite complex for instance). And don't start me on C++ libraries...
 

Doug S

Platinum Member
Feb 8, 2020
2,976
5,091
136
Having worked on embedded systems, I feel the pain you had with your app. printf is particularly nasty given the range of features it uses (FP printing is quite complex for instance). And don't start me on C++ libraries...

I actually got the printf code I used as a starting point from some freeware tiny libc intended for embedded systems, which had cut out unnecessary stuff like floating point and hex/octal support.
 
  • Like
Reactions: Nothingness

Gideon

Golden Member
Nov 27, 2007
1,908
4,612
136
I am as convinced with the no clock increase with Z6 as I am with the 32% IPC Z5.
The clocks will absolutely rise. Even if first and foremost for MT workloads and mobile chips. I also believe slightly on desktop, but we'll see.
 

adroc_thurston

Diamond Member
Jul 2, 2023
4,390
6,121
96
The clocks will absolutely rise. Even if first and foremost for MT workloads and mobile chips. I also believe slightly on desktop, but we'll see.
note that Vmax is going down on future nodes.
But Z5 parts dialed it down anyway.
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
ZEN 6 Client (with RDNA5)
Ryzen AI 400 series

All Medusa parts use 12-core CCDs. The difference is the IOD, of which 3 unique ones exist for each Ridge/Halo/Point.

MEDUSA RIDGE (Desktop)
24C/4CU
20C/4CU
16C/4CU
12C/4CU
8C/4CU

192bit/LPDDR6-10667 LPCAMM
100 TOPS NPU

MEDUSA POINT
12C/24CU
10C/20CU
8C/16CU
6C/12CU

192bit/LPDDR6-10667 LPCAMM/Soldered
100 TOPS NPU
4 LP cores

MEDUSA HALO
24C/72CU
20C/60CU
16C/48CU

384bit/LPDDR6-10667 On-package
200 TOPS NPU
4 LP cores
___

*above is speculation
 
Last edited:

soresu

Diamond Member
Dec 19, 2014
3,444
2,734
136
MEDUSA POINT
12C/24CU

MEDUSA HALO
24C/72CU
No earthly way AMD will increase their APU CU count by 1.5x and 1.8x just one generation after increasing it 1.33x on the base APU and creating the big APU SKU.

Not unless RDNA5 completely changes what a CU is to the point that the previous specs are meaningless as a guiderule.

72 CU would wipe out the point of all RDNA4 SKUs when they are still barely on the market.

IMHO RDNA5 won't have a mid range for quite a while after release to get some mileage out of RDNA4 before they retire it.

Likewise with the Halo SKU having 1.5x the core count of the previous gen.

Strix Halo having 16C sounds like a lot, but desktop mainstream had that since 3950X.

I doubt the APU side will increase CPU core counts over Strix Halo for quite a while.

On the other hand it's also been quite a while since 3950X, and 32C doesn't seem like a stretch for desktop mainstream/hi end in 2026+.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,545
2,190
136
72 CU would wipe out the point of all RDNA4 SKUs when they are still barely on the market.
That doesn't matter. Halo APUs are not available on the desktop, they only potentially compete on large laptops, and a large APU is preferable to AMD to a discrete part because it's an angle where AMD can compete against NV in an asymmetric way. If AMD has a customer willing to pay for such a part, they will make it, and what other GPU parts they have in their lineup is irrelevant.

The bigger issue I have with such a large CU count is memory. What memory interface would keep that fed? Would they go for soldered LPDDR6 with a very wide interface or what?

Strix Halo having 16C sounds like a lot, but desktop mainstream had that since 3950X.
Dragon Range was there first.

I doubt the APU side will increase CPU core counts over Strix Halo for quite a while.

On the other hand it's also been quite a while since 3950X, and 32C doesn't seem like a stretch for desktop mainstream/hi end in 2026+
The Halo parts are supposed to use similar core chiplets as the desktop, so their core counts track the desktop products. If there are more cores on desktop, there are more cores on Medusa Halo.
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
The bigger issue I have with such a large CU count is memory. What memory interface would keep that fed? Would they go for soldered LPDDR6 with a very wide interface?
As I mentioned, 384 bits of LPDDR6-10667 will be the dream for Medusa Halo. ~450 GB/s of bandwidth (66% higher than Strix Halo's 273 GB/s).

But now I am not entirely sure if Medusa would use LPDDR6. The timing and supply would be contentious.