Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 83 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
how do you plan on cooling this monstrosity Tim?
Quantum wells, it's the answer to everything! 🤘

 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
IPC has plenty to do with clockspeed. IPC will be higher than slower you clock, because DRAM is fewer cycles away. A design that targets a higher clock speed also has to increase cache latency (in terms of clock cycles) at every level; i.e. an L1 able to work at 1 cycle latency at clock x will require 2 cycles latency at clock 2x.

If they increase IPC by 19% it is very unlikely they will be able to maintain the same clock speed. I'm extremely skeptical of any claims that IPC can be increased by that much and clock rates can be increased as well. Sure, they are getting some "free" clock increase due to process, but there's less and less of that available with each process generation.
I'm a bit confused here Doug. IPC is literally, Instructions Per Clock. I used to do calculations on this working on firmware development looking at how long a given instruction took to execute. We had to stick with C/C++ code for portability, but I had no problem tweaking the code to get the compiler to use slightly faster instructions.

What we really have here, and this debate raged on ATF for a while, is Performance Per Clock - which is really an aggregate base on the execution of large instruction stream (from whatever benchmark being used). Ultimately, all I, and I would think most ppl care about is the actual performance delta between a Ryzen 6000 series and an 8000 series APU. +20% is pretty good gen-to-gen nowadays.
 

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
Zen 5's arch is supposed to be the Zen 1 type of clean sheet performance and efficiency overhaul. Why was Mike Clark so excited about it if it's just 19% improved over Zen 4? Why was he so anxious to want to "buy" it? Something doesn't compute.
Even if Zen 5 is a clean sheet, Zen 4 is pretty good. There's no reason to believe we'll have a Zen 1 moment again if for no other reason than we're going to be comparing to something that isn't 15h.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
That's a huge jump. It's the same jump that occurred from Zen 2 -> Zen 3.
I think people are too locked into the weird geomean amd pushed with the Zen 4 preview last august. the performance leap should be interesting, but the price will be higher than what we've seen. economy should be better then hopefully but who knows.
 

Geddagod

Golden Member
Dec 28, 2021
1,296
1,368
106
Zen 5's arch is supposed to be the Zen 1 type of clean sheet performance and efficiency overhaul. Why was Mike Clark so excited about it if it's just 19% improved over Zen 4? Why was he so anxious to want to "buy" it? Something doesn't compute.
The words used to describe the new architecture for Zen 5 were the exact same words used to describe the new architecture for zen 3, in both cases AMD said "grounds up".
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
It's not only about the improvements directly achieved but also the new technologies introduced (which can then be refined) and future improvements enabled by the changes (the usual even Zen gen).

Also the excitement may be not only about the Zen cores but also the package layout with CCDs and one IOD that with Zen 4 was still essentially unchanged since Zen 2.
I expect some interesting things from Zen 5 considering AMD is not developing cores on a shoestring budget anymore since a couple of years now.
Zen 3 was developed pretty much during the years of austerity at AMD. Zen 4 slightly less so and Zen 5 should see the first fruits of R&D under better days.
But more interesting for me is indeed packaging and SoC architecture. MI300 is almost here (next week?) to give us a glimpse of next gen packaging.

Curious to see whether InFO-R will replace substrate based PHY for 2.5D packaging on the Zen 5 family. Bergamo seems to have demonstrated the limits of routing with the substrate based interconnects and a likely way forward is fanout based RDLs at a minimum if not active bridges.
Besides the issue with practically no more space for traces coming out from IOD to CCD there is also the problem that the next gen IF which as per employee LinkedIn can hit up to 64Gbps compared to the current 36 Gbps.

I think InFO-3D could be a wildcard to enable lower cost 3D packaging. InFO-3D fits nicely here to enable lesser dense interconnect density than say FE packaging like SoIC but dense enough for SoC level interconnects for stacking on top of IOD. There is big concern at the moment with F15 and F14 underutilized and TSMC is pushing customers from 16FF and older to N7 family and ramping down those fabs (commodity process nodes you might say). Having any customer generously making use of N7/6 besides the leading node would be a win win.

Regarding the core perf gains, they have more transistors to work with and a more efficient process to work with, so at the very least just throwing more transistors at the problem should bring decent gains if their ~6 years (2018-2023) of 'grounds up design' of Zen 5 has to be worthwhile. Zen 4 is behind in capacity in almost all key resources of a typical OoO machine from key contemporaries. Pretty good (but not surprising given other factors) that it even keep up.

Nevertheless, few AMD patents, regarding core architecture, I have been reading strikes me as intriguing and I wonder if they will make it to Zen 5 in some form.
Not coincidentally all these patents are about increasing resources without drastically increasing Transistor usage.
  • Dual fetch/Decode and op-cache pipelines.
    • This seems like something that would be very interesting for mobile to power gate the second pipeline during less demanding loads
    • Remove secondary decode pipeline for a Zen 5c variant? Lets say 2x 4 wide decode for Zen 5 and 4 wide for Zen 5c
  • Retire queue compression
  • op-cache compression
  • Cache compression
  • Master-Shadow PRF
 

Doug S

Platinum Member
Feb 8, 2020
2,784
4,746
136
I'm a bit confused here Doug. IPC is literally, Instructions Per Clock. I used to do calculations on this working on firmware development looking at how long a given instruction took to execute. We had to stick with C/C++ code for portability, but I had no problem tweaking the code to get the compiler to use slightly faster instructions.

What we really have here, and this debate raged on ATF for a while, is Performance Per Clock - which is really an aggregate base on the execution of large instruction stream (from whatever benchmark being used). Ultimately, all I, and I would think most ppl care about is the actual performance delta between a Ryzen 6000 series and an 8000 series APU. +20% is pretty good gen-to-gen nowadays.


Instructions per clock can't be calculated by "looking at how long a given instruction takes to execute", at least not since the days of the 6502 (I remember doing what you are talking about programming an Atari 800's 6502 when I was in junior high) That sort of cycle counting is fine for something like 'MOV R2,0' (assuming you want to deal with figuring out how many instructions can issue and retire in a cycle, which gets more and more complicated the wider CPUs get) but you can't do it for everything.

The reason (as I'm sure you are aware) is that the pipeline will stall on some instructions. For example, when a load cannot be satisfied from cache/DRAM in time. When that happens you are getting 0 instructions for however many cycles that delay lasts. The higher your clock rate, the more often the pipeline will stall - and for more cycles - thus executing fewer instructions per cycle on average.

Now 'performance per clock' sure that's a more useful figure than IPC, though you have to decide what "performance" means. Is it Geekbench 6? Is it SPEC? Is it CBR23? What compiler are you using, with what settings? IPC is talked about often because it is far easier to measure. It may not be free of the issues "PPC" is, but at least most people can agree on the accuracy of the measurement since you use the CPU's performance counters to do it.

Since in this case we are hearing a claim "IPC is increased by 19%" then we can't talk about "PPC", and we have to take into account the effect of clock rate on it (and while IPC is kinda sorta sensitive to what code is being run and the compiler used that's a fairly small effect unless you go out of your way to create a bad test which we will assume AMD is not going to do) If instead we heard a claim that "performance increases by 19%" (i.e. what you will hear Apple talk about when they introduce a new Mac or whatever) then they are talking about "PPC", but generally you aren't going to know what they used to measure it unless they give you a nice graph saying they used SPECint2017 or whatever.

I remember when Apple announced A9 and claimed a 70% performance increase everyone thought that was crazy and they were cherry picking some corner case, then sure enough Geekbench showed a ~70% increase in ST thanks to the combination of IPC improvement along with a massive increase in clock rate. Now maybe they weren't using Geekbench specifically but it was interesting how that lined up so well with their claim in that instance. Obviously Apple was working from a much lower bar back then, everyone is subject to the law of diminishing returns after all.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
I remember when Apple announced A9 and claimed a 70% performance increase everyone thought that was crazy and they were cherry picking some corner case, then sure enough Geekbench showed a ~70% increase in ST thanks to the combination of IPC improvement along with a massive increase in clock rate. Now maybe they weren't using Geekbench specifically but it was interesting how that lined up so well with their claim in that instance. Obviously Apple was working from a much lower bar back then, everyone is subject to the law of diminishing returns after all.
I recall this. I remember saying something like this on a now defunct blog but I was toasted in the comment replies. I had the same outlook when Ryzen launched; people presumed AMD were talking out of their ass when they claimed Ryzen would have a giant leap in performance over dozer. I don't remember the exact figure and whether it was overall performance or not. Sure enough they met that goal. Though that being in recent and thus fresh memory has led some of the wild claims about Zen 5 that have been circling like flies to a pile of horse crap. doesn't help when you have morons like mlid talking out of their behinds. Or the chunky boy with greasy curly hair. It's difficult to say what zen 5 would be like or what arrow lake would be like. I can take my best guess and post it here but my words as valid as the bs spewed by leakers.

6502... I knew you were older than I but I wasn't expecting that or you're not as old and had access to those earlier computers. I remember posting on here a while back how much I disliked computers and technology in the late70s and 80s until I learned to love it. like being fed mushy peas as a child and not liking them. still don't like them. now mushy peas made from frozen sweet peas is delightful, but not the authentic stuff. that is rank.
 
Jul 27, 2020
20,040
13,738
146
Has anyone here postulated that Zen 5 being on N3 and N4 could mean that the single CCD SKUs may use N4 and the dual CCD ones may use N3? It's also possible that the E-core CCD may use N3 for minimal energy usage while the P-core CCD will benefit from the maturity of the N4 node family?
 

Timorous

Golden Member
Oct 27, 2008
1,748
3,240
136
Has anyone here postulated that Zen 5 being on N3 and N4 could mean that the single CCD SKUs may use N4 and the dual CCD ones may use N3? It's also possible that the E-core CCD may use N3 for minimal energy usage while the P-core CCD will benefit from the maturity of the N4 node family?

If there is a node split I expect it is more likely to be APUs Vs CCDs.
 

yuri69

Senior member
Jul 16, 2013
541
975
136
IPC has plenty to do with clockspeed. IPC will be higher than slower you clock, because DRAM is fewer cycles away. A design that targets a higher clock speed also has to increase cache latency (in terms of clock cycles) at every level; i.e. an L1 able to work at 1 cycle latency at clock x will require 2 cycles latency at clock 2x.

If they increase IPC by 19% it is very unlikely they will be able to maintain the same clock speed. I'm extremely skeptical of any claims that IPC can be increased by that much and clock rates can be increased as well. Sure, they are getting some "free" clock increase due to process, but there's less and less of that available with each process generation.
Golden Cove achieved a 19% IPC increase & clocked 0.2GHz higher than Cypress Cove.
Zen 3 achieved a 19% IPC increase & clocked 0.2GHz higher than Zen 2.

What is the catch?
 
  • Like
Reactions: Tlh97 and coercitiv

Geddagod

Golden Member
Dec 28, 2021
1,296
1,368
106
Golden Cove achieved a 19% IPC increase & clocked 0.2GHz higher than Cypress Cove.
Zen 3 achieved a 19% IPC increase & clocked 0.2GHz higher than Zen 2.

What is the catch?
Frequency iso power. You might be able to hit higher peak ST max, but usually larger architectures take more power to reach the same frequencies as the previous architecture, which is way more of an important limiting factor in MT.
For example, CML clocked ~10% higher iso power vs RKL (10400 vs 11400 @65 watts)
IIRC Zen 3 was impressive in the fact that it clocked similarly or maybe even slightly higher than zen 2 iso power.
 
  • Like
Reactions: Tlh97 and moinmoin

Mopetar

Diamond Member
Jan 31, 2011
8,113
6,768
136
Instructions per clock can't be calculated by "looking at how long a given instruction takes to execute", at least not since the days of the 6502 (I remember doing what you are talking about programming an Atari 800's 6502 when I was in junior high) That sort of cycle counting is fine for something like 'MOV R2,0' (assuming you want to deal with figuring out how many instructions can issue and retire in a cycle, which gets more and more complicated the wider CPUs get) but you can't do it for everything.

You can always find the ideal CPI for any instruction just by counting the number of cycles it would take to execute. That may or may not be particularly useful, but I'm not aware of any instruction that takes a variable number of cycles to execute given a perfect cache.

Obviously loads and stores can take longer to execute if the miss the cache, but the access times for the different levels of the memory system (L1, L2, RAM, etc.) are also known quantities, though at least with accesses to main memory you could introduce additional variability by saturating the memory controller with requests and forcing it to stall the pipeline for that reason, or even going a step further and intentionally generating page faults so it has to go to disk.

Theoretical IPC is still something you could calculate, but it's not all that meaningful to an end-user. The engineers might want to know what it is though as it would help inform them of where they should spend more of their time or what alleviating some bottleneck would afford in terms of potential performance gains. Software developers may also benefit if they really want to be able to extract maximum performance by hand-tuning program code, though almost no one bothers doing this since compilers tend to do a better job at it and it's incredibly time consuming.
 

Doug S

Platinum Member
Feb 8, 2020
2,784
4,746
136
You can always find the ideal CPI for any instruction just by counting the number of cycles it would take to execute. That may or may not be particularly useful, but I'm not aware of any instruction that takes a variable number of cycles to execute given a perfect cache.

The 6502 took variable numbers of cycles to execute instructions - if you had an instruction like 'AND $10C0,X' which would execute a logical AND on the accumulator using the byte found in the address $10C0 + the X register. If the value of the X register was $40 or higher that instruction took an additional cycle to execute due to the crossing of a 256 byte page boundary. There were some more complicated instructions that could take two additional cycles. Since the cycle timing depended on the value of a register you generally wouldn't know in advance, calculating exact cycle timing was impossible (there were ways around this involving self modifying code if you REALLY needed cycle accurate timing)

Thankfully I left my assembler programming behind with the 6502 so I couldn't say if current CPUs like Intel/AMD's x86 or Apple/ARM AAarch64 have any instructions with variable timing but I wouldn't be surprised - I'd look at instructions doing stuff like multiplication and division first if I was trying to find such.
 
  • Like
Reactions: Mopetar

naukkis

Senior member
Jun 5, 2002
903
786
136
Thankfully I left my assembler programming behind with the 6502 so I couldn't say if current CPUs like Intel/AMD's x86 or Apple/ARM AAarch64 have any instructions with variable timing but I wouldn't be surprised - I'd look at instructions doing stuff like multiplication and division first if I was trying to find such.

Every modern cpu instruction timing is wildly variable. Your 6502 example had memory running zero latency but additional one cycle latency for more than 8 bit addressing. CPU's now have usually 3-level caches for both instruction and data separately, memory is divided to many different timed pages - and usually operate in translated virtual memory meaning that memory access speed in every cache level varies too depending on translation cache hits and misses with page walks -resulting that every instruction execute can vary from one cycle to thousand cycles. And cpu's can reorder instruction to hide that to up to thousand instruction window. So to study how code is executing needs special tools to diagnose it - and for example Intel provides great tools for that.
 
  • Like
Reactions: Tlh97 and moinmoin

Doug S

Platinum Member
Feb 8, 2020
2,784
4,746
136
Every modern cpu instruction timing is wildly variable. Your 6502 example had memory running zero latency but additional one cycle latency for more than 8 bit addressing. CPU's now have usually 3-level caches for both instruction and data separately, memory is divided to many different timed pages - and usually operate in translated virtual memory meaning that memory access speed in every cache level varies too depending on translation cache hits and misses with page walks -resulting that every instruction execute can vary from one cycle to thousand cycles. And cpu's can reorder instruction to hide that to up to thousand instruction window. So to study how code is executing needs special tools to diagnose it - and for example Intel provides great tools for that.

I was talking in context of Mopetar's "perfect cache". i.e. some specialized thing for an embedded system where you could fix something in L1 and avoid TLB misses etc.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Has anyone here postulated that Zen 5 being on N3 and N4 could mean that the single CCD SKUs may use N4 and the dual CCD ones may use N3? It's also possible that the E-core CCD may use N3 for minimal energy usage while the P-core CCD will benefit from the maturity of the N4 node family?
I think that for Zen 5 generation, AMD will have a native 16 core chiplets, not dual CCD. On N3.

And N4 for "classic" 8 core chiplets.