Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

HurleyBird

Platinum Member
Apr 22, 2003
2,690
1,278
136
Zen 5 somehow has to redo a lot of CCX layout like Zen 3 did. Their Zen 5 lineup reportedly consists a 256c server SKU, that's why.

Bergamo is probably done in an easy manner of two CCXes within a CCD. Still, this requires 8 of them. Reaching to 256c would require 16 dies each with 2 CCXes leading 32 separate CCXes. Merging CCXes to 16c sounds cool, but surely requires proper engineering.

My guess for Zen 5 is that, like MI300, there will be no non-3D version, except for perhaps some budget SOC. Cache on one die, cores on the other, and IF on whatever ends up as the base die (probably cache). This changes the layout and routing game entirely from 2D land.
 

naad

Member
May 31, 2022
63
176
66
Lack of R&D probably led to ignoring the 4S/8S server market Intel had on lockdown, if AMD gets to 50% server marketshare they'll probably start targeting that as well.
Guess that would require a faster socket to socket interconnect?
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Lack of R&D probably led to ignoring the 4S/8S server market Intel had on lockdown, if AMD gets to 50% server marketshare they'll probably start targeting that as well.
Guess that would require a faster socket to socket interconnect?
Nah, it's not worth it. Or at the very least not 8S. With 12+ memory channels and 100+ cores per socket, there's no room (physically or metaphorically) for them going forward.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,037
2,496
106
Just some thoughts.

One thing I wonder about this is that if the L3 is fully 3D stacked, underneath the core, they can put a dense CCX and high frequency CCX on top of unified L3 . Bypassing the IFOP/IOD for inter CCX CC snooping.
Current dual CCDs chips are already differing 300+ MHz in clock speed between CCD0 and CCD1 and the ACPI CPPC preferred core handling is already a thing in Windows and Linux moving the task to the fastest cores possible.
Let's say a 4.5 GHz dense CCX and 6 GHz Fast CCX which are exact instruction compatible would make the CPU very transparent to software, during all core loads the entire chip runs at 4.5 GHz anyway and during bursty ST loads the preferred core takes over. Pretty much like any 5950X or 7950X when you enable CPPC preferred core in BIOS.

Also really curious about the IFOP going away or is gonna remain, replaced by EFB or interposer things
I have seen AMD folks working on N3 GMI PHYs with double the BW of current N5 PHYs.
GMI2 --> Up to 25 Gbps SerDes, GMI3 --> 32-26 Gbps SerDes, GMI4 --> Up to 64 Gbps SerDes. Not sure what to make of that.

That is why I find the un-core really the most interesting thing to look forward to in Zen 5.
The core is like yeah 50% bigger PRF/decode/execute etc. etc. but the un-core is the total wild card.

Unfortunately there will hardly be any leaks around as usual for AMD. We have to go by some vague statements like these

TSMC is talking about things to come, and one of the biggest ones may be newly demonstrated SoIC-H. Where H stands for Horizontal.

It is an interposer(ish) type of thing, where the dies are attached not by using bumps / micro-bumps but by "stacking" the dies on the interposer using hybrid bond connection.

Using hybrid bond connection, the number of interconnects can grow by order of magnitude, most of the latency is squeezed out, clock speeds, bandwidth increased, power overhead drastically cut.

End of SerDes. EFB / EMIB, classic interposer leapfrogged.

While there is ~zero chance of this making it to RDNA3, imagine quite small interposer covering area of GCD+MCD, with both GCD and MCD stacked using hybrid bond, and the GPU behaving indistinguishable from a monolithic die (while using different (and optimum) process technologies for the individual dies.

Or imagine a Zen5 chip with unlimited Chiplet to I/O, Chiplet to Chiplet bandwidth, latency squeezed out to almost nothing, possible mesh connection between chiplets, access to another chiplet's L3 at nearly the same speed as own L3.

Now add HBM, that is not being hampered by power, latency and bandwidth limits of HBM stack and signaling and power overhead., with I/O die memory controller being able to directly and fully control each memory chip.

Almost as if cores on a chiplet, memory controller and individual DRAM die being one gigantic monolithic die.

Some links:

TSMC Demos SoIC_H for High-Bandwidth HPC Applications – WikiChip Fuse
(paywalled)

SoIC_H Technology for Heterogeneous System Integration | IEEE Journals & Magazine | IEEE Xplore
(also paywalled but abstract is there)
 

RnR_au

Golden Member
Jun 6, 2021
1,722
4,223
106
What does it mean to 'squeeze out latency'? I would have thought that at the speeds that IC's are switching at nowadays, distance = latency.
 
  • Like
Reactions: maddie

Joe NYC

Platinum Member
Jun 26, 2021
2,037
2,496
106
What does it mean to 'squeeze out latency'? I would have thought that at the speeds that IC's are switching at nowadays, distance = latency.

I think (I am not sure) that multiple hops over a micro bump introduce latency in addition to power overhead.

While I am a little out of my depth, I imagine that signal that travels inside a single die does not have the same impediments compared to what is needed to power the (micro)bump to go off die and then back on die. I don't know if this is electrical only of if it costs also a certain number of clocks. Crossing different clock domains is likely going to introduce some latency, potentially maintaining signal integrity may add latency.

This is not like comparing serial to parallel. It is comparing parallel (EMIB, EFB, Interposer) to a slightly better parallel (hybrid bond). Parallel is already low latency. Hybrid bond may offer some minor improvement, which is why I used the term "squeeze out". Because there is not a lot of latency there to begin with.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,687
1,222
136
AMD could bring back CMT, except modernized with way more cores, different types of cores, some type of scheduling optimizer, etc. 😉
They never launched the original cluster-based multithreading core in the first place. Bulldozer -> Excavator is a chip-level multithreading architecture not a cluster-based multithreading one. AMD would not be bringing anything back, but rather launching it for the first time. Since, the first one got replaced by the chip-level multithreading architecture as a production product.

In fact, they technically already did the architecture-style within Zen3. Zen3 has two FPUs which can execute in single-threaded mode[both schedulers get one thread] or simultaneous-threaded mode[each scheduler gets a different thread].

1st Cluster-based core(single-threaded only) = Clustered Integer&FPU, Monolithic Memory(LSU)
::https://patents.google.com/patent/US6256721B1
2nd Cluster-based core(adds multithreading and has a shared FPU instead of duplicated integrated FPU) = Clustered Integer + Monolithic Memory(LSU) + Monolithic FPU
:: patents are wonky because teams kept shifting. (MGB switched it towards to multi-core rather than multi-cluster)
:: https://patents.google.com/patent/US7043626B1 starts from here to slightly before 2007.. as if you go beyond mid-2007 it becomes https://patents.google.com/patent/US7877559B2 which is the chip-level multithreading architecture. From what I can gather the switch away from cluster-based to chip-level was caused by Microsoft.
:: 2004-1H2007 => cluster-based multithreading
:: 2H2007+ => chip-level multithreading with Microsoft being involved up to 2009 then who ever it was ditched the collaboration going on. (Microsoft wanted a cheap and fast x86-64 to swap out Xenon and they wanted something for early cloud. 2001 - x86 -> 2005 - PowerPC -> 2009 - AMD64(Bulldozer 2.0) -> 2013 - AMD64(originally Bulldozer 2.5(Steamroller)) -> 2017 - AMD64 then 3 year cycle.)
:: The not-AMD timeline is: Chuck Moore in 2005 announces cluster-based multithreading, AMD+Laptop Manufactor meet in 2006, where Bulldozer/Bobcat were both single-core. Announcing both products would be early fusion products for late 2008/early 2009. Bobcat first to launch with singe-core+1 GPU(40SP?Radeon HD 3450-related?)(late 2008) -> Bulldozer second to launch with single-core+3 GPU(120SP?Radeon HD 3650-related?)(early 2009). This meet also described each cores target: 1+ GHz for Bobcat and 2+ GHz for Bulldozer. Also, later on they did announce at AMD+Laptop Meet 2.0 in 2007 a third product: three Bulldozer cores+1 GPU core for late 2009.
:: 2007 is off the rails... 1H2007 announces the delay that all new products shifted to late 2009. Then, 2H2007 announced that nope, it won't be coming till 2010+. They literally dumped the slides for the old architecture in mid-2007. When they knew they changed gears to another architecture for the "Bulldozer" processor.
:: Pretty confident that Bulldozer 1.0 was 2x2 ALU + 3 AGU + 4 FPU and was referred to as a single-core with multithreading throughout. It is Bulldozer 2.0 (2x2 ALU, 2x2 AGLU. 1x4 FPU) which was referred to a dual-core throughtout and was referred to chip-level multithreading by the chief architect that launched it.
:: I also probably identified the original mm2 size for Bulldozer 1.0 being ~10.8 mm2(11.3 mm2 for pervasive bits) for a single-core on 45nm. It is also likely that the FPU was 2x64b for FMA(Penalty for MUL and ADD(diff instructions) simultaneously was low) and 2x64b for MMX. I believe at the time AMD wasn't eyeballing 256-bit, but rather speeding up 64-bit(2x throughput of K8) and 128-bit(same throughput in 2/3rd the area of GH45/Stars45). Bulldozer 1.0 and Bobcat was very close in design and target. However, Bulldozer 1.0 could scale out and up. Bobcat ~10W target for single-core(~5.4 mm2 on 45nm/1+ GHz) and Bulldozer went ~100W target with eight-cores(~10.8 mm2 * 8/2+ GHz).
:: Greyhound standard cells were same across => Agena, Deneb, Thuban, Llano
:: Bobcat and Bulldozer 1.0 standard cells were the same && Bulldozer 2.0 used its own grounds-up standard cells.
-- Chief Architect 0 (1997-2002): Low Power (1st Gen Clustered Core)
-- Chief Architect 1 (2002-2004): LP -> High Performance (Adopts K10 and is very close to Bulldozer that launched)
-- Chief Architect 2 (2004 to December 2007): HP -> Low Power (Adopts Bulldozer and is mobile-focused and is closer to DW/JK's K8-alternative architecture)
-- Chief Architect 3 (2008 to 2012) -> LP -> HP (Architecture that would launch and start on Steammroller 1.0 in 2009)
-- (Acting) Chief Architect 4 (2012 to 2015?) -> Mobile-ify Steamroller/Excavator and work on 3rd Gen architecture(back to original design)
-- Chief Architect 5* (2016 to 2020) -> Ultra Low Power cluster-based Multithreading architecture (22FDX/12FDX)
-- Chief Architect 6* (2020 to present)-> Ultra Low Power grid-based multiclustered architecture
* => same person, maybe.
With the nodes of the new architecture being: 90CPP/80Mx FDX and 64CPP/56Mx FDX and 45CPP/40Mx FDX
New Malta Testsite/Shuttle September 2021-present for the above nodes. AMD is exclusively getting their own FDX nodes. AMD/GF - DTCO covers ultra-low-voltage custom digital, SerDes, memories, and etc.

Zen3 core = Monolithic Integer + Monolithic Memory(LSU) + Clustered FPU
::https://patents.google.com/patent/US11281466B2
::literal die shot + software optimization guide.

I believe you are wanting to refer to cluster-based multithreading and not chip-level multithreading.

Example: Zen3/4 w/ partial architecture cluster-based multithreading to Zen5 w/ full architecture cluster-based multithreading:
8-core => 8-core
Only FPU clusters have thread to cluster pairing => Both Int and FPU clusters have thread to cluster pairing.

The reason for clusters to become more prevalent is detailed all the way back from 2001:
"A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors, with an advantage that grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs."

If Zen5 uses cluster-based for Front-end and Integer-end to get 6-wide decode/ALU. Then, it would be more efficient than Apple's P-core(if, it is still 6 ALU by N3) and Intel's P-core(if, it is still somewhere around 6 ALU by Intel 3?). Cluster-based architectures are always faster and more efficient than their conventional competitors. It is the main reason Power9/Power10 is cluster-based multithreaded; each thread gets its own cluster; t0->s0, t1->s1, t2->s2, t3->s3 when in full load.

Also, to cover all bases just in case... AMD could delete the Integer Execution and Integer Scheduler. Re-purpose the repeated instances inside FPU clusters for general purpose. The area given with a shrink should allow for a second decode/fetch path and a second op-cache.

Decode 4-wide + >4K op-cache (lo-path) and Decode 4-wide + >4K op-cache (hi-path) (Decodes have two pipeline stages so 4+4 equals 8 macro-ops for each, total decode being 16 macro-ops.)
P0/P1/P2/P3 => 8 64-bit ALUs + 8 64-bit ALUs within in two+two Vector Integer 256-bit vALUs. These units are already exposed to 64-bit operations. Allowing the other 3x64-bit internal paths to do 64-bit is a minimal area gain. There is also shift/rotate instructions that can shift/rotate hi-64 and lo-64 already by Zen4. This also has the side-effect of allowing EVEX/VEX/SSE 128-bit operations the upper-half for a second execution. The vPRFs in Zen3/Zen4 have 128-bit unit size, so the pairing only needs to get 2x64-bit ops into 1x128-bit op, which then can use the upper half when 2x128-bit is fused into 1x256-bit op.

Minus: Integer/GPR scheduler+execution (subtraction of power consumption (-20%))
Addition: second decode+op-cache (addition of power consumption (+10%))
Re-using: FPU scheduler/VI datapths for grid-GPR Integer (similar power consumption as before (no change%))

With them skipping straight to grid-based multicluster on Zen5. Reasons for this:
- Integer cluster-based and added cores falls under chip scaling.
- FPU being improved to support GPR-Int is under logic scaling, as the PRF's already allow manipulation of 64-bit(Lo-XMM) and 2x64-bit(Lo&Hi Swaps).
- Easiest way to improve superscalar performance on both Integer and FPU. The slowest code on any benchmark is when they spend ~50-70% of the time running easily vectorized, but superscalar int/fp operations on the FPU.
- It is also the easiest way to improve area/power efficiency with a huge IPC increase. Vectorized instructions avoid cache dependencies most of the time;
3x64b+2x64b = 320b (a lot of operations over time)
3x128b+2x128b = 640b
2x256b+1x256b = 768b (least amount of operations over time)
Of which, the average load/store might be the same, the instantaneous load/store is much better(power&perf) for 4-wide 64-bit superscalar compartmentalized in 256-bit instructions post-decode.
 
Last edited:

randomhero

Member
Apr 28, 2020
181
247
86
I am personally not that much interested in part of widening core execution resources, it is important, but how to feed core(s) and SOC. As @itsmydamnation always says, executing data is easy, moving data is hard.
Will AMD ditch on CCD L3 to expand number of cores? How many cores per CCD, 8,12 or 16? Will they do shared L2? What is the best topology to reduce number of hops to fetch data? Will CCDs able to to communicate directly thus reducing on package traffic and energy costs? What is best layout for HPC and what is best layout for cloud(number of cores per CCD, layout of CCDs, etc.) ?

Tough questions to answer engineering wise and this is just scratching the surface.
 
  • Like
Reactions: Tlh97 and Vattila

Joe NYC

Platinum Member
Jun 26, 2021
2,037
2,496
106
I am personally not that much interested in part of widening core execution resources, it is important, but how to feed core(s) and SOC. As @itsmydamnation always says, executing data is easy, moving data is hard.
Will AMD ditch on CCD L3 to expand number of cores? How many cores per CCD, 8,12 or 16? Will they do shared L2? What is the best topology to reduce number of hops to fetch data? Will CCDs able to to communicate directly thus reducing on package traffic and energy costs? What is best layout for HPC and what is best layout for cloud(number of cores per CCD, layout of CCDs, etc.) ?

Tough questions to answer engineering wise and this is just scratching the surface.

From the direction Forrest Norrod outlined, everything going forward will be heterogeneous, including advanced packaging.

With experience AMD is gaining in Zen 3 3D and Zen 4 3D, with V-Cache and with RDNA3, with off-chip Infinity Cache, I think it is same to assume that AMD will move to fully off chip L3 in Zen 5, and 3D cache will be standard. It will relieve prescious advanced node die area, and for the cost of, say N3 L3 SRAM, AMD can get 3x L3 size on N6.

GreyMon recently tweeted that 1 and 2 stack high V-Cache is completing testing for AMD may be able to get decent amount of L3 connected with Hybrid Bond connection, lowering the cost of that data movement further.

With possibility of using SoIC-H (I outlined above), AMD might even include a separate system level cache on a separate die on top of the interposer using SoIC-H, instead of being stacked on top of CCD.

Another approach is the rumored Mi300, that is rumored to have a large base die, which may include SRAM for L3, with compute elements stacked on top of base die, again using hybrid bond interconnect.

Edit: fixed some misspellings
 
Last edited:

Panino Manino

Senior member
Jan 28, 2017
830
1,033
136
After seeing how little Zen 4 really changed, the hype people are creating about Zen 5 being the "real big thing" (with probably Xilinx accelerators embedded and "stacking") and it's codename "Turin", sometimes it feels like it'll be the Second Coming of Christ.

Every Hype Train must be bigger than the last one.
 

Geddagod

Golden Member
Dec 28, 2021
1,159
1,033
106
From the direction Forrest Norrod outlined, everything going forward will be heterogeneous, including advanced packaging.

With experience AMD is gaining in Zen 3 3D and Zen 4 3D, with V-Cache and with RDNA3, with off-chip Infinity Cache, I think it is same to assume that AMD will move to fully off chip L3 in Zen 5, and 3D cache will be standard. It will relieve prescious advanced not die area, and for the cost of, say N3 L3 SRAM, AMD can get 3x L3 size on N6.

GreyMon recently tweeted that 1 and 2 stack high V-Cache is completing testing for AMD may be able to get decent amount of L3 connected with Hybrid Bond connection, lowering the cost of that data movement further.

With possibility of using SoIC-H (I outlined above), AMD might even include a separate system level cache on a separate die on top of the interposer using SoIC-H, instead of being stacked on top of CCD.

Another approach is the rumored Mi300, that is rumored to have a large base die, which may include SRAM for L3, with compute elements stacked on top of base die, again using hybrid bond interconnect.
The fact that Zen 5 has a separate V-cache variant (according to AMD roadmaps) makes me think that V-cache still will not be standard for zen 5.
 

exquisitechar

Senior member
Apr 18, 2017
657
872
136
After seeing how little Zen 4 really changed, the hype people are creating about Zen 5 being the "real big thing" (with probably Xilinx accelerators embedded and "stacking") and it's codename "Turin", sometimes it feels like it'll be the Second Coming of Christ.

Every Hype Train must be bigger than the last one.
It's been expected that Zen 5 will be the "real big thing" long, long before Zen 4.
 

maddie

Diamond Member
Jul 18, 2010
4,762
4,728
136
After seeing how little Zen 4 really changed, the hype people are creating about Zen 5 being the "real big thing" (with probably Xilinx accelerators embedded and "stacking") and it's codename "Turin", sometimes it feels like it'll be the Second Coming of Christ.

Every Hype Train must be bigger than the last one.
And yet, the performance jump is the biggest since the jump from construction cores to Zen 1.

What a complete failure, correct?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,611
14,592
136
After seeing how little Zen 4 really changed, the hype people are creating about Zen 5 being the "real big thing" (with probably Xilinx accelerators embedded and "stacking") and it's codename "Turin", sometimes it feels like it'll be the Second Coming of Christ.

Every Hype Train must be bigger than the last one.
Well, I own both and thats just crap. Zen 3 to Zen 4 is light years in difference. If you don't believe it , buy both, then talk to me.
 

exquisitechar

Senior member
Apr 18, 2017
657
872
136
You're kidding. Zen 4 rumors peaked at huge 20-40 percent IPC gains! Many people thought that zen 4 was going to be a huge architectural uplift.
I think expecting some massive architectural rework never made sense for the follow-up to Zen 3, although I was still disappointed by Zen 4, partially due to the more optimistic rumors. I tempered my expectations as we got closer to the reveal, and I was still let down. Rumors of Zen 5 being huge and the largest change to the core, among other things, since Zen, have been around for a long time and I always expected it to be more than whatever Zen 4 ended up being (which is, admittedly, not much). I'm not the only one, either. The interviews with Mike Clark and such make me pretty hopeful.

Anyway, I hope that AMD picks up the pace and we get no more Zen 4s in the near future. It's certainly in their interest, because they could get lapped by Intel in 2025 and later if most of their ambitious plans on both the process and CPU design side are realized and Royal Cove is even half as good as some of the rumors say.
 
  • Like
Reactions: Kaluan

Saylick

Diamond Member
Sep 10, 2012
3,209
6,553
136
I think expecting some massive architectural rework never made sense for the follow-up to Zen 3, although I was still disappointed by Zen 4, partially due to the more optimistic rumors. I tempered my expectations as we got closer to the reveal, and I was still let down. Rumors of Zen 5 being huge and the largest change to the core, among other things, since Zen, have been around for a long time and I always expected it to be more than whatever Zen 4 ended up being (which is, admittedly, not much). I'm not the only one, either. The interviews with Mike Clark and such make me pretty hopeful.

Anyway, I hope that AMD picks up the pace and we get no more Zen 4s in the near future. It's certainly in their interest, because they could get lapped by Intel in 2025 and later if most of their ambitious plans on both the process and CPU design side are realized and Royal Cove is even half as good as some of the rumors say.
I think we won't get another Zen 4-esque architecture unless a few things happen:
1) AMD's financial situation does not allow them to put enough engineering resources to perform a grounds-up redesign.
2) A completely new DT platform/socket is being designed such that it pulls engineering resources away from the core design.

I think the current Zen 4 launch is a consequence of the above conditions considering that Zen 4 was likely designed in 2018, which is before they were financially on solid ground, and it coincided with them needing to design a new socket/platform from scratch. Zen 1+, Zen 2 and Zen 3 had the luxury of the AM4 socket already existing so they could focus more effort on the core and core package design. As a platform, AM5 should not have the same weaknesses and shortcomings of AM4 from a longevity standpoint. It should have enough on-board BIOS memory to accommodate all future AM5 processors without needing to worry about compatibility or BIOS updates, and the socket power is beefed up to handle future 24 core processors, should they exist. PCIe Gen 5 is supported out of the gate (at a cost premium) but the future proofing is there if you want it. It would not surprise me if all of this effort meant that AMD did not have enough resources to do a grounds-up redesign for Zen 4. Yes, we probably were all expecting some kind of crazy uplift given the new N5 node, but it was likely more economical to just make Zen 4 basically a die shrunk Zen 3 with your run-of the-mill enlarged structures, extra transistors to enable the higher clocks, and QOL improvements, e.g. VNNI and AVX-512.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,027
136
I think expecting some massive architectural rework never made sense for the follow-up to Zen 3, although I was still disappointed by Zen 4, partially due to the more optimistic rumors. I tempered my expectations as we got closer to the reveal, and I was still let down. Rumors of Zen 5 being huge and the largest change to the core, among other things, since Zen, have been around for a long time and I always expected it to be more than whatever Zen 4 ended up being (which is, admittedly, not much). I'm not the only one, either. The interviews with Mike Clark and such make me pretty hopeful.

Anyway, I hope that AMD picks up the pace and we get no more Zen 4s in the near future. It's certainly in their interest, because they could get lapped by Intel in 2025 and later if most of their ambitious plans on both the process and CPU design side are realized and Royal Cove is even half as good as some of the rumors say.

I'm not disappointed at all. Between the smaller process and architectural improvements, they basically halved the power consumption vs Zen 3 while improving IPC. They chose to use that budget to increase performance, but that doesn't mean the improvement isn't there.