Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 857 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
32,012
32,465
146
I'm happy to pay bro 😭
w=400,h=348
 

Gideon

Platinum Member
Nov 27, 2007
2,030
5,034
136
32 MB on-die L3$ is already quite a lot for 8c/16t for most applications. The next time they spend more CCD area on cache, they might perhaps spend it on L2$ rather than L3$.
I really wish the x86 world (for desktop and mobile) would also offer similar tiled L2 as Qualcomm and Apple:

Snapdragon X has a 96KB L1 and 12MB L2 per core cluster (4 x 3MB slices, but a single core can use all 12MB), with nearly identical latency to AMD’s private 1MB L2.




Oryon uses an Apple-like caching strategy. A large 96 KB L1 and relatively fast L2 with 20 cycles of latency together mean Oryon doesn’t need a mid-level cache. Firestorm has a bigger 128 KB L1, but Oryon’s L1 is still much larger than the 32 or 48 KB L1 caches in Zen 4 or Redwood Cove.


AMD has a 1 MB L2 mid-level cache private to each core, then a 16 MB L3. That setup makes it easier to increase caching capacity, because the L2 cache can insulate the core from L3 latency. However, that advantage is minimal for mobile Zen 4 parts, which max out at 16 MB of L3. Oryon therefore provides competitive latency especially as accesses spill out of Zen 4’s L2. Meteor Lake follows a similar caching strategy to Zen 4, but has more caching capacity at the expense of higher latency.

I'm not sure what the optimal cache hierarchy would be for desktop chips, but the current design sure seems nonoptimal for client workloads (it makes much more sense on server).

Could you imagine the gaming performance of an AMD chip that would have:
  • 24MB tiled L2 (12MB + 12MB shared by 4 cores each, if it's required to not regress in latency)
  • 96 - 128MB L3 (3D cache)
  • 2nd gen chiplets (similar to Strix Halo) with full-width memory bandwidth per CCD and faster FCLK support (2600+Mhz)

Ideally it should also have 3-4 memory channels using aggressive low-latency CAMM2 modules, but you can't have it all I guess (that would need a new socket) ...


TL;DR:

A 1MB - 3MB private L2 only makes sense if it provides 2-3x better latency or bandwidth than a 12MB shared, tiled L2. Otherwise, it’s a waste of SRAM potential, IMO.
 
Last edited:

Kepler_L2

Senior member
Sep 6, 2020
998
4,258
136

MS_AT

Senior member
Jul 15, 2024
868
1,762
96
I really wish the x86 world (for desktop and mobile) would also offer similar tiled L2 as Qualcomm and Apple:

Snapdragon X has a 96KB L1 and 12MB shared L2 per 4 cores (4x 3MB slices but a single core can use all 12MB), but it has the same latency than AMD private 1MB L2.
It does have the similar latency in cycles, but worse absolute latency [ns].
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,665
2,530
136
I really wish the x86 world (for desktop and mobile) would also offer similar tiled L2 as Qualcomm and Apple:

Snapdragon X has a 96KB L1 and 12MB L2 per core cluster (4 x 3MB slices, but a single core can use all 12MB), with nearly identical latency to AMD’s private 1MB L2.

At what clocks? Cache latency measured in clock cycles is identical, but this was only possible for them to implement because they had twice the wall clock time to do so.

The AMD L2 is extremely tight, you are not increasing it's size at all without a latency regression. You are absolutely not sharing it with anything without a latency regression.
 

StefanR5R

Elite Member
Dec 10, 2016
6,670
10,551
136
This is supposed to be "official spec sheet". One thing that's missing from the previous "official spec sheet" is DDR-6000 support:

Already posted in #21,400 and #21,401. And it hasn't become any more official than it was at the time of #21,405 and #21,411. :-) IOW, it's quite possibly not a leak, but likely just a rehash of previous leaks and wannabe-leaks.

It's an Austrian price comparison site. They are unlikely to receive 1st party spec sheets before product launch.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
6,670
10,551
136
V-Cache isn't visible on retail silicon. Those shots of Lisa holding X3D with visible dies is without the top supporting silicon.
To be fair, it's not just the very lowest end of online journalism who don't get it. ComputerBase.de have been trolled ;-) by AMD in the same way. "... the 3D cache, which can otherwise be seen with the naked eye, ..."

(Edit: CB were made aware of this fact and corrected their article now.)
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
7,355
17,424
136
This writing is either ousourced to Malaysia or fake.
or both
You can add LLM generated as well:
Introducing the AMD Ryzen 7 9800X3D, the latest powerhouse in gaming and multitasking performance, featuring revolutionary 3D V-Cache technology. Elevate your computing experience with unbeatable speeds and unparalleled efficiency.
  • Next-Gen 3D V-Cache Technology: Enhanced performance with up to 96MB of L3 cache.
  • Unmatched Gaming Performance: Boosts gaming performance by up to 26% over the previous generation.
  • Higher Clock Speeds: Achieves up to 5.2GHz for lightning-fast processing.
  • Improved Thermal Performance: Better cooling efficiency for sustained high performance.
  • Zen 5 Architecture: Built on the latest Zen 5 core architecture for superior efficiency and power.
  • AM5 Compatibility: Fully compatible with the AM5 platform, supporting PCIe Gen 5 and DDR5 memory.
  • Multi-Threaded Excellence: Ideal for multitasking and content creation with 8 cores and 16 threads.
  • Future-Proof Design: Ready for upcoming technologies and applications.
 

Gideon

Platinum Member
Nov 27, 2007
2,030
5,034
136
It does have the similar latency in cycles, but worse absolute latency [ns].
Yeah, that's true. I wish we had (relatively) apples to apples latency comparison between M4 @ 4.4 Ghz and Zen 5 to see what are the actual latency. The only info that chipsandcheese has up is quite out of date 7950X vs M1 (from here):

ADe7qal.png

Roughly 5.4ns for M1 vs 2.4ns for 7950X. So yeah, a very significant over ~2x difference.

But M1 clocked only up to 3.2 M4 clocks to 4.4 GHz. I'm more than certain Apple relaxed the L2 latency by a few cycles doing that, but I'd still very much like to see where they ended up (and i hope reviewers measure it).

The AMD L2 is extremely tight, you are not increasing it's size at all without a latency regression. You are absolutely not sharing it with anything without a latency regression.

That's true, it's not possible without some latency regression. My whole point was that with the ever-more-prevalent 3D cache there is a growing gap between the rather small 1MB L2 and the gigantic 96MB L3

Take techpowerup reviews as an example (as they use the same mobo and ram configs):

For Ryzen 9700X review they registered 7.7ns L3 latency
For Ryzen 7700X it was 9.9ns L3 latency
For Ryzen 7800X3D it was 12.7ns L3 latency for 3x bigger cache

A <30% regression for 3x the size. Looks to be a pretty decent tradeoff (and I expect it to be less for 9800X3D as it clocks higher!).

But then again, the latency gap between L2 and L3 went from 3x to 4x.

As many consumer applications are heavily cache/memory bound, there seems to be performance there, waiting to be extracted.

What options are there to do that?

1. Adding extra cache layers - possible, but numerous other significant drawbacks
2. Upsizing the private L2 to 2MB or 3MB (as Intel did) - this is the easiest solution, but even with "just" 2MB of L2, we use 16MB of the CCD's SRAM budget on L2, while limiting the amount a single thread can use to 2 MB. Going beyond that (3MB for 24MB total) seems insanely wasteful to me.
3. sharing the L2 between cores - a much more complex solution with obvious latency regressions as you stated

Extrapolating what AMD did with L3 it should be possible to go from 1MB to 3MB with a 30% latenchy increase (3ns -> 4ns). Actually i think AMD would do better, as AFAIK going from 512KB to 1MB AMD managed to regress much less than that!

TL;DR: So a private 2MB L2 is indeed the most obvious solution to address this.

It's just that in my La La land, I'd like to see a shared L2 solution where the banks next to the core have almost no latency regression and the ones further away have 20-30% but allow a core to use up to 8MB of L2 instead of "just" 2MB.

The intriguing alternative is to keep the L2 at 2MB on the base SKU and take the "2-3 cycle hit” on 3D cache parts by also double the private L2 on the V-cache die to 4MB (keeping the relative latency between L2 and L3 the same)
 
Last edited: