Discussion Beyond zen 6

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
Cache sharing? (and I'm not going out to find it) IBM came out with a cache system where the L2 was shared. (Changed my mind I found it friend of the show Dr Ian Cutress and George?) In 2021 IBM Telum created a processor where the L2 size was increased and the lower caches were done away with. Well actually they were virtualized. Instead of evicting data to a slower L3, the core will borrow space from another cores L2 that has free space, thus creating a virtual L3 by ring busing the L2s. Further the logic flows to the entire system allowing it to borrow from further away cores creating a virtual L4.

IBM and AMD have often shared / worked with / carved out areas of mutual benefit. From SOI during the Global Foundries days, to the Roadrunner supercomputer. Or the JDA (Joint Development Agreement) in 2020.

I have no information to validate any transfer of technology, but I wouldn't put it past both companies from at the very least giving advice on certain technologies.

Could this be a blueprint for AMD future cache technologies ?
 
  • Like
Reactions: Joe NYC

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
lmao never.
You'll get a fat L2 at least. Maybe.

It's just moar L2 via SoIC-X.

That doesn't make physics sense. As far as I understand. L2 is way way too fast for 3d stacking and the more ways you put in a cache the slower it is. So you can't just make it bigger. But you can improve it's communication with near by assets. Which is basically what Telum does.
 

inquiss

Senior member
Oct 13, 2010
626
889
136
That doesn't make physics sense. As far as I understand. L2 is way way too fast for 3d stacking and the more ways you put in a cache the slower it is. So you can't just make it bigger. But you can improve it's communication with near by assets. Which is basically what Telum does.
Weird cos they're gonna 3D stack it.
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,479
11,199
106
That doesn't make physics sense
yeah it does.
As far as I understand. L2 is way way too fast for 3d stacking
No?
Private L2's are like 2*64B usually.
So you can't just make it bigger
How does Intel's 3M nice and comfy L2 slab work then?
They just deadass made a Bigger Cache.
But you can improve it's communication with near by assets. Which is basically what Telum does.
I suggest your forget all esoterica and always pick the simplest, most straightforward solution in your mind. That's how AMD does things.
 

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
Weird cos they're gonna 3D stack it.

I mean 3d cache stacking. (the way the L3 cache works now) An L2 would never work with that level of latency. Seriously. It increased the already slow L3 to a few ticks slower and I'm sure there is some secrete logic sause where the faster L3 makes up for the slower stacked part.
 

inquiss

Senior member
Oct 13, 2010
626
889
136
I mean 3d cache stacking. (the way the L3 cache works now) An L2 would never work with that level of latency. Seriously. It increased the already slow L3 to a few ticks slower and I'm sure there is some secrete logic sause where the faster L3 makes up for the slower stacked part.
Ah, that's the best part. This reduces latency (at the same capacity)
 

gdansk

Diamond Member
Feb 8, 2011
4,749
8,054
136
So your argument is that the patent is impossible to implement? That can happen but this included access time estimates which suggest it was at least simulated even if it couldn't be manufactured yet.
 

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
I was just putting Telum there as a point of discussion as what can be done at a L2 level. I'm willing to concede that Telum is not a viable option for zen.
 

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
Some thoughts regardless of implementation (stacked, flat, angle of the tip squared). You're going to be limited by how you balance the associativity.

On zen 5 we have a L1 48k 12 way cache up from 32k 8-way zen 4. Latency stayed the same but there was a need for extra logic and area, doubling the bandwidth and dual porting essentially hid the complexity for the new system.

For the L2 AMD basically split the difference. The L2 size stayed the same but the ways and data paths doubled. This allowed them to maintain the 14 cycle latency while catering to the wider 512 bit data path. Note the size of the sets is halved so you have to do more work to maintain the same level of functionality.

For zen 6 moving to a bigger L2 is going to require some level of either regression in latency or increase in complexity (area, wires, ways) If you double the size you need to double your ways or increase your latency. If you increase your ways you're going to use way way more power. (see the way I phrased that)

Even if stacking did provide a pathway to a larger L2. The extra complexity would certainly force a trade-off somewhere. When you're adding 50% more cores you can hardly provision an extra power hungry cache in each unit.

At best I think they beef up the L2 ways to 32, optimize the L3 snoop functionality, and finish off with the Sea of Wires (InFO-oS) chiplet improvements.

This optimization of the L3 snoop is what I was hinting at in the Telum post. Basically providing better fabric between cores such that you don't have to fully commit to the L3 and incur that 100 cycle penalty.
 

Joe NYC

Diamond Member
Jun 26, 2021
4,200
5,779
136
well it ain't about latency reduction per se, but piling up like 5 or 6 megs of private L2 at an acceptable cycle count.

Which would be equivalent of per core L2+L3 in recent Zen processors. 1MB L2 + 4 MB L3.

This means that a theoretical base model could have just L2. There would be some tradeoffs (some beneficial some detrimental).

As far as die area, in Zen 5, excluding the SerDes, probably somewhere between 2/5 and 1/2 of the die size is L2 + L3.

Removing both L2 and L3 and placing only L2 on the base die, a hypothetical Zen 5 CCD would shrink to ~40 mm2 and the base die could then accommodate 4-6 MB of L2 per core (32-48 MB total). Which would be quite interesting cost-wise.

Even more interesting would be when the cost difference between logic die and cache die diverge further, in N2 and beyond.

There is a considerable benefit to shared, very large L3. I am just not sure where the tradeoff would be.
 
  • Like
Reactions: Tlh97

Joe NYC

Diamond Member
Jun 26, 2021
4,200
5,779
136
I mean 3d cache stacking. (the way the L3 cache works now) An L2 would never work with that level of latency. Seriously. It increased the already slow L3 to a few ticks slower and I'm sure there is some secrete logic sause where the faster L3 makes up for the slower stacked part.

Seems like you are suggesting that the patent posted is invalid, impossible to implement, or you are ignoring it.

The patent suggests 2 things:
- L2 can be moved to stacked die
- L2 latency, after it was moved to stacked die will go down (!!!).
 

Schmide

Diamond Member
Mar 7, 2002
5,788
1,093
126
Seems like you are suggesting that the patent posted is invalid, impossible to implement, or you are ignoring it.

The patent suggests 2 things:
- L2 can be moved to stacked die
- L2 latency, after it was moved to stacked die will go down (!!!).

When the L3 received a stacked cache latency went up and that's a victim cache.

The patent is valid. The point is there are trade offs for every operation. If you do more work, it costs power or latency.

There may be some process that makes the above puzzle work. I look forward to seeing the solution.
 

Joe NYC

Diamond Member
Jun 26, 2021
4,200
5,779
136
When the L3 received a stacked cache latency went up and that's a victim cache.

The patent is valid. The point is there are trade offs for every operation. If you do more work, it costs power or latency.

There may be some process that makes the above puzzle work. I look forward to seeing the solution.

I agree some tradeoffs will surface, and if the yield loss turns out minimal, then it may be in terms of power. Which could then be further mitigated by less power burned on cache misses (DRAM accesses).

But the patent, why it can be extremely significant, it seems that AMD discovered certain methods or shortcuts to achieve something that seemed impossible. Going beyond previous L3 V-Cache.

Namely, that instead of slight latency penalty, there is a slight latency benefit (which can then be "spent" on larger L2, on a cheap die).
 

StefanR5R

Elite Member
Dec 10, 2016
6,844
10,998
136
IBM came out with a cache system where the L2 was shared. (Changed my mind I found it friend of the show Dr Ian Cutress and George?)
For reference, here is AnandTech's article on the original Telum (z16), i.e. Dr. Ian Cutress reporting on IBM's Hot Chips 2021 presentation:
https://web.archive.org/web/2021090...924/did-ibm-just-preview-the-future-of-caches

PS, on the question of whether or not IBM previewed the future of caches:
Dr. Ian Cutress said:

How Is This Possible?​

Magic. Honestly, the first time I saw this I was a bit astounded as to what was actually going on.

In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data. In the slack channel as part of the event, he also stated that lots of cycle counting goes on!

I’m going to stick with magic.

Truth be told, a lot of work goes into something like this, and there’s likely still a lot of considerations to put forward to IBM about its operation, such as active power, or if caches be powered down in idle or even be excluded from accepting evictions altogether to guarantee performance consistency of a single core. It makes me think what might be relevant and possible in x86 land, or even with consumer devices.

I’d be remiss in talking caches if I didn’t mention AMD’s upcoming V-cache technology, which is set to enable 96 MB of L3 cache per chiplet rather than 32 MB by adding a vertically stacked 64 MB L3 chiplet on top. But what would it mean to performance if that chiplet wasn’t L3, but considered an extra 8 MB of L2 per core instead, with the ability to accept virtual L3 cache lines?

Ultimately I spoke with some industry peers about IBM’s virtual caching idea, with comments ranging from ‘it shouldn’t work well’ to ‘it’s complex’ and ‘if they can do it as stated, that’s kinda cool’.

Back to the topic of stacked cache:
The patent suggests 2 things:
- L2 can be moved to stacked die
- L2 latency, after it was moved to stacked die will go down (!!!).
Go down compared to an on-die cache of same size, associativity, etc., right? (I haven't read the patent.)

When the L3 received a stacked cache latency went up and that's a victim cache.
At the same time, the L3$ size was tripled.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

Joe NYC

Diamond Member
Jun 26, 2021
4,200
5,779
136
Go down compared to an on-die cache of same size, associativity, etc., right? (I haven't read the patent.)

I have not read (or searched for the link) to full patent either. If anybody has the link handy, it would be appreciated.

But my assumption is that it would be identical L2, same size, same associativity.

I am guessing that the reduction of latency is coming from reduced distances, that L2 is exactly under the die area where data would normally be transferred to, from greater distance.
 

marees

Platinum Member
Apr 28, 2024
2,235
2,871
96
I have not read (or searched for the link) to full patent either. If anybody has the link handy, it would be appreciated.
RGT has got the links

RGT thinks this is zen 8 or later

AMD's SECRET WEAPON For Zen: Stacked L2 Cache Patent Analysis​

a new secret weapon for future Zen CPUs has been revealed! In this video, we dive into everything revealed in the newly discovered/leaked AMD patent for stacked L2 cache for Ryzen processors. There is a lot to dive into today, but this is VERY exciting for the future generations of Ryzen CPUs, and could have huge impacts on not only the hardware specs and architecture, but also unlocking new levels of performance.But just what kind of specifications and performance improvements can we expect from AMD's new Ryzen tech for stacked L2 cache? And just when will we see this new technology implemented into AMD's roadmap for Ryzen gaming CPUs?

SOURCES
https://redgamingtech.com/amds-nightmare-intels-razor-hammer-lake-serpent-lake-leaks/
https://patents.google.com/patent/US20260003794A1/en?oq=US20260003794A1
https://www.latitudeds.com/post/amd-development-of-cache-architecture-from-planar-to-3d-integration
https://globaldossier.uspto.gov/details/US/18758517/A/125173
 

Thibsie

Golden Member
Apr 25, 2017
1,178
1,389
136
I’m not sure a stacked L2 would increase latency.
It would if it was a supplement of L2 on top of the current L2 on the die.
But if they strip the L2 out of the die and place it just on top of the core, logic dictates the latency would not go up, it should actually go down since there’s less length for the signal to move the data (IMHO)
 

marees

Platinum Member
Apr 28, 2024
2,235
2,871
96
implementation of a larger stacked cache than a typical planar cache, yet achieves the same or better cycle latency.

For example, a conventional planar 1 MB L2M cache has a 14 cycle latency, while a stacked 1 MB L2M cache implemented using the described techniques has only a 12 cycle latency.

it's not just better latency, AMD also discloses that the stacked L2 cache provides power savings too.

via AMD Research Paper (Google Patents)

 

Joe NYC

Diamond Member
Jun 26, 2021
4,200
5,779
136
I think a bigger deal for the stacked L2 cache would be for uses by AI GPUs, given the amount of money that is behind it and almost "money is no object" in race to achieve max performance.

Most people think the implementation will come much later, but, curiously, AMD is using quite expensive N3 base die for Mi400. Something interesting has to be going on in that die if AMD is spending big bucks on it.

Since AMD is getting better at using L2 to derive better performance and to optimize external bandwidth utilization, AI GPUs need a lot more of that and have large budgets to make it happen.

Another tidbit: NVidia is trying to maximize memory bandwidth by twisting the arms of HBM suppliers to increase the clocks. In the meantime, AMD says: we will take the cheap, low speed / lower bandwidth HBMs
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,479
11,199
106
I think a bigger deal for the stacked L2 cache would be for uses by AI GPUs, given the amount of money that is behind it and almost "money is no object" in race to achieve max performance.
you do understand that GPU caches are very different things built for very different reasons?
They have latency in tens or hundreds of cycles.
Something interesting has to be going on in that die if AMD is spending big bucks on it.
It's cheap given what those things will retail for. That's it.
Since AMD is getting better at using L2 to derive better performance and to optimize external bandwidth utilization, AI GPUs need a lot more of that and have large budgets to make it happen.
you should just stop.
Another tidbit: NVidia is trying to maximize memory bandwidth by twisting the arms of HBM suppliers to increase the clocks. In the meantime, AMD says: we will take the cheap, low speed / lower bandwidth HBMs
AMD ain't paying any less for HBM4, they're just not willing to deal with the thermal nightmare of 11Gbps HBM4 at the current DRAM nodes.