Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 26 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
While you can certainly increase the L2 capacity, the cost is:

a TON more connections between the CPU core and the stacked die. There's a reason that the L2 cache is tightly pressed against the core in every modern processor, there are a lot of connections between the two, and the farther apart they are, and this is just even by a few percent, the higher the latency and more power accessing the L2 takes, and, it's accessed a LOT. L3 is accessed much less often, and even less so if the L2 size increases. When money and power are relatively low priorities, giant L2 caches with a virtual L3 makes sense. When both are big issues, the balance CURRENTLY swings the other way.

If AMD eventually gets to the point where their CPU cores don't have to be all things to all people, they may go the same way. It just doesn't make sense with where they are.
 
Jul 27, 2020
27,953
19,100
146
If AMD eventually gets to the point where their CPU cores don't have to be all things to all people, they may go the same way. It just doesn't make sense with where they are.
They really should release a gaming optimized Ryzen. Take away all the trade-offs necessary to balance the ST and MT performance and just turn the CPU into a ST beast. They could release a separate gaming optimized Ryzen 5 SKU since that's their most popular CPU among gamers anyway.
 
  • Like
Reactions: Joe NYC

gorobei

Diamond Member
Jan 7, 2007
3,999
1,500
136
That's a lot of distance to cover for an L2 cache. That's also a lot more connections that will need to be driven between the die.

I don't see any sort of advantage to having L2 on the stack.
See smack middle of slide:

23110816828s.jpg


AMD details its 3D packaging technology at Hot Chips 33 | OC3D News (overclock3d.net)

AMD has ambitious packaging ideas, it seems.
dr ian covered the ibm presentation on their upcoming ibm Z16/telum platform.
ibm is ditching L3 and L4 cache. all the cores will have 32mb private L2 cache that can move the data out to other core's L2(virtualized L3) or even other sockets/chips as virtual L4.
intel has cross licensing with ibm, so amd may be prepping for the future if this works out.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
They really should release a gaming optimized Ryzen. Take away all the trade-offs necessary to balance the ST and MT performance and just turn the CPU into a ST beast. They could release a separate gaming optimized Ryzen 5 SKU since that's their most popular CPU among gamers anyway.

Yes!

I have been saying this in this thread, but most people seem to think that AMD should follow the segmentation games Intel used to play, when AMD was not competitive.

And one of the main reason, I was told, AMD should have its hands tied with the segmentation games is that if AMD released a gaming optimized CPU, it could possibly hurt Threadripper sales. Like perhaps, Threadripper would go from a single asterisk to double asterisk. As in:
* less than 0.5%
** not measurable
Or, 10 CPUs to 5 CPUs.

1630864257081.png

 

Asterox

Golden Member
May 15, 2012
1,058
1,864
136
Yes!

I have been saying this in this thread, but most people seem to think that AMD should follow the segmentation games Intel used to play, when AMD was not competitive.

And one of the main reason, I was told, AMD should have its hands tied with the segmentation games is that if AMD released a gaming optimized CPU, it could possibly hurt Threadripper sales. Like perhaps, Threadripper would go from a single asterisk to double asterisk. As in:
* less than 0.5%
** not measurable
Or, 10 CPUs to 5 CPUs.

View attachment 49748



Well, R5 5600X is gaming optimized+power consumption=performance very efficient CPU. :grinning:

As usual, Mindfactory CPU sales numbers.R5 5600X is seling like crazy 2800+ CPU-u sold past week.

Back to school CPU sales numbers, hm but we have big GPU shortage+high GPU prices.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
dr ian covered the ibm presentation on their upcoming ibm Z16/telum platform.
ibm is ditching L3 and L4 cache. all the cores will have 32mb private L2 cache that can move the data out to other core's L2(virtualized L3) or even other sockets/chips as virtual L4.
intel has cross licensing with ibm, so amd may be prepping for the future if this works out.

That virtual cache does sound interesting, but it is one of many approaches to take.

It is quite possible that a brute force approach, of just slapping can achieve some of the latency gains.

With all the L3, I am going to make a guess that Zen 4, with its beefy new IOD of Genoa, will have a mechanism that will allow all of the L3s of all the chiplets (8-16 chiplets) to act as a a large, shared L3/L4
 

Timorous

Golden Member
Oct 27, 2008
1,978
3,864
136
That virtual cache does sound interesting, but it is one of many approaches to take.

It is quite possible that a brute force approach, of just slapping can achieve some of the latency gains.

With all the L3, I am going to make a guess that Zen 4, with its beefy new IOD of Genoa, will have a mechanism that will allow all of the L3s of all the chiplets (8-16 chiplets) to act as a a large, shared L3/L4

Or just stack L4 on the IO die since that is meant to be built on N6.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
Or just stack L4 on the IO die since that is meant to be built on N6.

That could be a possibility, but L3 stacking on CCDs is probably more of a sure thing. A base line.

BTW Sapphire Rapids already has a shared L3, spread between the 4 tiles.
 
  • Like
Reactions: Tlh97

Timorous

Golden Member
Oct 27, 2008
1,978
3,864
136
That could be a possibility, but L3 stacking on CCDs is probably more of a sure thing. A base line.

BTW Sapphire Rapids already has a shared L3, spread between the 4 tiles.

I am sure both are possible, maybe not with Zen 4 but I don't see why L4 can't sit on the IO die and L3 on the core dies. It would make a pretty good server SKU.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
For Zen4 desktop and above, it probable makes more sense to just increase the bandwidth between the CCDs and the I/O die as main memory increases in performance with DDR5. That additional performance can then be used to use a pool of the spare capacity in each CCDs L3, which could be huge with stacking, as a virtual L4. It will respond much faster than main memory and likely transfer more quickly as well.

it’ll certainly make for a higher package power, but, it can save memory controller power by needing it less as well.
 
  • Like
Reactions: Tlh97 and Joe NYC

eek2121

Diamond Member
Aug 2, 2005
3,408
5,046
136
AMD should have released the Zen 3 threadripper 6 months ago. Even with very limited availability. That would have better positioned them against intel. When Intel says here is our new stuff vs. AMD. They would run into a wall of shit against threadripper.
If AMD is delaying the Threadripper so much, I wonder if it is going to be released with V-Cache. What sense would it make to release it in November without V-Cache and then a few months later, another set of SKUs with V-Cache?
Threadripper may have V-Cache in the future, but current SKUs planned for release do not have V-Cache, unless AMD has managed to keep a couple SKUs hidden from leakers (some of who work in the fabs that produce these products).
In my opinion, your expectations are too high. According to the Chinese forum about GoldenCove and Cortex X2, Zen 4 is a 20-25% increase in IPC.
I doubt that the ST performance will rise by more than 15% for Zen4 on 5nm as N5 doesn't really clock that much higher than N7/N6. Any improvement in performance would have to come from major changes in architecture and/or improved caching. I can see a doubling of L2 being an obvious place for improvement. The enlarged FPU and improved capabilities are able to easily account for a big situational improvement in some benchmarks, skewing the "average" improvement heavily.

Where I expect to see a bigger uplift is in MT scores. N5 does do better with power by a notable degree and will allow usefully increased MT all core boost performance. Once thing that we didn't see as big of a boost in from Zen 2 to Zen 3 on the desktop was MT performance. Part of this was due to the fact that both were produced on TSMC N7, with very little improvements in power draw per core. Also, the improvement in memory throughput from DDR5 will also be a nice uplift in memory bound applications.
Golden Cove is significantly faster than Zen 3 -- at least 15%. If you add another 20-25%, you end up with 35-40%. Not far off from my guess. However, I'm anyone to know final performance numbers likely does not know what they are talking about, as Zen 4, to my knowledge, has not been taped out. However, a redesigned FPU that supports AVX-512 and a number of new features will likely provide a significant uplift.

Also note, that regardless, I was specifically talking about multi-core performance (not single core, though RPL-S will have gains over ADL-S in that area). Golden Cove will actually have a tiny regression in single core performance due to the lack of AVX-512 (it is disabled in BIOS). Note that none of that applies to Raptor Cove (as of yet). AMD either will need to bump up the core count or provide higher IPC.

I saw the concern about hot spots, but could some of those stacking ideas, depending on the power envelope they're looking at, bring power savings for products meant to be mobile?
I suspect we won't see mobile products with V-Cache for a while unless AMD finds a way to make products with V-Cache cheaper than similar performing products without V-Cache. People often forget AMD is a business, a publicly traded business, and as such they have a desire to grow margins. Product X may be faster than Product Y, but that does not matter if Product X is even a dollar more than Product Y unless the competition is delivering something more competitive or they can upsell product X for a premium (think 5950X vs 5900X).
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
Alder Lake, especially the 6+8 in the "H" chips, should be a significant pain for AMD. The most threads that AMD can currently put in a true mobile product is 16, with only 8 hardware cores. That 6+8 mobile SKU should be able to put down 20 threads, with 14 physical cores, in a similar power envelope. If Windows 11 scheduling is even only marginally better than Windows 10 scheduling, that 6+8 H chip should wipe the floor with Cezanne and Barcello, and will probably do just as well against Rembrandt and Van Gogh. From what we think we know, even the Zen4 APU is expected to just be 8 cores again, and not hybrid at all. Giving Zen4 it's full, expected improvements, such a chip should still struggle against a 6+8 AlderLake.

As for the 2+8 mobile specific part, that will be interesting to watch. 12 threads will compete well with the 6 core APUs. I feel that the 8 core APUs can still outperform it, especially an 8 core Cezanne on N6 for the power improvements, but it'll definitely munch on power to do so.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
Threadripper may have V-Cache in the future, but current SKUs planned for release do not have V-Cache, unless AMD has managed to keep a couple SKUs hidden from leakers (some of who work in the fabs that produce these products).

I have not seen any leaks showing Threadripper 5000x without V-Cache (or with V-Cache).

Can you point me to any leaks you came across showing this?
 

yuri69

Senior member
Jul 16, 2013
677
1,215
136
Alder Lake, especially the 6+8 in the "H" chips, should be a significant pain for AMD. The most threads that AMD can currently put in a true mobile product is 16, with only 8 hardware cores. That 6+8 mobile SKU should be able to put down 20 threads, with 14 physical cores, in a similar power envelope. If Windows 11 scheduling is even only marginally better than Windows 10 scheduling, that 6+8 H chip should wipe the floor with Cezanne and Barcello, and will probably do just as well against Rembrandt and Van Gogh. From what we think we know, even the Zen4 APU is expected to just be 8 cores again, and not hybrid at all. Giving Zen4 it's full, expected improvements, such a chip should still struggle against a 6+8 AlderLake.

As for the 2+8 mobile specific part, that will be interesting to watch. 12 threads will compete well with the 6 core APUs. I feel that the 8 core APUs can still outperform it, especially an 8 core Cezanne on N6 for the power improvements, but it'll definitely munch on power to do so.
This moar-cores approach in highly power-constrained environment is obvious. The small cores should do well in throughput benchmarks.
However, does it match the reality? Average notebook workloads rarely benefit from 8+ threads. Even developer-class workloads require mainly heaps of RAM.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
For Zen4 desktop and above, it probable makes more sense to just increase the bandwidth between the CCDs and the I/O die as main memory increases in performance with DDR5. That additional performance can then be used to use a pool of the spare capacity in each CCDs L3, which could be huge with stacking, as a virtual L4. It will respond much faster than main memory and likely transfer more quickly as well.

it’ll certainly make for a higher package power, but, it can save memory controller power by needing it less as well.

The link between the CCD and I/O die is a bottleneck that is holding up many advances for AMD. As much as the Infinity Fabric On Package facilitated the chiplet breakthrough, if a few short years it is going from a breakthrough to an albatross....

There are some interesting speculations in the latest Anandtech article, but they don't exactly fix the problem at hand, unless there are 4 interposers each connecting a group of chiplets to IOD (4 corners of IOD).
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
I suspect that the current package will have to modernize a bit. It is still, essentially, just an mcm with nothing special about the substrate. In Zen4, AM5, it seems to me that they will need some sort of more advanced package, maybe with something like EMIB, or what they used for Radeon VII for the HBM modules. That's sort of a given. The other possibility is just stacking the CCDs on the IOD. That could be the delay on getting Zen4 out the door on desktop.
 

yuri69

Senior member
Jul 16, 2013
677
1,215
136
I suspect that the current package will have to modernize a bit. It is still, essentially, just an mcm with nothing special about the substrate. In Zen4, AM5, it seems to me that they will need some sort of more advanced package, maybe with something like EMIB, or what they used for Radeon VII for the HBM modules. That's sort of a given. The other possibility is just stacking the CCDs on the IOD. That could be the delay on getting Zen4 out the door on desktop.
There is a nice paper by AMD covering the packaging topic.

Even for Rome/Milan they picked the classic MCM substrate. The reasoning behind this decision is:
* 1 CDD to IOD has to provide "only" up to 2xDDR4 bandwidth. Server average is even lower. The interposter tech provides superior bandwidth which was not really required for the classic 2xDDR4 bandwidth. Radeon VII packaging provides 1TBps, 1 CCD to IOD should be like 0.06TBps.
* Both EMIB and interposter solutions provide great bandwidth but require quite short signal routes => the dies have to sit nearly next to each other. Placing Rome/Milan 8 CCDs around the IOD would be impractical.
* Even when aiming for longer signal routes at the expense of bandwidth there is still the reticle limit. Nvidia hit that expensive limit with V100 or so.
* $$$ for interposter/EMIB.

So far Genoa seems to follow Rome/Milan. However, there is still Trento which might use some crazy stuff.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
If anyone besides you have asked, I would have provided. However, as you like to argue and post hundreds of forum posts, I will not be engaging in this discussion with you. Sorry.

That's an awfully convenient answer when caught BSing

I spent considerable amount of time searching for the answer, and I came up with nothing, so it was a genuine inquiry, that perhaps I missed something,

Here is what I did find - 2 unique tidbits of info on upcoming Threadripper, and neither of them said anything about cache.

1. Puget bench leak
2. MilkyWay@Home leak

So do you have something or just BSing?
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I am not even sure if Threadripper is right product for 3D cache in the world of limited wafers. HEDT is niche within niche already and i think there are two main "volume" uses for such chips:

1) Rendering - at least Cinebench does not scale with L3 cache size once it has enough ot it and we are long past that point. Probably similar story with other renderers that are still stuck on CPU and not using GPUs for reasons?
2) IO heavy setups with GPUs or storage, these don't care about caches either

Other than that - for example code compiler style of usage, remains to be seen how much it is helped by increased L3, it is usually thread count affair as long as memory subsystem is competent, and ZEN3 already has one in place.
The generic workstation for modelling, CAD and so on might benefit somewhat, as it's usually all about several strong threads and GPU acceleration there, but 32MB of L3 is already plenty good

Desktop and servers is where 3D cache can help the most and even on desktop AMD is touting the gaming as the one where the benefits are largest: hard to escape the fact that to render each frame game has to go over huge working set of data, communicate with DirectX or Vulcan runtime, that in turn communicates with GPU drivers that chew that freshly generated data and render it on screen. So ton of benefits can be had here by using enlarged L3 for communication and avoiding going to memory.

But even in gaming there are limits, just like AMD during ZEN days was fond of GPU bottlenecked games, 3D cache will take "competitive" games with absurd FPS rates to really shine, while giving little benefit to the "averages" when GPU bottlenecked. Frankly it will help the most in minimums and consistency, that is area that AMD ironically overlooked over Zen/Zen2 era and will now need to tout hard.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
I suspect that the current package will have to modernize a bit. It is still, essentially, just an mcm with nothing special about the substrate. In Zen4, AM5, it seems to me that they will need some sort of more advanced package, maybe with something like EMIB, or what they used for Radeon VII for the HBM modules. That's sort of a given.

At this point, EMIB would be superior to IFoP from bandwidth, power and latency perspecive, but EMIB has some adjacency limitations.

But there are also some leaks coming from RDNA3 side about MCM module that links adjacent chips using 3D stacking (allegedly). This solution would be significantly better than EMIB as far as bandwidth, power and maybe a tiny latency improvement as well. But also with significant adjacency limitations.

The other possibility is just stacking the CCDs on the IOD. That could be the delay on getting Zen4 out the door on desktop.

If they could figure out perhaps a partial stacking, where edge of the CCD would be stacked on top of IOD, that would be one way to do it.

But I am afraid that this ship (Zen 4) has probably sailed, and AMD will likely be stuck with inferior MCM interconnect technology for this upcoming cycle... Unless there is some surprise that has not leaked...
 

DrMrLordX

Lifer
Apr 27, 2000
22,899
12,963
136
I am not even sure if Threadripper is right product for 3D cache in the world of limited wafers.

Not sure if salvaging of Milan-X CCDs is a factor here? Do they test the CCDs before bonding the cache die to the CCD?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,633
5,174
136
I am not even sure if Threadripper is right product for 3D cache in the world of limited wafers.

This might be something to consider, if Threadripper had any volume, like, say the old Intel HEDT platform. But since Threadripper has miniscule volume, the wafers needed to add V-Cache for Threadripper are likewise miniscule.

HEDT is niche within niche already and i think there are two main "volume" uses for such chips:
1) Rendering - at least Cinebench does not scale with L3 cache size once it has enough ot it and we are long past that point. Probably similar story with other renderers that are still stuck on CPU and not using GPUs for reasons?
2) IO heavy setups with GPUs or storage, these don't care about caches either

Other than that - for example code compiler style of usage, remains to be seen how much it is helped by increased L3, it is usually thread count affair as long as memory subsystem is competent, and ZEN3 already has one in place.
The generic workstation for modelling, CAD and so on might benefit somewhat, as it's usually all about several strong threads and GPU acceleration there, but 32MB of L3 is already plenty good

I think AMD product positioning misfired with Threadripper. They had a perfect opportunity to claim 2 different market niches with 2 different products:
- workstation
- HEDT

AMD already had 2 platforms perfectly positioned for these 2 niches:
- 8 memory channel for workstation
- 4 memory channel for HEDT

And they also have to brand names, that would seemingly fit perfectly these 2 different market segments:
- Threadripper
- Threadripper Pro

But where AMD dropped the ball is SKUs that would target these 2 segments. The Threadripper SKU targeting is not only incomprehensible, it is counter intuitive.

For example, Threaddripper Pro, that would be more suited for Workstation (8 channel) segment, starts from 12 and 16 cores. But the vanilla Threadripper, that would be well suited for HEDT starts from 24 cores and up.

Desktop and servers is where 3D cache can help the most and even on desktop AMD is touting the gaming as the one where the benefits are largest: hard to escape the fact that to render each frame game has to go over huge working set of data, communicate with DirectX or Vulcan runtime, that in turn communicates with GPU drivers that chew that freshly generated data and render it on screen. So ton of benefits can be had here by using enlarged L3 for communication and avoiding going to memory.

But even in gaming there are limits, just like AMD during ZEN days was fond of GPU bottlenecked games, 3D cache will take "competitive" games with absurd FPS rates to really shine, while giving little benefit to the "averages" when GPU bottlenecked. Frankly it will help the most in minimums and consistency, that is area that AMD ironically overlooked over Zen/Zen2 era and will now need to tout hard.

There are some nice opportunities on desktop, and AMD could also play high end of this segment with Threadripper with 3D cache, with non-Pro, 4 channel platform. Perhaps starting from as little as a single CCD and more layers of V-Cache at ~$500 and up from there. What made Intel HEDT successful what generated most of the volume for the platform were not the $1,000 CPUs but the $500 CPUs.

Just from knowing what I would ideally want in my upgrade from Intel HEDT to AMD HEDT - the chances are that AMD will not sell it and will miss the boat again.

If the Threadripper present is a good predictor of Threadripper future, chances are that AMD will have all the components I want, but will misconfigure them to a Frankenstein configuration I don't want.