Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
FYI, according to Ian:

"Confirmed with AMD that V-Cache will be coming to Ryzen Zen 3 products, with production at end of year. "

Zen 3 it is. AMD gets to skip a generation. Let's go!
 
  • Like
Reactions: cytg111

Kedas

Senior member
Dec 6, 2018
355
339
136
About that You tube video some wrong info in there, the CPU that Lisa did show had 64MB on both dies, they just removed the top layer of one die so you could see the V-cache.

The fact that it can be switched of is a nice feature for low power usage.
Zen3 wat build with this add-on in mind, so that info took a long time to get out...
 

Gideon

Golden Member
Nov 27, 2007
1,633
3,663
136
I'm a bit confused about performance expectations. Someone cited the Broadwell with extra cache and that it only performed better in games.
Broadwell's L4 was a different beast though since it was an MCM solution.
  1. It had poor bandwidth for a cache (50 GiB/s vs 25.6 GiB/s for DDR3 1600)
  2. It also suffered from poor latency of around 150 cycles (vs 200+ cycles for DDR3)

AMD's solution meawhile has:
  1. 2 TB/s bandwidth
  2. According to this indiscernible latency from the rest of the L3, that would be around 50 cycles.
So despite the overall cache size being similar (128MB vs 96MB) V-cache can transfer 40x more data at once while doing it over 3x faster.

There will still be plenty of use-cases that won't benefit much from the extra cache, but it should be much more capable than Broadwell.
I'm particularly interested in software compiling benchmarks as these tend to scale very well with extra cache.
 
Last edited:
Jul 27, 2020
16,280
10,318
106
A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.
 

Gideon

Golden Member
Nov 27, 2007
1,633
3,663
136
A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.
Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
This is on absolutely and completely different level than eDRAM or HBM, since it is good old L3 cache, made huge by AMD.

eDRAM or HBM are "L4" solutions, so they either need tags ( that take space that could be taken by L3 cache ) or they serve as so called "system cache" on memory side of things, acting as huge buffer (There is possibility of outright replacing some DRAM with say HBM, so first 16GB of address space are served by HBM and so on but that is different solution ).

This above sounds complex and it is without drawbacks, obvious and hidden. For example tag checking is not free, on every L3 miss you need time and energy to see if your cache line is in L4. So your average memory latency has now grown. System side caches add complexity and also use energy, while not being that effective.

While AMD's solution is good old L3, but made huge. They have 8x4MB slices now, most likely 16 more will be added, with resulting increase in cumulative bandwidth and reducing averagef pending request queues that happen now due to address bit collisions. AMD is citing 2TB/s, that is pretty much BW of their L2 cache at clocks of 5900x. INCREDIBLE achievement to have L3 caches with bandwidth that is near L2 cache bw and it really opens up things for FP perf and MT performance in general due to prefetching.

AMD somehow went from also rans in cache designs, with caches that were either questionable or behind Intel to beating the hell out of everyone in latencies and capacities.
 
Last edited:

leoneazzurro

Senior member
Jul 26, 2016
927
1,452
136
Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

No, inter-CCD latencies remain the same and there are no interconnections anywhere. They are stacking (each CCD) with additional layer of silicon, full of SRAM on top of existing L3 area and using vias to connect it to achieve huge bandwith and all required signalling for L3 to function. Latency is great cause it is basically same physical distance to consumers, it is using same address hashing to put lines into slices ( just the number of slices is greater ).
No need to invent complicated schemes, when 3D stacking solved the problems.
 

leoneazzurro

Senior member
Jul 26, 2016
927
1,452
136
No, inter-CCD latencies remain the same and there are no interconnections anywhere. They are stacking additional layer of silicon, full of SRAM on existing L3 area and using vias to connect it to achieve bandwith. Latency is great cause it is basically same physical distance to consumers, it is using same address hashing to put lines into slices ( just the number of slices is greater ).
No need to invent complicated schemes, when 3D stacking solved the problems.

I mean in future CPUs, not the one demoed there.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I mean in future CPUs, not the one demoed there.

Again, why complicate things? If current L3 is not connected, why should extension of current L3 be connected? CCDs are physically separated and already connected via IOD.
Way more interesting to use 3D stacked SRAM slices on top of IOD as a sort of L4 cache, to throw silicon on improving inter-CCD and inter-socket scalability.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.
Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.

Yep, DDR5 latency will be similar to DDR4.

The extra cache should actually help most workloads not just gaming.
 
  • Like
Reactions: Tlh97 and Makaveli

leoneazzurro

Senior member
Jul 26, 2016
927
1,452
136
I was not speaking about current architecture, but future ones (not immediate future). That is why I spoke about moving L3 cache away from CCDs. L4 is also viable and probably it would be the first step but in the long run it adds another level in the memory hyerarchy and thus it adds latencies.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

This won't happen. Off-die will increase latency. On-die is always faster, and at the same time use less power to do so.

JoeRambo has a good point. As long as you solve/accept the technical and economical limitations to stacking, it does it far better.

eDRAM or HBM are "L4" solutions, so they either need tags ( that take space that could be taken by L3 cache ) or they serve as so called "system cache" on memory side of things, acting as huge buffer (There is possibility of outright replacing some DRAM with say HBM, so first 16GB of address space are served by HBM and so on but that is different solution ).

And why would V-cache not require tags? I assume the stacks will have tags in them. It sounds like it's just an extension of the existing L3.
 
  • Like
Reactions: Tlh97

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
AMD affirmed that the additional stack of L3 has practically the same latency level as on-die L3.

You are not listening to either me nor JoeRambo.

AMD's V-cache is not, I repeat NOT off-die.

Unless you are misunderstanding what "off-die" means. In this case it means something like a separate CCX with caches only.
 
  • Like
Reactions: Tlh97

leoneazzurro

Senior member
Jul 26, 2016
927
1,452
136
You are not listening to either me nor JoeRambo.

AMD's V-cache is not, I repeat NOT off-die.

Unless you are misunderstanding what "off-die" means. In this case it means something like a separate CCX with caches only.

Then even what I am speaking about is not off-die L3, you too are not trying to understand what I meant. For me a stack IS off-die as it requires another die of SRAM to be assembled on the top of the CCD. That die being seen as "logically not off-die" or "behaving like an on-die cache" does not mean that it being not a separate die.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Then even what I am speaking about is not off-die L3, you too are not trying to understand what I meant.

Let me quote what you said originally.

Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.
 

Makaveli

Diamond Member
Feb 8, 2002
4,718
1,054
136
Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.

This is what i've read also. DDR5 is suppose to offer the same true latency. People are getting this confused because they are only focusing on the CL numbers on the memory sticks.
 

leoneazzurro

Senior member
Jul 26, 2016
927
1,452
136
Fine.

But you should use common terminology. AMD's V-cache is vertical stacking. Off-die means off-die.

It's DIE stacking, so it's implicitly affirming that there are multiple dies. It's you using an incorrect terminology, as you are implying that a CCD+stack are the same die while they are not, they are of course separate dies, stacked one (or more) on top of another.
 

Hougy

Member
Jan 13, 2021
77
60
61
Caches are different, because they are easier to manufacture and have higher yields because of the repetitive structure. Also they are quite power efficient.

You are applying the square root law in a wrong way. The square root law is a big penalty because the power consumption increase in the core also increases just as much.

This cache is going to add at best 3-4W.
Thanks for the reply, I learned a lot.
I was wrong about it being two stacks, it's only one. So the cache dies are much cheaper than I expected, but is joining them to the compute die simple and with high yields?
I think we're talking about different square root laws
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Thanks for the reply, I learned a lot.
I was wrong about it being two stacks, it's only one. So the cache dies are much cheaper than I expected, but is joining them to the compute die simple and with high yields?
I think we're talking about different square root laws

I know what you mean.

The Square root law in microprocessors say that if you increase the number of transistors by X%, the actual performance you get is square root of X. So if you quadruple the number of transistors, you get twice the performance. It's not just die area that's quadrupled. The power use is quadrupled as well, since 4x the transistors.

You can use clever engineering and ideas to overcome that somewhat but new ideas are much harder to come by.

Intel's Cypress Cove(the 14nm version of Icelake's core) uses 37% more transistors for 18% performance. Follows the law almost exactly. :)

But this is a generalized statement. Because Logic transistors such as ALUs, FPUs, branch predictors, decoders use a lot of power per transistor while caches are very power efficient. Power is pretty much the biggest limiter to performance nowadays.

Caches, and the way AMD does it will add die area and cost, but is a very power efficient way to increase performance.