We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

K

Kedas

Senior member

Jun 1, 2021

#1

Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"

Last edited: Jun 1, 2021

E

eek2121

Diamond Member

Jun 1, 2021

#26

FYI, according to Ian:

"Confirmed with AMD that V-Cache will be coming to Ryzen Zen 3 products, with production at end of year. "

Zen 3 it is. AMD gets to skip a generation. Let's go!

K

Kedas

Senior member

Jun 2, 2021

#27

About that You tube video some wrong info in there, the CPU that Lisa did show had 64MB on both dies, they just removed the top layer of one die so you could see the V-cache.

The fact that it can be switched of is a nice feature for low power usage.
Zen3 wat build with this add-on in mind, so that info took a long time to get out...

NTMBK

Lifer

Jun 2, 2021

#28

Hopefully this tech makes it into the SteamPal. This would be awesome for low power handheld gaming.

A

A///

Diamond Member

Jun 2, 2021

#29

I wonder how the cache will behave when it comes to film and music production.

Gideon

Platinum Member

Jun 2, 2021

#30

CakeMonster said:
I'm a bit confused about performance expectations. Someone cited the Broadwell with extra cache and that it only performed better in games.

Broadwell's L4 was a different beast though since it was an MCM solution.

It had poor bandwidth for a cache (50 GiB/s vs 25.6 GiB/s for DDR3 1600)
It also suffered from poor latency of around 150 cycles (vs 200+ cycles for DDR3)

AMD's solution meawhile has:

2 TB/s bandwidth
According to this indiscernible latency from the rest of the L3, that would be around 50 cycles.

So despite the overall cache size being similar (128MB vs 96MB) V-cache can transfer 40x more data at once while doing it over 3x faster.

There will still be plenty of use-cases that won't benefit much from the extra cache, but it should be much more capable than Broadwell.
I'm particularly interested in software compiling benchmarks as these tend to scale very well with extra cache.

Last edited: Jun 2, 2021

I

igor_kavinski

Lifer

Jun 2, 2021

#31

A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.

Gideon

Platinum Member

Jun 2, 2021

#32

igor_kavinski said:
A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.

Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.

J

JoeRambo

Golden Member

Jun 2, 2021

#33

This is on absolutely and completely different level than eDRAM or HBM, since it is good old L3 cache, made huge by AMD.

eDRAM or HBM are "L4" solutions, so they either need tags ( that take space that could be taken by L3 cache ) or they serve as so called "system cache" on memory side of things, acting as huge buffer (There is possibility of outright replacing some DRAM with say HBM, so first 16GB of address space are served by HBM and so on but that is different solution ).

This above sounds complex and it is without drawbacks, obvious and hidden. For example tag checking is not free, on every L3 miss you need time and energy to see if your cache line is in L4. So your average memory latency has now grown. System side caches add complexity and also use energy, while not being that effective.

While AMD's solution is good old L3, but made huge. They have 8x4MB slices now, most likely 16 more will be added, with resulting increase in cumulative bandwidth and reducing averagef pending request queues that happen now due to address bit collisions. AMD is citing 2TB/s, that is pretty much BW of their L2 cache at clocks of 5900x. INCREDIBLE achievement to have L3 caches with bandwidth that is near L2 cache bw and it really opens up things for FP perf and MT performance in general due to prefetching.

AMD somehow went from also rans in cache designs, with caches that were either questionable or behind Intel to beating the hell out of everyone in latencies and capacities.

Last edited: Jun 2, 2021

L

leoneazzurro

Golden Member

Jun 2, 2021

#34

Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

J

JoeRambo

Golden Member

Jun 2, 2021

#35

leoneazzurro said:
Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

No, inter-CCD latencies remain the same and there are no interconnections anywhere. They are stacking (each CCD) with additional layer of silicon, full of SRAM on top of existing L3 area and using vias to connect it to achieve huge bandwith and all required signalling for L3 to function. Latency is great cause it is basically same physical distance to consumers, it is using same address hashing to put lines into slices ( just the number of slices is greater ).
No need to invent complicated schemes, when 3D stacking solved the problems.

L

leoneazzurro

Golden Member

Jun 2, 2021

#36

JoeRambo said:
No, inter-CCD latencies remain the same and there are no interconnections anywhere. They are stacking additional layer of silicon, full of SRAM on existing L3 area and using vias to connect it to achieve bandwith. Latency is great cause it is basically same physical distance to consumers, it is using same address hashing to put lines into slices ( just the number of slices is greater ).
No need to invent complicated schemes, when 3D stacking solved the problems.

I mean in future CPUs, not the one demoed there.

J

JoeRambo

Golden Member

Jun 2, 2021

#37

leoneazzurro said:
I mean in future CPUs, not the one demoed there.

Again, why complicate things? If current L3 is not connected, why should extension of current L3 be connected? CCDs are physically separated and already connected via IOD.
Way more interesting to use 3D stacked SRAM slices on top of IOD as a sort of L4 cache, to throw silicon on improving inter-CCD and inter-socket scalability.

E

eek2121

Diamond Member

Jun 2, 2021

#38

igor_kavinski said:
A large L4 cache might be required for most enthusiast level future CPU's paired with DDR5, to hide the rumored higher DRAM latencies, the one critical thing that DDR5 doesn't improve upon.

Gideon said:
Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.

Yep, DDR5 latency will be similar to DDR4.

The extra cache should actually help most workloads not just gaming.

L

leoneazzurro

Golden Member

Jun 2, 2021

#39

I was not speaking about current architecture, but future ones (not immediate future). That is why I spoke about moving L3 cache away from CCDs. L4 is also viable and probably it would be the first step but in the long run it adds another level in the memory hyerarchy and thus it adds latencies.

IntelUser2000

Elite Member

Jun 2, 2021

#40

leoneazzurro said:
Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

This won't happen. Off-die will increase latency. On-die is always faster, and at the same time use less power to do so.

JoeRambo has a good point. As long as you solve/accept the technical and economical limitations to stacking, it does it far better.

eDRAM or HBM are "L4" solutions, so they either need tags ( that take space that could be taken by L3 cache ) or they serve as so called "system cache" on memory side of things, acting as huge buffer (There is possibility of outright replacing some DRAM with say HBM, so first 16GB of address space are served by HBM and so on but that is different solution ).

And why would V-cache not require tags? I assume the stacks will have tags in them. It sounds like it's just an extension of the existing L3.

L

leoneazzurro

Golden Member

Jun 2, 2021

#41

IntelUser2000 said:
This won't happen. Off-die will increase latency. On-die is always faster, and at the same time use less power to do so.

AMD affirmed that the additional stack of L3 has practically the same latency level as on-die L3.

IntelUser2000

Elite Member

Jun 2, 2021

#42

leoneazzurro said:
AMD affirmed that the additional stack of L3 has practically the same latency level as on-die L3.

You are not listening to either me nor JoeRambo.

AMD's V-cache is not, I repeat NOT off-die.

Unless you are misunderstanding what "off-die" means. In this case it means something like a separate CCX with caches only.

L

leoneazzurro

Golden Member

Jun 2, 2021

#43

IntelUser2000 said:
You are not listening to either me nor JoeRambo.

AMD's V-cache is not, I repeat NOT off-die.

Unless you are misunderstanding what "off-die" means. In this case it means something like a separate CCX with caches only.

Then even what I am speaking about is not off-die L3, you too are not trying to understand what I meant. For me a stack IS off-die as it requires another die of SRAM to be assembled on the top of the CCD. That die being seen as "logically not off-die" or "behaving like an on-die cache" does not mean that it being not a separate die.

IntelUser2000

Elite Member

Jun 2, 2021

#44

leoneazzurro said:
Then even what I am speaking about is not off-die L3, you too are not trying to understand what I meant.

Let me quote what you said originally.

Not speaking about the fact that if latencies of the off-CCD L3 are very similar to the standard L3, this would point out to multiple CCDs without L3 being interconnected to a large off-die L3 cache, thus improving by orders of magnitude the communication and coherency among CCDs.

Makaveli

Diamond Member

Jun 2, 2021

#45

Gideon said:
Are you sure? Geil promised 10ns true latency modules (7200 MT/s @ CL36 and 6400 MT/s @ CL32) at launch, which is exactly the same as the vast majority of overclocked DDR4 modules. XPG even promises 7400 MT/s modues at unknown latencies.

I know that Ian was worried about latencies in this article, but it does not seem to have been materialized.

This is what i've read also. DDR5 is suppose to offer the same true latency. People are getting this confused because they are only focusing on the CL numbers on the memory sticks.

L

leoneazzurro

Golden Member

Jun 2, 2021

#46

IntelUser2000 said:
Let me quote what you said originally.

And? I clarified that a stack is OFF-die in my view, because it needs a separate die and an sasembly process to be connected to the CCD.

IntelUser2000

Elite Member

Jun 2, 2021

#47

leoneazzurro said:
And? I clarified that a stack is OFF-die in my view, because it needs a separate die and an sasembly process to be connected to the CCD.

Fine.

But you should use common terminology. AMD's V-cache is vertical stacking. Off-die means off-die.

L

leoneazzurro

Golden Member

Jun 2, 2021

#48

IntelUser2000 said:
Fine.

But you should use common terminology. AMD's V-cache is vertical stacking. Off-die means off-die.

It's DIE stacking, so it's implicitly affirming that there are multiple dies. It's you using an incorrect terminology, as you are implying that a CCD+stack are the same die while they are not, they are of course separate dies, stacked one (or more) on top of another.

H

Hougy

Member

Jun 2, 2021

#49

IntelUser2000 said:
Caches are different, because they are easier to manufacture and have higher yields because of the repetitive structure. Also they are quite power efficient.

You are applying the square root law in a wrong way. The square root law is a big penalty because the power consumption increase in the core also increases just as much.

This cache is going to add at best 3-4W.

Thanks for the reply, I learned a lot.
I was wrong about it being two stacks, it's only one. So the cache dies are much cheaper than I expected, but is joining them to the compute die simple and with high yields?
I think we're talking about different square root laws

IntelUser2000

Elite Member

Jun 2, 2021

#50

Hougy said:
Thanks for the reply, I learned a lot.
I was wrong about it being two stacks, it's only one. So the cache dies are much cheaper than I expected, but is joining them to the compute die simple and with high yields?
I think we're talking about different square root laws

I know what you mean.

The Square root law in microprocessors say that if you increase the number of transistors by X%, the actual performance you get is square root of X. So if you quadruple the number of transistors, you get twice the performance. It's not just die area that's quadrupled. The power use is quadrupled as well, since 4x the transistors.

You can use clever engineering and ideas to overcome that somewhat but new ideas are much harder to come by.

Intel's Cypress Cove(the 14nm version of Icelake's core) uses 37% more transistors for 18% performance. Follows the law almost exactly. 🙂

But this is a generalized statement. Because Logic transistors such as ALUs, FPUs, branch predictors, decoders use a lot of power per transistor while caches are very power efficient. Power is pretty much the biggest limiter to performance nowadays.

Caches, and the way AMD does it will add die area and cost, but is a very power efficient way to increase performance.

You must log in or register to reply here.

Share:

Facebook X Bluesky LinkedIn Reddit Tumblr WhatsApp Email Link

TRENDING THREADS

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)
- Started by DisEnchantment
- Sep 29, 2022
- Replies: 25K
CPUs and Overclocking
T
Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads
- Started by Tigerick
- Aug 22, 2022
- Replies: 25K
CPUs and Overclocking
Discussion Intel current and future Lakes & Rapids thread
- Started by TheF34RChannel
- Jun 18, 2017
- Replies: 24K
CPUs and Overclocking
Discussion Apple Silicon SoC thread
- Started by Eug
- Nov 10, 2020
- Replies: 12K
CPUs and Overclocking
Question Zen 6 Speculation Thread
- Started by IronLynx
- May 22, 2024
- Replies: 10K
CPUs and Overclocking

This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Accept Learn more…

Top