Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Talking about gigabytes of SRAM cache is getting a bit ahead of ourselves, methinks.
Isn’t there already a Zen 3 bios showing some settings for up to 4 stacked cache (X3D) die? 8 chiplets * (32 MB+ (64 MB * 4)) = 2304 MB. Over 2 GB seems possible, but it may pull too much power for the general case. Might be limited to specialized enterprise applications (some types of HPC, high-end database servers, etc.).

edit: I guess that could be a fake. I don’t actually know the origin of that bios setting image, but a lot of people seem to accept it.
 
Last edited:

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
It will be interesting to see what massive SRAM caches can do for GPUs. We already have 128 MB on AMD GPUs, but I doubt that is being taken full advantage of yet. Probably not until we get a chiplet based GPU.
Isn't that basically the first step to this. I wouldn't be surprised if that was step one for to align this tech to be used on every product line. Someone can tell me if I am wrong, but this isn't like nand where every level is another full plan of nand cells. They are stacking cache on top of cache so for example they can't have the stacked cache on top of the cores. It makes sense why AMD went with the 8 matrixed cores for the CCD's in Zen 3 over the CCX's. They would have to make 2 area's on the layer unconnected for the older CCX's. Now its just one area. So by creating a SRAM cache in the Navi 2x they have the building blocks to start adding more cache. When they do it on NAVI they could do 3 layers and have half a gig of cache. That would be even without direct software development a pretty big boost. That's more than all of the PS3 or Xbox had of shared memory. Navi from the looks of it has a bit of Zen1/2 look to it with them having split compute complexes. Entirely possible Navi following a step behind Zen development the next one has a single compute complex and a single shared L3 cache location. Navi 3x was all about tripling the performance efficiency over Navi 1. Maybe thats the final step, using this V-cache to double or quadruple the L3 (on top of whatever the L3 is next gen). Hell then you could have an easier time seperating performance. One layer of V-ram is bad, cut it off and its a 7800XT, 2 and its a 7800. You could cut out compute cores on top of that for overall Yields.

But that assumes I am right. I do think they will eventually go to multichip designs. But AMD's been as much about lowering waste and maximizing wafer returns for long while. Hell all of Zen might have been an attempt to stay profitable while having to play nice with GoFlo's WSA that allowed for low yields and maybe low clocks. Variable cache layers and the possible large impact that would have on large monolithic dies gives AMD another avenue for binning to reduce waste numbers. Specially on a video card. AMD is launching this on Zen 3, because its already on market, and with a single layer, any non-qualifying layer, can just be market a standard Zen 3 chip and the day is done. It gives a mild refresh while they prep Zen 4 and the process can be used as a pipe cleaner / process tester. But Ryzen isn't where the real money is. That is in Epyc. After that the best margin and sales will be in the GPU market, with Desktop/Workstation/Server all being high margin chips.
 
  • Like
Reactions: Tlh97

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Ehh cache will grow by what the hierarchy demands and as the number of ways allows. Having more cache than you can search in the allotted divider will sooner or later bump up against the memory below.

Just because you an put it there does not necessarily make it useful. At some point the CPUs above will have to double their cache line size and push the bottleneck down the tree.

Going from 32mb to 96mb is more than a normal cache cycle increment. I wouldn't be surprised if there is some blending of the speed of the normal 32mb and the width of the extra (64)96mb to give the illusion of uniformity.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Going from 32mb to 96mb is more than a normal cache cycle increment. I wouldn't be surprised if there is some blending of the speed of the normal 32mb and the width of the extra (64)96mb to give the illusion of uniformity.

They have pretty much revealed all info and then some about how it works. They are adding more L3 cache slices and letting their usual partial address bits hash to select the slice where certain cache line is placed ( and what slice is "queried" when someone comes looking for address with bits for that ).
Currently 8C compete for 8 slices, that limit cumulative bandwidth and there are 2nd order effects like address conflicts, uneven loading of slices, cache way limits and conflicts.
They are adding either 8x8MB or 16x4MB slices. As long as chip is designed for it, average L3 latency does not have to increase much, no extra processing is done to select slices beyond what is done already. And hard to comment on extra "physical" latency, L3 caches are large and comparatively slow, several extra cycles won't be noticeable.

"Blending" is on already for both Intel/AMD, final latency depends on what address cache line has, and how far away on ring, X-bar, mesh or whatever a slice for said hash bits is.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
The obvious answer to this is Epyc. Several datacenter DBs and HPC setups that have near endless improvements in cache and memory.

But your not wrong. But it's also seems like a fatalistic option. We shouldn't contemplate cache size improvements and possibilities because there is some limit at or a diminishing returns point.

This is where multilayer cache would come in handy. Specially with semi custom. You want 1GB in L3. Sure there you go. That's $10k. And even with diminishing returns a lot of times customers are willing to pay big to push through it. It's why companies like Amazon are making their own CPU's. They can scale things like cache and core counts based on their requirements.
 

gdansk

Platinum Member
Feb 8, 2011
2,107
2,605
136
Some of this cache talk reminds me of the late PA RISC chips. All PA-8x00 cores were similar except the cache arrangements. Later models had a massive on die 2.25MB L1 and 32 or 64MB external L2. Although those were much slower than what AMD is doing and I think the L2 was eDRAM of some sort, not SRAM. But it's interesting to see how much you can get out of a design just by changing the cache configurations.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Some of this cache talk reminds me of the late PA RISC chips. All PA-8x00 cores were similar except the cache arrangements. Later models had a massive on die 2.25MB L1 and 32 or 64MB external L2. Although those were much slower than what AMD is doing and I think the L2 was eDRAM of some sort, not SRAM. But it's interesting to see how much you can get out of a design just by changing the cache configurations.
I think I used one of those. It was a lot faster than x86 and SPARC of the time, if I remember correctly. Given the die area taken, you are buying more a memory chip than a processing chip these days. It will be interesting to see the compile benchmarks. The Linux kernel compile is already down to 20 seconds on Epyc. We’re going to need a bigger benchmark.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
The obvious answer to this is Epyc. Several datacenter DBs and HPC setups that have near endless improvements in cache and memory.

But your not wrong. But it's also seems like a fatalistic option. We shouldn't contemplate cache size improvements and possibilities because there is some limit at or a diminishing returns point.

This is where multilayer cache would come in handy. Specially with semi custom. You want 1GB in L3. Sure there you go. That's $10k. And even with diminishing returns a lot of times customers are willing to pay big to push through it. It's why companies like Amazon are making their own CPU's. They can scale things like cache and core counts based on their requirements.
Considering per core licensing fees and such, $10k might be the lower end model. Some applications can get a massive boost from this much cache, so the prices might be reasonable in that case. I am curious how much power that much sram will take though.

Ltt would probably get one and play video games on it, which would be amusing. They previously played a game on a 64 core Epyc using a software render; no gpu. If we get a zen 4 Epyc with massively improved FP and massive caches, it might actually be playable.
 
  • Like
Reactions: Tlh97 and Topweasel

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Considering per core licensing fees and such, $10k might be the lower end model. Some applications can get a massive boost from this much cache, so the prices might be reasonable in that case. I am curious how much power that much sram will take though.

Ltt would probably get one and play video games on it, which would be amusing. They previously played a game on a 64 core Epyc using a software render; no gpu. If we get a zen 4 Epyc with massively improved FP and massive caches, it might actually be playable.

Possibly, but they have to offer insane value for companies to make the transition from Intel. Some of that requires performance and features. Some of that is in cost. So while I am sure they could offer these more than what you are even think let alone me. It would be smart for AMD to still try to be price competitive.
 

Doug S

Platinum Member
Feb 8, 2020
2,263
3,515
136
Some of this cache talk reminds me of the late PA RISC chips. All PA-8x00 cores were similar except the cache arrangements. Later models had a massive on die 2.25MB L1 and 32 or 64MB external L2. Although those were much slower than what AMD is doing and I think the L2 was eDRAM of some sort, not SRAM. But it's interesting to see how much you can get out of a design just by changing the cache configurations.

PA-RISC was always doing something different than everyone else with caches. Back in the days when 8K to 16K was state of the art for on chip L1, HP was doing wave pipelined off chip L1 caches from 256K to several MB. That was feasible since the CPU's clock rates were in the 50 to low hundreds of MHz range. They continued with huge off chip L1s until I think the PA-8500 finally allowed them to integrate it on chip.

Given that PA-RISCs market was servers running massive Oracle databases and workstations running eCAD and mCAD software that cost more to license than the very expensive hardware on which it was running, that sort of thing made sense.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Possibly, but they have to offer insane value for companies to make the transition from Intel. Some of that requires performance and features. Some of that is in cost. So while I am sure they could offer these more than what you are even think let alone me. It would be smart for AMD to still try to be price competitive.
They already offer CPUs with 32 MB of L3 per core, 256 MB total. That is the 72F3 with 8 cores at 3.7 base clock, 4.1 boost for maximum per core performance. These are ~$2500 list, but if your software is licensed per core, that is probably a good deal. Intel has 38.5 MB on one chip, but that is a 28 core cpu. That is only 1.375 MB per core. I don’t know what their current max cache per core product is.

AMD with Rome kind of already offers an insane value before we even get into stacked caches; it is up to 256 MB on package already. If Milan-x actually goes up to 4 layers of cache die, then it will absolutely destroy everything else for certain applications. They will be able to sell those for really high prices. Milan-x with even 1 layer of cache die would probably dominate for many benchmarks, even more than they already do. Intel hasn’t had comparable products for a while.

If they can pull off gigabyte(s) of SRAM in package, then that will massively accelerate some HPC applications, high end database servers, and probably ray tracing applications that still run on the cpu. The render workstation / server using cpu may be quickly obsoleted, if it isn’t already, by GPUs. We are going to get chiplet based GPUs , possibly with massive amounts of on die SRAM and DRAM. I am also looking forward to compile benchmarks on the cpu.

I am a little suspicious of the X3D settings in the bios though. If you can enable or disable the cache die, then that might mean that there is a trade off there somewhere. Perhaps higher latency as more cache die are enabled. That would mean that some applications may perform worse with the larger cache due to higher latency and little benefit from the higher hit rate and / or bandwidth.
 
  • Like
Reactions: Tlh97 and moinmoin

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
I'm sure that there is at least some sort of trade off with respect to total package power and peak clocks due to thermals. While the L3 due aren't big power hogs, they are also certainly not free.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
They already offer CPUs with 32 MB of L3 per core, 256 MB total. That is the 72F3 with 8 cores at 3.7 base clock, 4.1 boost for maximum per core performance. These are ~$2500 list, but if your software is licensed per core, that is probably a good deal. Intel has 38.5 MB on one chip, but that is a 28 core cpu. That is only 1.375 MB per core. I don’t know what their current max cache per core product is.

AMD with Rome kind of already offers an insane value before we even get into stacked caches; it is up to 256 MB on package already. If Milan-x actually goes up to 4 layers of cache die, then it will absolutely destroy everything else for certain applications. They will be able to sell those for really high prices. Milan-x with even 1 layer of cache die would probably dominate for many benchmarks, even more than they already do. Intel hasn’t had comparable products for a while.

If they can pull off gigabyte(s) of SRAM in package, then that will massively accelerate some HPC applications, high end database servers, and probably ray tracing applications that still run on the cpu. The render workstation / server using cpu may be quickly obsoleted, if it isn’t already, by GPUs. We are going to get chiplet based GPUs , possibly with massive amounts of on die SRAM and DRAM. I am also looking forward to compile benchmarks on the cpu.

I am a little suspicious of the X3D settings in the bios though. If you can enable or disable the cache die, then that might mean that there is a trade off there somewhere. Perhaps higher latency as more cache die are enabled. That would mean that some applications may perform worse with the larger cache due to higher latency and little benefit from the higher hit rate and / or bandwidth.

I suspect it is for compatibility. Some applications may not particularly like large caches.

Also, being able to easily turn it off makes it easier to debug and benchmark.
 

NTMBK

Lifer
Nov 14, 2011
10,237
5,020
136
I am a little suspicious of the X3D settings in the bios though. If you can enable or disable the cache die, then that might mean that there is a trade off there somewhere. Perhaps higher latency as more cache die are enabled. That would mean that some applications may perform worse with the larger cache due to higher latency and little benefit from the higher hit rate and / or bandwidth.

Makes for a good review, doesn't it? If the reviewer can turn the new feature on and off and easily show a big improvement, then that helps a lot with marketing.
 
  • Like
Reactions: Joe NYC

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
They already offer CPUs with 32 MB of L3 per core, 256 MB total. That is the 72F3 with 8 cores at 3.7 base clock, 4.1 boost for maximum per core performance. These are ~$2500 list, but if your software is licensed per core, that is probably a good deal. Intel has 38.5 MB on one chip, but that is a 28 core cpu. That is only 1.375 MB per core. I don’t know what their current max cache per core product is.

AMD with Rome kind of already offers an insane value before we even get into stacked caches; it is up to 256 MB on package already. If Milan-x actually goes up to 4 layers of cache die, then it will absolutely destroy everything else for certain applications. They will be able to sell those for really high prices. Milan-x with even 1 layer of cache die would probably dominate for many benchmarks, even more than they already do. Intel hasn’t had comparable products for a while.

If they can pull off gigabyte(s) of SRAM in package, then that will massively accelerate some HPC applications, high end database servers, and probably ray tracing applications that still run on the cpu. The render workstation / server using cpu may be quickly obsoleted, if it isn’t already, by GPUs. We are going to get chiplet based GPUs , possibly with massive amounts of on die SRAM and DRAM. I am also looking forward to compile benchmarks on the cpu.

I am a little suspicious of the X3D settings in the bios though. If you can enable or disable the cache die, then that might mean that there is a trade off there somewhere. Perhaps higher latency as more cache die are enabled. That would mean that some applications may perform worse with the larger cache due to higher latency and little benefit from the higher hit rate and / or bandwidth.
My point was that they might want to be still be the value option here for market penetration. I get what Rome and Milan brought in comparison to the competition and even in worst case scenario (super high clock, requires water cooling) versions still don't quite reach Intels high end in cost. Intel is still a moving target. Their packaging moves in the future will allow them to toss on SRAM modules almost on demand. They will have packaging limits AMD won't have, but AMD also has to make sure they are using enough layers to keep up as it will be harder for them to up the layer count on a reasonable time table. Also just considering the sheer mass of requirements that it takes to get DC customers to make the switch, I could see AMD still requiring both in cost/features/performance needing to keep value up at least for DC.
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
Had the same feeling back when I got my Ryzen 3600. Back in 1999 32MB RAM was pretty common for mid-range systems. 20 years later, and that's the CPUs L3 cache. Now that's progress. :D

Got 32GB RAM to go along with it. The symmetry is beautiful, isn't it?
32MB?! In '98 I was running WinNT 4.0 w/256MB of RAM. I think I doubled that (on different machines/OSes) every year for ~4 years. I think I was spending more on RAM than I was spending on my CPU & motherboard.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,691
136
32MB?! In '98 I was running WinNT 4.0 w/256MB of RAM. I think I doubled that (on different machines/OSes) every year for ~4 years.

You were lucky. Not all of us could afford that at the time. I only got to 96MB in my personal machine in '99, later 160MB in '00 (64+64+32MB PC100 SD). Had an MVP3-G2 in it. Never did manage to get a K6-III for myself, before I got an Athlon (600MHz, I think).

I think I was spending more on RAM than I was spending on my CPU & motherboard.

Now, that I can believe. Memory was expensive.
 
  • Like
Reactions: lobz

Joe NYC

Golden Member
Jun 26, 2021
1,948
2,289
106
Isn’t there already a Zen 3 bios showing some settings for up to 4 stacked cache (X3D) die? 8 chiplets * (32 MB+ (64 MB * 4)) = 2304 MB. Over 2 GB seems possible, but it may pull too much power for the general case. Might be limited to specialized enterprise applications (some types of HPC, high-end database servers, etc.).

It could also be a way of recreating 8 chiplet 72F3 (8 core 256MB L3) with 1 chiplet (8 core, 288MB L3. (with slightly different power, performance profile).

If one layer of extra L3 is $6 in die cost and another $6 in assembly, AMD could sell these for $50 to $100 per layer, with great margins.

It would be Intel way of thinking, to limit accessible technology to some high priests of an ivory tower - when Intel had the performance crown. Intel would hold back the technology and play various marketing / segmentation games.

AMD does not owe anything to anybody, does not need to hold back.

As far as power, there are several dimensions to that question:
- If SRAM is busy serving data, it is going to use power, but at fraction of power of what it would take to send the request and receiver response from main RAM
- If cores are kept fed with data faster (from L3 rather than RAM), they will use more power, but will also do more work.
- Idle power of extra L3 should be quite low.

So I doubt there is going to be a lot of power being wasted.

edit: I guess that could be a fake. I don’t actually know the origin of that bios setting image, but a lot of people seem to accept it.

That was actually a BIOS of AMD Milan test platform, provided to reviewers during Milan launch Called Daytona.
 
  • Like
Reactions: Vattila and bsp2020