Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 27 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

jpiniero

Lifer
Oct 1, 2010
14,585
5,208
136
For example, Threaddripper Pro, that would be more suited for Workstation (8 channel) segment, starts from 12 and 16 cores. But the vanilla Threadripper, that would be well suited for HEDT starts from 24 cores and up.

That's because Threadripper Pro is for OEMs, and OEMs will OEM. The people who would buy those are looking for workstations that have validated ECC at a bare mininum. You don't get that on HEDT Threadripper.
 
  • Like
Reactions: lightmanek

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Where AMD went wrong wasn't with threadripper. If you need those threads, you know what to look for. What they did wrong was not coming up with a competing chipset for Intel's C series. Those "workstation and low end server" Chipsets really set apart a lot of Intel's workstation offerings. For example, let's say that AMD commissioned a "D" series chipset, like a D580 above the X570. Let's also say that, using said chipset meant that a motherboard mfr had to fully qualify ECC RAM memory. The only real deficiency that AMD's processors had was PCIe lanes. A minor tweak allows x16 PCIe connector to interface with a D class chipset, which has in it a PCIe switch that allows that x16 4.0 downlink to becone four x8 4.0 slots that are oversubscribed 2:1. The old downlink and the existing x4 can drive a pair of M.2 ports. That's perfectly fine for 90% of use cases. Couple that with a pair of x1 3.0 lanes, a bunch of sata ports, sacrifice one x8 slot for an extra pair of m.2 slots. A pair of 10Gbps ports, lots of USB and you have a very competent low to mid end workstation with the 3900x up to the 5950x. The DRAM bandwidth isn't that big of a deal in many cases, and the 5950x already has a lot of L3 cache to begin with.

Threadripper was, and is, a vanity project. It shows that AMD can completely outclass anything that Intel puts out for HEDT and can function just fine as a low end server.
 
  • Like
Reactions: Tlh97 and Joe NYC

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
That's because Threadripper Pro is for OEMs, and OEMs will OEM. The people who would buy those are looking for workstations that have validated ECC at a bare mininum. You don't get that on HEDT Threadripper.

There is now a Threadripper Pro Retail. Anandtech has an article, and I have seen them on Newegg:

The parts that don't make sense are for example:
- if I want only 12-16 core Threadripper, I have to buy an 8 memory channel mobo for ~$700
- if I want a more affordable 4 memory channel platform for ~#400, I have to buy, at minimum a 24 core CPU for ~$1500

It's as if Intel threw HEDT market in AMD's lap and AMD said: "No, we don't want it"

But there is a chance for AMD to bring some sanity into this insanity with Threadripper 5000x
 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
Where AMD went wrong wasn't with threadripper. If you need those threads, you know what to look for. What they did wrong was not coming up with a competing chipset for Intel's C series. Those "workstation and low end server" Chipsets really set apart a lot of Intel's workstation offerings. For example, let's say that AMD commissioned a "D" series chipset, like a D580 above the X570. Let's also say that, using said chipset meant that a motherboard mfr had to fully qualify ECC RAM memory. The only real deficiency that AMD's processors had was PCIe lanes. A minor tweak allows x16 PCIe connector to interface with a D class chipset, which has in it a PCIe switch that allows that x16 4.0 downlink to becone four x8 4.0 slots that are oversubscribed 2:1. The old downlink and the existing x4 can drive a pair of M.2 ports. That's perfectly fine for 90% of use cases. Couple that with a pair of x1 3.0 lanes, a bunch of sata ports, sacrifice one x8 slot for an extra pair of m.2 slots. A pair of 10Gbps ports, lots of USB and you have a very competent low to mid end workstation with the 3900x up to the 5950x. The DRAM bandwidth isn't that big of a deal in many cases, and the 5950x already has a lot of L3 cache to begin with.

Threadripper was, and is, a vanity project. It shows that AMD can completely outclass anything that Intel puts out for HEDT and can function just fine as a low end server.

But isn't the TRx40 chipset very well positioned for this already? If there were some competitive CPUs for it, there would have been volume to drive these mobo prices to more competitive levels.

4 memory channels and extra PCIe4 lanes for a whole bunch of M.2s are quite attractive features, and the platform cost is not as astronomical as the 8 channel one, even with the low volume.
 
  • Like
Reactions: lightmanek

zir_blazer

Golden Member
Jun 6, 2013
1,164
406
136
For example, let's say that AMD commissioned a "D" series chipset, like a D580 above the X570. Let's also say that, using said chipset meant that a motherboard mfr had to fully qualify ECC RAM memory. The only real deficiency that AMD's processors had was PCIe lanes.
You don't even NEED a new Chipset for that, since AMD doesn't segment ECC support depending on Processor + Chipset like Intel does, just Processor. But yeah, AMD could have sold a new SKU as part of a entry level Workstation platform. AMD has nothing that can directly compete with the Xeon E3 platform, which is merely lack of validated ECC support and BMCs for remote management. A few vendors included that on their own like AsRock, but compare that with the Xeon E3 ecosystem and it obviously pales in comparison. You have almost nothing to choose from on AM4 side, same with ThreadRipper.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
There are some nice opportunities on desktop, and AMD could also play high end of this segment with Threadripper with 3D cache, with non-Pro, 4 channel platform. Perhaps starting from as little as a single CCD and more layers of V-Cache at ~$500 and up from there. What made Intel HEDT successful what generated most of the volume for the platform were not the $1,000 CPUs but the $500 CPUs.

This is a good point, but Intel HEDT lavished in the decade we were stuck with 4C on desktop, the CQD to 6700K era. Frankly only tiny fraction of people now are not better served by something like 5950X and need low core count HEDT for IO or memory channels. The people that need to combine that with 3D cache?
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
The Zen 2 in consoles and APU has 1/4 of the L3 cache. It doesn't seem to make a big difference. Why would doubling the Zen 3 L3 cache give +15% gaming performance?

Really? No one seems to have answered this properly, unless I missed it. I did mostly skim this since I have been away a while. Although, I read it closely enough to spot a few more straw man arguments.

Anyway, while higher levels of optimization allowed by a fixed hardware makes a difference that is probably not the main difference. The main thing is that the console APUs are directly attached to around 16 GB of GDDR6 at about 500 GB/s bandwidth vs. a desktop CPU or APU with DDR4 at about 50 GB/s. That kind of bandwidth will make up for less cache in many cases, even though the gpu will take a lot of it. AMD went the other direction with infinity cache on their GPUs; using large caches in place of higher bandwidth.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
There is now a Threadripper Pro Retail. Anandtech has an article, and I have seen them on Newegg:

The parts that don't make sense are for example:
- if I want only 12-16 core Threadripper, I have to buy an 8 memory channel mobo for ~$700
- if I want a more affordable 4 memory channel platform for ~#400, I have to buy, at minimum a 24 core CPU for ~$1500

It's as if Intel threw HEDT market in AMD's lap and AMD said: "No, we don't want it"

But there is a chance for AMD to bring some sanity into this insanity with Threadripper 5000x
It almost seems like market segmentation, doesn’t it? I didn’t think AMD did such things.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
It is actually less distance to cover, not more.

Suppose AMD would in fact go from 512k to 2 MB. It means doubling both dimension of the L2 cache rectangle. Could be a millimeter or more

Distance up is only 50 microns



- Increasing the total L2 capacity without increasing the base CCD size
- Shorter distances
- Increasing area eligible for stacking by another 10-20%

The pitch for the TSVs is likely too large to make vertical stacking at the L2 level a reasonable thing to do. TSMC claimed to have demonstrated pitch down to 0.9 microns, but even that is quite large when you start to get to tens of thousands of TSVs and the circuitry required to support them. I don’t know what pitch is actually in volume production vs. demonstrated. I have seen a claim on Twitter the the pitch in use for Zen 3 is actually 17.3 microns:

“A 4MB L3 partition = 3000 TSVs. There are also 56 TSVs in the SMU area and an extra 14 TSVs in the self-test area. Total 24070 TSVs. Size: ~6.1μm, ~17.3μm pitch.”

It isn’t just the signal pins, it also has power/ground TSVs all over:


This also indicates that each TSV has a little circuit associated with it.
 

Abwx

Lifer
Apr 2, 2011
10,940
3,441
136

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
It almost seems like market segmentation, doesn’t it? I didn’t think AMD did such things.
Not to apologize for AMD. But you have to have some market segmentation. But I am not sure AMD is completely doing that. They offer 6-16 cores on their desktops. They have a more cores solution with TRX40, And the more memory bandwidth solution in TRX80, that one where the requirements maybe less but faster cores, they offer it down (and at a bit ASP boost, but considering the market, and the increased production costs, plus the validation, and low volume it makes sense).

What they do not offer is a low core count 4 dimm solution.

I might be wrong but have we actually seen a TR or Epyc with 4 CCD's? I can't think of one. That's a lot of good chips to disable to fill a low cost, low volume, solution to cover what is mostly already covered by Ryzen with the 6,8,12,16 core options they have there. Specially with the death of Xfire and SLI, removing the need for as many PCIe lanes. It's a really small market that wants a 16c or lower, but needs more memory bandwidth, but doesn't want 8 channels.

All the while they are miles away from keep up with orders on any of their chips. Specially server, where all these Threadrippers go.
 
  • Like
Reactions: Tlh97

cortexa99

Senior member
Jul 2, 2018
319
505
136
So the base frequency on the 64 core had to be cut from 2.45 to 2.2 to maintain the 280 W TDP.
I suspect not only power but also heat spreading/temperature problem on stack L3 play a role here to force the core run lower clocks.

(But seems IPC/bandwidth uplift bring by stack L3 is likely more than 2.45/2.2=11% though)
 

zir_blazer

Golden Member
Jun 6, 2013
1,164
406
136
I might be wrong but have we actually seen a TR or Epyc with 4 CCD's? I can't think of one. That's a lot of good chips to disable to fill a low cost, low volume, solution to cover what is mostly already covered by Ryzen with the 6,8,12,16 core options they have there. Specially with the death of Xfire and SLI, removing the need for as many PCIe lanes. It's a really small market that wants a 16c or lower, but needs more memory bandwidth, but doesn't want 8 channels.

You will notice that there is a large central I/O die and eight CCD’s or the 7nm chiplets with up to 8 CPU cores each. Eight chiplets with 8 CPU cores each and we get to the maximum of 64 cores. If one had two cores per CCD inactive, one would then see 6 cores per chiplet or 48 cores.

When one gets to the lower-end 8, 12, and 16 core parts, this presents a challenge. AMD would need to populate all eight chiplets with dies that only have 1-2 cores active. Given the small size and relatively good yields of the 7nm chiplets, that is a challenge. The company would also have to go through the process of packaging nine dies for an 8 core CPU.

Instead of doing that, AMD essentially populates two active dies per Rome package on some of these lower-end SKUs. That helps keep costs down.





As per the CPU forum rules.
Your post needs to have personal comments made by you. No dropping of links or images only without comment.


esquared
Anandtech Forum Director
 
Last edited by a moderator:

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
I suspect not only power but also heat spreading/temperature problem on stack L3 play a role here to force the core run lower clocks.

(But seems IPC/bandwidth uplift bring by stack L3 is likely more than 2.45/2.2=11% though)

The cores could be doing more work by being fed better (spending less time sitting idle waiting for memory accesses).

As a result, all core clock speed could not be kept at 2.45 GHz with all cores all busy.

It will be more interesting to see some performance results or 7773X vs. 7763X than to compare all core guaranteed minimum clock speed.
 

jpiniero

Lifer
Oct 1, 2010
14,585
5,208
136
The cores could be doing more work by being fed better (spending less time sitting idle waiting for memory accesses).

As a result, all core clock speed could not be kept at 2.45 GHz with all cores all busy.

Don't think it works that way. Obvs the L3 chiplets burns some power so that reduces the power available to the cores.
 
  • Like
Reactions: lightmanek

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
Don't think it works that way. Obvs the L3 chiplets burns some power so that reduces the power available to the cores.

When you mention power used by L3, the first comparison to make would not be vs. power available to the cores.

The first comparison would be vs power saving on SerDes and power savings on IO Die memory controller.

Because that is the role of V-Cache - lowering that traffic out of CCD, which can be power hungry.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Not to apologize for AMD. But you have to have some market segmentation. But I am not sure AMD is completely doing that. They offer 6-16 cores on their desktops. They have a more cores solution with TRX40, And the more memory bandwidth solution in TRX80, that one where the requirements maybe less but faster cores, they offer it down (and at a bit ASP boost, but considering the market, and the increased production costs, plus the validation, and low volume it makes sense).

What they do not offer is a low core count 4 dimm solution.

I might be wrong but have we actually seen a TR or Epyc with 4 CCD's? I can't think of one. That's a lot of good chips to disable to fill a low cost, low volume, solution to cover what is mostly already covered by Ryzen with the 6,8,12,16 core options they have there. Specially with the death of Xfire and SLI, removing the need for as many PCIe lanes. It's a really small market that wants a 16c or lower, but needs more memory bandwidth, but doesn't want 8 channels.

All the while they are miles away from keep up with orders on any of their chips. Specially server, where all these Threadrippers go.
What are you talking about? Zen 2 Epyc has 2, 4, 6, or 8 CCD. The 2 or 6 are asymmetric and only a couple products are made with that number. Milan is either 4 or 8 CCDs only. I believe Threadripper has been 2, 4, and 8 CCD. The most common Epyc or Threadripper products are 4 CCD (32 core or less). Threadripper can’t be that cheap due to the giant IO die and expensive package so they have certain constraints on price in addition to market segmentation.

This gets worse with Genoa if it has 12 channel memory and 12 CCD connections. This would make a Genoa based Threadripper a much more expensive package than a Milan based Threadripper. Also, if the IO die is made in 6 nm rather than cheap GF 14 nm silicon, then that makes it even more expensive. This is part of why I have been wondering if the IO die will be modular with local silicon interconnect bridges or just wide IF links. Each one could have 3 memory channels, 3 CCD links, and probably 2 pci-express 5. They already have sufficient IO. The Ryzen IO die has always been basically 1/4 of an Epyc IO die. If they made a modular IO die, they could possibly cover a huge part of their product stack with 2 different chips. Desktop Ryzen could use 1 modular IO die and 1 to 3 CCD. Threadripper could use 2 modular IO die and up to 6 CCD which would be a cheaper, smaller socket compared to Epyc. Epyc could then be 4 IO die package with up to 12 CCD. That would be more manufacturable on 6 nm and would allow more efficient use of silicon. I don’t think we have seen any rumors to support this, but it seems like it makes a lot of sense. It may help with routing to be able to move the IO die components apart a bit. Routing connections for 12 CCDs will not be simple.

AMD does market segmentation, just not as much as Intel and they segment the market on different things. The cache size is likely to be a big market segmentation feature. The Milan-X3D is rumored to be a significant price increase over Milan with no stacked cache, and that is just with a single layer.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
The lower end of Milan Epyc is 4 dies.
What are you talking about? Zen 2 Epyc has 2, 4, 6, or 8 CCD. The 2 or 6 are asymmetric and only a couple products are made with that number. Milan is either 4 or 8 CCDs only. I believe Threadripper has been 2, 4, and 8 CCD. The most common Epyc or Threadripper products are 4 CCD (32 core or less). Threadripper can’t be that cheap due to the giant IO die and expensive package so they have certain constraints on price in addition to market segmentation.
I feel that more of this is a given assumption. Have we seen an xray or delid of a 4CCD Epyc (and I don't believe there is a 2 or 6 CCD physical EPYC at all)? I know they are creating substrates for 1 and 2 CCD's on Ryzen. But considering what they did with 1k Threadripper and Naples I wouldn't assume they aren't populating all 8 spots and then disabling the weaker (but probably still working) dies. I know the IO dies are substantially larger, but also much cheaper process. An extra 4 dies even if they aren't great, would be nearly IO die size and wafer wise would cost 2-3x more.

AMD does market segmentation, just not as much as Intel and they segment the market on different things. The cache size is likely to be a big market segmentation feature. The Milan-X3D is rumored to be a significant price increase over Milan with no stacked cache, and that is just with a single layer.
They create products for markets. It is market segmentation. But when people mention Intel and market segmentation. It's always about holding back. It's about disabling functionality to artificially create a difference forcing certain buyers to buy products in another product segment. Whethere its tightly limiting core size, you could say AMD is doing the same because Ryzen doesn't support more than 16 cores, but they aren't when stuck with an option to when the die size is to small for their pinout, multiple times choosing to increase GPU core count instead of CPU cores. There is a big difference from deciding to manufacturer a product even though they have the tools for it due to where they want to keep the product and chopping off functionality in the products in that segment to force people over to a more expensive segment.

A lot of this is also due to manufacturing limitations. Right now it doesn't make sense to me for AMD to offer regardless of production costs (what I was referring to above), in a limited market, to make cheaper Threadrippers (8/12/16), I never understood the 8c TR 1k in the first place other than to suck people into the platform. But now when every single die is selling, they can't keep up with server chip requests and can't give up that juicy mindshare their normal desktop selection gets them (and the extra wafers they have to set aside for Ryzen mobile that are selling like gang busters), selling a server chip as desktop chip, when their is already a desktop chip alternative, doesn't make sense. Sure it means someone who decides they must have more lanes, or they don't need more cores just more memory bandwidth, no real option. They either lose those and go 5950x or go overboard and get a TRX80/Epyc solution, or they use the difference in board costs and jump up to a 3960x. Heck the fact that the options for TR are still Rome based proves the point. They are so far ahead on HDET, that it doesn't make sense for them to use Zen 3 CCD's for the market and they continue to offer those as EPYC's. We look at the GPU apocalypse and get angry that AMD and Nvidia continue to offer extra sku's even though we can't get a hold of the previous ones. But then give them a hard time when they are shy of doing that on CPU's because they aren't releasing the exact configuration we are looking for.
 
  • Like
Reactions: Tlh97 and moinmoin

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
What are you talking about? Zen 2 Epyc has 2, 4, 6, or 8 CCD. The 2 or 6 are asymmetric and only a couple products are made with that number. Milan is either 4 or 8 CCDs only. I believe Threadripper has been 2, 4, and 8 CCD. The most common Epyc or Threadripper products are 4 CCD (32 core or less). Threadripper can’t be that cheap due to the giant IO die and expensive package so they have certain constraints on price in addition to market segmentation.

One thing I was wondering - is TRx4 the same IO die as full Epyc?

Because if it is, that may explain why AMD has not pushed it more aggressively to the original HEDT space intel used to occupy. Because the floor (cost-wise) would be higher.

I am guessing it is, since the core count goes all the way to 64...

This gets worse with Genoa if it has 12 channel memory and 12 CCD connections. This would make a Genoa based Threadripper a much more expensive package than a Milan based Threadripper. Also, if the IO die is made in 6 nm rather than cheap GF 14 nm silicon, then that makes it even more expensive. This is part of why I have been wondering if the IO die will be modular with local silicon interconnect bridges or just wide IF links.

It seems that way,. At least the major functional parts 2 IF links + 2 memory channels + PCI links are part of a block,

Recent Anandtech article described a ring bus connecting the elements.

It would be nice if AMD could spin a half IO die, with just 2 of 4 elements, hopefully at more competitive cost, in order (for AMD) to be more of a force in lower end of HEDT.

Each one could have 3 memory channels, 3 CCD links, and probably 2 pci-express 5. They already have sufficient IO. The Ryzen IO die has always been basically 1/4 of an Epyc IO die. If they made a modular IO die, they could possibly cover a huge part of their product stack with 2 different chips. Desktop Ryzen could use 1 modular IO die and 1 to 3 CCD. Threadripper could use 2 modular IO die and up to 6 CCD which would be a cheaper, smaller socket compared to Epyc. Epyc could then be 4 IO die package with up to 12 CCD. That would be more manufacturable on 6 nm and would allow more efficient use of silicon. I don’t think we have seen any rumors to support this, but it seems like it makes a lot of sense. It may help with routing to be able to move the IO die components apart a bit. Routing connections for 12 CCDs will not be simple.

Anandtech had an interesting speculation about the 3 CCD's sitting on top of either an interposer or stacked on some base die, which could then have a single wider IF link to IO die that 3 separate ones.

AMD does market segmentation, just not as much as Intel and they segment the market on different things. The cache size is likely to be a big market segmentation feature. The Milan-X3D is rumored to be a significant price increase over Milan with no stacked cache, and that is just with a single layer.

Intel's market segmentation, as revealed by Hardware unboxed is quite anti-consumer. Intel disables large portions of L3, and only enables them to higher end, higher core count models.

That, in order to hide that the gains (for gaming) are coming from disabled / enabled L3 rather than higher clock speed or cores.

 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

Chips and Cheese did some heavy work testing "application" traces with varying sizes and latencies of L3 cache.
Both results from sim runs and real world results of L3 cache hit rates for Zen2 are there. A treat for curious mind.