Speculation: Ryzen 4000 series/Zen 3

Page 142 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

amd6502

Senior member
Apr 21, 2017
796
255
106
That's not how it works.
I'm considering the simpler case here of a single 8c/16t CCD AM4 chip. Consider a single thread running on this.

In Zen2 case we have a 2x16MB L3, in Zen3 we have a unified 1x32MB L3.

Assume the thread does not jump between CCX's (if applicable).

Now in Zen2 case the thread is limited to filling up to one of the 16MB L3 units.

In Zen3 case we have the thread limited to filling a 32MB L3 cache.

This means potentially significantly greater hit rate (though at the supposed cost of almost 20% latency hit).
 

HurleyBird

Platinum Member
Apr 22, 2003
2,097
481
136
Now in Zen2 case the thread is limited to filling up to one of the 16MB L3 units.

In Zen3 case we have the thread limited to filling a 32MB L3 cache.

This means potentially significantly greater hit rate (though at the supposed cost of almost 20% latency hit).
Point is, you can replace "a thread" with "two threads," "three threads," or however many threads. The number of threads by itself doesn't make a difference. What does is the extent to which datasets fits into 16 MB L3, and the extent to which data is specific to individual threads vs. shared between them.

A hypothetical single threaded task that can consume the entire 32 MB L3 will benefit.

A hypothetical task that consumes all 16 threads in a chiplet and fills the entire 32MB L3 with shared data will benefit extremely, even more than the prior example.

A hypothetical single threaded task that fits entirely inside a 16 MB L3 will regress.

A program that creates two processes that each consume 8 threads and 16 MB (eg. perfectly in-line with the Zen 2 CCX structure) will regress extremely, even more than the former example.

A significant majority of tasks will benefit, both single and multi-threaded. But some minority of both single threaded and multi-threaded tasks will regress. To say that "Such a tradeoff would have an advantage pretty much only for single threaded loads" is entirely misleading and seems to misunderstand how things work, not to mention that there are plenty of database workloads that would profusely disagree with that statement.
 
Last edited:

DisEnchantment

Senior member
Mar 3, 2017
578
1,077
106
Zen3 pushed out late has something to do with wafer availability rather than silicon readiness, in my opinion.
Also in the recent web conference with Papermaster, they said that they are working with both Zen4 and Zen3 silicon. So they might be having early Zen4 silicon already.

Sony is reportedly planning to assemble 10m consoles till end of the year to cover launch and early 2021 according to report.

Even if we assume 6m consoles for PS5 that would be 42k+ wafers which AMD has to deliver by Q3 to early Q4 assuming a conservative die size of 295mm2 for PS5 SoC.
If MS were to assemble 4m consoles that would be another 38K wafers. Thats 80K+ wafers for the consoles starting from June (AMD's statement of console ramp up).
That is like 60% of AMD's wafer allocation for entire H2 and 100% till end of Q3. They took over some Huawei wafers, but this is not available until last quarter.

If anything, trends shows people buy more gaming equipment during the epidemic.

If Zen3 uses TSMC's IOD then they would probably have preferred to prolong Zen2 due to usage of GF IOD and high yielding Zen2 chiplets.
But whatever it may be, Zen3 CPUs would be scarce at I launch I could surmise and be prepared to pay top dollars for them.
 
Last edited:

JoeRambo

Senior member
Jun 13, 2013
809
470
136
A hypothetical task that consumes all 16 threads in a chiplet and fills the entire 32MB L3 with shared data will benefit extremely, even more than the prior example.
The real answer is: depends

There are a lot of moving parts when it comes to shared L3 caches, for example:
Lets assume AMD sticks with eviction cache, two cache domains, each 16MB might have advantages over 32MB unified:

1) in cumulative bandwidth, moving to shared by all cores might reduce total bandwidth available to cores both "directly" by having less ports and indirectly by moving from crossbar to whatever they will use now.
2) The chance of "way/address" conflicts increases, even if certain L3 cache slice is "larger", but there might be ways and address conflicts coming from multiple cores, such chance was cut in half by two domains
3) While eviction cache somewhat mitigates it, cores still might fight for cache, like 6 cores working on some read only structure that overflows L2, getting hurt by two cores that stream to memory. It is not as bad as on client Intel where L2 is inclusive, so pretty much L3 see all , but it takes algorithms and policies to stop those two cores from trashing performance.

These problems are "artificial", but if MT load is sensitive to L3 size, they will happen for some workloads.
 
  • Like
Reactions: HurleyBird

Gideon

Senior member
Nov 27, 2007
833
1,259
136
Lets assume AMD sticks with eviction cache, two cache domains, each 16MB might have advantages over 32MB unified:
I'll be very dissapointed if AMD goes through all this trouble with the total redesing of cache hierarchy and CCXes and continues to use L3 just as an eviction cace.
 
  • Like
Reactions: Elfear

LightningZ71

Senior member
Mar 10, 2017
417
314
106
Since it seems that, from the way that I read it, the performance impact is most felt in the highest core count scenario, the issue could simply be one of data throughput? Could 64 cores with smt2 at Milan's core IPC be hitting the limits of 8 channel DDR4 bandwidth and be suffering from a degree of data starvation?

Another thought, Zen3 was still in design phase when all the SMT based sideband data vulnerabilities came out. Could it that a deliberate decision was made to emphasize single thread throughput over sMT efficiency in case that it was determined that SMT represented too much of a security vulnerability?

The same is possible with respect to software licensing. Some vendors have moved to licensing on a per-thread basis. Some legacy software has no idea about SMT and assumes that each thread IS its own CPU. Maximizing single thread throughput has big benefits there.
 
Mar 11, 2004
20,202
2,364
126
Zen3 pushed out late has something to do with wafer availability rather than silicon readiness, in my opinion.
Also in the recent web conference with Papermaster, they said that they are working with both Zen4 and Zen3 silicon. So they might be having early Zen4 silicon already.

Sony is reportedly planning to assemble 10m consoles till end of the year to cover launch and early 2021 according to report.

Even if we assume 6m consoles for PS5 that would be 42k+ wafers which AMD has to deliver by Q3 to early Q4 assuming a conservative die size of 295mm2 for PS5 SoC.
If MS were to assemble 4m consoles that would be another 38K wafers. Thats 80K+ wafers for the consoles starting from June (AMD's statement of console ramp up).
That is like 60% of AMD's wafer allocation for entire H2 and 100% till end of Q3. They took over some Huawei wafers, but this is not available until last quarter.

If anything, trends shows people buy more gaming equipment during the epidemic.

If Zen3 uses TSMC's IOD then they would probably have preferred to prolong Zen2 due to usage of GF IOD and high yielding Zen2 chiplets.
But whatever it may be, Zen3 CPUs would be scarce at I launch I could surmise and be prepared to pay top dollars for them.
Weird, as just a few months ago Sony was supposedly cutting back their orders on PS5 (and I think rumors suggested like only 1-2million PS5s in 2020). Wonder what happened (maybe TSMC dropped prices due to other companies scaling back orders? or maybe Sony is trying to maximize economies of scale to offer lower price and hope to edge out Microsoft's more powerful console with a flood of PS5s - giving them install base simply because they had availability?).
 

blckgrffn

Diamond Member
May 1, 2003
6,859
158
106
www.teamjuchems.com
Weird, as just a few months ago Sony was supposedly cutting back their orders on PS5 (and I think rumors suggested like only 1-2million PS5s in 2020). Wonder what happened (maybe TSMC dropped prices due to other companies scaling back orders? or maybe Sony is trying to maximize economies of scale to offer lower price and hope to edge out Microsoft's more powerful console with a flood of PS5s - giving them install base simply because they had availability?).
Man, trying to buy a console right now is super frustrating. I am not sure where supply is at, but tracking sales it's been a super solid quarter for tech. A friend of mine @ Target corp has been saying their consumer tech sales have been like Cyber Week/BF continuously since the start of the lock down type measures and they have been selling every TV, Console and other device they have been able to stock.

If Sony is serious and can have stock AND pricing (seriously, xbox's and switches are going for $100-$200 more than they were going for last fall if you can even find them) they could really sweep this holiday season. That would be quite the achievement to start out with 2x the install base heading into 2021 just because you could put them on the shelves.
 

DisEnchantment

Senior member
Mar 3, 2017
578
1,077
106
Weird, as just a few months ago Sony was supposedly cutting back their orders on PS5 (and I think rumors suggested like only 1-2million PS5s in 2020). Wonder what happened (maybe TSMC dropped prices due to other companies scaling back orders? or maybe Sony is trying to maximize economies of scale to offer lower price and hope to edge out Microsoft's more powerful console with a flood of PS5s - giving them install base simply because they had availability?).
I think there is uncertainty with the market. I suppose Sony was initially worried about the pandemic and underestimating.
But then most likely marketing came back and said guess what people are buying gaming stuffs.
This trend is also same on PC.
Still the report is saying that the numbers include stocks for early 2021 because sony wanted to avoid PS4 launch issues when shipments were sent by air cargo.

Nonetheless, for PS5 I think 10m is on the high side, but 6m is reasonable for this year and for stockpiling for early 2021 as well.
 

blckgrffn

Diamond Member
May 1, 2003
6,859
158
106
www.teamjuchems.com
Air shipments are crazy right now. I am bringing in a pallet from India by air and by sea is ~$800 (59-62 days on the water, plus customs times on both ends, plus other random stuff) and air is ~$1,200 for 2-3 day shipment time plus all the other shenanigans.

Buuuuuut there is currently a surcharge to the tune of $2,400 on my air shipment. UPS rate of $6.70 per kg by air.

1594908342674.png

"Most trades lanes around the world are still showing strong double-digit capacity declines. Global capacity for the most recent week is now at -26%. Belly capacity on passenger aircraft is down -74% compared to this time last year."

If we think a boxed PS5 is ~5 kilos we are talking a $10-$15 surcharge per unit, which I doubt Sony is very enthusiastic about, especially if they are trying to not bleed to much on the first run of these things.

If Sony is cranking production right now to load boats with cargo instead of using planes, that makes 100% sense to me. Expect Q4 production to arrive and be ready for sale in Q1 2021, but everything that they can make and put on boats right now can be sold in Q4, imo. I have had ocean shipments arrive in Georgia in less then 30 days, I am guessing the shortest routes between China & the US can take less than that. Now the West Coast port situation is probably pretty terrible, but I don't think that's new.
 

blckgrffn

Diamond Member
May 1, 2003
6,859
158
106
www.teamjuchems.com
Indeed I think they want to avoid Air Cargo.


PS5 production is in Japan though afaik.
Sorry for the OT.
Jeeze, of course. Wow, I need more coffee.

My chart has two charges listed for Vietnam, Singapore, Taiwan etc. but no listed surcharge for Japan.

*Googles*

Ah, no. Appears Foxconn in China made the PS4 for worldwide export. Based on the current scale of PS5 production I would assume the same.


If you live in Japan it sounds like you get a Japan made console. Not sure if that just means some final integration or what, probably used for PR and import duty reasons.

Also sorry for the OT. I think it further validates how much silicon is likely being dedicated to next gen console creation and how/why that is likely the case.

I am further assuming this will impact when we get RDNA2 based PC GPU options.

****Last edit, Bloomberg Article link*****

This explains a lot, including the 10M target number and that they are being made in China for US consumption:

 
Last edited:
  • Haha
Reactions: DisEnchantment

JoeRambo

Senior member
Jun 13, 2013
809
470
136
I'll be very dissapointed if AMD goes through all this trouble with the total redesing of cache hierarchy and CCXes and continues to use L3 just as an eviction cace.
Actually You never know. In fact in the long history of AMD they were completely incompetent in cache design in all but a few designs. With the way AMD is executing lately the chance is low, but same "designers" brought us gems like 16KB of L1D in Bulldozer and so on.

But eviction cache is not bad idea for these designs, Intel's workstation/server chips swithed to anemic L3 eviction cache ( even if they claim the smarts of prefetch into L3 ).
What else can they choose? Inclusive cache would allow to "save" on tags they have to have anyway to know what is inside L2 by storing tags + data in L3. At a cost of 4MB of space
I guess 2/8MB in Zen1 CCX was no no, 2/16 in Zen2 CCX sounds better and 4/32MB in Zen3 could work. But inclusive has its own drawbacks.
 
Mar 11, 2004
20,202
2,364
126
Man, trying to buy a console right now is super frustrating. I am not sure where supply is at, but tracking sales it's been a super solid quarter for tech. A friend of mine @ Target corp has been saying their consumer tech sales have been like Cyber Week/BF continuously since the start of the lock down type measures and they have been selling every TV, Console and other device they have been able to stock.

If Sony is serious and can have stock AND pricing (seriously, xbox's and switches are going for $100-$200 more than they were going for last fall if you can even find them) they could really sweep this holiday season. That would be quite the achievement to start out with 2x the install base heading into 2021 just because you could put them on the shelves.
Yeah. The thing is though what will the market be like in 4-6 months, especially if its for a $600 system.

I do think that it could be a smart move by Sony if they can have plenty of availability. If the market is good enough, then sell at the higher prices ($549/599). If not, eat some cost and sell at $449/499 and try to make it up in games.

I think there is uncertainty with the market. I suppose Sony was initially worried about the pandemic and underestimating.
But then most likely marketing came back and said guess what people are buying gaming stuffs.
This trend is also same on PC.
Still the report is saying that the numbers include stocks for early 2021 because sony wanted to avoid PS4 launch issues when shipments were sent by air cargo.

Nonetheless, for PS5 I think 10m is on the high side, but 6m is reasonable for this year and for stockpiling for early 2021 as well.
I think some of the reports about low early shipments predated the pandemic, but I think those just kinda talked about there being quite a bit of internal strife at Sony over the PS5. So yeah, I could see things changing over the months since, especially with the situation at TSMC opening up more.
 

blckgrffn

Diamond Member
May 1, 2003
6,859
158
106
www.teamjuchems.com
Yeah. The thing is though what will the market be like in 4-6 months, especially if its for a $600 system.

I do think that it could be a smart move by Sony if they can have plenty of availability. If the market is good enough, then sell at the higher prices ($549/599). If not, eat some cost and sell at $449/499 and try to make it up in games.

I think some of the reports about low early shipments predated the pandemic, but I think those just kinda talked about there being quite a bit of internal strife at Sony over the PS5. So yeah, I could see things changing over the months since, especially with the situation at TSMC opening up more.
Right. The other thing is that winter might be even more claustrophobic that this summer in terms of what people can do... these consoles are like the perfect escape medium especially as new movies and many shows run out of material. The utility on a game console goes way up and perhaps more families consider having more than one console that is TV dependent.

My Dad just paid $450 with tax and shipping for a "new" One X because he uses it as his media hub - live TV, bluray and all streaming services. It turns his TV on and off, controls the receiver volume and he never has to change inputs on anything. His OG xbox died and with new One S prices at ~$350 plus I had him just buy the beefier SKU. It was from a third party seller on Walmart.com and of course came opened and with a few months shaved off the warranty, but my Dad was sick of shopping and it looked brand new enough he's just decided to use it. My kids play games on it when they spend time with them so everyone wins, I guess.

I plan on doing what I can to pre-order a PS5. Based on how hard it has been to buy a switch or xbox recently, I think it's pre-order or bust because the profiteering that is bound to happen.

I also think that scalpers are going to buy every one of them that hits the shelves to try to resell them. Based on that I think Sony could sell 10M of those things at MSRP this year. Now, how many of those things make into real customer hands?

Console demand still looks poised - if not necessarily locked due to potential trade restrictions, economy implosions, etc. as you say - to really consume a lot of AMD wafer supply for the foreseeable future.
 

DisEnchantment

Senior member
Mar 3, 2017
578
1,077
106
I'll be very dissapointed if AMD goes through all this trouble with the total redesing of cache hierarchy and CCXes and continues to use L3 just as an eviction cace.
Victim cache is kind of confirmed by a kernel patch.
+ "An ECC error or poison bit mismatch was detected on a tag read by a probe or victimization",
But we can hope it not just a plain unification of the CCXes cache, There is a big patent trail around caches like compression, bypass, memory prefetching, load/store combine etc.
But you never know, could turn out to be crappy as well.
 

amd6502

Senior member
Apr 21, 2017
796
255
106
victim cache seems like a very effective strategy. I could imagine added behavior though and thought the trend would be more complex and intelligent L3/LLC. Or, are there good reasons for keeping victim caches to be only vanilla victim caches?
 

moinmoin

Golden Member
Jun 1, 2017
1,657
1,594
106
L3$ being "only" a victim cache is essentially meaningless imo. In an ideal world you would want everything code and data wise as close to the core needing it as possible, so in L1$. That's unfeasible for obvious reasons, at which point first L2$ and then L3$ comes into play. As such all the heuristics like prefetching and so on benefiting L1/2$ will benefit L3$ as well. What makes L3$ as a LLC more interesting is the introduction of additional logic for sharing and managing the data between cores (within a CCX) as well as between dies, it looks beyond the single core to which L1$ and L2$ are bound. My understanding of the unification of the L3$ between two CCXs on one die is that it strives to make available the results of a specific core's heuristics to other cores not only within a CCX but also within a die (and potentially beyond, latency permitted) thus ideally avoiding repeated processing of the same data by several cores' heuristics.
 

DisEnchantment

Senior member
Mar 3, 2017
578
1,077
106
L3$ being "only" a victim cache is essentially meaningless imo. In an ideal world you would want everything code and data wise as close to the core needing it as possible, so in L1$. That's unfeasible for obvious reasons, at which point first L2$ and then L3$ comes into play. As such all the heuristics like prefetching and so on benefiting L1/2$ will benefit L3$ as well. What makes L3$ as a LLC more interesting is the introduction of additional logic for sharing and managing the data between cores (within a CCX) as well as between dies, it looks beyond the single core to which L1$ and L2$ are bound. My understanding of the unification of the L3$ between two CCXs on one die is that it strives to make available the results of a specific core's heuristics to other cores not only within a CCX but also within a die (and potentially beyond, latency permitted) thus ideally avoiding repeated processing of the same data by several cores' heuristics.
I am not sure I understood properly but I try to share my point of view on this. I am happy to be corrected.

When a thread is executing, in case data/instruction is not found on L2, the L3 is probed and if hit the cache lines are evicted and loaded with one from L3. If miss then go to Main memory.
If multiple CCXs are present, when there is an L3 miss the other L3s are probed (if there is a entry in the coherency directory otherwise go to main memory)

I understood there are proactive coherency probes, where directory entries are maintained across the L3s which are used to invalidate data on other CCXs when the data has been modified by another CCX. Additionaly the L3 directory contains entries to the private L2 caches
Probing across L3s is across the fabric hence some additional latency. But it does not matter how many L3s (at least to some extent), so it is a bit scalable across a lot of CCXs.
The directory entries across L3 only refers to regions not specific cache lines. So this way they can deal with big sizes. Probing across the L2 involves cache lines. But is way faster as well.
I don't know if inclusive cache or others would be more suited in this case.
1594986679676.png

1594986740564.png
All of this is already on Zen2. So for Zen3
  • For the single core on the 8 core CCX, the bigger L3 means it has a lot of space to swap with until eviction to main memory finally happens. Additionally due to memory prefetching, at L3, chances of hits when data is not found on L2 are higher with bigger L3 ( and possibily avoiding engaging the probes across to other L3s which incurs penalty ).
  • For a very high core count part, the traffic on the Infinity fabric for highly threaded loads is probably going to be much less than with a 4 core CCX, simply because there are lesser CCXs to talk to.
  • On the other hand, the 8 Core CCX would be more complex due to a lot of routing between the cores/L2 to the L3 but then counter balanced by the presence of a single fabric connection instead of two previously for two 4 core CCX. Lesser SerDes blocks.
  • Request to the IMC are also streamlined, because for an 8 core part for example, the IMC (coherency slave) is handling only one master. This is very important because DRAM has big recovery times and this could degrade memory performance.
 
Last edited:

LightningZ71

Senior member
Mar 10, 2017
417
314
106
From a higher level point of view, making the L3 cache inclusive would also reduce the total effective amount of cache in the processor. If everything from each 256K L2 was copied in the L3, that would be a loss of 1MB per 4 core CCX, and two in an 8 core configuration. While that's not a lot, it's still a net loss.
Expanding the size of the CCX to 8 cores was bound to incur an L3 access latency penalty, if only for dealing with the overhead of managing not adversely affecting acceses by now twice as many cores. I think that benchmarks will eventually show that the cost was worth it as 8CCD EPYC parts will only have to deal with 8 CCX units instead of 16. The added benefit is that, if they keep a reasonably similar wiring layout in the packages and I/O die, each CCX will have twice the bandwidth between themselves and the IO die (remember, each CCX had it's own data connection to the IO die). With the CCX units fused, they have a channel to the IO die that's double what it was, and less need to actually cross it to talk to other cores. This also creates an advantage in low CCD number chips. For the SKUs that have 8 memory channels, but only a pair of CCDs, this doubles the effective bandwidth per CCX to RAM, increasing the effective memory performance for each active CCX. There will still be region issues with the IO die, unless there's another revision that addresses this.
 

naukkis

Senior member
Jun 5, 2002
344
171
116
From a higher level point of view, making the L3 cache inclusive would also reduce the total effective amount of cache in the processor. If everything from each 256K L2 was copied in the L3, that would be a loss of 1MB per 4 core CCX, and two in an 8 core configuration. While that's not a lot, it's still a net loss.
And even little bit more loss as Zen core's have 512KB of L2 each.

Do we know how much of L2 those Zen3-cores have? It would be logical to increase L2 size if L3 latency increases.
 

moinmoin

Golden Member
Jun 1, 2017
1,657
1,594
106
On the other hand, the 8 Core CCX would be more complex due to a lot of routing between the cores/L2 to the L3
I'm not sure whether they still go down that path in Zen 3. You listed some patents before which show that AMD has plans to move IF from being point-to-point to actually being network based with switches etc. My expectation would be that by unifying the L3$ Zen 3 is already realizing some aspects of that, like virtualizing the slices each core owns, allowing the L3$ management to deduplicate and move around data on its own and optimizing data usage and transfer that way etc.
 
  • Like
Reactions: Vattila

DisEnchantment

Senior member
Mar 3, 2017
578
1,077
106
I'm not sure whether they still go down that path in Zen 3. You listed some patents before which show that AMD has plans to move IF from being point-to-point to actually being network based with switches etc. My expectation would be that by unifying the L3$ Zen 3 is already realizing some aspects of that, like virtualizing the slices each core owns, allowing the L3$ management to deduplicate and move around data on its own and optimizing data usage and transfer that way etc.
I think the Zen3 CCD would be the intermediate step in that direction.
The CCD could be a basic building block but unlike a CCX it would be a single manufactureable block/die. It would have the cache coherency agent and the tag directories could be attached to the L3/within the CCD including the SerDes and so on.
Missing would be routing blocks.
And also another overlooked point is that, if you can recollect there are tag directories on the IOD as well for Zen2. How would this look in Zen3 would be interesting as well. I suppose this could disappear. and the directories would be all on the CCDs with coherency probes running across.
Too many unknowns at this point.
 

naukkis

Senior member
Jun 5, 2002
344
171
116
I think the Zen3 CCD would be the intermediate step in that direction.
The CCD could be a basic building block but unlike a CCX it would be a single manufactureable block/die. It would have the cache coherency agent and the tag directories could be attached to the L3/within the CCD including the SerDes and so on.
But wouldn't tag directory in L3 be pretty useless? In memory controller directory check can be done in side of memory access - how would that be practical when directory and memory controller are in different dies? And instead of one directory there would be as many as chiplets so coherence traffic between chiplets would be many times more than with IOD.
 

ASK THE COMMUNITY