Speculation: Ryzen 4000 series/Zen 3

Page 160 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Antey

Member
Jul 4, 2019
105
153
116
Considering Intel has stated that Tiger Lake is a 10-65nm design, you will likely never see parts that are <10W. Both Intel and AMD rarely release 6W parts. AMD’s 6W parts are still 14nm as an example.

if their plan is to replace kaby lake Y series with a lakefield successor in late 2021 or early 2022 i think van gogh has nothing to fear from intel.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Acktually Cezanne's Vega isn't like Renoir's Vega.
Renoir's Vega re-uses Vega2x/MI50/MI60/RVII w/ RDNA1 blocks.
Cezanne's Vega re-uses Vega-H(Non-numbered name of Vega3x/Arct1x)/MI100 w/ RDNA2 blocks.

Of which, CPUs and GPUs were leaked to be moved to 5nm in 1H2020. However, the leaks don't specify when the decision was made. Which was shortly after trial production in 2018 and before risk production in 2019.

5nm didn't just miraculously appear in 2019...
TSMCF12B.png

N5P isn't as far either:
"Design kits of N5P technology will be available in the next N5 revision in the second quarter of 2020."

And if you go all the way back to 2016:
5nm.png
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Acktually Cezanne's Vega isn't like Renoir's Vega.
Renoir's Vega re-uses Vega2x/MI50/MI60/RVII w/ RDNA1 blocks.
Cezanne's Vega re-uses Vega-H(Non-numbered name of Vega3x/Arct1x)/MI100 w/ RDNA2 blocks.

Of which, CPUs and GPUs were leaked to be moved to 5nm in 1H2020. However, the leaks don't specify when the decision was made. Which was shortly after trial production in 2018 and before risk production in 2019.

5nm didn't just miraculously appear in 2019...
View attachment 28515

N5P isn't as far either:
"Design kits of N5P technology will be available in the next N5 revision in the second quarter of 2020."

And if you go all the way back to 2016:
View attachment 28516
Not sure how you brought 5nm into this.

Both Arcturus/CDNA1 and RDNA2 are still 7nm, as is Zen3.

Albeit current Rembrandt information implies that both RDNA2 and Zen3 also have 5nm variants in the oven.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
MI60 world's first 7nm GPU. ( https://www.anandtech.com/show/12677/tsmc-kicks-off-volume-production-of-7nm-chips / https://community.amd.com/community...nm-gpu-and-fastest-double-precision-pcie-card )
V100 => March 27, 2018
MI100 world's first not 5nm GPU. ( https://www.tsmc.com/english/dedicatedFoundry/technology/5nm.htm "The momentum at 5nm node was carried on well into volume production which started in the first half of 2020." / ??? )
A100 => May 14, 2020

Very sus, imho.

We know RDNA2 didn't make the move completely because of N7E being 7nm DUV on consoles.

AMD went in all in at TSMC not for 7nm+ but for 5nm. They are a leading member of N5 with at least Apple and Hisilicon. All the Core/CU, L2, L3, I/O testchip macros for 5nm @ AMD ended in 2018. Any new preliminary chip in 2019 is initial IP combination for 5nm, aka combine all the pieces to one pre-finished die.

Pilot 5nm => 2017 (TSMC SRAM/FinFET tests)
Trial 5nm => 2018 (test chips from customers)
Risk volume 5nm => 2019 (preliminary finished chips)
Ramp volume 5nm => 2020 (finished chips go out)

I found some indications than 7nm+/5nm being overlapped:
With it being updated to latest PDK here:
With another update to latest PDK here:

Zen3 on 7nm EUV is canned (One with AVX512)
Zen3 on 7nm DUV is just Zen2 with extended optimizations(10 to 20% perf boost:BD->PD, SR->XV, JG->PM, ZN->ZN+) (Not sure if we are about to see this one)
Zen3 on 5nm EUV is a new architecture built for security, speed and power efficiency. (This one was SMT4 but from the original ARM target, not the recent AMD64 target)

Core A = Person A, Team A
Core B = Person B, Team B
Core C&D = Person C, Team C <== Revolution in power.
Core E = Person A&B, Team A&B <== Revolution in performance.
 
Last edited:
  • Haha
Reactions: spursindonesia

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Regarding L3 cache topology in the "Zen" CCX:

We've been over this before.

We have. But I cannot remember any consensus building around your interpretation. Actually, I'm surprised that you maintain this interpretation and present it as fact. For those interested, see discussion earlier in this thread:


Yes, very sure, because otherwise one of the L3 slices would be much faster than the other 3, due to only having to do one hop. Their speeds are too similar for this.

As pointed out in previous discussion, L3 access latency uniformity is achieved by address interleaving.

I think the conventional interpretation, as described in AMD's presentations and slides, is the correct one: the L3 cache controller acts as a crossbar between the 4 cache slices in a CCX, requiring 6 links for a fully connected topology.

For "Zen 3", I suspect they have added another four links between CCXs to create the 8-core unified L3 cache (for a maximum 2 hops between any two slices (a fully connected topology between 8 slices would require 28 links, which presumably is excessive). In short: a bigger crossbar.


1598251346832.png
 
Last edited:

Vattila

Senior member
Oct 22, 2004
799
1,351
136
no Tuna is right, its covered in Zen1 Hot chips Q&A when the intel guys keeps asking him cache questions. The way Micheal Clarke describes how each single ported cache selection handles requests.

Do you have a link or quotation?

I suspect there is confusion. The idea that each L2 controller is a 4-way switch to the L3 slices seems wasteful (complexity, area, power). It makes address interleaving pointless also, as all L3 cache slices would have equal distance (1 hop). Address interleaving is pointed out on the AMD slides referred to earlier.

PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):

 
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,625
3,650
136
PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):
He says "Every core sees the same average latency out of the L3" which seems to support @Tuna-Fish 's point.
Memory interleaving still makes plenty of sense (and is used on Zen, see link) even when not helpful for L3 access.
 
  • Like
Reactions: Tlh97

Vattila

Senior member
Oct 22, 2004
799
1,351
136
He says "Every core sees the same average latency out of the L3" which seems to support @Tuna-Fish 's point.

I'm surprised that you come to that conclusion. The way it sounds to me, Clark makes the point that memory interleaving is used to achieve uniform latency on average, thus weakening, not supporting, the slice-aware L2 interpretation.

Anyway, my point is that uniform latency is not an argument for Tuna-Fish's interpretation. Considering he has pointed out that his interpretation is resting on this fallacy, it makes me suspect confusion and incorrect interpretation.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
Do you have a link or quotation?

I suspect there is confusion. The idea that each L2 controller is a 4-link crossbar to the L3 slices seems wasteful (complexity, area, power). It makes address interleaving pointless also, as all L3 cache slices would have equal distance. Address interleaving is pointed out on the AMD slides referred to earlier.

PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):

i was pretty specific in stating where it was


the buffers have to be at the destination slice otherwise you would get even less uniform latency.

Anyway, my point is that uniform latency is not an argument for Tuna-Fish's interpretation. Considering he has pointed out that his interpretation is resting on this fallacy, it makes me suspect confusion and incorrect interpretation.

it totally is,

how about this.
Every core currently is flushing ~16 cache lines from its L2 to L3. detail how each evicated cache line is tracked, hashes ,transferred , acknowledged and written. In direct links from each core to each L3 slice that is easy. In yours it is super hard.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Isn't the L3 itself a crossbar?

Core X <-> Nearest L3 => 1 Link

L3 Nearest <-> All other L3 caches => 3 Links

Every L3 <-> Fabric => 4 links, but only one can be used at anytime to recieve or send to Cache-Coherent Master.

Each slice of L3 => 5+5+5+5 => 20 Links
L2 tags are duplicated in the L3 for fast cache transfers within a CCX. <== Which enables this.

In the case of L2 as crossbar then wouldn't the L3 tags need to be in L2?

2x16B/c IOD => CCX0-16B/c + CCX1-16B/c
Zen2

Zen3
1x32B/c IOD => CCX0-32B/c

In this case there is no reason to have more fabric links. Four slices of L3 still.

Changes:
- 2 cores share a slice rather than one core.
- Local L3 slice still connects to 3 other slices.
- Four CCX L3 slices still interconnect to the Cache-Coherent Master.
6+6+6+6 = 24 links.

8 Single-core L3 CTLs with 2x20 Links => 4 Dual-core L3 CTLs with 1x24 Links
 
Last edited:
  • Like
Reactions: Vattila

Vattila

Senior member
Oct 22, 2004
799
1,351
136
i was pretty specific in stating where it was

Sorry, but to me this doesn't clarify anything about the topology issue. They are discussing single vs multiple transfers at a single point in time (single-ported vs multi-ported), and Clark is reluctant to go into detail, only saying "we have buffering around it to handle that".
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,346
1,525
136
We have. But I cannot remember any consensus building around your interpretation. Actually, I'm surprised that you maintain this interpretation and present it as fact. For those interested, see discussion earlier in this thread:

I have seen and read that, at some point I just stopped bothering to respond.

I think the conventional interpretation, as described in AMD's presentations and slides, is the correct one: the L3 cache controller acts as a crossbar between the 4 cache slices in a CCX, requiring 6 links for a fully connected topology.

That is not the conventional interpretation, and it is not backed by AMD's slides.

You are basing too much of your interpretation on a few arrows on a slide with no attempt to accurately portray the situation. The "L3 is crossbar"-interpretation requires substantial evidence for it, because it makes no sense whatsoever from an engineering standpoint, and since extraordinary claims require extraordinary evidence, it should be discarded. In contrast, there is substantial evidence for a fully connected topology from examining how code actually runs on the chip.

I'm surprised that you come to that conclusion. The way it sounds to me, Clark makes the point that memory interleaving is used to achieve uniform latency on average, thus weakening, not supporting, the L2 crossbar interpretation.

How exactly address interleaving would produce uniform latency in your topology? Just to make sure you get the basics right: every cache line lives only in one of the slices. Interleaving is done between adjacent cache lines. That is, cache line 0 (addresses 0x0..0x3f) is in slice 0, cache line 1 (0x40..0x7f) is in slice 1, and so on until cache line 5 (0x100..0x13f) is again in slice 0. It is easy to confirm this by allocating an array that fits into the L3, and then only accessing every fourth cache line, measuring the throughput, and comparing to accessing all of the array linearly. Accessing all of it gets substantially higher throughput.

If access to any of the L3 slices went through the closest one, you would expect a substantially different latency to that one, yet you cannot see that. You can also measure that Core 1 accessing L3 slice 4 does not impact the throughput of core 4 accessing L3 slice 1, which is what you'd expect if they were sharing the link.

You keep repeating that 6 links are less than, and therefore better than 16 (and ignoring that your interpretation actually includes 10 links, and that those extra 4 links would have to be different and substantially beefier because you can see a throughput difference between accessing all of the cache versus just any slice), without considering what, exactly is it that you are saving.

Your method would would cost more power, it would potentially require buffering at two places instead of one, it would increase latencies because of longer total distance traveled, and because it would force a long-distance signal to be brought back down from the uppermost metal layers without purpose. And it would save nothing of consequence.

The physical links themselves are free, because they occupy an area of the die which would be just plain blank without them. As for routing logic, your interpretation has 1x4 crossbar at every L3 (just because one of those links takes to the L3 slice itself doesn't mean you can leave it out), mine has a 1x4 crossbar at every core. Since there are as many cores as there are L3 slices, that comes out to the same amount of logic. Except, your method would result in multiple routing hops per transmitted line, as opposed to the one I have, so it would actually require more logic for the same total throughput.

Consider the case of core 0 accessing L3 slice 4, and how it would work under heavy contention. Note that the distance between the core and the cache is multiple cycles, and so the core doing the request cannot know the readiness of the cache when it makes the request.

With a fully connected topology with no shared links:
Immediately after confirming L2 miss, the core knows which slice the line is potentially in. Each core has a 1x4 crossbar, and can immediately send a request on that to the appropriate slice. Each slice has 4 separate reservation stations which store requests, one for each core. Every time a core sends a request, it consumes a slot, and every time it gets a response, it considers one slot freed. The core keeps track of the occupancy of it's own reservation station at each slice, and is only allowed to send a request if there is a slot free. This way, you only need one layer of buffering to get full throughput out of any link.

With your topology:
After confirming L2 miss, the core knows which slice the line is in, but for some reason sends the request to the closest L3 slice instead. At this L3 there is a 1x4 crossbar which picks another slice as the target. Since the link between that L3 slice and the one you actually want to talk to might be saturated for reasons that core 0 cannot predict, there needs to be a buffer both at this point, and at the final L3 slice.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
@Tuna-Fish

zenl3.png

<-> 4-cycles
^
| 2-cycles
v
^\v and v/^ 6-cycles

"A core complex (CCX) is composed of four Zen 2 cores and a shared level-3 (L3) cache. The L3 cache has four slices connected with a highly tuned fabric/network. Each L3 slice consists of an L3 controller, which reads and writes the L3 cache macro, and a cluster core interface that communicates with a core. The four slices of L3 are accessible by any core within the CCX. The distributed L3 cache control provides the design with improved control granularity. Each slice of L3 contains 4 MB of data for a total of 16 MB of L3 per CCX. The L3 cache is 16-way set-associative and is populated from L2 cache victims. The L3 is protected by DECTED ECC for reliability."

Each slice contains a controller, a CCI(cluster core interface) which communicates with a core(singular).
 
Last edited:
  • Like
Reactions: Vattila

Tuna-Fish

Golden Member
Mar 4, 2011
1,346
1,525
136
@Tuna-Fish


<-> 4-cycles
^
| 2-cycles
v
^\v and v/^ 6-cycles

Yes. These latency differences correspond well to the actual distance differences for a fully connected topology. If there was an actual step down from the metal layers and processing, I would expect ~10 cycles minimum for one additional hop (there and back).

Or, to put it in other words, according to that diagram, the difference between one-directional access from core 0 to slice 0 or slice 1 is one clock cycle. Does anyone actually believe they can do routing, boost the signal to the uppermost metal, travel half the height of the CCX and then get the signal back down in one cycle?
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Yes. These latency differences correspond well to the actual distance differences for a fully connected topology. If there was an actual step down from the metal layers and processing, I would expect ~10 cycles minimum for one additional hop (there and back).

Agreed. AMD has fully connected topology without any doubt:

1) Latency is too uniform ( +-3 cycles do no count )
2) Memory level parallelism is exceptional and result of 4 slices of L3 having independent resources without a common bottleneck like Ring is on Intel.

An important thing to note that AMD is free to play around with L3 cache architecture, there is nothing forcing them to have 8 slices of L3. They can go with 4, 6, 8, 12 whatever suits them.

For example 32MB L3 "client" configuration could have 4 slices of L3, each 8MB ~same per core BW resources as current ZEN2, but more cache overall. Easy to keep fully connected topology too.
"Server" or Zen3+ whatever next year design could go with 48MB of L3, 6 slices of 8MB, 50% more per core and CCX BW to serve those bw hungry tasks.

Right now L3 is married to core, as it contains shadow tags/status of L2, but I guess there are ways around it.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
The physical links themselves are free, because they occupy an area of the die which would be just plain blank without them. As for routing logic, your interpretation has 1x4 crossbar at every L3 (just because one of those links takes to the L3 slice itself doesn't mean you can leave it out), mine has a 1x4 crossbar at every core.

This! Essentially implies that both your and Vattilas proposal are topologically the same as far as 4xL3$ is concerned - just drawn differently. You argue about nothing. Its just that Vattila has the 1x4 splitter implicitly. Also there need to be 4x1 merger in addition - making the whole thing a 4x4 crossbar.
 
Last edited:
  • Like
Reactions: Vattila

TESKATLIPOKA

Platinum Member
May 1, 2020
2,355
2,848
106
RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.
7nm process allowed RX5500XT to have comparable FP32 performance as RX 570 even though It has a lot less CU, because It clocks a lot higher, but this is also thanks to a better architecture.
BTW I think It was Glo. who linked to a comparison between Navi and Polaris with comparable parameters and the same clocks. The result was 40% better IPC for Navi.

Back to Van Gogh. If Van Gogh will have 8CU then It won't be much better than Tiger Lake with 96EU, but that Tiger Lake won't have such a low TDP.
I think Van Gogh will have only 4 cores and that's the reason why they used Zen 2 which has 4 cores per CCX, while Zen 3 is supposed to have 8 cores per CCX.
 
Last edited:
  • Like
Reactions: Tlh97

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
Only thing I'll say for a long time in regards to AM5 is a lot of possibilities are opened with the new socket.


It's literally Renoir but Zen 3 cores.

The GPU may also get a speed bump. It may be Vega by name, but Vega in the APU is a totally different beast from Vega desktop.
 

Ajay

Lifer
Jan 8, 2001
15,431
7,849
136
So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.
So, basically, a fully meshed design. I get how that helps maintain a more even latency, so I suppose it is better (also from an area perspective as well). Do we have any idea how many slices there will be per L3$?
 

Ajay

Lifer
Jan 8, 2001
15,431
7,849
136
The L2$ controllers aren't a crossbar, they are simple p2p switches. A crossbar would be all 8 L2 controllers being connected to a fully meshed switch which is then connected to each L3 slice. That, or I've completely forgotten what I learned from the EEs while working at an enterprise network hardware company.
 
  • Like
Reactions: Tlh97 and Vattila

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
At some point you figure AMD will start doing IGP chiplets.

Might take longer than desktop? Baseless rumors have it that NVidia's Hopper will go chiplet as will RDNA3, whether that be 12-24 months from now.