Speculation: Ryzen 4000 series/Zen 3

NostaSeronx · Aug 23, 2020

Acktually Cezanne's Vega isn't like Renoir's Vega.
Renoir's Vega re-uses Vega2x/MI50/MI60/RVII w/ RDNA1 blocks.
Cezanne's Vega re-uses Vega-H(Non-numbered name of Vega3x/Arct1x)/MI100 w/ RDNA2 blocks.

Of which, CPUs and GPUs were leaked to be moved to 5nm in 1H2020. However, the leaks don't specify when the decision was made. Which was shortly after trial production in 2018 and before risk production in 2019.

5nm didn't just miraculously appear in 2019...

N5P isn't as far either:
"Design kits of N5P technology will be available in the next N5 revision in the second quarter of 2020."

And if you go all the way back to 2016:

soresu · Aug 23, 2020

NostaSeronx said:
Acktually Cezanne's Vega isn't like Renoir's Vega.
Renoir's Vega re-uses Vega2x/MI50/MI60/RVII w/ RDNA1 blocks.
Cezanne's Vega re-uses Vega-H(Non-numbered name of Vega3x/Arct1x)/MI100 w/ RDNA2 blocks.

Of which, CPUs and GPUs were leaked to be moved to 5nm in 1H2020. However, the leaks don't specify when the decision was made. Which was shortly after trial production in 2018 and before risk production in 2019.

5nm didn't just miraculously appear in 2019...
View attachment 28515

N5P isn't as far either:
"Design kits of N5P technology will be available in the next N5 revision in the second quarter of 2020."

And if you go all the way back to 2016:
View attachment 28516

Not sure how you brought 5nm into this.

Both Arcturus/CDNA1 and RDNA2 are still 7nm, as is Zen3.

Albeit current Rembrandt information implies that both RDNA2 and Zen3 also have 5nm variants in the oven.

NostaSeronx · Aug 23, 2020

MI60 world's first 7nm GPU. ( https://www.anandtech.com/show/12677/tsmc-kicks-off-volume-production-of-7nm-chips / https://community.amd.com/community...nm-gpu-and-fastest-double-precision-pcie-card )
V100 => March 27, 2018
MI100 world's first not 5nm GPU. ( https://www.tsmc.com/english/dedicatedFoundry/technology/5nm.htm "The momentum at 5nm node was carried on well into volume production which started in the first half of 2020." / ??? )
A100 => May 14, 2020

Very sus, imho.

We know RDNA2 didn't make the move completely because of N7E being 7nm DUV on consoles.

AMD went in all in at TSMC not for 7nm+ but for 5nm. They are a leading member of N5 with at least Apple and Hisilicon. All the Core/CU, L2, L3, I/O testchip macros for 5nm @ AMD ended in 2018. Any new preliminary chip in 2019 is initial IP combination for 5nm, aka combine all the pieces to one pre-finished die.

Pilot 5nm => 2017 (TSMC SRAM/FinFET tests)
Trial 5nm => 2018 (test chips from customers)
Risk volume 5nm => 2019 (preliminary finished chips)
Ramp volume 5nm => 2020 (finished chips go out)

I found some indications than 7nm+/5nm being overlapped:

Cadence Collaborates with TSMC to Advance 5nm and 7nm+ Mobile and HPC Design Innovation

Cadence announced its continued collaboration with TSMC to further 5nm and 7nm+ FinFET design innovation for mobile and high-performance computing (HPC) platforms.

www.cadence.com

Cadence Achieves EDA Certification for TSMC 5nm and 7nm+ FinFET Process Technologies to Facilitate Mobile and HPC Design Creation

Cadence announced its continued collaboration with TSMC to certify its design solutions for TSMC 5nm and 7nm+ FinFET process technologies for mobile and high-performance computing (HPC) designs.

www.cadence.com

With it being updated to latest PDK here:

Cadence Digital and Signoff Full Flow and Custom/Analog Tools Certified for TSMC N6 and N5/N5P Process Technologies

Cadence announced that its digital and signoff full flow and custom/analog tools have achieved certification on TSMC’s N6 and N5/N5P process technologies.

www.cadence.com

With another update to latest PDK here:

Cadence Achieves Digital and Custom/Analog EDA Flow Certification for TSMC N6 and N5 Process Technologies

Cadence announced that its digital full flow and custom/analog tool suites have been further enhanced to deliver optimal results on TSMC’s N6 and N5 process technologies.

www.cadence.com

Zen3 on 7nm EUV is canned (One with AVX512)
Zen3 on 7nm DUV is just Zen2 with extended optimizations(10 to 20% perf boost:BD->PD, SR->XV, JG->PM, ZN->ZN+) (Not sure if we are about to see this one)
Zen3 on 5nm EUV is a new architecture built for security, speed and power efficiency. (This one was SMT4 but from the original ARM target, not the recent AMD64 target)

Core A = Person A, Team A
Core B = Person B, Team B
Core C&D = Person C, Team C <== Revolution in power.
Core E = Person A&B, Team A&B <== Revolution in performance.

Vattila · Aug 24, 2020

Regarding L3 cache topology in the "Zen" CCX:

Tuna-Fish said:
We've been over this before.

We have. But I cannot remember any consensus building around your interpretation. Actually, I'm surprised that you maintain this interpretation and present it as fact. For those interested, see discussion earlier in this thread:

Page 29 - Speculation: Ryzen 4000 series/Zen 3

Page 29 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Tuna-Fish said:
Yes, very sure, because otherwise one of the L3 slices would be much faster than the other 3, due to only having to do one hop. Their speeds are too similar for this.

As pointed out in previous discussion, L3 access latency uniformity is achieved by address interleaving.

I think the conventional interpretation, as described in AMD's presentations and slides, is the correct one: the L3 cache controller acts as a crossbar between the 4 cache slices in a CCX, requiring 6 links for a fully connected topology.

For "Zen 3", I suspect they have added another four links between CCXs to create the 8-core unified L3 cache (for a maximum 2 hops between any two slices (a fully connected topology between 8 slices would require 28 links, which presumably is excessive). In short: a bigger crossbar.

Page 14 - Speculation: The CCX in Zen 2

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

itsmydamnation · Aug 24, 2020

no Tuna is right, its covered in Zen1 Hot chips Q&A when the intel guys keeps asking him cache questions. The way Micheal Clarke describes how each single ported cache selection handles requests.

Vattila · Aug 24, 2020

itsmydamnation said:
no Tuna is right, its covered in Zen1 Hot chips Q&A when the intel guys keeps asking him cache questions. The way Micheal Clarke describes how each single ported cache selection handles requests.

Do you have a link or quotation?

I suspect there is confusion. The idea that each L2 controller is a 4-way switch to the L3 slices seems wasteful (complexity, area, power). It makes address interleaving pointless also, as all L3 cache slices would have equal distance (1 hop). Address interleaving is pointed out on the AMD slides referred to earlier.

PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):

Gideon · Aug 24, 2020

Vattila said:
PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):

He says "Every core sees the same average latency out of the L3" which seems to support @Tuna-Fish 's point.
Memory interleaving still makes plenty of sense (and is used on Zen, see link) even when not helpful for L3 access.

Vattila · Aug 24, 2020

Gideon said:
He says "Every core sees the same average latency out of the L3" which seems to support @Tuna-Fish 's point.

I'm surprised that you come to that conclusion. The way it sounds to me, Clark makes the point that memory interleaving is used to achieve uniform latency on average, thus weakening, not supporting, the slice-aware L2 interpretation.

Anyway, my point is that uniform latency is not an argument for Tuna-Fish's interpretation. Considering he has pointed out that his interpretation is resting on this fallacy, it makes me suspect confusion and incorrect interpretation.

itsmydamnation · Aug 24, 2020

Vattila said:
Do you have a link or quotation?

I suspect there is confusion. The idea that each L2 controller is a 4-link crossbar to the L3 slices seems wasteful (complexity, area, power). It makes address interleaving pointless also, as all L3 cache slices would have equal distance. Address interleaving is pointed out on the AMD slides referred to earlier.

PS. Here is Michael Clark's presentation at Hot Chips 2016, discussing the L3 (note what he says about address interleaving and average latency):

i was pretty specific in stating where it was

the buffers have to be at the destination slice otherwise you would get even less uniform latency.

Anyway, my point is that uniform latency is not an argument for Tuna-Fish's interpretation. Considering he has pointed out that his interpretation is resting on this fallacy, it makes me suspect confusion and incorrect interpretation.

it totally is,

how about this.
Every core currently is flushing ~16 cache lines from its L2 to L3. detail how each evicated cache line is tracked, hashes ,transferred , acknowledged and written. In direct links from each core to each L3 slice that is easy. In yours it is super hard.

NostaSeronx · Aug 24, 2020

Isn't the L3 itself a crossbar?

Core X <-> Nearest L3 => 1 Link

L3 Nearest <-> All other L3 caches => 3 Links

Every L3 <-> Fabric => 4 links, but only one can be used at anytime to recieve or send to Cache-Coherent Master.

Each slice of L3 => 5+5+5+5 => 20 Links
L2 tags are duplicated in the L3 for fast cache transfers within a CCX. <== Which enables this.

In the case of L2 as crossbar then wouldn't the L3 tags need to be in L2?

2x16B/c IOD => CCX0-16B/c + CCX1-16B/c
Zen2

Zen3
1x32B/c IOD => CCX0-32B/c

In this case there is no reason to have more fabric links. Four slices of L3 still.

Changes:
- 2 cores share a slice rather than one core.
- Local L3 slice still connects to 3 other slices.
- Four CCX L3 slices still interconnect to the Cache-Coherent Master.
6+6+6+6 = 24 links.

8 Single-core L3 CTLs with 2x20 Links => 4 Dual-core L3 CTLs with 1x24 Links

Vattila · Aug 24, 2020

itsmydamnation said:
i was pretty specific in stating where it was

Sorry, but to me this doesn't clarify anything about the topology issue. They are discussing single vs multiple transfers at a single point in time (single-ported vs multi-ported), and Clark is reluctant to go into detail, only saying "we have buffering around it to handle that".

Tuna-Fish · Aug 24, 2020

Vattila said:
We have. But I cannot remember any consensus building around your interpretation. Actually, I'm surprised that you maintain this interpretation and present it as fact. For those interested, see discussion earlier in this thread:

I have seen and read that, at some point I just stopped bothering to respond.

Vattila said:
I think the conventional interpretation, as described in AMD's presentations and slides, is the correct one: the L3 cache controller acts as a crossbar between the 4 cache slices in a CCX, requiring 6 links for a fully connected topology.

That is not the conventional interpretation, and it is not backed by AMD's slides.

You are basing too much of your interpretation on a few arrows on a slide with no attempt to accurately portray the situation. The "L3 is crossbar"-interpretation requires substantial evidence for it, because it makes no sense whatsoever from an engineering standpoint, and since extraordinary claims require extraordinary evidence, it should be discarded. In contrast, there is substantial evidence for a fully connected topology from examining how code actually runs on the chip.

Vattila said:
I'm surprised that you come to that conclusion. The way it sounds to me, Clark makes the point that memory interleaving is used to achieve uniform latency on average, thus weakening, not supporting, the L2 crossbar interpretation.

How exactly address interleaving would produce uniform latency in your topology? Just to make sure you get the basics right: every cache line lives only in one of the slices. Interleaving is done between adjacent cache lines. That is, cache line 0 (addresses 0x0..0x3f) is in slice 0, cache line 1 (0x40..0x7f) is in slice 1, and so on until cache line 5 (0x100..0x13f) is again in slice 0. It is easy to confirm this by allocating an array that fits into the L3, and then only accessing every fourth cache line, measuring the throughput, and comparing to accessing all of the array linearly. Accessing all of it gets substantially higher throughput.

If access to any of the L3 slices went through the closest one, you would expect a substantially different latency to that one, yet you cannot see that. You can also measure that Core 1 accessing L3 slice 4 does not impact the throughput of core 4 accessing L3 slice 1, which is what you'd expect if they were sharing the link.

You keep repeating that 6 links are less than, and therefore better than 16 (and ignoring that your interpretation actually includes 10 links, and that those extra 4 links would have to be different and substantially beefier because you can see a throughput difference between accessing all of the cache versus just any slice), without considering what, exactly is it that you are saving.

Your method would would cost more power, it would potentially require buffering at two places instead of one, it would increase latencies because of longer total distance traveled, and because it would force a long-distance signal to be brought back down from the uppermost metal layers without purpose. And it would save nothing of consequence.

The physical links themselves are free, because they occupy an area of the die which would be just plain blank without them. As for routing logic, your interpretation has 1x4 crossbar at every L3 (just because one of those links takes to the L3 slice itself doesn't mean you can leave it out), mine has a 1x4 crossbar at every core. Since there are as many cores as there are L3 slices, that comes out to the same amount of logic. Except, your method would result in multiple routing hops per transmitted line, as opposed to the one I have, so it would actually require more logic for the same total throughput.

Consider the case of core 0 accessing L3 slice 4, and how it would work under heavy contention. Note that the distance between the core and the cache is multiple cycles, and so the core doing the request cannot know the readiness of the cache when it makes the request.

With a fully connected topology with no shared links:
Immediately after confirming L2 miss, the core knows which slice the line is potentially in. Each core has a 1x4 crossbar, and can immediately send a request on that to the appropriate slice. Each slice has 4 separate reservation stations which store requests, one for each core. Every time a core sends a request, it consumes a slot, and every time it gets a response, it considers one slot freed. The core keeps track of the occupancy of it's own reservation station at each slice, and is only allowed to send a request if there is a slot free. This way, you only need one layer of buffering to get full throughput out of any link.

With your topology:
After confirming L2 miss, the core knows which slice the line is in, but for some reason sends the request to the closest L3 slice instead. At this L3 there is a 1x4 crossbar which picks another slice as the target. Since the link between that L3 slice and the one you actually want to talk to might be saturated for reasons that core 0 cannot predict, there needs to be a buffer both at this point, and at the final L3 slice.

NostaSeronx · Aug 24, 2020

@Tuna-Fish

<-> 4-cycles
^
| 2-cycles
v
^\v and v/^ 6-cycles

"A core complex (CCX) is composed of four Zen 2 cores and a shared level-3 (L3) cache. The L3 cache has four slices connected with a highly tuned fabric/network. Each L3 slice consists of an L3 controller, which reads and writes the L3 cache macro, and a cluster core interface that communicates with a core. The four slices of L3 are accessible by any core within the CCX. The distributed L3 cache control provides the design with improved control granularity. Each slice of L3 contains 4 MB of data for a total of 16 MB of L3 per CCX. The L3 cache is 16-way set-associative and is populated from L2 cache victims. The L3 is protected by DECTED ECC for reliability."

Each slice contains a controller, a CCI(cluster core interface) which communicates with a core(singular).

Tuna-Fish · Aug 24, 2020

NostaSeronx said:
@Tuna-Fish

<-> 4-cycles
^
| 2-cycles
v
^\v and v/^ 6-cycles

Yes. These latency differences correspond well to the actual distance differences for a fully connected topology. If there was an actual step down from the metal layers and processing, I would expect ~10 cycles minimum for one additional hop (there and back).

Or, to put it in other words, according to that diagram, the difference between one-directional access from core 0 to slice 0 or slice 1 is one clock cycle. Does anyone actually believe they can do routing, boost the signal to the uppermost metal, travel half the height of the CCX and then get the signal back down in one cycle?

JoeRambo · Aug 24, 2020

Tuna-Fish said:
Yes. These latency differences correspond well to the actual distance differences for a fully connected topology. If there was an actual step down from the metal layers and processing, I would expect ~10 cycles minimum for one additional hop (there and back).

Agreed. AMD has fully connected topology without any doubt:

1) Latency is too uniform ( +-3 cycles do no count )
2) Memory level parallelism is exceptional and result of 4 slices of L3 having independent resources without a common bottleneck like Ring is on Intel.

An important thing to note that AMD is free to play around with L3 cache architecture, there is nothing forcing them to have 8 slices of L3. They can go with 4, 6, 8, 12 whatever suits them.

For example 32MB L3 "client" configuration could have 4 slices of L3, each 8MB ~same per core BW resources as current ZEN2, but more cache overall. Easy to keep fully connected topology too.
"Server" or Zen3+ whatever next year design could go with 48MB of L3, 6 slices of 8MB, 50% more per core and CCX BW to serve those bw hungry tasks.

Right now L3 is married to core, as it contains shadow tags/status of L2, but I guess there are ways around it.

Thala · Aug 24, 2020

Tuna-Fish said:
The physical links themselves are free, because they occupy an area of the die which would be just plain blank without them. As for routing logic, your interpretation has 1x4 crossbar at every L3 (just because one of those links takes to the L3 slice itself doesn't mean you can leave it out), mine has a 1x4 crossbar at every core.

This! Essentially implies that both your and Vattilas proposal are topologically the same as far as 4xL3$ is concerned - just drawn differently. You argue about nothing. Its just that Vattila has the 1x4 splitter implicitly. Also there need to be 4x1 merger in addition - making the whole thing a 4x4 crossbar.

moinmoin · Aug 24, 2020

The latest discussion has been very insightful, thanks for the contributions (especially @Tuna-Fish for picking it up, @itsmydamnation for the Hot Chips video link and @Vattila for linking back to previous discussion) everybody.

Now I'm really curious how much the increase in link distance in an 8c CCX will affect the latency.

TESKATLIPOKA · Aug 24, 2020

mikk said:
RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.

7nm process allowed RX5500XT to have comparable FP32 performance as RX 570 even though It has a lot less CU, because It clocks a lot higher, but this is also thanks to a better architecture.
BTW I think It was Glo. who linked to a comparison between Navi and Polaris with comparable parameters and the same clocks. The result was 40% better IPC for Navi.

Back to Van Gogh. If Van Gogh will have 8CU then It won't be much better than Tiger Lake with 96EU, but that Tiger Lake won't have such a low TDP.
I think Van Gogh will have only 4 cores and that's the reason why they used Zen 2 which has 4 cores per CCX, while Zen 3 is supposed to have 8 cores per CCX.

eek2121 · Aug 24, 2020

uzzi38 said:
Only thing I'll say for a long time in regards to AM5 is a lot of possibilities are opened with the new socket.

It's literally Renoir but Zen 3 cores.

The GPU may also get a speed bump. It may be Vega by name, but Vega in the APU is a totally different beast from Vega desktop.

Ajay · Aug 24, 2020

Tuna-Fish said:
So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

So, basically, a fully meshed design. I get how that helps maintain a more even latency, so I suppose it is better (also from an area perspective as well). Do we have any idea how many slices there will be per L3$?

Ajay · Aug 24, 2020

The L2$ controllers aren't a crossbar, they are simple p2p switches. A crossbar would be all 8 L2 controllers being connected to a fully meshed switch which is then connected to each L3 slice. That, or I've completely forgotten what I learned from the EEs while working at an enterprise network hardware company.

A/// · Aug 24, 2020

jpiniero said:
At some point you figure AMD will start doing IGP chiplets.

Might take longer than desktop? Baseless rumors have it that NVidia's Hopper will go chiplet as will RDNA3, whether that be 12-24 months from now.

eek2121 · Aug 24, 2020

A/// said:
Might take longer than desktop? Baseless rumors have it that NVidia's Hopper will go chiplet as will RDNA3, whether that be 12-24 months from now.

They have to solve the latency issue first.

A/// · Aug 24, 2020

eek2121 said:
They have to solve the latency issue first.

Mhmm. Like following the yellow brick road to find Alice's nemesis the queen and her archery chessboard played with humans.

DisEnchantment · Aug 24, 2020

Looking at this commit in the Linux kernel

[PATCH] x86/mce: Increase maximum number of banks to 64 - Yazen Ghannam

lore.kernel.org

...because future AMD systems will support up to 64 MCA banks per CPU.

MAX_NR_BANKS is used to allocate a number of data structures, and it is
used as a ceiling for values read from MCG_CAP[Count]. Therefore, this
change will have no functional effect on existing systems with 32 or
fewer MCA banks per CPU.

Current MCA banks in Family 17h

This really a major architectural change. I wonder how many new blocks are now capable of supporting MCA banks.
They are going to be new and updated blocks because new registers and status fields would need to wired up around the new stuff.
PSP for sure will be there.

Are there going to be new decomposable lego blocks. Right now it is a black hole with the leaks from AMD.

Everytime I go find new changes in the manuals and the kernel changes, I keep wondering about Forrest's comment about Zen3.

AMD Inks New Server CPU Deals; Data Center Chief Discusses Them and More

During a talk with TheStreet, AMD exec Forrest Norrod highlighted new supercomputer deals and an expanded partnership with AWS. He also suggested AMD's next-gen server CPUs will deliver healthy performance gains.

realmoney.thestreet.com

When asked about what kind of performance gain Milan's CPU core microarchitecture, which is known as Zen 3, will deliver relative to the Zen 2 microarchitecture that Rome relies on in terms of instructions processed per CPU clock cycle (IPC), Norrod observed that -- unlike Zen 2, which was more of an evolution of the Zen microarchitecture that powers first-gen Epyc CPUs -- Zen 3 will be based on a completely new architecture.
Norrod did qualify his remarks by pointing out that Zen 2 delivered a bigger IPC gain than what's normal for an evolutionary upgrade -- AMD has said it's about 15% on average -- since it implemented some ideas that AMD originally had for Zen but had to leave on the cutting board. However, he also asserted that Zen 3 will deliver performance gains "right in line with what you would expect from an entirely new architecture."

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Senior member

Diamond Member

Diamond Member

Senior member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Diamond Member

Golden Member