Speculation: Ryzen 4000 series/Zen 3

Page 167 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

moonbogg

Lifer
Jan 8, 2011
10,731
3,440
136
I'm thinking we'll hear something this month? Outing Zen 3 before Big Navi would be AMD's best option. Testing the new GPU's should be done with PCI 4.0 rigs. Why not seed some Zen 3's to a handful of reputable reviewers for the task?

That's a good idea, although it could backfire if the i9's are still faster at gaming. The few percent performance difference between PCI-E generations would get smashed by even moderate CPU performance differences. I expect Ryzen 4000 to be better than all the Ryzen generations before it, and therefore it will be more than good enough for me, but you know what people will say if the clocks aren't high enough to overtake the i9's. They will say i9's should be used to test the new GPU's, but the i9's have old PCI-E slots while Ryzen 4000 has new PCI-E slots, yet i9 is 8% faster at games but PCI-E 4.0 gives 5% more GPU performance, and then everyone's head just explodes all over the place trying to decide what to do.
 
  • Like
Reactions: french toast

Vattila

Senior member
Oct 22, 2004
820
1,456
136
but you know what people will say if the clocks aren't high enough to overtake the i9's.

The sentiment is changing — at least, in some communities. Hardware Unboxed polled their audience, and the audience responded by overwhelmingly voting (83% to 17%) for the replacement of their i9-10900K based GPU test platform by a Ryzen 3950X-based platform with PCI Express 4 support. So they did.

 
Last edited:

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
That's a good idea, although it could backfire if the i9's are still faster at gaming. The few percent performance difference between PCI-E generations would get smashed by even moderate CPU performance differences.

Well there's no such thing as too much information when making a buying decision.

Viewing the fancy graphs of fps it's best to ignore the maximum fps and look at the minimums or averages. Who cares if there was a spike of 20-30fps for a split second or two.

Have you watched this video yet?


You can already see the effects of PCI bandwidth in certain games and gpu's that have no wear near the grunt of the 3090 or 3080. View the minimums when they get to the comparisons.
 

Mopetar

Diamond Member
Jan 31, 2011
8,451
7,664
136
Interesting video for the results alone. I think the part about their user poll overwhelmingly favoring the AMD CPU for their benchmarks is far more intriguing. It was hard enough to imagine AMD being able to have a competitive product only a few years ago, but the fact that they've gained so much mind share in the enthusiast market is astonishing. I suspect a lot of people just like the underdog success story, but Intel got blown out in that poll.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
WOW, between RTX 3000 cards and AMD 4000 cpus, the gaming world could change drasticly in the nect 6 months or less (once both are available)
Indeed! I think we've still got six months of this mess of a pandemic, so good time gaming? A few of the games I was waiting on turned out to be polished turds, such as PC3. Good grief!
 

LightningZ71

Platinum Member
Mar 10, 2017
2,413
3,075
136
I suspect that PCIe 4 will start to show its legs when games start to implement DirectStorage from DirectX. Being able to directly transfer textures and other data from high speed NVME drives over PCIe 4 will allow the GPU to be able to continue operating at maximum performance without interruption and will bypass the CPU so that it doesn't have to handle the memory transfers itself.
 
  • Like
Reactions: lightmanek

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Although this is not my field of expertise, I very much doubt these differences are solely due to interconnect wire delay, as you suggest. Intuitively, from the little I know about this, it seems to me that a wire delay of 6 clock cycles is excessive. Assuming that "wires have an approximate propagation delay of 1 ns for every 6 inches (15 cm) of length" (Wikipedia), which equates to about half the speed of light, then signals travel 30 mm/cycle at 5 GHz. Hence, a signal can travel more than 180 mm in 6 cycles! The whole Zen L3 is 16 mm², about 2 to 1 rectangular; so around 6.3 mm diagonal, and 8.5 mm between opposite corners along the periphery.

Even if the actual L3 interconnect wire delay is twice that Wikipedia quote, the difference in wire lengths from any L2 to any L3 controller should not require extra cycles at all — the worst-case wire delay within the L3 should still be well within a single clock cycle (>15 mm).

PS. By the way, for those interested, here is a PhD thesis I found while reading up on interconnect delay: "Efficient High-Speed On-Chip Global Interconnects", Peter Caputa, 2006. It has a nice introduction to high-speed on-chip interconnect, and proposes an upper metal layer interconnect approaching lightspeed.
I don’t know that much about the quoted measurements. I would expect the wire delay to be a fraction of some pipeline stages. At 4 GHz, a clock cycle is only 0.25 nanoseconds, so I suspect that interconnect adds some number of cycles, it just may not all be in one cycle. It is actually quite amazing that they have managed to keep the pipeline stages as low as they are in current designs. The pentium 4 actually had a pipeline stage just labeled “drive” to send signals long distance. That was maybe 130 nm tech though, and they originally thought they were going scale up to much higher clock speeds than they actually achieved.

Any modern, low power design is going to aggressively optimize interconnect lengths. Long interconnect is terrible for power consumption. If you double the length of a wire, you double the resistance, which means more lost power. You also have an increased voltage drop the longer the wire, so it may need to be driven at higher voltage which hits power consumption even more. I don’t know if the interconnect is long enough for that to be an issue though.

The wire delay is not a simple thing. I worked at a company doing parasitic extraction for circuit simulation. I wasn’t actually involved in that software though. An extractor takes the geometry (12 to 13 metal layers of interconnect in modern cpus) and converts it to resistors and capacitor networks for circuit simulation. Any wire would be modeled by possibly multiple resistors and also multiple parasitic capacitors. The voltage at the other end of the wire can’t change instantly, or even at the speed of light, since it has to charge up a certain amount of parasitic capacitance before the voltage is high enough to change the state of the transistor on the receiving end. The wires are wider traces (lower resistance) on the upper layers to carry high speed signals longer distance, but that just makes it a bigger capacitor.

I still suspect most of this cache architecture confusion is due to a bad drawing on a marketing slide. AMD stated that all slices have the same average latency which means that there cannot be any paths with multiple hops or a differing number of hops in some manner. I don’t think that there is any slice owned by any core; they are all equally accessible to any core. Actually getting a measurement that is exactly the same would probably be difficult though. It has to have a buffer or queue to handle multiple request to the same cache slice in the same clock. A cache slice would be able to fill only one request per clock. It probably has to buffer both sending the address and sending the data since it is probably possible to have multiple outstanding request form a single core.
 

Vattila

Senior member
Oct 22, 2004
820
1,456
136
The wire delay is not a simple thing.

It is an interesting topic. The PhD thesis I linked describes the issues you mention and discusses how to deal with them.

My point though was that it is implausible that wire delay accounts for multiple cycles of latency difference between near and far slice in the L3 cache.

Wiring issues and complexity also argue for a simpler topology with fewer and shorter links — which is my main point.

I still suspect most of this cache architecture confusion is due to a bad drawing on a marketing slide.

Actually, as far as I recall, AMD's slides have been quite accurate when it comes to topology, i.e. how things are connected — for example, how CCXs are connected in package and across packages. I don't see any reason to distrust the way they draw the interconnections in the L3 structure.

That said, the slides are not my main argument; they just corroborate my understanding of the topology. Conversely, they do not corroborate the slice-aware L2 hypothesis (4 x 4 = 16 links).

AMD stated that all slices have the same average latency which means that there cannot be any paths with multiple hops or a differing number of hops in some manner.

Here I think you are mistaken.

We have to be careful here, since it is easy to create confusion. Each core sees the same average L3 latency, and this is explained by the use of memory interleaving. When reading contiguous memory out of the L3, every other cache line will come from every other slice, which evens out latency and maximises utilisation.

However, L3 access does see a difference in latency between near and far slice. This is not in dispute.

It was claimed the difference is explained by wire delay only. I investigated this claim, and I find it implausible.

I don’t think that there is any slice owned by any core

There is no ownership per se. Think of it as 4 connection ports to the L3 structure.

1599312704966.png
 
Last edited:
  • Like
Reactions: maddie

Vattila

Senior member
Oct 22, 2004
820
1,456
136
Free program to make your drawings 'prettier' ;)

Actually, Paint.NET is what I used to create those charts with personal charm — highly recommended! But perhaps my child-like scribblings undermine the seriousness of my argument? :)
 
Last edited:

Ajay

Lifer
Jan 8, 2001
16,094
8,113
136
Actually, Paint.NET is what I use to create charts with personal charm — highly recommended! But perhaps my child-like scribblings undermine the seriousness of my argument? :)

Maybe a little bit :p.

I just want to bring a couple of things up. One is that all this traffic is moving between cache controllers - that's were the logic and switches are.
Second, I want to reiterate Tuna-Fish's excellent description of the most likely cache topology (AMD needs to reduce latency, not save power - well, at least not in the cache data channels).

Each core needs a link to each L3 slice. The controller on the closest L3 slice is still quite far from the core. The L3 slice closest to a core is not in any special way associated with it, and requests going to other slices do not travel through it. The links are only bidirectional in the sense that a core can both read and write through the link, Core 1 communicating with L3 slice 4 does not use the same link as core 4 communicating to L3 slice 1. The links do not even terminate near each other, there is no reason why you'd need to take that huge detour.

So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

Or, with my terrible paint skills:
View attachment 28498
Note that the orange link is still a massive distance on die that requires significant infrastructure to cross, you can't just magic it away just because the two blocks are close to each other in the block diagram.
 

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
Actually, Paint.NET is what I used to create those charts with personal charm — highly recommended! But perhaps my child-like scribblings undermine the seriousness of my argument? :)

Have you ever considered being a abstract artist? Not sarcasm.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
It is an interesting topic. The PhD thesis I linked describes the issues you mention and discusses how to deal with them.

My point though was that it is implausible that wire delay accounts for multiple cycles of latency difference between near and far slice in the L3 cache.

Wiring issues and complexity also argue for a simpler topology with fewer and shorter links — which is my main point.



Actually, as far as I recall, AMD's slides have traditionally been quite accurate when it comes to topology, i.e. how things are connected — for example, how CCXs are connected in package and across packages. I don't see any reason to distrust the way they draw the interconnections in the L3 structure.

That said, the slides are not my main argument; they just corroborate my understanding of the topology. Conversely, they do not corroborate the slice-aware L2 hypothesis (4 x 4 = 16 links).



Here I think you are mistaken.

We have to be careful here, since it is easy to create confusion. Each core sees the same average L3 latency, and this is explained by the use of memory interleaving. When reading contiguous memory out of the L3, every other cache line will come from every other slice, which evens out latency and maximises utilisation.

However, L3 access does see a difference in latency between near and far slice. This is not in dispute.

It was claimed the difference is explained by wire delay only. I investigated this claim, and I find it implausible.



There is no ownership per se. Think of it as 4 connection ports to the L3 structure.

View attachment 29263

“L2 slice awareness” seems to be a strange way to talk about a simple address interleave. When the memory request gets to the point where it needs to be sent to the L3 cache, 2 bits are used to address the proper slice: [0, 1, 2, 3]. That is super simple and could be described by a 4 to 1 multiplexor. It wouldn’t be that simple in practice due to handling contention with multiple cores trying to access the same slice. If it was some kind of multi-hop system, then I would doubt that would be described as having the same average latency, but that is semantics. I would expect average latency to describe the situation where it will not be same every time due to contention rather than differences in structural latency.

As I said, I don’t know much about the measurements. I wouldn’t expect much difference in wire delay since the L3 cache controllers do not actually seem to be that close to the cores. They look like they would all have similar latency. I could see it adding a clock cycle or two to the latency, but they would probably not actually take advantage of lower latency to a “near” slice in that case since it would add complexity. I would actually expect that they all see the same structural latency in clock cycles. The measured differences appear to be too low to account for some kind of multi-hop system though. That would almost certainly be more than a few clock cycles. I wouldn’t expect latency measurements to be perfect here; you have a lot of interactions that could change the latency in access specific manners.

From a chip fabrication perspective, having all cores connected to all slices is not an issue. An HBM memory interface is 1024 bits per stack and GPUs have used 4 stacks for 4096 bit busses. This only requires 4*256-bit for 1024 bits per core or 2048 if it supports read and write at the same time. This is also much simpler than an HBM bus since it doesn’t go off die. The many wide connections should not be an issue on die. Looking at die photos, Intel does appear to associate an L3 cache closely with each core, but that does not appear to be the case with AMD Zen 2.
 
  • Like
Reactions: Tlh97 and Vattila

jamescox

Senior member
Nov 11, 2009
644
1,105
136
OK, I'll give it a go. Which one do you like best?

View attachment 29312



I see. Let me try some sarcasm then: Intel engineers have picked up on AMD's magic "free links" and are now dropping all their interconnect technologies for free direct connections everywhere. It is a scaling breakthrough across the industry. Technology enthusiasts allergic to the word "latency" can now substitute it by "wire delay".

AMD kept the core cluster small, at 4 cores, to allow for things like directly connecting cache slices. Zen 2 has considerably lower L3 latency per CCX that what Intel has with their scalable architecture because of the small, local core cluster. Also, I would not make any claims on how Zen 3 cache is connected. AMD hasn’t said anything about it other than that they redesigned the caches. It is likely completely different from Zen 2, so I would not assume that it is 8 directly connected slices.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
AMD kept the core cluster small, at 4 cores, to allow for things like directly connecting cache slices. Zen 2 has considerably lower L3 latency per CCX that what Intel has with their scalable architecture because of the small, local core cluster. Also, I would not make any claims on how Zen 3 cache is connected. AMD hasn’t said anything about it other than that they redesigned the caches. It is likely completely different from Zen 2, so I would not assume that it is 8 directly connected slices.
At the very least the 3300X shows us what baseline performance may be like with a unified cache and 0% IPC improvement, but as you point out, given how vague AMD is and has been, they could have done more. You don't exactly spill the beans before you make your soup and serve it.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
Didn't this guy throw a tantrum a year ago because AMD didn't credit him with something?
 

Gideon

Platinum Member
Nov 27, 2007
2,023
5,026
136
Didn't this guy throw a tantrum a year ago because AMD didn't credit him with something?
AMD didn't provide samples of Zen 2 if I remember correctly. I can somewhat understand his point (though not tantrums), considering how many people use his soft for Ryzen RAM tuning, providing a few samples shouldn't be ultra-hard for AMD. His Windows power-plan was also superior to AMD's on release (not that much now, after AMD's updates).

He also seems to confirm the per-core voltage regulation Igor mentioned earlier:
1usmus said:
About Zen 3. Part 1. One of the key features of Zen 3 will be the "Curve Optimizer" , which allows you to configure the boost of the Ryzen processor. In addition, you will be able to customize the frequency for each core without any restrictions.
 

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
The no compromise per core clocking with voltage control is going to be game changing. This is probably the most interesting new feature other than the alleged higher IF clocks.

I wonder if they'll just add another sub-tab in the uEFI or if it'll be done in Ryzen Master?
 

Asterox

Golden Member
May 15, 2012
1,052
1,850
136
The no compromise per core clocking with voltage control is going to be game changing. This is probably the most interesting new feature other than the alleged higher IF clocks.

I wonder if they'll just add another sub-tab in the uEFI or if it'll be done in Ryzen Master?

On Twitter, he said or replied this.

- "AMD Overclocking menu in bios "

- and as expected only on "B550 , X570 , sTRX40 , sTRX80" motherboards