Speculation: Ryzen 4000 series/Zen 3

naukkis · Apr 23, 2020

DrMrLordX said:
Yes? Physics is a bitch. L2 takes more die space than L3, and as you may have noticed, having a lot of L3 with good prefetch units can do a lot of improve the performance of multicore CPUs in parallel workloads with lots of intercore communication. Which is one sort of workload for which Intel and AMD have optimized their CPUs. Compare that situation to Apple who exclusively uses their A-series SoCs in phones and tablets where bursty, single-threaded (or sparsely-threaded) applications predominate. There you have less likelihood of core->core writes, meaning maintaining cache coherency is less important (and therefore, shared L3 is less important). So Apple chose to spend a lot of die area on L2 that could have been spent elsewhere, or that could not have been spent at all (driving higher yields and/or lower costs per die). Apple has the freedom to charge insane amounts of money for their hardware, and they don't have any OEMs telling them to trim costs, since they provide all their own SoCs for their own designs from top to bottom.

This is redacted. Apple L2 is shared so it's working just like L3 of todays x86 designs. AMD does use L2 as last level cache with Jaquar, private L2 is just intermediate cache level to increase performance, which Apple yet don't use. But that's easy few percent of more performance for Apple if they also implement that. For low clock-targeted devices like Apple soc and AMD Jaquar that middle cache level won't bring as much performance advantage that designs prefer to save die space and not use it, but if Apple scales their clocks up they probably also will implement that 3-level cache system.

Profanity is not allowed in
tech areas.

AT Mod Usandthem

DrMrLordX · Apr 23, 2020

naukkis said:
This is redacted. Apple L2 is shared so it's working just like L3 of todays x86 designs.

In A13, Lightning and Thunder do NOT share L2. The last-level cache between those core blocks is the 16 MB SLC. It's not much different than Conroe (see my previous post) since presently there are only two Lightning cores on A13.

In any case, Apple made a conscious decision to spend transistors on L2 instead of SLC (which has to be shared with other components of the SLC so it isn't exactly L3 exclusive to the CPU). As I stated, this is NOT a design intended to maximize cache coherency between all the cores of the SoC. It is a design to emphasize consumer workload performance, e.g. bursty workloads with limited thread parallelism and limited intercore communication.

AMD does use L2 as last level cache with Jaquar

1). Jaguar is old
2). Jaguar doesn't perform all that well in PC workloads, even from its day
3). Jaguar wasn't designed for the server/workstation sector. It was designed for consumer devices. Just like Apple's chips.

If Apple wants to move to an 8+4 DynamIQ configuration in 2021, they will have to rethink their SLC. Bottom line: L2 takes more transistors than L3 (or presumably SLC). When you increase L2 size, you make sacrifices elsewhere. And not all cache is made the same - look at the L2 and L3 performance of AMD chips versus Intel chips over the last ~15 years. Intel often had/has better cache performance.

naukkis · Apr 23, 2020

DrMrLordX said:
If Apple wants to move to an 8+4 DynamIQ configuration in 2021, they will have to rethink their SLC. Bottom line: L2 takes more transistors than L3 (or presumably SLC). When you increase L2 size, you make sacrifices elsewhere. And not all cache is made the same - look at the L2 and L3 performance of AMD chips versus Intel chips over the last ~15 years. Intel often had/has better cache performance.

What makes you think that L2 sram takes more transistors that L3? They use exactly same sram arrays.

Intel AMD/Zen L3 is actually part of a core. Every core has a slice of L3 where addresses are interleaved evenly to every slice. Pre Zen AMD chips have monolithic L3 which didn't scale with core count, similar to Apple SLC.

Apple has a lot of simple cache performance left on table, design optimized to Macs will be bring much more performance over phone Soc designs. For phone, only thing that matters is power efficiency, with relaxed power budget they can implement much better performing cache subsystem.

Geranium · Apr 23, 2020

Richie Rich said:
Did God prohibit AMD and Intel to use same size of L2 cache like Apple? NO. Core2Duo was using big shared L2 cache 10 years ago. So you cry good but on a wrong shoulder here. You should write complain email to Apple headquarters to stop developing such a powerful cores because your ego cannot digest that your brand new x86 looks like garbage in compare to Apple uarch. Well, the problem is that you should complain 5 years ago because a very old Apple A9 Twister core from 2015 already had higher IPC by 7% than today's 9900K CoffeeLake and Zen2

Stop with the insults/confrontational postings.
Take some more time off for reflection.

AT Mod Usandthem

None but letency. A big L2 will be slower than small L2. AMD reduced L1 either Data or Instruction cache to 32KB just to make it faster. And besides AMD64 processor have to cache a lot of different instruction than a ARM64 processor.

Core2Due had 3MB L2 per core but it was also 10 years old architecture.

Richie Rich said:
So you cry good but on a wrong shoulder here. You should write complain email to Apple headquarters to stop developing such a powerful cores because your ego cannot digest that your brand new x86 looks like garbage in compare to Apple uarch.

I don't have Tim Cook's Email. And has no interest to email small flys.
One correction, it is not Apple's uarch, it is ARM's uarch which Apple customized to fit thier need. Apple,Samsung, Qualcomm, Mediatek, Huawei and many other ARM vendors are nothing with out ARM.

If ARM processor are so great, then why all SoC need Fixed function unit to process video. x86 processor can do that without need of special fixed function unit.

And also if Apple's ARM cpu is so fast, then why iPhone's has problem running new version of iOS?? My 2014 Haswell cpu has no problem running Win10. Dont forget iOS is lot lighter than Windows 10. iOS is comperable to Nokia's S40 OS.

Well, the problem is that you should complain 5 years ago because a very old Apple A9 Twister core from 2015 already had higher IPC by 7% than today's 9900K CoffeeLake and Zen2

Maybe you have ego problem. That is why you compare invalid benchmark score over and over again. Try to read old Anandtech article about cpus and gpu.
Also looks like you dont know anything about benchmark and how to conduct it. Maybe check how Dr. Ian or Ryan do their CPU and GPU benchmark.

Here is three limition of MIPS benchmark (which can also be applied to other benchmark) :
1. MIPS specifies the instruction execution rate but does not specify the capabilities of the instructions.
2. MIPS varies between program on the same computer. Thus, a machine should not have a same MIPS ratings.
3. MIPS inversely related to performance.

DrMrLordX · Apr 23, 2020

naukkis said:
What makes you think that L2 sram takes more transistors that L3? They use exactly same sram arrays.

I had thought that was not the case at all, with L1 being the least dense. Though looking at AT's breakdown of A13, it doesn't look like the SLC on A13 is much more dense than the L2.

Intel AMD/Zen L3 is actually part of a core. Every core has a slice of L3 where addresses are interleaved evenly to every slice. Pre Zen AMD chips have monolithic L3 which didn't scale with core count, similar to Apple SLC.

Right, it's inclusive. AMD previously had an exclusive cache design but changed it. I wasn't sure about Apple's SLC inclusivity but it looks like it's exclusive as well. They may need to change that. In fact I'd say that it's certain that they will. They may even need to opt for smaller L2 and to stop sharing L2 to speed it up. It'll be interesting to see if they try that.

For phone, only thing that matters is power efficiency

And performance is short, sparsely-threaded workloads, which plays into efficiency.

amrnuke · Apr 23, 2020

DrMrLordX said:
In A13, Lightning and Thunder do NOT share L2. The last-level cache between those core blocks is the 16 MB SLC. It's not much different than Conroe (see my previous post) since presently there are only two Lightning cores on A13.

Based on Andrei's breakdown, it looks like the L2 is partially shared:

Last year we determined that the L2 cache structure physically must be around 8MB in size, however we saw that it only looks as if the big cores only have access to around 6MB. Apple employs an “L2E” cache – this is seemingly a region of the big core L2 cache that serves as an L3 to the smaller efficiency cores (which themselves have their own shared L2 underneath in their CPU group).

In this region the new A13 behaves slightly different as there’s an additional “step” in the latency ladder till about 6MB. Frankly I don’t have any proper explanation as to what the microarchitecture is doing here till the 8MB mark. It does look however that the physical structure has remained at 8MB.

While it's not technically shared L2$ between the big and little cores, it does mean that the L2 attached to the big cores is indeed used as an L3 (L2E) for the efficiency cores. So rather than 8MB L2$ on the big cores, it behaves like 6MB L2$, then 2MB of somewhat slower L2$ on the big cores with the latency analysis.

DrMrLordX · Apr 23, 2020

amrnuke said:
Based on Andrei's breakdown, it looks like the L2 is partially shared:

That's really odd. Thanks for adding your input though. That must be one coping strategy for the cache exclusivity of the SLC.

DisEnchantment · Apr 23, 2020

I don't know why we have to keep bringing Apple on every thread in AT. Basically all the active threads.

Lisa keeps repeating AMD's core is high performance computing. Whether they are succeeding or not is a different matter. They are not chasing the cell phone market, at least on the CPU side for now.
Most of their architecture and patents have been targeted towards their EHP. At this time it has come to such a point that the major work around the architecture is interconnect not the core.
You can bet the core will have, at best, minor changes. But expect radical changes in interconnects, packaging, scalabilty, coherency with CPUs, GPUs, FPGAs etc.
Different goals different designs. Flip the purpose and you will see the outcome is different.

Markfw · Apr 23, 2020

DisEnchantment said:
I don't know why we have to keep bringing Apple on every thread in AT. Basically all the active threads.

Lisa keeps repeating AMD's core is high performance computing. Whether they are succeeding or not is a different matter. They are not chasing the cell phone market, at least on the CPU side for now.
Most of their architecture and patents have been targeted towards their EHP. At this time it has come to such a point that the major work around the architecture is interconnect not the core.
You can bet the core will have, at best, minor changes. But expect radical changes in interconnects, packaging, scalabilty, coherency with CPUs, GPUs, FPGAs etc.
Different goals different designs. Flip the purpose and you will see the outcome is different.

Yes, I for one am sick of trying to compare desktop/server/laptop cores to PHONE scores. If its the same OS (and I don't care what OS), running more than one benchmark, probably at least 5 different ones, than its a valid comparison. Otherwise its crap. You can't compare a PHONE CPU to a desktop , server or even real laptop. Now tablets and the like are a different story, kind of on their own also.

DisEnchantment · Apr 24, 2020

Update:
I edited this because I think the info was not meant to go out, unlike a github commit which is intentional.

uzzi38 · Apr 24, 2020

DisEnchantment said:
Snip

Dude, this is new to me. Very bloody new. I thought Rembrandt succeeded Renoir, not Cezanne.

Holy hell stop feeding me on the 'Rembrandt might be 5nm idea". Only thing I can say I know is AMD have been working on Rembrandt for quite a while... I didn't expect Cezanne to appear first.

NTMBK · Apr 24, 2020

Markfw said:
Yes, I for one am sick of trying to compare desktop/server/laptop cores to PHONE scores. If its the same OS (and I don't care what OS), running more than one benchmark, probably at least 5 different ones, than its a valid comparison. Otherwise its crap. You can't compare a PHONE CPU to a desktop , server or even real laptop. Now tablets and the like are a different story, kind of on their own also.

How about comparing the Intel CPU in a Surface Pro with the Apple CPU in an iPad Pro? They're in the same form factor, targeting the same market.

uzzi38 · Apr 24, 2020

DisEnchantment said:
Update:
I edited this because I think the info was not meant to go out, unlike a github commit which is intentional.

Good choice, but the guy is several levels of screwed already. He's included documents for TGL-H as well xd

RIP.

DisEnchantment · Apr 24, 2020

uzzi38 said:
Good choice, but the guy is several levels of screwed already. He's included documents for TGL-H as well xd

RIP.

Please edit the quote to the post too

amrnuke · Apr 24, 2020

NTMBK said:
How about comparing the Intel CPU in a Surface Pro with the Apple CPU in an iPad Pro? They're in the same form factor, targeting the same market.

Since this is a Ryzen 4000 series / Zen3 thread, what good would it do to start a new Intel vs Apple discussion in here?

jpiniero · Apr 24, 2020

It does end the rumors that Apple will use AMD, even temporarily. You would think anyway...

randomhero · Apr 28, 2020

First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).

Veradun · Apr 28, 2020

randomhero said:
First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).

That can also have odd numbers for single CCD skus :>

randomhero · Apr 28, 2020

Veradun said:
That can also have odd numbers for single CCD skus :>

Yes, but those would be odd to market

.

Joking aside, they could use those dies in higher value skus in that make up list of mine.

Valantar · Apr 28, 2020

randomhero said:
First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).

Sure, they could, but the question then becomes: is there a point to that much segmentation? They are currently selling a 12-core at $500 MSRP and a 16-core at $750. Is there really room for a 14-core in between, and is there a significant market consisting of "people who want more than 12 cores but don't need 16 or are willing to pay more than $500 but less than $750"? Because that sounds unlikely to me. I know Intel did this, but they also had astronomically inflated prices which made the artificial segmentation seem to make much more sense. That question becomes even more precarious between 12 and 8, as the price difference shrinks to half as much even if the relative increase in core count is higher.

Rather than excessively binning for functional cores like this (separate bins for every possible number of working cores per CCD) it seems much more sensible to stick with broader core count categories (8 working, >=6 working, etc.) as this would leave much more room to further bin for clock scaling, efficiency, etc. A chip with 7 working cores might kick butt as a high speed and efficient 12-core or 6-core CPU even if one of its working cores is a "dud" (that scales poorly/has lots of leakage etc.), while the same piece of silicon would fall into a much worse 7-core (14-core CPU) bin due to that single dud core - it might even be rejected and have to go through a second round of binning to see if it qualifies for a lower core count bin, making binning more complex, time consuming, costly and wasteful. Not to mention that half the bins (every odd number) would be for only one CPU (or a couple in tiers with multiple of the same core count) rather than several as in the current implementation.

randomhero · Apr 29, 2020

Valantar said:
Sure, they could, but the question then becomes: is there a point to that much segmentation? They are currently selling a 12-core at $500 MSRP and a 16-core at $750. Is there really room for a 14-core in between, and is there a significant market consisting of "people who want more than 12 cores but don't need 16 or are willing to pay more than $500 but less than $750"? Because that sounds unlikely to me. I know Intel did this, but they also had astronomically inflated prices which made the artificial segmentation seem to make much more sense. That question becomes even more precarious between 12 and 8, as the price difference shrinks to half as much even if the relative increase in core count is higher.

Rather than excessively binning for functional cores like this (separate bins for every possible number of working cores per CCD) it seems much more sensible to stick with broader core count categories (8 working, >=6 working, etc.) as this would leave much more room to further bin for clock scaling, efficiency, etc. A chip with 7 working cores might kick butt as a high speed and efficient 12-core or 6-core CPU even if one of its working cores is a "dud" (that scales poorly/has lots of leakage etc.), while the same piece of silicon would fall into a much worse 7-core (14-core CPU) bin due to that single dud core - it might even be rejected and have to go through a second round of binning to see if it qualifies for a lower core count bin, making binning more complex, time consuming, costly and wasteful. Not to mention that half the bins (every odd number) would be for only one CPU (or a couple in tiers with multiple of the same core count) rather than several as in the current implementation.

Good arguments and I do agree with your keep it simle, keep it easy philosophy.
But something bugs me. With heavy segmentation AMD can increase prices, say $ 20-50 for sku and noone would bat an eye for it. You get 2 more cores and 20 % increase in per core performance than previous gen. And I do believe that that would not cost AMD that much more than it is now.
That tactics works.

Veradun · Apr 29, 2020

They can still choose to get rid of non-X variants and only diversify on cores and -X or -G keeping segmentation on basically the same level.

randomhero · Apr 29, 2020

They can do that on higher end skus, eg. 4800x an higher.

Veradun · Apr 29, 2020

randomhero said:
They can do that on higher end skus, eg. 4800x an higher.

Something like this:

4400g 4c (Renoir)
4400x 4c
4500g 6c (Renoir)
4500x 6c
4600g 8c (Renoir)
4600x 8c
4700x 10c
4800x 12c
4900x 14c
4950x 16c

randomhero · Apr 29, 2020

Veradun said:
Something like this:

4400g 4c (Renoir)
4400x 4c
4500g 6c (Renoir)
4500x 6c
4600g 8c (Renoir)
4600x 8c - $279
4700x 10c - $349
4800x 12c - $429
4900x 14c - $529
4950x 16c - $750

Yes, something like that.
I edited quote with possible prices to reflect what I have meant in previous post.

Speculation: Ryzen 4000 series/Zen 3

Golden Member

Lifer

Golden Member

Member

Lifer

Golden Member

Lifer

Golden Member

Moderator Emeritus, Elite Member

Golden Member

Platinum Member

Lifer

Platinum Member

Golden Member

Golden Member

Lifer

Member

Senior member

Member

Golden Member

Member

Senior member

Member

Senior member

Member