Speculation: Ryzen 4000 series/Zen 3

Page 103 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Did God prohibit AMD and Intel to use same size of L2 cache like Apple?

Yes? Physics is a bitch. L2 takes more die space than L3, and as you may have noticed, having a lot of L3 with good prefetch units can do a lot of improve the performance of multicore CPUs in parallel workloads with lots of intercore communication. Which is one sort of workload for which Intel and AMD have optimized their CPUs. Compare that situation to Apple who exclusively uses their A-series SoCs in phones and tablets where bursty, single-threaded (or sparsely-threaded) applications predominate. There you have less likelihood of core->core writes, meaning maintaining cache coherency is less important (and therefore, shared L3 is less important). So Apple chose to spend a lot of die area on L2 that could have been spent elsewhere, or that could not have been spent at all (driving higher yields and/or lower costs per die). Apple has the freedom to charge insane amounts of money for their hardware, and they don't have any OEMs telling them to trim costs, since they provide all their own SoCs for their own designs from top to bottom.

Core2Duo was using big shared L2 cache 10 years ago.

Conroe only had two cores. Cache coherency on that generation of CPU wasn't that big of a deal. With shared L2, you didn't even have to think about which core had which data in its cache since it was in a shared L2 and since Intel was using an inclusive cache hierarchy; e.g. if your CPU couldn't find the data in L1d on Core 0 but it was in L1d of Core 1, it was guaranteed to be in the L2, so you wouldn't have to do any core->core communication to read that data into the L1d of Core 0.
 

naukkis

Senior member
Jun 5, 2002
702
571
136
Yes? Physics is a bitch. L2 takes more die space than L3, and as you may have noticed, having a lot of L3 with good prefetch units can do a lot of improve the performance of multicore CPUs in parallel workloads with lots of intercore communication. Which is one sort of workload for which Intel and AMD have optimized their CPUs. Compare that situation to Apple who exclusively uses their A-series SoCs in phones and tablets where bursty, single-threaded (or sparsely-threaded) applications predominate. There you have less likelihood of core->core writes, meaning maintaining cache coherency is less important (and therefore, shared L3 is less important). So Apple chose to spend a lot of die area on L2 that could have been spent elsewhere, or that could not have been spent at all (driving higher yields and/or lower costs per die). Apple has the freedom to charge insane amounts of money for their hardware, and they don't have any OEMs telling them to trim costs, since they provide all their own SoCs for their own designs from top to bottom.

This is redacted. Apple L2 is shared so it's working just like L3 of todays x86 designs. AMD does use L2 as last level cache with Jaquar, private L2 is just intermediate cache level to increase performance, which Apple yet don't use. But that's easy few percent of more performance for Apple if they also implement that. For low clock-targeted devices like Apple soc and AMD Jaquar that middle cache level won't bring as much performance advantage that designs prefer to save die space and not use it, but if Apple scales their clocks up they probably also will implement that 3-level cache system.

Profanity is not allowed in
tech areas.

AT Mod Usandthem
 
Last edited by a moderator:

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
This is redacted. Apple L2 is shared so it's working just like L3 of todays x86 designs.

In A13, Lightning and Thunder do NOT share L2. The last-level cache between those core blocks is the 16 MB SLC. It's not much different than Conroe (see my previous post) since presently there are only two Lightning cores on A13.

In any case, Apple made a conscious decision to spend transistors on L2 instead of SLC (which has to be shared with other components of the SLC so it isn't exactly L3 exclusive to the CPU). As I stated, this is NOT a design intended to maximize cache coherency between all the cores of the SoC. It is a design to emphasize consumer workload performance, e.g. bursty workloads with limited thread parallelism and limited intercore communication.

AMD does use L2 as last level cache with Jaquar

1). Jaguar is old
2). Jaguar doesn't perform all that well in PC workloads, even from its day
3). Jaguar wasn't designed for the server/workstation sector. It was designed for consumer devices. Just like Apple's chips.

If Apple wants to move to an 8+4 DynamIQ configuration in 2021, they will have to rethink their SLC. Bottom line: L2 takes more transistors than L3 (or presumably SLC). When you increase L2 size, you make sacrifices elsewhere. And not all cache is made the same - look at the L2 and L3 performance of AMD chips versus Intel chips over the last ~15 years. Intel often had/has better cache performance.
 

naukkis

Senior member
Jun 5, 2002
702
571
136
If Apple wants to move to an 8+4 DynamIQ configuration in 2021, they will have to rethink their SLC. Bottom line: L2 takes more transistors than L3 (or presumably SLC). When you increase L2 size, you make sacrifices elsewhere. And not all cache is made the same - look at the L2 and L3 performance of AMD chips versus Intel chips over the last ~15 years. Intel often had/has better cache performance.

What makes you think that L2 sram takes more transistors that L3? They use exactly same sram arrays.

Intel AMD/Zen L3 is actually part of a core. Every core has a slice of L3 where addresses are interleaved evenly to every slice. Pre Zen AMD chips have monolithic L3 which didn't scale with core count, similar to Apple SLC.

Apple has a lot of simple cache performance left on table, design optimized to Macs will be bring much more performance over phone Soc designs. For phone, only thing that matters is power efficiency, with relaxed power budget they can implement much better performing cache subsystem.
 

Geranium

Member
Apr 22, 2020
83
101
61
Did God prohibit AMD and Intel to use same size of L2 cache like Apple? NO. Core2Duo was using big shared L2 cache 10 years ago. So you cry good but on a wrong shoulder here. You should write complain email to Apple headquarters to stop developing such a powerful cores because your ego cannot digest that your brand new x86 looks like garbage in compare to Apple uarch. Well, the problem is that you should complain 5 years ago because a very old Apple A9 Twister core from 2015 already had higher IPC by 7% than today's 9900K CoffeeLake and Zen2 :cool:


Stop with the insults/confrontational postings.
Take some more time off for reflection.

AT Mod Usandthem
None but letency. A big L2 will be slower than small L2. AMD reduced L1 either Data or Instruction cache to 32KB just to make it faster. And besides AMD64 processor have to cache a lot of different instruction than a ARM64 processor.

Core2Due had 3MB L2 per core but it was also 10 years old architecture.

So you cry good but on a wrong shoulder here. You should write complain email to Apple headquarters to stop developing such a powerful cores because your ego cannot digest that your brand new x86 looks like garbage in compare to Apple uarch.
I don't have Tim Cook's Email. And has no interest to email small flys.
One correction, it is not Apple's uarch, it is ARM's uarch which Apple customized to fit thier need. Apple,Samsung, Qualcomm, Mediatek, Huawei and many other ARM vendors are nothing with out ARM.

If ARM processor are so great, then why all SoC need Fixed function unit to process video. x86 processor can do that without need of special fixed function unit.

And also if Apple's ARM cpu is so fast, then why iPhone's has problem running new version of iOS?? My 2014 Haswell cpu has no problem running Win10. Dont forget iOS is lot lighter than Windows 10. iOS is comperable to Nokia's S40 OS.

Well, the problem is that you should complain 5 years ago because a very old Apple A9 Twister core from 2015 already had higher IPC by 7% than today's 9900K CoffeeLake and Zen2 :cool:

Maybe you have ego problem. That is why you compare invalid benchmark score over and over again. Try to read old Anandtech article about cpus and gpu.
Also looks like you dont know anything about benchmark and how to conduct it. Maybe check how Dr. Ian or Ryan do their CPU and GPU benchmark.

Here is three limition of MIPS benchmark (which can also be applied to other benchmark) :
1. MIPS specifies the instruction execution rate but does not specify the capabilities of the instructions.
2. MIPS varies between program on the same computer. Thus, a machine should not have a same MIPS ratings.
3. MIPS inversely related to performance.
 
Last edited:
  • Like
Reactions: pcp7 and Tlh97

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
What makes you think that L2 sram takes more transistors that L3? They use exactly same sram arrays.

I had thought that was not the case at all, with L1 being the least dense. Though looking at AT's breakdown of A13, it doesn't look like the SLC on A13 is much more dense than the L2.

Intel AMD/Zen L3 is actually part of a core. Every core has a slice of L3 where addresses are interleaved evenly to every slice. Pre Zen AMD chips have monolithic L3 which didn't scale with core count, similar to Apple SLC.

Right, it's inclusive. AMD previously had an exclusive cache design but changed it. I wasn't sure about Apple's SLC inclusivity but it looks like it's exclusive as well. They may need to change that. In fact I'd say that it's certain that they will. They may even need to opt for smaller L2 and to stop sharing L2 to speed it up. It'll be interesting to see if they try that.

For phone, only thing that matters is power efficiency

And performance is short, sparsely-threaded workloads, which plays into efficiency.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
In A13, Lightning and Thunder do NOT share L2. The last-level cache between those core blocks is the 16 MB SLC. It's not much different than Conroe (see my previous post) since presently there are only two Lightning cores on A13.
Based on Andrei's breakdown, it looks like the L2 is partially shared:

Last year we determined that the L2 cache structure physically must be around 8MB in size, however we saw that it only looks as if the big cores only have access to around 6MB. Apple employs an “L2E” cache – this is seemingly a region of the big core L2 cache that serves as an L3 to the smaller efficiency cores (which themselves have their own shared L2 underneath in their CPU group).

In this region the new A13 behaves slightly different as there’s an additional “step” in the latency ladder till about 6MB. Frankly I don’t have any proper explanation as to what the microarchitecture is doing here till the 8MB mark. It does look however that the physical structure has remained at 8MB.

While it's not technically shared L2$ between the big and little cores, it does mean that the L2 attached to the big cores is indeed used as an L3 (L2E) for the efficiency cores. So rather than 8MB L2$ on the big cores, it behaves like 6MB L2$, then 2MB of somewhat slower L2$ on the big cores with the latency analysis.
 
  • Like
Reactions: lightmanek

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I don't know why we have to keep bringing Apple on every thread in AT. Basically all the active threads.

Lisa keeps repeating AMD's core is high performance computing. Whether they are succeeding or not is a different matter. They are not chasing the cell phone market, at least on the CPU side for now.
Most of their architecture and patents have been targeted towards their EHP. At this time it has come to such a point that the major work around the architecture is interconnect not the core.
You can bet the core will have, at best, minor changes. But expect radical changes in interconnects, packaging, scalabilty, coherency with CPUs, GPUs, FPGAs etc.
Different goals different designs. Flip the purpose and you will see the outcome is different.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,483
14,434
136
I don't know why we have to keep bringing Apple on every thread in AT. Basically all the active threads.

Lisa keeps repeating AMD's core is high performance computing. Whether they are succeeding or not is a different matter. They are not chasing the cell phone market, at least on the CPU side for now.
Most of their architecture and patents have been targeted towards their EHP. At this time it has come to such a point that the major work around the architecture is interconnect not the core.
You can bet the core will have, at best, minor changes. But expect radical changes in interconnects, packaging, scalabilty, coherency with CPUs, GPUs, FPGAs etc.
Different goals different designs. Flip the purpose and you will see the outcome is different.
Yes, I for one am sick of trying to compare desktop/server/laptop cores to PHONE scores. If its the same OS (and I don't care what OS), running more than one benchmark, probably at least 5 different ones, than its a valid comparison. Otherwise its crap. You can't compare a PHONE CPU to a desktop , server or even real laptop. Now tablets and the like are a different story, kind of on their own also.
 

NTMBK

Lifer
Nov 14, 2011
10,208
4,940
136
Yes, I for one am sick of trying to compare desktop/server/laptop cores to PHONE scores. If its the same OS (and I don't care what OS), running more than one benchmark, probably at least 5 different ones, than its a valid comparison. Otherwise its crap. You can't compare a PHONE CPU to a desktop , server or even real laptop. Now tablets and the like are a different story, kind of on their own also.

How about comparing the Intel CPU in a Surface Pro with the Apple CPU in an iPad Pro? They're in the same form factor, targeting the same market.
 

randomhero

Member
Apr 28, 2020
180
247
86
First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).
 

Veradun

Senior member
Jul 29, 2016
564
780
136
First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).
That can also have odd numbers for single CCD skus :>
 

Valantar

Golden Member
Aug 26, 2014
1,792
508
136
First, hi all!

I have not seen anyone posted anything about implications of the new ccx arrangement for product lineup.
AMD could have products from 1(yeah, I know) to 16 cores for desktop in this arrangement :
1
2
4(4100)
6(4300)
8(4600)
10(4700)
12(4800)
14(4900)
16(4950).
Sure, they could, but the question then becomes: is there a point to that much segmentation? They are currently selling a 12-core at $500 MSRP and a 16-core at $750. Is there really room for a 14-core in between, and is there a significant market consisting of "people who want more than 12 cores but don't need 16 or are willing to pay more than $500 but less than $750"? Because that sounds unlikely to me. I know Intel did this, but they also had astronomically inflated prices which made the artificial segmentation seem to make much more sense. That question becomes even more precarious between 12 and 8, as the price difference shrinks to half as much even if the relative increase in core count is higher.

Rather than excessively binning for functional cores like this (separate bins for every possible number of working cores per CCD) it seems much more sensible to stick with broader core count categories (8 working, >=6 working, etc.) as this would leave much more room to further bin for clock scaling, efficiency, etc. A chip with 7 working cores might kick butt as a high speed and efficient 12-core or 6-core CPU even if one of its working cores is a "dud" (that scales poorly/has lots of leakage etc.), while the same piece of silicon would fall into a much worse 7-core (14-core CPU) bin due to that single dud core - it might even be rejected and have to go through a second round of binning to see if it qualifies for a lower core count bin, making binning more complex, time consuming, costly and wasteful. Not to mention that half the bins (every odd number) would be for only one CPU (or a couple in tiers with multiple of the same core count) rather than several as in the current implementation.
 

randomhero

Member
Apr 28, 2020
180
247
86
Sure, they could, but the question then becomes: is there a point to that much segmentation? They are currently selling a 12-core at $500 MSRP and a 16-core at $750. Is there really room for a 14-core in between, and is there a significant market consisting of "people who want more than 12 cores but don't need 16 or are willing to pay more than $500 but less than $750"? Because that sounds unlikely to me. I know Intel did this, but they also had astronomically inflated prices which made the artificial segmentation seem to make much more sense. That question becomes even more precarious between 12 and 8, as the price difference shrinks to half as much even if the relative increase in core count is higher.

Rather than excessively binning for functional cores like this (separate bins for every possible number of working cores per CCD) it seems much more sensible to stick with broader core count categories (8 working, >=6 working, etc.) as this would leave much more room to further bin for clock scaling, efficiency, etc. A chip with 7 working cores might kick butt as a high speed and efficient 12-core or 6-core CPU even if one of its working cores is a "dud" (that scales poorly/has lots of leakage etc.), while the same piece of silicon would fall into a much worse 7-core (14-core CPU) bin due to that single dud core - it might even be rejected and have to go through a second round of binning to see if it qualifies for a lower core count bin, making binning more complex, time consuming, costly and wasteful. Not to mention that half the bins (every odd number) would be for only one CPU (or a couple in tiers with multiple of the same core count) rather than several as in the current implementation.

Good arguments and I do agree with your keep it simle, keep it easy philosophy.
But something bugs me. With heavy segmentation AMD can increase prices, say $ 20-50 for sku and noone would bat an eye for it. You get 2 more cores and 20 % increase in per core performance than previous gen. And I do believe that that would not cost AMD that much more than it is now.
That tactics works.
 

Veradun

Senior member
Jul 29, 2016
564
780
136
They can still choose to get rid of non-X variants and only diversify on cores and -X or -G keeping segmentation on basically the same level.