Discussion Apple Silicon SoC thread

Page 12 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,587
1,001
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
The funny thing is why they are limiting their devices now on the RAM side. 16Gb of course is plenty for office+ browsing but the default is 8GB. But why would I need 4 such beefy cores just for browsing? heck I had no issue browsing on my 4 year old smartphone which is probably 10xtimes slower in these micro benches than this m1.

Ok, maybe for developers for compiling it helps but who will buy a Mac now if he develops x86 software? Can it even be used to make "general" ARM software or will everything get compiled for apple cores?

"Limiting their devices" is not ideal phrasing.

They made the decision to construct the package in a certain way, with the two DRAM chips side by side and with LPDDR4. That means they're limited to the max that 2 LPDDR4 chips can cover, ie 16GB.
Of course any of those decisions could in theory be changed; they could use different packaging, they could put two chips on one side two on the other, they could use LPDDR5. But all of those are decisions involving a lot of expensive additional design, to support a market that, honestly, I am not convinced even exists outside the whining class. It's not like this is an arbitrary decision that could trivially be modified if someone wanted to modify it.

What is so hard to understand about
- first devices
- low-end devices?

Are people truly so thick that they believe these constraints will propagate up to the mid-range and high-end? And if they don't propagate up, then WTF is the problem? Low-end is low-end!
 
  • Like
Reactions: Mopetar and Eug

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
"Limiting their devices" is not ideal phrasing.

They made the decision to construct the package in a certain way, with the two DRAM chips side by side and with LPDDR4. That means they're limited to the max that 2 LPDDR4 chips can cover, ie 16GB.
Of course any of those decisions could in theory be changed; they could use different packaging, they could put two chips on one side two on the other, they could use LPDDR5. But all of those are decisions involving a lot of expensive additional design, to support a market that, honestly, I am not convinced even exists outside the whining class. It's not like this is an arbitrary decision that could trivially be modified if someone wanted to modify it.

What is so hard to understand about
- first devices
- low-end devices?

Are people truly so thick that they believe these constraints will propagate up to the mid-range and high-end? And if they don't propagate up, then WTF is the problem? Low-end is low-end!
For me it's apprehension about the device, this just basically being an A14 Bionic. It still doesn't explore the full range of the mid and high end that we're so sorely lacking with Arm. From a pure interest standpoint, the arrival of V1 and Apple's analogs can't come soon enough.

This chip will do perfectly OK in these $700-$1300 "low-end" devices.
 
  • Like
Reactions: Tlh97 and kurosaki

name99

Senior member
Sep 11, 2010
404
303
136
I doubt its there for latency. It's similar to PoP DRAM configs on mobile chips. It's for saving board space.

The A14 gets stellar performance using PoP DRAM. The way M1 is packaged might allow larger DRAM configurations to be available than on typical PoP packages.

Actually the DRAM packaging does give performance advantages. The short and extremely well characterized traces from the CPU to the LPDDR chips require many fewer electrons to move over them to generate some sort of signal (ie they have lower capacitance), and this allows for not just lower power, but faster switching.

That's why DDR4 maxes out at 3200, LPDDR4 at 4267 MT/s.

The price you pay for this performance is that you have to give up the flexibility of sockets and their much higher capacitance (physical size, distance from the CPU).

Technically the M1 (and A12X) aren't exactly PoP, if anything they're like a poor man's version of an interposer:


Could that design grow to 4 DRAM chips? I don't see why not in principle, with two on each side. That's one way Apple might give us the mythical M1X, if the plan for mid-range machines (mini Pro, iMac, larger MacBook Pro) early next year is based on an "easy" extension of the M1 -- make the SoC larger with 8+8 (or8+4) cores, maybe 16 GPU cores, double the memory controller hardware/PHY's; should be able to fit within 200mm^2 so still feasible. Maybe sell the lower end versions with one dead CPU core or 14GPUs in the mini Pro model?

Another way they could get the mythical M1X to 32GiB is same memory controllers and packaging design, but use LPDDR5. This seems less disruptive IF the A14/M1 design includes an LPDDR5-capable memory controller. You'd figure that's the case, but who knows for sure.
 

name99

Senior member
Sep 11, 2010
404
303
136
I don't think there will be a core truly that wide from x86 even then. Nor should there necessarily be, maintaining higher clocks might well be a better way to provide the performance.

The one big advantage ARM, especially 64-bit ARM has over x86 is that growing decode width grows it's power and complexity linearly. For ARM, 8-wide decode is ~ twice as power-hungry and large as a 4-wide decode. For x86, increasing decode width grows the complexity and power use of the decoders way faster than linearly, mainly because they have to build in massive muxes to line up instructions because instructions are wildly variable-width. This means that the ideal width of the machine from an engineering standpoint is much wider for an ARM machine than it is for x86. The width of M1 isn't free, they pay for it in many ways, including clock speed.

True, but misleading.
The real issue in any design is what's the first pain point, because that determines everything; by definition you can't get past it.

For CISC the pain point is decode, apparently at around 4..5 instructions.
That doesn't mean for ARM going to 8 is easy because the next pain point is register rename and this is (unless you start doing very clever things) a quadratic problem, so going from 4 to 8 on ARM means the rename problem is 4x as difficult.

Or to put it differently, even when x86 can solve much of their fetch/decode width difficulty via larger opcode caches, they still have to confront the difficulty of running rename 6, 7, 8 wide as compared to 4 or 5 wide.

There are no free lunches, no. Running wider means running slower. BUT transistors are getting smaller every year, while they are not getting much faster. That's why it's smarter to bet on the technology that fully exploits small transistors (width) than the technology that fully exploits fast transistors.
 
  • Like
Reactions: scannall

name99

Senior member
Sep 11, 2010
404
303
136
Fully agree and I have in the past written the same thing. Apple also gets it's efficiency at the cost of die space which as you say is fine if the cheapest product it will go into is the $999 MacBook air.

On top of that I repeat my previous statement that apple comes from consumer world where few wide cores have obvious benefits. Now they are scaling up. x86 comes from server world in which it is far easier to make use of many cores and they are scaling their cores down. It's clear Apple will have a significant advantage in consumer usage with their SOCs.

But again, it's irrelevant for me as it's a completely locked platform.

The "cheapest product" actually goes into the HomePod mini...
The "cheapest product with a big core" goes into the Apple TV (right now an A10X at $179; one expects this will soon change, same price point but A12X? A13? who knows.)

My point is not snark, it's that Apple's basic chips, even after packaging with the DRAM, are just not that expensive. The estimate for the A13 SoC was $64 (slightly cheaper than the cost of the screen).
Of course this is pure "BOM" cost -- the SoCs require a massive up-front cost to design. But it's misleading to imagine that ~100mm^2 of silicon is as expensive as you are suggesting.
The MBA costs what it costs not because the SoC is much more expensive than what Intel is charging for the SoC in a $400 laptop, but because the rest of the package is so much nicer. The flash is faster (will $400 even get you an SSD today? if so I expect it will be a lousy SSD with 3yr old specs), the battery larger and more reliable, the ports higher speed, the screen nicer, the track pad nicer. Light, milled aluminum rather than heavier steel+plastic.

Maybe you don't care about any of these, but lots of people do. And THEY, rather than the SoC per se, are determining the MBA cost.
 
  • Like
Reactions: scannall
Apr 30, 2020
68
170
76
Why would you compare M1 transistor count, to a single chiplet 3000 series, unless, as I am suspecting given your recent ranting about the lack of Renoir comparisons, your motivation is primarily AMD advocacy?

You need to factor that the M1 is a full SoC, that has a very large percentage of transistors devoted to functions beyond CPU cores, that will obviously be lacking in a single Chiplet AMD Ryzen 3000, which doesn't even have GPU cores.

A large portion of the die is in the GPU, claimed to be the most powerful iGPU in a PC part, and another large portion is in the Neural engine. Then there will be other functions pulled in like security and SSD controllers.

While the cores are almost certainly the biggest in class, simply comparing total transistors, and making pronouncements without factoring the extra SoC functionality, seems disingenuous or ignorant.
Ryzen CPUs do not need a chipset to run - all Ryzen CPUs are "SOC" designs. Standard desktop parts obviously don't have an iGPU, but they are still capable of "SOC" operation as long as you had a dGPU. Obviously all Ryzen APUs are true "full SOC". That's what AMD's A300 "chipset" is. It's literally just a motherboard without a chipset, with all I/O going directly to the CPU. Renoir laptops do not have a chipset either.

Even comparing it to Renoir, it still has a massive amount of additional transistors. Renoir is packing 9.8 billion transistors. The M1 is packing ~65% more transistors. This is enabled in large part due to the 45% increased density of the TSMC 5nm manufacturing process. Those ~6.2 billion extra transistors compared to Renoir are not all being used in the iGPU - that'd literally be a whole Radeon 5500XT worth of transistors ON TOP of the transistors already used in Renoir's Vega 8. Renoir with that many transistors, and that large of a die would likely cost too much for low-end to mid-range consumer products. As it is, AMD worked very hard to get Renoir's die size as small as possible - it's far more dense than desktop Zen 2 chips or even Navi 1or Navi 2. Every mm^2 counts, especially on the already very capacity constrained 7nm node.

Apple having first dibs and near wide open access to a brand new node gives them a lot of manufacturing flexibility that companies like AMD just don't have at the moment. I'm sure AMD would have loved to build a 16 billion transistor APU, but it's just not going to happen on the 7nm node.
 
Last edited:
  • Like
Reactions: Tlh97

name99

Senior member
Sep 11, 2010
404
303
136
calling a $999 device low-end. that is the problem.

If you want a cheap APPLE COMPUTER buy an iPad.
MBA is the low end of the MAC range.

It's not Apple's fault that you insist on interpreting their computer lineup through the eyes of 2005 rather than the eyes of 2025...
(Again, this is not snark. I am trying to get you to see that Apple is skating to where the puck will be, is reinterpreting personal computing as a whole. If you insist that "they're doing it wrong" because what they're doing doesn't match what you're used to, well, complain all you like but remember the lessons of 2007.
Personal computing was reconfigured once before in the recent past -- to widespread mocking by the old guard who insisted nothing could or would change.)
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Actually the DRAM packaging does give performance advantages. The short and extremely well characterized traces from the CPU to the LPDDR chips require many fewer electrons to move over them to generate some sort of signal (ie they have lower capacitance), and this allows for not just lower power, but faster switching.

Nothing new on the Macbook Air(or for most premium ultrabooks). Icelake supports LPDDR4x.
 

name99

Senior member
Sep 11, 2010
404
303
136
Ryzen CPUs do not need a chipset to run - all Ryzen CPUs are "SOC" designs. Standard desktop parts obviously don't have an iGPU, but they are still capable of "SOC" operation as long as you had a dGPU. Obviously all Ryzen APUs are true "full SOC". That's what AMD's A300 "chipset" is. It's literally just a motherboard without a chipset, with all I/O going directly to the CPU. Renoir laptops do not have a chipset either.

Even comparing it to Renoir, it still has a massive amount of additional transistors. Renoir is packing 9.8 billion transistors. The M1 is packing ~65% more transistors. This is enabled in large part due to the 45% increased density of the TSMC 5nm manufacturing process. Those ~6.2 billion extra transistors compared to Renoir are not all being used in the iGPU - that'd literally be a whole Radeon 5500XT worth of transistors ON TOP of the transistors already used in Renoir's Vega 8. Renoir with that many transistors, and that large of a die would likely cost too much for low-end to mid-range consumer products. As it is, AMD worked very hard to get Renoir's die size as small as possible - it's far more dense than desktop Zen 2 chips or even Navi 1or Navi 2. Every mm^2 counts, especially on the already very capacity constrained 7nm node.

You are missing the point.
Renoir density is 9.8B transistors in 156mm^2
A13 density is 8.5B in 98mm^2
You don't need to break out the calculator to see that one of these is going to be ~1.4x the other.

Why the difference? Most important for the issue at hand is that AMD prioritized speed over smarts. Sure, they have some smarts in the CPUs, but the priority is hitting 5GHz.
That requires larger transistors, which means lower density.

Now AMD can make whatever choices they like, it's up to them. And given the nature of the average x86 fan, I can't blame them for prioritizing GHz, as a marketing number.
But it's simply incorrect to insist that Apple gets to use more transistors because they are on 5nm; they get to use more transistors because they have prioritized the use of small transistors. Even on exactly the same process, Apple was running at substantially higher transistor density.
 
  • Like
Reactions: insertcarehere

name99

Senior member
Sep 11, 2010
404
303
136
Nothing new on the Macbook Air(or for most premium ultrabooks). Icelake supports LPDDR4x.

The claim was not about who is first to use LPDDR4 on a laptop, the claim was that

"I doubt its there for latency. It's similar to PoP DRAM configs on mobile chips. It's for saving board space."

My point is that, while board space is nice, LPDDR (and its packaging) gets you both power and performance advantages, and those matter even more than board space.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
My point is that, while board space is nice, LPDDR (and its packaging) gets you both power and performance advantages, and those matter even more than board space.

That's why it sounded strange to me. LPDDR4x isn't new even for x86. Sounded like people were talking about some inherent advantage M1 CPU has in that aspect.
 

Doug S

Platinum Member
Feb 8, 2020
2,263
3,514
136
calling a $999 device low-end. that is the problem.

Low end is relative. Find a higher performing laptop that costs less. Or even a higher performing laptop that costs MORE.

The knocks on Apple for pricing too high relative to what you get worked a little better when they were using the same x86 chips other laptops were, and the $999 you spent for an Air could be matched performance wise by laptops that cost less than half that much. That's definitely no longer true.

Now you can of course move the goal posts and talk about max memory config or whatever, but these are just the initial models. I'm sure they'll fill things out more next year and that 16GB ceiling won't last long, but there will always be something for Apple haters to pick on I'm sure.

If by "low end" you mean laptops costing $250 like the trash you can get at Best Buy, well Apple never has and never will fight it out in that crap end of the market. Just like while you can buy $50 Androids, you will never see a $50 iPhone.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
Or course not, they are also in the Neural engine, the security engine, the SSD controller, and probably a few other functions not in Renoir, which considerably shrinks the gap.

All things I mentioned and you ignored.
Not sure what Apple's "security engine" entails, but all Zen chips do contain an ARM based Platform Security Processor that implements ARM's TrustZone and further security concepts like extensive memory encryption through SME, SEV etc.

Apple did mention "secure enclave" as a feature which reminds me of Intel's now infamous since often broken SGX and exists in Apple's A-line since iPhone 5S according to its support page. Is that the extend of the "security engine"?
 
  • Like
Reactions: Tlh97

name99

Senior member
Sep 11, 2010
404
303
136
That's why it sounded strange to me. LPDDR4x isn't new even for x86. Sounded like people were talking about some inherent advantage M1 CPU has in that aspect.

What IS new is how tightly the DRAM is packaged with the M1.

On older macs (and every PC) the LPDDR4 is is soldered to the motherboard. Better capacitance than sockets, but still not as low as mounting it on the same package as the SoC. This will have power consequences, and may affect the max frequency you can hit.
For example I see that the most recent Apple Intel MBA and MBP run their LPDDR4x at 3733, not 4266.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Why the difference? Most important for the issue at hand is that AMD prioritized speed over smarts. Sure, they have some smarts in the CPUs, but the priority is hitting 5GHz.
That requires larger transistors, which means lower density.

Now AMD can make whatever choices they like, it's up to them. And given the nature of the average x86 fan, I can't blame them for prioritizing GHz, as a marketing number.
But it's simply incorrect to insist that Apple gets to use more transistors because they are on 5nm; they get to use more transistors because they have prioritized the use of small transistors. Even on exactly the same process, Apple was running at substantially higher transistor density.
I agree with most all of what you're saying, except the part where you imply that Apple's chip design is somehow smarter than AMD's.

Apple, AMD, and Intel design their architecture to fit within certain design and market parameters. It's Apple fitting their chips into those parameters that produces the result of a (relatively) wide, slow, high IPC core. It's AMD fitting their chips into their own design parameters that produces a (relatively) narrow, fast, low IPC core. Neither is child's play, but it's also not 17-dimension rocket science to either AMD or Intel or Apple's teams as to how to fit a chip into a certain set of parameters. And both Apple and AMD are succeeding in what they're doing.

You mention Apple like small transistors -- and a LOT of them. Yes. Because (and I'm going to take a total devil's advocate stance here) they are brute-forcing a bunch of transistors into a wide core and huge L2$ and leveraging TSMC's node advancements to ensure that this ever-widening core fits within a certain envelope. How is brute-forcing a ton of transistors to create a wide high IPC core that's slow any smarter than AMD leveraging design changes that produce IPC uplifts that exceed raw transistor count increases?

After all, within each of their sets of parameters, and playing in Apple's seemingly dominant area of pure single-threaded performance, AMD went from lagging the A13 with Zen2 to outperforming A14 with Zen3 in SPEC2006 INT and FP. So being dismissive of AMD's work is just ridiculous from someone as clearly-intelligent about these matters as you are, and it seems really biased.

And to be clear, Apple are doing a STELLAR job of fitting all that single-threaded performance into a small power envelope. But Apple's parameters aren't really AMD's parameters right now, are they? And to imply AMD's designs are less smart because they don't fit within Apple's parameters is just silly. Are Apple stupid because they can't produce a highly-scalable MT CPU?
 

name99

Senior member
Sep 11, 2010
404
303
136
Not sure what Apple's "security engine" entails, but all Zen chips do contain an ARM based Platform Security Processor that implements ARM's TrustZone and further security concepts like extensive memory encryption through SME, SEV etc.

Apple did mention "secure enclave" as a feature which reminds me of Intel's now infamous since often broken SGX and exists in Apple's A-line since iPhone 5S according to its support page. Is that the extend of the "security engine"?

Until you have read Apple's Security White Paper you don't get to comment on the quality of their work:
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
Thanks for the link, was nice talking with you.

Edit: The white paper talks extensively about an expanded approach to trusted platform that extends to app and accessory level, as such it's definitely more thorough than what's available on PC. As of version April 2020 of the paper memory (as in RAM) encryption is not part of that though, so that's an advantage of the Zen platform.
 
Last edited:
Apr 30, 2020
68
170
76
Or course not, they are also in the Neural engine, the security engine, the SSD controller, and probably a few other functions not in Renoir, which considerably shrinks the gap.

All things I mentioned and you ignored.
You are missing the point.
Renoir density is 9.8B transistors in 156mm^2
A13 density is 8.5B in 98mm^2
You don't need to break out the calculator to see that one of these is going to be ~1.4x the other.

Why the difference? Most important for the issue at hand is that AMD prioritized speed over smarts. Sure, they have some smarts in the CPUs, but the priority is hitting 5GHz.
That requires larger transistors, which means lower density.
I didn't ignore them, but its really impossible to ascertain the transistor counts of those items without more information from Apple. You need to account that Renoir also has a security processor, 20x PCI-E PHYs, two NVM-E/SATA PHYs, 5 Display PHYs, 4 display controllers, 10 USB PHYs, the massive DDR4 PHY and more. That monstrous amount of IO takes up massive amount of die space and requires tons transistors. The M10 doesn't have 20x PCI-E lanes, it doesn't have NVM-E or SATA connections, it doesn't have 10 USB ports. It doesn't have 5 display PHYs. It doesn't need a giant DDR4 PHY with big transistors to drive high-speed signals all the way across a motherboard.

Same thing goes for other Apple chips. For AMD to pack in the same amount of transistors as the M1, while still retaining the same I/O that's necessary for a modern non-Apple PC, it would result in their die sizes being unreasonably large. AMD wouldn't be able to sell these chips affordably or manufacture enough of them. AMD has to make a lot of compromises to accommodate the demands of PC notebooks and desktops. Things Apple in their non-upgradeable, tightly-integrated, walled-garden don't have to deal with or accommodate.
 
  • Like
Reactions: Tlh97 and kurosaki

gdansk

Platinum Member
Feb 8, 2011
2,107
2,603
136
Apple has some great security SNAFUs:
And, unfortunately, they give people a lot of reason to look for these exploits (in order to jailbreak iPhones).

Not exactly on topic but if you use MacOS today you'll notice applications taking longer to launch. That's because their code signing server (http://ocsp.apple.com ) is not responding.
 
  • Haha
Reactions: Tlh97 and kurosaki

Eug

Lifer
Mar 11, 2000
23,587
1,001
126
What IS new is how tightly the DRAM is packaged with the M1.

On older macs (and every PC) the LPDDR4 is is soldered to the motherboard. Better capacitance than sockets, but still not as low as mounting it on the same package as the SoC. This will have power consequences, and may affect the max frequency you can hit.
For example I see that the most recent Apple Intel MBA and MBP run their LPDDR4x at 3733, not 4266.
If they go with 2x2 LP-DDR5 in the package for M1X (or whatever it’s called), doesn’t that mean support for up to 64 GB?

32 GB seems like too low a target to aim for, for the Mac mini and iMac.
 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
I agree with most all of what you're saying, except the part where you imply that Apple's chip design is somehow smarter than AMD's.

Apple, AMD, and Intel design their architecture to fit within certain design and market parameters. It's Apple fitting their chips into those parameters that produces the result of a (relatively) wide, slow, high IPC core. It's AMD fitting their chips into their own design parameters that produces a (relatively) narrow, fast, low IPC core. Neither is child's play, but it's also not 17-dimension rocket science to either AMD or Intel or Apple's teams as to how to fit a chip into a certain set of parameters. And both Apple and AMD are succeeding in what they're doing.

You mention Apple like small transistors -- and a LOT of them. Yes. Because (and I'm going to take a total devil's advocate stance here) they are brute-forcing a bunch of transistors into a wide core and huge L2$ and leveraging TSMC's node advancements to ensure that this ever-widening core fits within a certain envelope. How is brute-forcing a ton of transistors to create a wide high IPC core that's slow any smarter than AMD leveraging design changes that produce IPC uplifts that exceed raw transistor count increases?

After all, within each of their sets of parameters, and playing in Apple's seemingly dominant area of pure single-threaded performance, AMD went from lagging the A13 with Zen2 to outperforming A14 with Zen3 in SPEC2006 INT and FP. So being dismissive of AMD's work is just ridiculous from someone as clearly-intelligent about these matters as you are, and it seems really biased.

And to be clear, Apple are doing a STELLAR job of fitting all that single-threaded performance into a small power envelope. But Apple's parameters aren't really AMD's parameters right now, are they? And to imply AMD's designs are less smart because they don't fit within Apple's parameters is just silly. Are Apple stupid because they can't produce a highly-scalable MT CPU?

(a) Apple's choice, TODAY, gives you equal single-threaded performance to the best achievable on the x86 side, at substantially lower power.
How is this not a superior result?

(b) AMD are doing OK with the hand they are dealt (which includes their customer base). That's not the same thing as saying that they are on the right side of design history.

(c) The issue is not "brute-forcing a bunch of transistors into a wide core and huge L2$". Saying that demonstrates massive ignorance.

Growing resources blindly buys you very little.
Look at the graphs (esp the first) in this paper:
Note how blindly quadrupling resources gets you no more than 1.5x IPC; yet Apple are already at higher than than relative to AMD or Intel.

It's all about how you use your transistors; better algorithms not just larger storage. This is the part that x86 people consistently refuse to even understand, let alone believe, the massive impact of things like better prefetching algorithms, better cache placement and replacement algorithms, better branch prediction algorithms (not just TAGE for most branches, but special case handlers for the various difficult branches). x86 folks are happy to talk about the value of fusion on their side, because they know Intel and AMD engage in it; but they cannot see the extent to which ARMv8 provides a very rich pre-fused instruction set, along with Apple adding a very rich set of dynamic fusions. They can't see the value in Load/Store Pair. They refuse to concede the value of CSEL and other ARMv8 predication (presumably because Intel has made a hash of this every time it implements, so they can't imagine anyone else could do it correctly). And so on and so on.

For example:
look at what IBM do with single instruction branches, from POWER8 on:
https://pdfs.semanticscholar.org/a1e4/f4ae16c5a18896fe1718acfe56a26aeca620.pdf
(figure 4, c)
until you understand why this transformation is worth doing, you're not understanding what matters in a modern CPU...
And then remember than it's almost certain that pretty much every good idea you see in any other CPU is also implemented by Apple. I say that not as a fan boy, but as someone who understands what it takes to hit the IPC numbers they achieve across such a wide range of code.

And so on, and so on.