Discussion Apple Silicon SoC thread

Page 322 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,790
1,361
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,700
4,581
136
Even if the 12P+4E core count is unchanged in M4 Max, I wonder if the amount of L2 cache pet cluster will be increased.

That's hard to do because cache isn't shrinking. So if you bump it to 24 MB you're devoting 50% more area to the cache. They've moved to LPDDR5X so it'll have more memory bandwidth to compensate for the slightly lower L2 hit rate. Then they'll move to LPDDR6 if the per core L2 shrinks further with M5.
 

FlameTail

Diamond Member
Dec 15, 2021
3,757
2,203
106
That's hard to do because cache isn't shrinking. So if you bump it to 24 MB you're devoting 50% more area to the cache. They've moved to LPDDR5X so it'll have more memory bandwidth to compensate for the slightly lower L2 hit rate. Then they'll move to LPDDR6 if the per core L2 shrinks further with M5.
I find it hard to believe that the cache can be reduced without consequences.

The whole reason for having a cache is to decrease the going out to DRAM, because DRAM accesses are costly in terms of power, and having less cache will reduce the performance.
 
  • Like
Reactions: dr1337

Doug S

Platinum Member
Feb 8, 2020
2,700
4,581
136
I find it hard to believe that the cache can be reduced without consequences.

The whole reason for having a cache is to decrease the going out to DRAM, because DRAM accesses are costly in terms of power, and having less cache will reduce the performance.

Cache size has a long tail like most things, that's why multiple levels of cache work so well. So cutting the cache size in half isn't halving its impact, it is probably reducing its impact by a single digit percent. And those single digits are probably pretty small at a 16MB L2 size!

For stuff with large working sets of randomly accessed data (Oracle and other RDBMS being the classic example) it can hurt, but for the type of MT stuff that's blocking or streaming (more Cinebench type of workloads) it would have little or no impact. In fact, if you get more memory bandwidth as part of the deal it may increase performance (if it was limited by memory bandwidth and not something like FP/NEON/SSVE/SME throughput)

Let's say they went to 8 or even 10 P cores per cluster in the future without changing the L2 size, thus shrinking the amount of cache per core. That only matters when you are using ALL those cores at once. Nobody is running Oracle or other large scale databases on a Mac, but the loads they do run that might consume all available cores are likely to be video or scientific in nature - i.e. more like Cinebench than Oracle. If your cores are doing different things, they won't all have the same working set size, so you could have one core use 8MB of L2 and a few other cores using only 1MB collectivity, etc.
 

LightningZ71

Golden Member
Mar 10, 2017
1,782
2,135
136
With finflex, is it possible that, in the larger Max and Ultra chips that have their own die, Apple may choose to make three different types of clusters instead of two? My idea is that they make one of the P core clusters finflexed to a high performance transistor layout, the remaining ones flexed to a high efficiency layout, and leave the E cores in their high efficiency layout.

This would allow Apple to optimize single thread and lightly threaded responsiveness and performance while retaining high MT throughput at more manageable power/thermal levels overall. Preferred core/cluster is well understood in the industry and finflex makes this more doable.
 

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,352
136
With finflex, is it possible that, in the larger Max and Ultra chips that have their own die, Apple may choose to make three different types of clusters instead of two? My idea is that they make one of the P core clusters finflexed to a high performance transistor layout, the remaining ones flexed to a high efficiency layout, and leave the E cores in their high efficiency layout.

This would allow Apple to optimize single thread and lightly threaded responsiveness and performance while retaining high MT throughput at more manageable power/thermal levels overall. Preferred core/cluster is well understood in the industry and finflex makes this more doable.

Yes. Some ARM chips already do this type of thing with having different P-cores with different cache sizes and target frequencies.
 

name99

Senior member
Sep 11, 2010
489
379
136
Even if the 12P+4E core count is unchanged in M4 Max, I wonder if the amount of L2 cache pet cluster will be increased.

Although M3 Pro/Max increase the number of P-cores in a cluster to 6, the L2 cache size is unchanged!
Number of P cores in a clusterQuantity of L2 cache in a clusterL2 cache per core
M1 series412 MB3 MB
M2 series416 MB4 MB
M3 Pro/Max616 MB2.66 MB

This means the L2 cache per core in M3 Pro/Max is less than that of even the M1!
There are many ways to make a cache "effectively" larger without increasing the nominal cache size.
There are the obvious elements like better cache placement and replacement algorithms. For example at some point recently (and it may well have been with the M3) Apple added support for tracking critical lines in the L2. Critical lines are lines make much more of a performance difference than most lines - obvious examples (for obvious reasons) are I-lines as opposed to D-lines, or lines that hold page walk info.

It's well known that most of the lines (ie more than 50%) in large caches are dead, ie will effectively never be touched again in the significant future. You can go a long way to making your cache effectively larger just by reducing the number of dead lines.

A second way to make your cache effectively larger is by compression. The most obvious compression that would work for an L2 is an auxiliary zero-content cache, one that holds the addresses of lines that are fully zero (of which there are a surprisingly large number). You can do the same sort of thing for pages with one extra bit in the TLB.
There are fancier compression schemes, and Apple has patents on a scheme that's so vaguely described it could mean anything from full cache compression to an auxiliary zero content cache. The parts that are implemented already seem to be the ability to move around zero lines with just a one-bit indicator, not the entire 128B line. But presumably at some point the rest will be added, and again the point could even be as of the M3.
 
  • Like
Reactions: smalM

name99

Senior member
Sep 11, 2010
489
379
136
I find it hard to believe that the cache can be reduced without consequences.

The whole reason for having a cache is to decrease the going out to DRAM, because DRAM accesses are costly in terms of power, and having less cache will reduce the performance.
Define consequences...

Compare the size of Apple's caches to the competition. That extra size is not PRIMARILY for performance - you can run the simulations, many have, and increasing L2 from say 2 to 4MB per core buys you a few percent, nice but not life changing.
Apple's large L2's are PRIMARILY about lower energy (as is practically every SoC choice Apple makes).
If they found an alternative way to get the same energy savings (ie less energy spent in communicating with DRAM [imagine, for example that DRAM and SLC are now transparently HW compressed, so that most line transactions actually move two lines worth of data]) then L2 size can go down or at least remain stationary.
 

FlameTail

Diamond Member
Dec 15, 2021
3,757
2,203
106
I believe they can, if they move away from the clockspeed ideology.

In the Golden days of scaling, you were uarch limited in terms of clocks so high pipeline stages got you a lot more. So 40% increase in pipeline might have resulted in say 25-30% increase in clocks. 10 vs 20 stages might be 60-70% difference in clocks. 3GHz vs 5.xGHz is a lot to overcome.

Now you have 9-10 stage pipeline CPUs reaching 4.4GHz, and above 5.X GHz you run into thermal density issues, so you need to do stupid things like widen the space between transistors to reduce that making it larger too. And you are doing that even though the 5.x GHz CPU has a near 20-stage pipeline. You have chips like Raptorlake literally frying itself with extra voltages to get to 6GHz.

And uop caches are better avoided. The reason? The more the cores are limited by power, die size, lower scaling, the less speculative gains are worth it. Uop cache hit is at best a chance on hit, while avoiding it and increasing it elsewhere is a guarantee. Branch predictors will never hit 100% accuracy, so there's always room for uncertainty, so those extra stages make it worth. Remember that the uop cache itself adds 2 extra stages on a miss, which is why we went from 14 stages on Core to 14-18 on Sandy Bridge.

The OC headroom for modern CPUs are zero for this reason as well. While it has been painfully slowly creeping up, above 5GHz has always been the domain of exotic cooling, regardless of pipeline stages. What happened was cooling has not only advanced, but become significantly larger too. You should see how small "power hungry" Prescott heatsinks are compared to the modern literal aluminum bricks. Or how water cooling has become common, when it used to be exotic cooling domain too.
I find this topic very interesting. Historically ARM cores used clock very low (<3GHz). But that has been changing in the past 5 years.

M1 -> M4, 3.2 GHz -> 4.4 GHz. More than a 30% gain in clock speeds in just 4 years.

For comparison on the x86 side, Zen3 -> Zen5, 4.9 GHz -> 5.7 GHz, only a 15% gain in the same time period.

If Apple continues on thus trajectory, they will hit 5 GHz in the M6.

And it's not only Apple. Even Qualcomm is doing it. Snapdragon 8G4/XElite's Oryon CPU clocks at 4.2 GHz, and there are rumours that the next generation will hit 5 GHz.

What has changed in the ARM camp, and how is ti different from Intel/AMD's high clock speed philosophy?
 

johnsonwax

Member
Jun 27, 2024
77
146
66
I find this topic very interesting. Historically ARM cores used clock very low (<3GHz). But that has been changing in the past 5 years.

M1 -> M4, 3.2 GHz -> 4.4 GHz. More than a 30% gain in clock speeds in just 4 years.

For comparison on the x86 side, Zen3 -> Zen5, 4.9 GHz -> 5.7 GHz, only a 15% gain in the same time period.

If Apple continues on thus trajectory, they will hit 5 GHz in the M6.

And it's not only Apple. Even Qualcomm is doing it. Snapdragon 8G4/XElite's Oryon CPU clocks at 4.2 GHz, and there are rumours that the next generation will hit 5 GHz.

What has changed in the ARM camp, and how is ti different from Intel/AMD's high clock speed philosophy?
I don't see why you would think they would extrapolate in that way. Apple's entire history with Apple Silicon has been about catching up ARM development and process to x86 a little bit each year. They're now caught up. They're going to face the same kind of scaling issues that x86 has been dealing with for ages.

Their design approach may make that process a little easier, but we're trading dollars for physics here, and it's a diminishing relationship at this point. Their opportunity is by controlling the whole stack, they can take advantage of asymmetric cores faster than anyone else. Closing up the design to ship window would also favor them, if there's any meaningful opportunities there.
 

DavidC1

Senior member
Dec 29, 2023
778
1,236
96
I find this topic very interesting. Historically ARM cores used clock very low (<3GHz). But that has been changing in the past 5 years.

M1 -> M4, 3.2 GHz -> 4.4 GHz. More than a 30% gain in clock speeds in just 4 years.

For comparison on the x86 side, Zen3 -> Zen5, 4.9 GHz -> 5.7 GHz, only a 15% gain in the same time period.

If Apple continues on thus trajectory, they will hit 5 GHz in the M6.
They aren't going to reach 5GHz without similar trade-offs.

Moore's Law always favored somewhat less absolute performance for much better size and energy efficiency. So Intel/AMD has been pushing clocks beyond sanity while Apple/ARM just takes advantage of natural gains(design/process) to reduce the gap.

Even Intel's 22nm design benefitted their Atoms primarily, which is where the 37% gain went.

Look inside your typical desktop. It is mostly empty space. It used to take more space, but everything is just smaller, including the M.2 drives. Since better signal integrity and power consumption is inversely proportional to the size, again it shows you development favors smaller sizes and lower power.

What do you think the new CAMM memory form factor will do? Smaller!
 

gai

Junior Member
Nov 17, 2020
5
24
51
I find this topic very interesting. Historically ARM cores used clock very low (<3GHz). But that has been changing in the past 5 years.

M1 -> M4, 3.2 GHz -> 4.4 GHz. More than a 30% gain in clock speeds in just 4 years.

For comparison on the x86 side, Zen3 -> Zen5, 4.9 GHz -> 5.7 GHz, only a 15% gain in the same time period.

If Apple continues on thus trajectory, they will hit 5 GHz in the M6.

And it's not only Apple. Even Qualcomm is doing it. Snapdragon 8G4/XElite's Oryon CPU clocks at 4.2 GHz, and there are rumours that the next generation will hit 5 GHz.

What has changed in the ARM camp, and how is ti different from Intel/AMD's high clock speed philosophy?
Each company will produce the fastest design that they can make within their area and power budgets. Neither clock frequency nor work-per-clock alone can predict the final performance, so they are only intermediate indicators.

Here is a basic formula that is taught in any computer architecture classroom:
Time to Completion = Cycle Time * Cycles-per-Instruction * (Dynamic) Instruction Count

Other than changes to the ISA, which would cause the compiled code to be different on a new processor, not much can be done about the instruction count. So, more or less, the performance optimization target is the product of frequency and IPC (across various programs with different characteristics). If you can get 15% frequency at the same per-clock performance as a previous generation, or you can get 5% frequency and 5% average IPC gain, then the higher frequency option has better performance.

Note that IPC fundamentally degrades with increase in clock frequency, because the latency difference between cache and DRAM becomes larger, which will add more stall cycles. Increased frequency with "no change to IPC" doesn't mean that a microarchitecture is unchanged.

If both microarchitectural factors can increase by the same amount (rare!), then the wider, lower frequency option usually has superior energy efficiency, but it depends.
 

okoroezenwa

Member
Dec 22, 2020
93
99
61
Now that WOA is becoming a thing, Apple should add bootcamp support and maybe invest in some Windows drivers. They won't, but they should! :D
Why should they though? Bootcamp (at least via running Windows on bare metal) isn't as important as it was when it was first released. I think a more meaningful thing they could do with Bootcamp is evolve it into a competent first-party VM app for running Windows and Linux.

But maybe this is just my strong desire to see Parallels sherlocked showing.
 
Last edited:

name99

Senior member
Sep 11, 2010
489
379
136
I find this topic very interesting. Historically ARM cores used clock very low (<3GHz). But that has been changing in the past 5 years.

M1 -> M4, 3.2 GHz -> 4.4 GHz. More than a 30% gain in clock speeds in just 4 years.

For comparison on the x86 side, Zen3 -> Zen5, 4.9 GHz -> 5.7 GHz, only a 15% gain in the same time period.

If Apple continues on thus trajectory, they will hit 5 GHz in the M6.

And it's not only Apple. Even Qualcomm is doing it. Snapdragon 8G4/XElite's Oryon CPU clocks at 4.2 GHz, and there are rumours that the next generation will hit 5 GHz.

What has changed in the ARM camp, and how is ti different from Intel/AMD's high clock speed philosophy?
Extrapolation is dumb because the OPTIMAL SoC design is tailored to the specifics of each process.
Apple has a primary goal of reduced energy usage, with a secondary goal of allowing the SoC to achieve more each generation (roughly summarized as reduce the EDP - energy delay product - each generation). So the question is HOW do you do that?

When your process allows for dramatically more transistors (and wires) than were taken advantage of in existing designs, the obvious thing to do is restructure the chip to exploit all those transistors and wires. And that was the story of Apple Silicon for the first few years.
We have now hit a (probably temporary) period where density (and wire density) are not moving much even as transistor "performance" (ie the frequency available at a particular power budget). Given this new reality, you can either continue doing what you have always done (aka "The Intel Strategy") even when it's far from optimal, or you can design a new optimal strategy. The new strategy is not JUST to use higher frequency (though why not, to the extent that it fits within the energy budget); it's also that the balance between SRAM and logic has shifted. Which means that it makes sense to boost performance via "smarter" SRAM rather than just more of it. Apple has always used more smarts in their caches (and cache-like entities like branch predictors) than the competition, but the current process environment encourages even more of that, more "decisions" associated with what to store in cache, how to store it, when to stop storing it.

Of course things will change. BSPD will open up wiring (which will also open up SRAM density for a process generation or two) and for those designs it will make sense to bump up the SRAMs. Once BSPD (and associated wiring, first clocks then some signals) is mined out we'll start seeing serious thinking about stacked designs.

None of this shows catastrophe. Instead it shows that, continuing the tradition since the A6, Apple plans their designs based on where process is headed, not where it is right now. They're not desperately scrambling in surprise when the balance changes between logic, SRAM, wires, and frequency; that was all already baked into the design four years before the masks were shipped to TSMC.

Note what these hysterical click-bait stories look like. They are NEVER about "here's how the latest Apple chip sucks", they always choose one narrow metric and then obsess about that that metric is now "failing". It was frequency, then it was IPC, then it was power. The common thread is that the metric is always considered in isolation, by people who are not engineers and do not understand either the gestalt of the system or the tradeoffs.
It sounds much more exciting to say "Apple IPC pattern shows their innovation is over" than to say "M4 continues same pattern of 20% annual performance increase as M3 and M2"...
 
  • Like
Reactions: mvprod123

FlameTail

Diamond Member
Dec 15, 2021
3,757
2,203
106
ProMax
M14P + 4E8P + 2E8P + 2E
M24P + 4E8P + 4E 8P + 4E
M34P + 4E6P + 6E12P + 4E
M44P + 6E
Any guesse as to the CPU core configuration of M4 Pro and M4 Max?
 

johnsonwax

Member
Jun 27, 2024
77
146
66
When does Apple start scaling NPU across the line? The point of Pro/Max is mostly GPU perf, not CPU. At some point they'll start looking at NPU scaling across the line. Not sure if M4 got designed in time to care about that for AI.