• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion Apple Silicon SoC thread

Page 478 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:


M5 Family discussion here:

 
Last edited:
Their P cores are nowhere near 15 watts. I can run my 16 Pro Max flat out and it gets a bit warm. Most phones get a lot warmer or even downright hot, especially ones using Qualcomm's latest that are pushing the clock rates WAY more than Apple.

Why are people questioning Apple's strategy when they sell a quarter billion iPhones a year, and it is easily the most profitable product in the history of mankind?

But no, random forum posters think they know better and Apple should copy all the failed strategies of past Android OEMs.
The consensus seems to be that, in rough terms, while running SPEC, and for M5
- an S core uses about 8W
- an M core uses about 2.5W
- an E core uses about 1W

with an E core at about 35..50% of S core performance (quite a wide spread depending on the excat type of code) and an M at the rumored ~70% performance.

The details (such as we have) *seem* to suggest that a current M core is "just" an E-core running 1.5x faster. When I scanned the SPEC numbers this seemed like a plausible interpretation; there was nothing that stood out as an M core behaving substantially differently from an upclocked E-core.

For the M5 gen however there are still two interesting questions:
- are the M cores physically larger than the E cores? It's possible that Apple relied on TSMC's FinFlex to bump up the physical size of some of the M transistors to allow it to reach ~4.4GHz. Meaning M core is not just a naive overclocked E core, it has a different physical layout targeting a different frequency range.
- was the SME unit modified? The E-core SME unit is a quarter the size of the S-unit, so takes four times as long for matrix operations. (This is not 100% true, but is good enough true as a rule of thumb.)
Given the different role of the M core, did it get a full sized, not a quarter sized SME unit?
Or even something new and in-between, like a half-sized SME unit?
 
Nice find!

That FP and INT score for the S/P core is ridiculous. Looking at Geekerwan’s M5 info this is quite a bit higher. Do we know the chosen variables and the compiler?
View attachment 140000
It's possible that he (guy with the 16/22 SPEC scores) is trying to investigate the underlying design choices and the absolute best possible performance, so he's testing while the chips are placed in a fridge or forced cooler or whatever.
While Geekerwan is trying to investigate real world performance, with realistic cooling as provided (or not) by the platform.
Both are legitimate investigations, it's only idiotic partisanship that renders either methodology problematic.

The guy with the higher scores may also be using a newer version of LLVM/XCode with tweaks in it that result in better code generation, and that's worth at least a few percent (maybe more if the difference is more aggressive automatic use of SME and SSVE in more code?)
 
Note the following interesting elements in his diagrams:
- claims of a private per core L2 (1MB in size) along with a similar private L2 TLB 512 entries in size.
Unclear what he's basing this off. Maybe it will be more visible (or not...) when the usual suspects provide more technical reviews.

- claims of a trace cache (32KB). This I believe. I've seen the Apple patents for a trace cache and the logic for why it's being added, and the whole system makes a lot of sense. But again I don't know where he got that size from (or even what it means, given that a trace cache does not consist of linear instructions, so the most useful way tod describe its size is somewhat different from this sort of raw capacity).

- the usual misunderstanding of the ROB size (which is more like 7 or 8x the 480 he gives). I see this everywhere and have given up trying to correct it. *sigh*

Same sort of situation with his M vs E core. The changes between the two he gives are plausible but no explanation for how he arrived at these conclusions. Seems like there hasn't been enough time to carefully execute and analyze appropriate microbenchmarks, so did he just execute a speed run of a bunch of these and unthinkingly draw p the results, without looking at anomalies? (It's the anomalies, which require investigation and thought, that both take all the time, but that also result in genuine understanding.)
 
I'll never get over the iPhone processor outpacing a desktop x86 by that much. I remember well the A7 launch, after everyone was screaming that Apple was incapable of designing their own cores, it became clear they could so they screamed that there was no point to a 64bit SOC in a phone. If only they knew where that would lead.
Don't worry.
After the subsequent rounds of "Apple can't design a GPU" and "Apple can't design a modem" get ready soon for "Apple can't design a data center rack" (something something optical interconnects, something something high speed SerDes, blah blah blah).
 
It's possible that he (guy with the 16/22 SPEC scores) is trying to investigate the underlying design choices and the absolute best possible performance, so he's testing while the chips are placed in a fridge or forced cooler or whatever.
The results are about in line with Dave Huang scores for SPEC int. I doubt there's any search for absolute best performance as you describe. It's Geekerwan scores that look lower than expected.

Both are legitimate investigations, it's only idiotic partisanship that renders either methodology problematic.
Both are legitimate as long as they don't change methodology, including using the same binaries; any change would require to re-benchmark all compared platforms.

The guy with the higher scores may also be using a newer version of LLVM/XCode with tweaks in it that result in better code generation, and that's worth at least a few percent
Compiler improvement and perhaps simply better flags.
 
Note the following interesting elements in his diagrams:
- claims of a private per core L2 (1MB in size) along with a similar private L2 TLB 512 entries in size.
Unclear what he's basing this off. Maybe it will be more visible (or not...) when the usual suspects provide more technical reviews.
Aren't you the last one not believing in that private L2 cache? At least you don't seem to dismiss that idea as trash anymore (as you laughed at me when I mentionned SME and SSVE use in Apple chips).
 
Last edited:
So, if accurate - vs the e-core, they went from 2LS to 2L+2S, kept the branch predictors and caches mostly in place, went for wider decode, removed a simple ALU, and bumped some structure sizes?

Perf/clock is pretty similar between the M5 e-core and the m-core - if these diagrams and SPEC runs are accurate, are we sure the M-core isn't just a new-gen e-core microarchitecture with a more aggressive physical design and a higher target clock (and resulting higher core power)?

Don't trust the missing ALU that much. It doesn't make any sense, and taken literally we're now missing IDIV...
I think it's more like he wasn't sure if he should have 5 vs 6 ALUs so he drew 4 he was confident of, then got distracted or didn't have time to run whatever tests he is using.

We also don't know the logic and timing within Apple. While they have been remarkably good at thinking up new idea to improve the CPU and the SoC, they're not perfect. The Intel idea of an area-optimized (rather than energy-optimized) core might not have occurred to them. But once they saw it, and appreciated its significance, they immediately considered "how can we replicate this?" The fastest solution may have been either an overclocked (and no other real changes) E-core, or minor rule-of-thumb tweaks to an E-core. Point is, no enough time to think of the optimal way to design this sort of solution, just get something out in the most convenient way possible. But with more time available (presumably the M6 generation) they'll have been able to run more simulations and ablations, able to gather more data as to what's the highest reward for lowest area of various alternatives (eg is adding more load/store a better deal vs increasing the size of the branch prediction tables?)

It's also possible (we never know these things!) that they had a better balanced M- lined up say around 2024, but that's targeting TSMC N2, and N2 was delayed relative to expectations.
So for whatever reason business strategy felt that it made more sense to roll out the M-core strategy now (using a sub-optimal quickly faked M core) rather than go with an M4-style Max and Pro setup.

For example (and again we never know these things) maybe M-core will be part of the A20?
And given how much money the A series generate, Apple wanted to soften up the ground before the A20 design, get the tech press comfortable with the idea of M cores and their strengths. So (for example) if the A20 looks like 1S+3M+2E, it won't look like a rollback (only 1 fast core compared to the A19!) but like a natural extension of a design that we know, from 6 months of M5 Pro/Max experience works well, and that M cores are pretty damn good, not a consolation prize?
 
It may well be the clock rate disparity between the E and M cores that accounts for most of the performance (and power) increase is due to different FinFlex transistors used in those blocks. The clock gain is bang on about what TSMC claims as the difference between the highest and lowest FinFlex level.

The one thing that doesn't really fit this narrative are the curves posted before that seem to show the M cores being as fast or slightly faster than the E core at most power levels, with the added band once you move beyond the E core's power (and clock) ceiling. Maybe that's just a newer/better E core but they just GOT a newer/better E core that what 28% faster at the same power, and to do that despite using high power / leakier transistors would be quite a feat.

It would mean there was not one but two fairly major upgrades to the "E core" in a very short time. If true that would be surprising but I suppose not impossible, if that core has had Apple's best architects focusing on it to leverage it into the M core. In that case we should expect the E core for A20/A20P will get another massive bump in performance at the same power level, between the "further tweaks" and what it gains from N2. If so I don't think anyone will complain that A20 doesn't have M cores, or still has "only" four E cores.
 
It may well be the clock rate disparity between the E and M cores that accounts for most of the performance (and power) increase is due to different FinFlex transistors used in those blocks. The clock gain is bang on about what TSMC claims as the difference between the highest and lowest FinFlex level.

The one thing that doesn't really fit this narrative are the curves posted before that seem to show the M cores being as fast or slightly faster than the E core at most power levels, with the added band once you move beyond the E core's power (and clock) ceiling. Maybe that's just a newer/better E core but they just GOT a newer/better E core that what 28% faster at the same power, and to do that despite using high power / leakier transistors would be quite a feat.

It would mean there was not one but two fairly major upgrades to the "E core" in a very short time. If true that would be surprising but I suppose not impossible, if that core has had Apple's best architects focusing on it to leverage it into the M core. In that case we should expect the E core for A20/A20P will get another massive bump in performance at the same power level, between the "further tweaks" and what it gains from N2. If so I don't think anyone will complain that A20 doesn't have M cores, or still has "only" four E cores.

Apple's been executing pretty well on the e-core side for a few generations now. Wouldn't be too shocking if they made another significant leap, especially if they were willing to trade some area efficiency for a higher clock.
 
That would be some progress if there were two generations of e-cores released so quickly; but also we’d be able to discern if that were true since the base M5 has e-cores and released one month after the A19-series.

Would just have to determine the relative incremental performance difference of the M-core to the expected E-core and see if it matches.
 
That would be some progress if there were two generations of e-cores released so quickly; but also we’d be able to discern if that were true since the base M5 has e-cores and released one month after the A19-series.

Would just have to determine the relative incremental performance difference of the M-core to the expected E-core and see if it matches.

If we're to believe this, there's precedent - it claims M3 Max used a later e-core rev (what they call third-generation Sawtooth) than M3 and M3 Pro did (second-generation Sawtooth.) It seems entirely possible to me that the M-core is just a later e-core variant, microarchitecturally (plus whatever physical design changes are allowing it to run relatively efficiently at high clocks.)
 
It may well be the clock rate disparity between the E and M cores that accounts for most of the performance (and power) increase is due to different FinFlex transistors used in those blocks. The clock gain is bang on about what TSMC claims as the difference between the highest and lowest FinFlex level.
The M core has ~2% higher PPC than the E-core, but it's being tested at a significantly higher frequency, so it wouldn't surprise me if iso frequency the M core has a high single double digits PPC advantage.
But yea, M4 E-cores were 2-1, very possible for that to remain the same on the M5 while the M cores move to 2-2. P-cores are almost certainly 3-2 like last gen.
The one thing that doesn't really fit this narrative are the curves posted before that seem to show the M cores being as fast or slightly faster than the E core at most power levels, with the added band once you move beyond the E core's power (and clock) ceiling. Maybe that's just a newer/better E core but they just GOT a newer/better E core that what 28% faster at the same power, and to do that despite using high power / leakier transistors would be quite a feat.
From what I've seen in the past, there's no way to actually control directly what frequency you want the cores to run on Apple silicon. At least on iphones, but even for macs, other than the M1, Geekerwan didn't seem to be able to get power curves, but only one specific point on the curve at Fmax. Other chinese reviewers have IOS results with 2 points, one at Fmax and another at very low power, which I'm assuming getting the phone to run on low power/battery mode might force it to do.
Could be completely speaking out of line here, so if anyone knows better, please do tell me.

I don't think the M-core's perf/watt curve is especially impressive or anything either though. The E-core has better perf/watt than the M-core for 50% of its perf/watt curve. The M-cores look like they get outclassed by the S-cores for a much greater percent of its power curve. Though I suppose this is a better look than ARM's ultra cores getting better perf/watt than the premium cores for pretty much all of the premium's power curve.
1773656525978.png
 
It may well be the clock rate disparity between the E and M cores that accounts for most of the performance (and power) increase is due to different FinFlex transistors used in those blocks. The clock gain is bang on about what TSMC claims as the difference between the highest and lowest FinFlex level.

The one thing that doesn't really fit this narrative are the curves posted before that seem to show the M cores being as fast or slightly faster than the E core at most power levels, with the added band once you move beyond the E core's power (and clock) ceiling. Maybe that's just a newer/better E core but they just GOT a newer/better E core that what 28% faster at the same power, and to do that despite using high power / leakier transistors would be quite a feat.

It would mean there was not one but two fairly major upgrades to the "E core" in a very short time. If true that would be surprising but I suppose not impossible, if that core has had Apple's best architects focusing on it to leverage it into the M core. In that case we should expect the E core for A20/A20P will get another massive bump in performance at the same power level, between the "further tweaks" and what it gains from N2. If so I don't think anyone will complain that A20 doesn't have M cores, or still has "only" four E cores.
I don't know how much we can trust the curve. Was it actually generated from multiple points?
Or from two points (highest and lowest frequency, then fit a cube root)?
 
Was just looking a bit further into notebookchecks info. It looks like they have provided a CB 2024 multi-score result for that 16 inch M5 max, got 2437. Aligns with pcmag’s result.
1773712009939.png
They said they’re formulating the review at the moment, so I assume that is where this data point comes from. No power draw data yet, but should be around the 102W of the 18-core M5 Pro in the 16in MBP.
 

Honestly I think Apple will eventually overtake Nvidia simply because the 5090 is a static target and there's no way they're gonna waste valuable A16 wafers on desktop GPUs when they can make so so so much more using those wafers for AI. Or RTX 6000 series cards will be in such short supply at such high prices they might as well not exist lol
 
Finally tried the MacBook Neo in person. It’s going to do fine for most basic users. Trackpad feel is fine but feels different, very similar to the older mechanical Mac trackpads but not quite the same. Can’t really test speakers in-store but was a bit odd coming out the sides. I prefer the smaller and curvy form factor over the Air, even though the weight is the same.

Speed is fine for basic mainstream use but I did notice occasional pauses on one but not all of the machines. I checked the memory usage of four different machines, and every single one of them had at least 1 GB of swap when sitting there idle. This persisted even when I killed all the active apps. With multiple apps open (but with very little content) they’d sometimes be over 2 GB swap.
 
Finally tried the MacBook Neo in person. It’s going to do fine for most basic users. Trackpad feel is fine but feels different, very similar to the older mechanical Mac trackpads but not quite the same. Can’t really test speakers in-store but was a bit odd coming out the sides. I prefer the smaller and curvy form factor over the Air, even though the weight is the same.

Speed is fine for basic mainstream use but I did notice occasional pauses on one but not all of the machines. I checked the memory usage of four different machines, and every single one of them had at least 1 GB of swap when sitting there idle. This persisted even when I killed all the active apps. With multiple apps open (but with very little content) they’d sometimes be over 2 GB swap.
MacOS is very aggressive with swap. I've got 32GB on this machine, 8GB of swap, and 12 GB free.

Where I think MacOS is most aggressive is with holding inactive Safari tabs in swap. Cache doesn't really do the job for JS reliant pages, whereas swap can quickly reload in, and where going back to network for data would be really slow. So I think even when there's plenty of RAM, MacOS just dumps inactive tabs into swap. I don't think Safari is the only place where this happens either. I suspect it does the same with Mail and some other apps. These are carryovers from iOS.

I think this is both one reason why MacOS performs better than Windows on low RAM - because who doesn't have a browser open, and why Safari runs so much better than Chrome, because Apple can't do that with Chrome tabs.

So I wouldn't consider the presence of some swap to be indicative of memory pressure, I think it's just how MacOS sort of 'cheats' inactive processes, and frankly it's a pretty good cheat because all modern OSes have a lot of work put into blasting swap in quickly and resuming processes, so why not just leverage that instead of a whole separate caching scheme?
 
MacOS is very aggressive with swap. I've got 32GB on this machine, 8GB of swap, and 12 GB free.

Where I think MacOS is most aggressive is with holding inactive Safari tabs in swap. Cache doesn't really do the job for JS reliant pages, whereas swap can quickly reload in, and where going back to network for data would be really slow. So I think even when there's plenty of RAM, MacOS just dumps inactive tabs into swap. I don't think Safari is the only place where this happens either. I suspect it does the same with Mail and some other apps. These are carryovers from iOS.

I think this is both one reason why MacOS performs better than Windows on low RAM - because who doesn't have a browser open, and why Safari runs so much better than Chrome, because Apple can't do that with Chrome tabs.

So I wouldn't consider the presence of some swap to be indicative of memory pressure, I think it's just how MacOS sort of 'cheats' inactive processes, and frankly it's a pretty good cheat because all modern OSes have a lot of work put into blasting swap in quickly and resuming processes, so why not just leverage that instead of a whole separate caching scheme?
My 24 GB M4 Mac never has swap. Literally zero. Safari is my main browser. My 16 GB M1 would get some swap over time and once it hit around 1 GB or so, I’d get occasional small pauses. However, the pauses I encountered on the Neo, although few, seemed a bit longer than the ones I got on the M1.

The pauses weren’t a deal killer by any means, but were noticeable for someone coming from a 24 GB M4.
 
My 24 GB M4 Mac never has swap. Literally zero. Safari is my main browser. My 16 GB M1 would get some swap over time and once it hit around 1 GB or so, I’d get occasional small pauses. However, the pauses I encountered on the Neo, although few, seemed a bit longer than the ones I got on the M1.

The pauses weren’t a deal killer by any means, but were noticeable for someone coming from a 24 GB M4.
My iMac Pro (32GB) has 12GB of swap right now. And I'm not doing anything especially aggressive.

My point is not some sort of pissing contest, it's that the standard methodology for this ("I'll look at apparent memory usage and make vague claims") is highly unsatisfactory! EVERY OS, for obvious reasons, maximizes its DRAM footprint. And EVERY OS, for obvious reasons, given the prevalence of SSDs, is more willing to slightly err on the side of over-enthusiastic allocation, just because a small amount of paging, given SSDs, is no big deal.
A large swap file MAY mean you have a large working set.

But it may ALSO mean that, for example, you've left an app in the background for three weeks, and the OS has sensibly concluded that all relevant modified pages should be paged out, so that the DRAM can be used by a more active app.
One difference between Eug and me may be that I open many apps, and leave them running perpetually – because why not, the system handles this appropriately. But other people either like to close apps after they use them, or simply don't open many apps.
Each of these parties will see a very different swap file size – with zero implications for actual usability and performance/stuttering.

I don't have a great solution (apart from, at the very least, every time this issue is raised, specifying memory PRESSURE rather than memory ALLOCATION or swap file size.
I just want to point out that the naive internet discussion I see (even by the well intentioned) is close to useless.
 
Back
Top