• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion Apple Silicon SoC thread

Page 462 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:


M5 Family discussion here:

 
Last edited:
basically if any single part screws up you are screwed
Yes, everything has to be balanced, and that's very difficult to achieve and can be difficult to measure in advance. Note this applies to all parts of a design, though mistakes in the memory hierarchy tend to strike harder.

I think he's the real GW (this perhaps means he's done painting his house 😀).
 
Yes, everything has to be balanced, and that's very difficult to achieve and can be difficult to measure in advance. Note this applies to all parts of a design, though mistakes in the memory hierarchy tend to strike harder.
we have seen practical example for 2 gens in a HVM product 🤣 🤣
 
Yes, everything has to be balanced, and that's very difficult to achieve and can be difficult to measure in advance. Note this applies to all parts of a design, though mistakes in the memory hierarchy tend to strike harder.
This is what struck me when looking at uarch diagrams of Apple (and also Qualcomm Oryon) CPU designs. They appear to be very well 'balanced'.

This really shows in the PPA of those cores. Apple's P-cores have similar area to ARM's Cortex X, while having superior performance and efficiency.

Oryon Prime cores have similar peak performance to Apple's P-cores, while being about a gen behind in IPC/performance-per-watt, but the core is only 3/4 the size.
 


"Apple ANE Successfully Reverse-Engineered! Is the 38 TOPS Performance Just a Numbers Game?

I just came across a hardcore open-source project by the blogger maderix: he reverse-engineered Apple’s private APIs, bypassed CoreML, and managed to run neural network training directly on the Apple Neural Engine (ANE)!

Wait — what exactly is ANE?
The ANE is the neural network accelerator inside Apple silicon chips. On the M4, it currently features 16 compute cores, and Apple officially claims 38 TOPS of performance. However, it has always been a black box: you can only access it through the CoreML framework. There are no public APIs, no documentation, no ISA — nothing.

So this guy basically peeled away the CoreML layer. Using reverse-engineering techniques (such as dyld_info scanning and method swizzling to intercept CoreML calls), he reconstructed the entire compilation and execution pipeline. Most importantly, he figured out the in-memory compilation path, allowing MIL (similar to NVIDIA’s PTX) to be compiled directly into ANE binaries in memory. This potentially makes training large models on ANE much more feasible.

During the reverse-engineering process, several explosive findings emerged:

First, ANE is fundamentally a convolution engine, not a matrix multiplication engine. If you rewrite the same computation as a convolution, throughput can increase by up to 3×. Apple’s own ml-ane-transformers reference implementation hints at this pattern, but they’ve never stated it explicitly.

Second, ANE appears to contain roughly 32MB of on-chip SRAM. This was inferred from performance cliffs observed during matrix multiplication scaling tests.

Third, a single operator can only achieve about 30% of ANE’s peak performance. That’s because the 16 ANE cores are organized in a pipeline. If you submit only one operation, most cores remain idle. To fully utilize the hardware, you need to chain together 16–64 operations in a single computation graph submission. That way, different cores can process different pipeline stages simultaneously, pushing utilization up to around 94%.

Finally — and perhaps most controversially — the “38 TOPS” figure may be a numbers game. The author ran identical operations in FP16 and INT8 and observed identical throughput. The conclusion: when executing INT8 workloads, ANE likely dequantizes them to FP16 internally before computation. Apple’s “38 TOPS INT8” claim may simply be 19 TFLOPS FP16 multiplied by two — essentially a marketing figure. The real peak performance appears to be 19 TFLOPS FP16.

Another interesting detail: ANE features hardware-level power gating. When idle, its power consumption is truly 0 mW — not low-power standby, but completely powered off with zero leakage. That level of power management is seriously impressive and extremely mobile-friendly.

Of course, beyond the performance claims, the reverse-engineering process itself is highly educational. The two blog posts are packed with technical depth — far more than I can summarize here. If you’re interested, I highly recommend reading the original article, “inside-the-m4-apple-neural-engine.” This is just a brief introduction to spark your curiosity."
 
Last edited:

iPhone 17e
A19 (4-core GPU)
N1+C1X
Starting storage at 256GB
MagSafe
Ceramic Shield 2 with improved anti-reflection

iPad Air
M4 (8-core CPU with 3 P-cores + 5 E-cores and a 9-core GPU)
N1+C1X
 
Last edited:
The M4 iPad Air got 12 GB RAM. That surprised me, especially since my M4 iPad Pro only has 8 GB. This means that M4 has variants with 8 GB, 12 GB, 16 GB, 24 GB, and 32 GB. I wonder just how many of those base RAM M4 iPad Pros actually have 12 GB RAM (with 4 GB inactive). All of them, or just some?

The base 256 GB storage in the 17e also surprised me. The 17e getting MagSafe was no surprise though, but then again that didn't really affect my kid with the 16e, since a $10 MagSafe case was all that was necessary to get the magnetic mount. (The charging speed is slower on the 16e, but that hasn't been a real world issue since my kid almost always just charges overnight anyways.)
 


"Apple ANE Successfully Reverse-Engineered! Is the 38 TOPS Performance Just a Numbers Game?

I just came across a hardcore open-source project by the blogger maderix: he reverse-engineered Apple’s private APIs, bypassed CoreML, and managed to run neural network training directly on the Apple Neural Engine (ANE)!

Wait — what exactly is ANE?
The ANE is the neural network accelerator inside Apple silicon chips. On the M4, it currently features 16 compute cores, and Apple officially claims 38 TOPS of performance. However, it has always been a black box: you can only access it through the CoreML framework. There are no public APIs, no documentation, no ISA — nothing.

So this guy basically peeled away the CoreML layer. Using reverse-engineering techniques (such as dyld_info scanning and method swizzling to intercept CoreML calls), he reconstructed the entire compilation and execution pipeline. Most importantly, he figured out the in-memory compilation path, allowing MIL (similar to NVIDIA’s PTX) to be compiled directly into ANE binaries in memory. This potentially makes training large models on ANE much more feasible.

During the reverse-engineering process, several explosive findings emerged:

First, ANE is fundamentally a convolution engine, not a matrix multiplication engine. If you rewrite the same computation as a convolution, throughput can increase by up to 3×. Apple’s own ml-ane-transformers reference implementation hints at this pattern, but they’ve never stated it explicitly.

Second, ANE appears to contain roughly 32MB of on-chip SRAM. This was inferred from performance cliffs observed during matrix multiplication scaling tests.

Third, a single operator can only achieve about 30% of ANE’s peak performance. That’s because the 16 ANE cores are organized in a pipeline. If you submit only one operation, most cores remain idle. To fully utilize the hardware, you need to chain together 16–64 operations in a single computation graph submission. That way, different cores can process different pipeline stages simultaneously, pushing utilization up to around 94%.

Finally — and perhaps most controversially — the “38 TOPS” figure may be a numbers game. The author ran identical operations in FP16 and INT8 and observed identical throughput. The conclusion: when executing INT8 workloads, ANE likely dequantizes them to FP16 internally before computation. Apple’s “38 TOPS INT8” claim may simply be 19 TFLOPS FP16 multiplied by two — essentially a marketing figure. The real peak performance appears to be 19 TFLOPS FP16.

Another interesting detail: ANE features hardware-level power gating. When idle, its power consumption is truly 0 mW — not low-power standby, but completely powered off with zero leakage. That level of power management is seriously impressive and extremely mobile-friendly.

Of course, beyond the performance claims, the reverse-engineering process itself is highly educational. The two blog posts are packed with technical depth — far more than I can summarize here. If you’re interested, I highly recommend reading the original article, “inside-the-m4-apple-neural-engine.” This is just a brief introduction to spark your curiosity."

"Explosive findings" only if you never bothered to read my PDFs.
Where I explained the convolution engine in greater detail, along with the pipelining.
I also explain how the doubled INT8 fits with the FP16 MACs, and if the authors read that, they might have a better idea of how to tap into that doubled INT8 performance. (I suspect it requires using the FP16 datapath to load both INT8s side by side then execute what looks like a SIMD INT8[2] operation.)
 
The M4 iPad Air is an odd product. 3 P cores, 5 E cores, 9 GPU cores - a dogs breakfast of bins.

Wonder if that's it? Since Apple doesn't advertise specs like SLC size when they announce products they could bin there as well and ship with 3/4 of the SLC vs a "regular" M4 and no one would know unless someone does benchmarks to see where the knees of the memory latency graph are located.
 
So even the iPhone 17e has 256 GB base storage.

So I doubt the low cost Macbook will have a 128 GB option.

They should bump up the Macbook Air's base to 512 GB, and the Macbook Pro to 1 TB.

The RAM/NAND specs for the products announced this week were probably set in stone before DRAM/NAND pricing started to get crazy last fall. I wouldn't look for them to be increasing base configs on products coming out this fall just because of what they do with the products announded this week. Not just this fall, I bet they hold the line on current DRAM/NAND configs vs the previous version they're replacing in most products until the bubble bursts.

The fact Apple customers have been so willing to pay for DRAM and NAND upgrades and that Apple has charged far more than the actual cost to them of those upgrades helps them out. It still hurts their margin (i.e. instead of making 90% profit when a user goes to a higher NAND tier of iPhone maybe they will only make 70%) but if holding the line on DRAM/NAND configs for the next couple years causes more customers to get those upgrades over time that'll make up a lot of the margin they would be losing by holding firm on list prices for the base config.

It sounds like Apple plans to hold the line on pricing as much as possible. They can use the fact everyone else will be under FAR greater margin pressure that will force them to raise their prices (and drop entry level products entirely) as a way for Apple's products to gain market share.

When they add expensive new technology like OLED displays on Macbook Pro they might use things like that as an opportunity to increase prices. Just like they did when they made that switch with iPhone. But whatever roadmap they may have had for bringing OLED displays in less expensive Macbooks down the line is probably on hold until the memory market stabilizes.
 
Back
Top