Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

soresu · Mar 28, 2024

FlameTail said:
Cortex X4 is a 10 wide monster

We know - the point was about how much IPC that ARM Ltd has been able to squeeze out of 4 wide designs at their apex vs Apple's initial 6 wide design.

soresu · Mar 28, 2024

okoroezenwa said:
Is your speculation that A20 and M4 will be the next synced architecture? That makes no sense given the rumoured M3 timeline.

Rumoured?

Isn't M3 out already?

okoroezenwa · Mar 28, 2024

Sorry, I meant the M3 family. The M3 Ultra is the only one left (assuming no higher tier is surprisingly dropped) and rumours have pointed to a WWDC 24 reveal. If A20 is the basis of the next M line, that means there’ll be no new Macs until 2026, which seems absurd to me.

soresu · Mar 28, 2024

FlameTail said:
Are IPC gains dead?

Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.

FlameTail · Mar 28, 2024

okoroezenwa said:
Is your speculation that A20 and M4 will be the next synced architecture? That makes no sense given the rumoured M3 timeline.

Sorry, my bad.

I meant to say A19, not A20.

I assure you, I was not high.

Of course, A18/M4 will be the next synced architecture.

I meant to say A19 because from early leaks, it seems A18 keeps the 2P+4E configuration, and thus it's useless to speculate for A18.

FlameTail · Mar 28, 2024

soresu said:
Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.

Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?

soresu · Mar 28, 2024

FlameTail said:
Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?

Zen5 is only just widening their core from the 4 wide design of Zen1 though.

They still have a lot of open road to explore before they are tapped out.

naukkis · Mar 28, 2024

igor_kavinski said:
Sounds intriguing. On that note, why don't compilers automatically write SIMD code for loops where it's "obvious" that the task can be parallelized? Or how about generating executable code with its own virtual machine that analyzes the code as it runs. There will be an overhead for small data inputs but if a large input is given, the VM will "see" that it's taking too long to execute and thus, it pauses the execution of the code, parallelizes the task so the loop runs in parallel as multiple threads and then resumes from where it left off.

Many part of loops are't vectorizable but can be unrolled. Compilers do unroll loops to extract parallelism but compile time unrolling is more limited than runtime. Hardware loop unrolling is pretty complicated scheme but known for ages - and todays hardware already has loop caches which is halfway to totally unrolling loops to independent execution domains. That Apple side-channel vulnerability tells some details where hardware today is - there is separate execution machine executing code and prefetching data from memory to cache from found possible pointer locations. This isn't multithreading like you suppose - if data execution is able to split to multiple threads programmer should do it - but multithreading is't trivial problems to solve even in programming time. Totally different approach than simple loop unrolling.

Doug S · Mar 28, 2024

FlameTail said:
So if making the core wider will only bring diminishing returns, what are they gonna do!?

Are IPC gains dead?

Diminishing means smaller, not zero.

poke01 · Mar 30, 2024

https://twitter.com/x/status/1738377087403086305

It seems the M4 family will be based on islands around Ireland. With M3, Apple used islands around Spain.

If Apple doesn’t go the Chop route and makes a unique chip for each M4 chip I would expect a new code name for each chip.

As of this post we only know the code names of the M4 base chip. The Pro and Mac are not yet known.

—————————
A18 is called Tahiti and of course I am obligated to say this quote:

“who lives in Tahiti?.. Tahiti-ins I guess”

Mopetar · Mar 30, 2024

naukkis said:
There's something that might give good results from very wide cores that aren't yet utilized - like hardware loop unrolling. Complex to do - but when done it makes possible to run every iteration of loop in it's own hardware making well use of very wide execution hardware. Though proper ISA support would make implementing that kind of parallelism much easier.

Vector instructions are even better in those cases. Apple supports NEON instructions in their SoCs, but they don't talk about their vector processing capabilities all that much.

Loop unrolling needs a front end capable of issuing a larger number of instructions that come from executing multiple loop iterations at once. SIMD might be faster even with fewer execution ports because it doesn't get bound up by the front end and if you have a large enough vector size the memory accesses will be better as well.

I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.

naukkis · Mar 30, 2024

Mopetar said:
Vector instructions are even better in those cases. Apple supports NEON instructions in their SoCs, but they don't talk about their vector processing capabilities all that much.

Loop unrolling needs a front end capable of issuing a larger number of instructions that come from executing multiple loop iterations at once. SIMD might be faster even with fewer execution ports because it doesn't get bound up by the front end and if you have a large enough vector size the memory accesses will be better as well.

I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.

This specially was solutions for utilizing wider cores. Vectorization(SIMD) is working only when there's no dependencies between data - basically dependencies have to be resolved compile time. With loop unrolling it's also possible resolve dependencies (calculate or predict variables from runtime data) and execute multiple loop iterations on paraller.

FlameTail · Mar 30, 2024

Mopetar said:
I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.

Apple doesn't need to. Their cores are OoO execution monsters. OoO ensures the good utilisation of resources.

FlameTail · Mar 31, 2024

Let us discuss about Apple's OLED Macbooks here👇

Question - 👑Apple: The OLED Roadmap👑

Apple will launch the 11- and 12.9-inch iPad Pro in 2024. Apple will launch an 8.3-inch OLED iPad mini tablet PC in 2026 and a 10.8-inch OLED iPad Air tablet PC with an LTPS backplane and single-structure RGB OLED in 2027. Omdia estimates that BOE plans to produce these panels in its newly...

forums.anandtech.com

Mopetar · Apr 2, 2024

FlameTail said:
Apple doesn't need to. Their cores are OoO execution monsters. OoO ensures the good utilisation of resources.

Every major CPU for general purpose use is OoO these days. Maybe Apple had a larger buffer than some other chips, but for some workloads the performance will be bound by whatever type of execution port is completely used. If it's something that's load/store heavy, that means ALUs that aren't getting any use. SMT makes it easier to avoid those type of situations because some other thread could use those ALUs in this hypothetical situation.

I think Apple could do a better job than AMD/Intel because they write the operating system as well. Maybe it's just not something they ever thought to add for their phones where it's mainly just a single app executing at a time and anything in the background being restricted in terms of what it can do, but now that they're designing desktop class chips as well, there's a lot more reason to consider an implementation.

Doug S · Apr 3, 2024

Mopetar said:
Every major CPU for general purpose use is OoO these days. Maybe Apple had a larger buffer than some other chips, but for some workloads the performance will be bound by whatever type of execution port is completely used. If it's something that's load/store heavy, that means ALUs that aren't getting any use. SMT makes it easier to avoid those type of situations because some other thread could use those ALUs in this hypothetical situation.

I think Apple could do a better job than AMD/Intel because they write the operating system as well. Maybe it's just not something they ever thought to add for their phones where it's mainly just a single app executing at a time and anything in the background being restricted in terms of what it can do, but now that they're designing desktop class chips as well, there's a lot more reason to consider an implementation.

SMT is probably pointless in a world where you have big and little cores. How would it benefit Apple to add a second thread to its P cores when they have very capable E cores just sitting there, sipping power? That's where you want that thread to run. They aren't designing servers, they don't care as much about maximizing throughput for unlimited MT code as Intel/AMD do.

Isn't Intel dropping HT in its next gen CPUs? Right after they added their own capable E cores? I don't think that's a coincidence.

FlameTail · Apr 4, 2024

2020: A14 = N5
2021: A15 = N5P
2022: A16 = N4
2023: A17 = N3B
2024: A18 = N3E
2025: A19 = N3P
2026: A20 = N2+BSPDN
2027: A21 = N2P
2028: A22 = A14
2029: A23 = A14P

[Speculation]

richardskrad · Apr 4, 2024

Holy, the M3 is already scoring 1000 points higher in GB6 single-thread over the M1. My M1 Air is just as snappy as the day I bought it and it's crazy that the M3 is that much faster while retaining Apple's efficiency lead. People give Tim Cook crap for a lots of things but you can't deny that under his watch, Apple has completely changed the laptop game. AMD and Intel still haven't caught up 4+ years later.

Mopetar · Apr 4, 2024

SMT doesn't use up nearly as much space as a separate e-core. Obviously having a dedicated core is better than having to share one, but there are some processes that could get by without their own dedicated e-core just fine.

You do make fair points regarding Apple not caring about the server market where SMT shines. There are other reasons that they might not want to include it as having multiple threads can pollute caches or create security concerns.

Personally I think Intel is acting foolishly if they really are abandoning Hyperthreading going forward.

igor_kavinski · Apr 4, 2024

Mopetar said:
Personally I think Intel is acting foolishly if they really are abandoning Hyperthreading going forward.

Yeah, if they are so concerned about security or performance degrading for certain applications, they should push Microsoft to let the user create affinity profiles so the affected applications don't get put on HT virtual cores. I mean, they already made Microsoft include support for their crappy Thread Director that can't handle AVX-512 code.

naukkis · Apr 4, 2024

igor_kavinski said:
Yeah, if they are so concerned about security or performance degrading for certain applications, they should push Microsoft to let the user create affinity profiles so the affected applications don't get put on HT virtual cores.

You do know that what you supposed means disabling HT. Splitting each core to two virtual cores = HT on, one thread per core = HT off. HT can of course "disabled" by parking one core from core pair. But to have 100% one thread performance HT have to be disabled totally as some hardware resources are split half between threads when HT ability is on. Gamers should disable HT for best performance when CPU has enough thread for current game without it.

igor_kavinski · Apr 4, 2024

I think if Intel were serious about it, they could devise some form of SMT that avoids the sharing of resources when the HT core isn't being used or it can be disabled virtually. I think it shows Intel's hypocrisy that they went to the trouble of creating Thread Director for proper E-core utilization when they could've also put the same effort to ensuring proper HT core utilization (use when it is beneficial, prevent when it's not). It's like they just gave up and said, we gonna keep on adding E-cores! Yeah, ok fine. Give us 32 or 64 E-cores then!

Maybe that's too much? OK, how about give us 16 Skymont E-cores plus 32 additional shrunken Tremont cores!

naukkis · Apr 4, 2024

igor_kavinski said:
I think if Intel were serious about it, they could devise some form of SMT that avoids the sharing of resources when the HT core isn't being used or it can be disabled virtually. I think it shows Intel's hypocrisy that they went to the trouble of creating Thread Director for proper E-core utilization when they could've also put the same effort to ensuring proper HT core utilization (use when it is beneficial, prevent when it's not). It's like they just gave up and said, we gonna keep on adding E-cores! Yeah, ok fine. Give us 32 or 64 E-cores then!

Maybe that's too much? OK, how about give us 16 Skymont E-cores plus 32 additional shrunken Tremont cores!

In hybrid cpu configuration big cores are there for best per thread performance. If they want still to utilize SMT right core to have it is those e-cores - splitting slow core performance to half for best n-thread performance but still maintaining good 1-thread performance. Having SMT on their fast cores is just stupid thing to do.

igor_kavinski · Apr 4, 2024

naukkis said:
splitting slow core performance to half for best n-thread performance but still maintaining good 1-thread performance.

Yeah just like what AMD did with Zen 4c.

DAPUNISHER · Apr 4, 2024

Are some of you lost? This is the Apple thread. There have been numerous reports of thread derailment. Back on topic please and thank you.

CPU Mod DAPUNISHER

Discussion Apple Silicon SoC thread

Lifer

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Super Moderator CPU Forum Mod and Elite Member