Discussion Apple Silicon SoC thread

Page 252 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,587
1,001
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

okoroezenwa

Junior Member
Dec 22, 2020
24
12
51
Sorry, I meant the M3 family. The M3 Ultra is the only one left (assuming no higher tier is surprisingly dropped) and rumours have pointed to a WWDC 24 reveal. If A20 is the basis of the next M line, that means there’ll be no new Macs until 2026, which seems absurd to me.
 

soresu

Platinum Member
Dec 19, 2014
2,665
1,865
136
Are IPC gains dead?
Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,274
106
Is your speculation that A20 and M4 will be the next synced architecture? That makes no sense given the rumoured M3 timeline.
Sorry, my bad.

I meant to say A19, not A20.

I assure you, I was not high.

Of course, A18/M4 will be the next synced architecture.

I meant to say A19 because from early leaks, it seems A18 keeps the 2P+4E configuration, and thus it's useless to speculate for A18.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,274
106
Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.
Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?
 

soresu

Platinum Member
Dec 19, 2014
2,665
1,865
136
Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?
Zen5 is only just widening their core from the 4 wide design of Zen1 though.

They still have a lot of open road to explore before they are tapped out.
 

naukkis

Senior member
Jun 5, 2002
706
578
136
Sounds intriguing. On that note, why don't compilers automatically write SIMD code for loops where it's "obvious" that the task can be parallelized? Or how about generating executable code with its own virtual machine that analyzes the code as it runs. There will be an overhead for small data inputs but if a large input is given, the VM will "see" that it's taking too long to execute and thus, it pauses the execution of the code, parallelizes the task so the loop runs in parallel as multiple threads and then resumes from where it left off.

Many part of loops are't vectorizable but can be unrolled. Compilers do unroll loops to extract parallelism but compile time unrolling is more limited than runtime. Hardware loop unrolling is pretty complicated scheme but known for ages - and todays hardware already has loop caches which is halfway to totally unrolling loops to independent execution domains. That Apple side-channel vulnerability tells some details where hardware today is - there is separate execution machine executing code and prefetching data from memory to cache from found possible pointer locations. This isn't multithreading like you suppose - if data execution is able to split to multiple threads programmer should do it - but multithreading is't trivial problems to solve even in programming time. Totally different approach than simple loop unrolling.
 

poke01

Senior member
Mar 8, 2022
741
725
106

It seems the M4 family will be based on islands around Ireland. With M3, Apple used islands around Spain.

If Apple doesn’t go the Chop route and makes a unique chip for each M4 chip I would expect a new code name for each chip.

As of this post we only know the code names of the M4 base chip. The Pro and Mac are not yet known.

—————————
A18 is called Tahiti and of course I am obligated to say this quote:

“who lives in Tahiti?.. Tahiti-ins I guess”
 
Last edited:
  • Like
Reactions: Apokalupt0

Mopetar

Diamond Member
Jan 31, 2011
7,848
6,014
136
There's something that might give good results from very wide cores that aren't yet utilized - like hardware loop unrolling. Complex to do - but when done it makes possible to run every iteration of loop in it's own hardware making well use of very wide execution hardware. Though proper ISA support would make implementing that kind of parallelism much easier.

Vector instructions are even better in those cases. Apple supports NEON instructions in their SoCs, but they don't talk about their vector processing capabilities all that much.

Loop unrolling needs a front end capable of issuing a larger number of instructions that come from executing multiple loop iterations at once. SIMD might be faster even with fewer execution ports because it doesn't get bound up by the front end and if you have a large enough vector size the memory accesses will be better as well.

I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.
 

naukkis

Senior member
Jun 5, 2002
706
578
136
Vector instructions are even better in those cases. Apple supports NEON instructions in their SoCs, but they don't talk about their vector processing capabilities all that much.

Loop unrolling needs a front end capable of issuing a larger number of instructions that come from executing multiple loop iterations at once. SIMD might be faster even with fewer execution ports because it doesn't get bound up by the front end and if you have a large enough vector size the memory accesses will be better as well.

I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.

This specially was solutions for utilizing wider cores. Vectorization(SIMD) is working only when there's no dependencies between data - basically dependencies have to be resolved compile time. With loop unrolling it's also possible resolve dependencies (calculate or predict variables from runtime data) and execute multiple loop iterations on paraller.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,274
106
I'm surprised that Apple hasn't added some kind of SMT yet. That's probably one of the best ways to keep a wider architecture busy without requiring too many additional transistors.
Apple doesn't need to. Their cores are OoO execution monsters. OoO ensures the good utilisation of resources.
 

Mopetar

Diamond Member
Jan 31, 2011
7,848
6,014
136
Apple doesn't need to. Their cores are OoO execution monsters. OoO ensures the good utilisation of resources.

Every major CPU for general purpose use is OoO these days. Maybe Apple had a larger buffer than some other chips, but for some workloads the performance will be bound by whatever type of execution port is completely used. If it's something that's load/store heavy, that means ALUs that aren't getting any use. SMT makes it easier to avoid those type of situations because some other thread could use those ALUs in this hypothetical situation.

I think Apple could do a better job than AMD/Intel because they write the operating system as well. Maybe it's just not something they ever thought to add for their phones where it's mainly just a single app executing at a time and anything in the background being restricted in terms of what it can do, but now that they're designing desktop class chips as well, there's a lot more reason to consider an implementation.
 

Doug S

Platinum Member
Feb 8, 2020
2,269
3,521
136
Every major CPU for general purpose use is OoO these days. Maybe Apple had a larger buffer than some other chips, but for some workloads the performance will be bound by whatever type of execution port is completely used. If it's something that's load/store heavy, that means ALUs that aren't getting any use. SMT makes it easier to avoid those type of situations because some other thread could use those ALUs in this hypothetical situation.

I think Apple could do a better job than AMD/Intel because they write the operating system as well. Maybe it's just not something they ever thought to add for their phones where it's mainly just a single app executing at a time and anything in the background being restricted in terms of what it can do, but now that they're designing desktop class chips as well, there's a lot more reason to consider an implementation.

SMT is probably pointless in a world where you have big and little cores. How would it benefit Apple to add a second thread to its P cores when they have very capable E cores just sitting there, sipping power? That's where you want that thread to run. They aren't designing servers, they don't care as much about maximizing throughput for unlimited MT code as Intel/AMD do.

Isn't Intel dropping HT in its next gen CPUs? Right after they added their own capable E cores? I don't think that's a coincidence.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,274
106
2020: A14 = N5
2021: A15 = N5P
2022: A16 = N4
2023: A17 = N3B
2024: A18 = N3E
2025: A19 = N3P
2026: A20 = N2+BSPDN
2027: A21 = N2P
2028: A22 = A14
2029: A23 = A14P

[Speculation]
 
Last edited:

richardskrad

Member
Jun 28, 2022
52
47
51
Holy, the M3 is already scoring 1000 points higher in GB6 single-thread over the M1. My M1 Air is just as snappy as the day I bought it and it's crazy that the M3 is that much faster while retaining Apple's efficiency lead. People give Tim Cook crap for a lots of things but you can't deny that under his watch, Apple has completely changed the laptop game. AMD and Intel still haven't caught up 4+ years later.
 

Mopetar

Diamond Member
Jan 31, 2011
7,848
6,014
136
SMT doesn't use up nearly as much space as a separate e-core. Obviously having a dedicated core is better than having to share one, but there are some processes that could get by without their own dedicated e-core just fine.

You do make fair points regarding Apple not caring about the server market where SMT shines. There are other reasons that they might not want to include it as having multiple threads can pollute caches or create security concerns.

Personally I think Intel is acting foolishly if they really are abandoning Hyperthreading going forward.
 
Jul 27, 2020
16,339
10,351
106
Personally I think Intel is acting foolishly if they really are abandoning Hyperthreading going forward.
Yeah, if they are so concerned about security or performance degrading for certain applications, they should push Microsoft to let the user create affinity profiles so the affected applications don't get put on HT virtual cores. I mean, they already made Microsoft include support for their crappy Thread Director that can't handle AVX-512 code.
 

naukkis

Senior member
Jun 5, 2002
706
578
136
Yeah, if they are so concerned about security or performance degrading for certain applications, they should push Microsoft to let the user create affinity profiles so the affected applications don't get put on HT virtual cores.

You do know that what you supposed means disabling HT. Splitting each core to two virtual cores = HT on, one thread per core = HT off. HT can of course "disabled" by parking one core from core pair. But to have 100% one thread performance HT have to be disabled totally as some hardware resources are split half between threads when HT ability is on. Gamers should disable HT for best performance when CPU has enough thread for current game without it.
 
Jul 27, 2020
16,339
10,351
106
I think if Intel were serious about it, they could devise some form of SMT that avoids the sharing of resources when the HT core isn't being used or it can be disabled virtually. I think it shows Intel's hypocrisy that they went to the trouble of creating Thread Director for proper E-core utilization when they could've also put the same effort to ensuring proper HT core utilization (use when it is beneficial, prevent when it's not). It's like they just gave up and said, we gonna keep on adding E-cores! Yeah, ok fine. Give us 32 or 64 E-cores then!

Maybe that's too much? OK, how about give us 16 Skymont E-cores plus 32 additional shrunken Tremont cores!
 

naukkis

Senior member
Jun 5, 2002
706
578
136
I think if Intel were serious about it, they could devise some form of SMT that avoids the sharing of resources when the HT core isn't being used or it can be disabled virtually. I think it shows Intel's hypocrisy that they went to the trouble of creating Thread Director for proper E-core utilization when they could've also put the same effort to ensuring proper HT core utilization (use when it is beneficial, prevent when it's not). It's like they just gave up and said, we gonna keep on adding E-cores! Yeah, ok fine. Give us 32 or 64 E-cores then!

Maybe that's too much? OK, how about give us 16 Skymont E-cores plus 32 additional shrunken Tremont cores!

In hybrid cpu configuration big cores are there for best per thread performance. If they want still to utilize SMT right core to have it is those e-cores - splitting slow core performance to half for best n-thread performance but still maintaining good 1-thread performance. Having SMT on their fast cores is just stupid thing to do.