Discussion Apple Silicon SoC thread

Page 250 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,825
1,396
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
ya they are newish as in there are a few notable changes, but bulk of the architecture is still the same. I hope we see a new architecture from ground up with A18. Same as how AMD is doing with Zen 5.
They are already sporting a super wide core - I don't think it's going to gain them much by going significantly wider a la Zen 5 at this point.

Beyond a certain width you hit diminishing returns.

Going from 4 -> 6 doesn't net the same gain as 6 -> 8, let alone 8 -> 10 even though you are increasing by the same amount.

Unless they can somehow architect 13-16 wide µArch without explosive power draw and area increase maybe, but that seems like a stretch.

Unless some big breakthrough in CPU design happens I think we will see perf hit a hard wall without a drastic change to the underlying hardware device and materials, perhaps something like antiferromagnetic or photonic logic with topological insulator based metal layers.
 
  • Like
Reactions: Apokalupt0

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
Not always true. Esp. if they manage to enhance the accompanying blocks.

The point was about taking it purely on the point of just throwing silicon at the problem.

The impact of diminishing returns is already starting to be a buzzkill for Apple I imagine.

Given ARM Ltd's best 4 wide CPU core far outperforms Apple's initial 6 wide design that point should be kinda obvious by now to anyone paying attention.

On that note, are the A7xx cores still 4 wide?

Anyone got the spec sheet on this? I seem to remember a Google Docs thing floating around some time ago.

If so I wonder if Chaberton/A730 will continue to be 4 wide.
 

Doug S

Platinum Member
Feb 8, 2020
2,784
4,746
136
Not always true. Esp. if they manage to enhance the accompanying blocks.

Doesn't matter, there are always diminishing returns for widening, because not all code has sufficient parallelism. Doesn't help as much to go from 8 to 10 wide if even under ideal circumstances the code you're running only exceeds 8 instructions that can be issued/retired at once 10 or 20 percent of the time. But maybe when you went from 6 to 8 it was 20 to 30 percent that could benefit.
 
  • Like
Reactions: Mopetar and Ajay

naukkis

Senior member
Jun 5, 2002
903
786
136
Doesn't matter, there are always diminishing returns for widening, because not all code has sufficient parallelism. Doesn't help as much to go from 8 to 10 wide if even under ideal circumstances the code you're running only exceeds 8 instructions that can be issued/retired at once 10 or 20 percent of the time. But maybe when you went from 6 to 8 it was 20 to 30 percent that could benefit.

There's something that might give good results from very wide cores that aren't yet utilized - like hardware loop unrolling. Complex to do - but when done it makes possible to run every iteration of loop in it's own hardware making well use of very wide execution hardware. Though proper ISA support would make implementing that kind of parallelism much easier.
 
  • Like
Reactions: soresu
Jul 27, 2020
20,040
13,738
146
There's something that might give good results from very wide cores that aren't yet utilized - like hardware loop unrolling. Complex to do - but when done it makes possible to run every iteration of loop in it's own hardware making well use of very wide execution hardware. Though proper ISA support would make implementing that kind of parallelism much easier.
Sounds intriguing. On that note, why don't compilers automatically write SIMD code for loops where it's "obvious" that the task can be parallelized? Or how about generating executable code with its own virtual machine that analyzes the code as it runs. There will be an overhead for small data inputs but if a large input is given, the VM will "see" that it's taking too long to execute and thus, it pauses the execution of the code, parallelizes the task so the loop runs in parallel as multiple threads and then resumes from where it left off.
 

FlameTail

Diamond Member
Dec 15, 2021
3,951
2,376
106
Doesn't matter, there are always diminishing returns for widening, because not all code has sufficient parallelism. Doesn't help as much to go from 8 to 10 wide if even under ideal circumstances the code you're running only exceeds 8 instructions that can be issued/retired at once 10 or 20 percent of the time. But maybe when you went from 6 to 8 it was 20 to 30 percent that could benefit.
So if making the core wider will only bring diminishing returns, what are they gonna do!?

Are IPC gains dead?
 

Apokalupt0

Junior Member
Feb 14, 2024
11
10
41
The point was about taking it purely on the point of just throwing silicon at the problem.

The impact of diminishing returns is already starting to be a buzzkill for Apple I imagine.

Given ARM Ltd's best 4 wide CPU core far outperforms Apple's initial 6 wide design that point should be kinda obvious by now to anyone paying attention.

On that note, are the A7xx cores still 4 wide?

Anyone got the spec sheet on this? I seem to remember a Google Docs thing floating around some time ago.

If so I wonder if Chaberton/A730 will continue to be 4 wide.
The A715 went from 4 wide to 5 wide
 
  • Like
Reactions: soresu

okoroezenwa

Member
Dec 22, 2020
100
108
86
Sorry, I meant the M3 family. The M3 Ultra is the only one left (assuming no higher tier is surprisingly dropped) and rumours have pointed to a WWDC 24 reveal. If A20 is the basis of the next M line, that means there’ll be no new Macs until 2026, which seems absurd to me.
 

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
Are IPC gains dead?
Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.
 

FlameTail

Diamond Member
Dec 15, 2021
3,951
2,376
106
Is your speculation that A20 and M4 will be the next synced architecture? That makes no sense given the rumoured M3 timeline.
Sorry, my bad.

I meant to say A19, not A20.

I assure you, I was not high.

Of course, A18/M4 will be the next synced architecture.

I meant to say A19 because from early leaks, it seems A18 keeps the 2P+4E configuration, and thus it's useless to speculate for A18.
 

FlameTail

Diamond Member
Dec 15, 2021
3,951
2,376
106
Short of a paradigm changing ISA and/or µArch I would say so.

Maybe not dead now, but give it 10-15 years for the current crop of ultra wide CPU core µArchs to be played out and left to no more than extremely high hanging and small fruit for improvement.

The prevailing thought seems to be that a return to specialisation vs generalisation in compute is the only way to keep getting more perf by just throwing silicon at it.
Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?
 

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
Zen 5 is said to betting a quantum leap of 40% performance improvement (much of it from IPC). It seems it's the ARM players who are struggling?
Zen5 is only just widening their core from the 4 wide design of Zen1 though.

They still have a lot of open road to explore before they are tapped out.
 

naukkis

Senior member
Jun 5, 2002
903
786
136
Sounds intriguing. On that note, why don't compilers automatically write SIMD code for loops where it's "obvious" that the task can be parallelized? Or how about generating executable code with its own virtual machine that analyzes the code as it runs. There will be an overhead for small data inputs but if a large input is given, the VM will "see" that it's taking too long to execute and thus, it pauses the execution of the code, parallelizes the task so the loop runs in parallel as multiple threads and then resumes from where it left off.

Many part of loops are't vectorizable but can be unrolled. Compilers do unroll loops to extract parallelism but compile time unrolling is more limited than runtime. Hardware loop unrolling is pretty complicated scheme but known for ages - and todays hardware already has loop caches which is halfway to totally unrolling loops to independent execution domains. That Apple side-channel vulnerability tells some details where hardware today is - there is separate execution machine executing code and prefetching data from memory to cache from found possible pointer locations. This isn't multithreading like you suppose - if data execution is able to split to multiple threads programmer should do it - but multithreading is't trivial problems to solve even in programming time. Totally different approach than simple loop unrolling.