Discussion Apple Silicon SoC thread

Page 284 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,725
1,263
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,660
1,209
136
@Doug S expressed skepticism that SPEC is, at this point, getting anything out of SME with any actually-existing compiler. I agree with him; I seriously doubt it is. (SPECfp might be a little bit, but probably not to any sort of breakthrough degree.)

Feel free to do a run yourself and check. That would be interesting and valuable information.
I did express my doubt too. I think some SPEC tests can benefit from SVE, but I'd be pleasantly surprised if any SME instruction was emitted (and no, SSVE doesn't count as SME).

All people I know who work on HPC code rely on intrinsics no matter the platform (or even generate specialized vector code at runtime). I know it's just a small sample, but still.
 

Doug S

Platinum Member
Feb 8, 2020
2,423
3,919
136
Mx chip platform power for ST has always been more than like 5W though. So you have to keep that in mind. More like 5-9W (idle normalized). The second thing is that Apple is getting more performance, around 25-30% more at 20-22% less power simultaneously. Or just like 30-35% more at the same power. (Minus SME stuff). Still more than an easy node change and frequency boost, but with some modest IPC gains, process gains alone they should be able to narrow the gap.

But yes dude your freakouts on this were unnecessary. You should freakout if 8 Gen 4 sucks, that’s our tell if Oryon, where they had extra time to do phydes and use N3E, sucks in phones. That and if V2 is like single digit IPC improvement. Otherwise, time to chill out.


See I don't see it as a problem if a chip is able to use a lot of power for single thread. If I have a single thread load, and a CPU with a 100 watt TDP, I'd love it if it was able to usefully turbo up enough to run that one core at 100 watts to finish my single thread load as quickly as possible. That's not realistic, I know, but to the extent it is possible I'd like that to happen. When there is more than a single core load then that single core wouldn't be allowed to go as high and that's fine. We're already used to the situation where when you spin up more cores you take a frequency hit, especially in mobile where the total power budgets are much lower.

This sort of behavior wouldn't be useful in a server CPU because you aren't going to see single core only loads on it - if you do you bought the wrong thing. But in a PC and most definitely in a phone, sure single core loads are a real thing and to the extent they can be made faster I'm there for it.

If an Intel or AMD CPU uses a bunch of power in a single core load its excused, because "that's turbo". Maybe M4 is running in a "turbo" mode of sorts when running GB6 ST. Apple doesn't publish frequency specs, and whether what you call the frequency it starts running an ST load until it has to slow down "turbo" or "standard frequency" doesn't really matter. It amounts to the same thing either way.
 

SpudLobby

Senior member
May 18, 2022
913
617
106
See I don't see it as a problem if a chip is able to use a lot of power for single thread. If I have a single thread load, and a CPU with a 100 watt TDP, I'd love it if it was able to usefully turbo up enough to run that one core at 100 watts to finish my single thread load as quickly as possible. That's not realistic, I know, but to the extent it is possible I'd like that to happen. When there is more than a single core load then that single core wouldn't be allowed to go as high and that's fine. We're already used to the situation where when you spin up more cores you take a frequency hit, especially in mobile where the total power budgets are much lower.

This sort of behavior wouldn't be useful in a server CPU because you aren't going to see single core only loads on it - if you do you bought the wrong thing. But in a PC and most definitely in a phone, sure single core loads are a real thing and to the extent they can be made faster I'm there for it.

If an Intel or AMD CPU uses a bunch of power in a single core load its excused, because "that's turbo". Maybe M4 is running in a "turbo" mode of sorts when running GB6 ST. Apple doesn't publish frequency specs, and whether what you call the frequency it starts running an ST load until it has to slow down "turbo" or "standard frequency" doesn't really matter. It amounts to the same thing either way.
I am actually fine with where Apple and Qualcomm apparently have it, at 11-13W peak with very very steep slopes and low power floors. My problem is that you start to change phydes and bloat cores when you aim for what Intel/AMD do. I want a bunch of Zen 5Cs in a laptop. I agree though re turbo.
I did express my doubt too. I think some SPEC tests can benefit from SVE, but I'd be pleasantly surprised if any SME instruction was emitted (and no, SSVE doesn't count as SME).

All people I know who work on HPC code rely on intrinsics no matter the platform (or even generate specialized vector code at runtime). I know it's just a small sample, but still.
Agreed.
 

FlameTail

Platinum Member
Dec 15, 2021
2,916
1,652
106
Memory bandwidth of hypothetical Apple M5 Ultra with LPDDR6-10667 and 1536 bit bus.

(10.667 Gbps÷8 bits) × 1536 bits

= 2054.4 GB/s
= ~2 TB/s

That is an insane amount of memory bandwidth.
 

SpudLobby

Senior member
May 18, 2022
913
617
106
What a pitty Geekerwan seams not to know the difference between power and energy. ;)
Lol.
Dissipating ~60% more power for a ~20% shorter period of time doesn't translate to less energy use.
I was about to say similar and just couldn’t help but leave it alone, people really think they’re clever by mentioning energy 😹. This place is amazing.
 
  • Haha
Reactions: Tlh97 and Doug S

Doug S

Platinum Member
Feb 8, 2020
2,423
3,919
136
Memory bandwidth of hypothetical Apple M5 Ultra with LPDDR6-10667 and 1536 bit bus.

(10.667 Gbps÷8 bits) × 1536 bits

= 2054.4 GB/s
= ~2 TB/s

That is an insane amount of memory bandwidth.

That also happens to be the data transfer rate between the two M1 Max dies in an M1 Ultra. They're gonna need more and/or faster I/Os between the dies when they go to LPDDR6.
 

Eug

Lifer
Mar 11, 2000
23,725
1,263
126
Memory bandwidth of hypothetical Apple M5 Ultra with LPDDR6-10667 and 1536 bit bus.

(10.667 Gbps÷8 bits) × 1536 bits

= 2054.4 GB/s
= ~2 TB/s

That is an insane amount of memory bandwidth.
That also happens to be the data transfer rate between the two M1 Max dies in an M1 Ultra. They're gonna need more and/or faster I/Os between the dies when they go to LPDDR6.
UltraFusion M1 was advertised to be 2.5 TB/s. Same goes for UltraFusion M2.
 
  • Like
Reactions: SpudLobby

FlameTail

Platinum Member
Dec 15, 2021
2,916
1,652
106
There are 2 bizarre things that happened in the tech world recently, which I still haven't come to terms with:

1. Apple downgrading the memory bus of M3 Pro to 192 bit.

2. Qualcomm disabling Core Boost in the Snapdragon X Plus, and even some Elite SKUs.
 

SteinFG

Senior member
Dec 29, 2021
512
598
106
Apple downgrading the memory bus of M3 Pro to 192 bit.
Nvidia downgrading the memory bus of 4070 to 192 bit.
Corporate wants you to find the difference

The real answer is, they probably felt 150GB/s is enough, and with the introduction of 12GB ram packages, they switched from 32GB (4x8) to 36GB (3x12) on their top-end M3 Pro chip
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,423
3,919
136
UltraFusion M1 was advertised to be 2.5 TB/s. Same goes for UltraFusion M2.

I thought it was 2 TB, but either way they probably need somewhere between double and triple the memory bandwidth given all the intra-cache transfers between each die's SLC, and perhaps even direct sharing between L2s in the CPU and GPU (I'm not sure how "fused" UltraFusion is)

Not that that such an increase would be an issue, but I think M4 is probably the line in the sand for the hope of an "Apple Silicon Extreme". If there's an M4 Ultra but nothing further we should not expect to ever see more than two dies linked together, previous Apple patents showing four die connectivity notwithstanding.
 

Glo.

Diamond Member
Apr 25, 2015
5,753
4,659
136
There are 2 bizarre things that happened in the tech world recently, which I still haven't come to terms with:

1. Apple downgrading the memory bus of M3 Pro to 192 bit.

2. Qualcomm disabling Core Boost in the Snapdragon X Plus, and even some Elite SKUs.
Considering how short lived M3 series was - it was understandable why they cut the bus.
 

FlameTail

Platinum Member
Dec 15, 2021
2,916
1,652
106
M3, M3 Pro, M3 Max

All three M3 generation parts, from the lowest end to the highest end, have a 17 TOPS NPU. This is interesting. The NPU does not scale up in size/performance for the higher end parts, like CPU/GPU does. Why not?

Will it remain this way for future generations too?
 

Doug S

Platinum Member
Feb 8, 2020
2,423
3,919
136
It may in future M series now that NPU is an important factor

Yeah I think they didn't really have a whole lot for the NPU to do, particularly in Macs, so it wasn't worth scaling in Pro/Max. It was a solution looking for a problem. While they're still not sure what the problem is, judging from stock market price surges and Microsoft "AI PC" hype the solution is clearly "more TOPS!"

I still think over time we'll see the GPU and NPU merge. When the NPU was this tiny little corner it wasn't worth the bother, but if the NPU grows significantly while the GPU will of course continue to be very important, there is a lot to be gained from combining the two. Yes it means some work since there isn't a 100% overlap in their function, and there will need to be a way of dynamically partitioning so it can tilt from almost entirely GPU to almost entirely NPU depending on the load, but the gains from such a merger are too great to ignore.

We might see it as soon as next year, but probably 2026 unless they've already been planning it for a while.
 

roger_k

Member
Sep 23, 2021
102
215
86
See I don't see it as a problem if a chip is able to use a lot of power for single thread. If I have a single thread load, and a CPU with a 100 watt TDP, I'd love it if it was able to usefully turbo up enough to run that one core at 100 watts to finish my single thread load as quickly as possible. That's not realistic, I know, but to the extent it is possible I'd like that to happen. When there is more than a single core load then that single core wouldn't be allowed to go as high and that's fine. We're already used to the situation where when you spin up more cores you take a frequency hit, especially in mobile where the total power budgets are much lower.

I welcome it when a CPU uses up the entire available thermal range, but this has to stay within reasonable limits. I do not think that 50+ watts for single-threaded operation is reasonable. A desktop might get away with it (even though it's a massive waste), but it is simply unacceptable for laptops. I do not want my power to shoot up beyond the CPU TDP when opening a new browser tab.

I do not see any excuses for contemporary mobile CPUs drawing more power than the enthusiast-class desktop ten years ago. That is not good engineering, and that is not honest advertising. I like Apple's hardware because their thermal design targets make sense to me. And they can still hit performance records despite using much less power than the competition. This is the path the industry should follow, not the massive power inflation we have witnessed in the last decade. And frankly, TDP should become recognized as a fraudulent advertising practice. The spec sheet should show CPU power consumption across the frequency range, not some detached from reality number that makes the CPU maker look good.
 
Last edited:

name99

Senior member
Sep 11, 2010
427
324
136
Nvidia Blackwell interconnect is 10 TB/s.

So there's definitely room for improvement.
That's to multiple devices.
I believe the Blackwell chip-to-chip link is 1.8TB/s so still slightly behind Apple.

(Of course to be fair we know nvLink scales, in a way that we believe is true for UltraFusion but have not actually seen; AND nvLink can cover longer distances.)