Discussion Apple Silicon SoC thread

Page 386 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,986
1,595
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
5,188
8,321
136
Yeah there are diminishing returns with more cores in GB6;
There is essentially no returns past 6 cores except for one single subtest. Furthermore the "MT" score rewards fast cache setups at low amounts of cores, something Apple is excelling at and improving further with every gen.

GB6's "MT" is a waste of time outside of core limited single task consumer hardware.
 

mvprod123

Senior member
Jun 22, 2024
237
271
96
M3 Ultra doesn't do so hot in Geekbench 6 CPU, vs. M4 Max. I guess the real test will be for the GPU.

M3 Ultra:


View attachment 119156

M4 Max vs. M3 Ultra:


View attachment 119157
M3 Ultra Geekbench GPU
1741340999736.png
 
Last edited:
  • Like
Reactions: Eug

naukkis

Senior member
Jun 5, 2002
989
840
136
No, it's mostly an uncooperative limited amount of threads single task test not worth the "multi" denomination it keeps getting.


Not "MT" in any case.

Do you see MT score is a bit higher than ST? That's how those workloads will scale to MT. For most use cases MT performance measured by running multiple independent ST threads concurrently won't tell anything about MT performance. And because of Amdahls law ST performance is very important part of MT-scaling. And that co-operative MT scaling is what actually users need - that score will present chips MT performance on well programmed MT workloads including gaming. GB5 or Cinebench won't tell you anything about those MT performance.
 
  • Like
Reactions: Nothingness

Nothingness

Diamond Member
Jul 3, 2013
3,235
2,285
136
No, it's mostly an uncooperative limited amount of threads single task test not worth the "multi" denomination it keeps getting.
The renderer test scales well. Others show limitations (either due to structurally not scaling, like most software, or poor choice/programming of workload).

I agree that using the global MT score doesn't show a good picture for MT scaling. But OTOH using benchmarks that have perfect scaling is as stupid and misleading.
 

trivik12

Senior member
Jan 26, 2006
347
318
136
M3 Ultra is so mid for the price at which Mac Studio is selling. Thank fully there are enough Apple loonies to buy this over priced system and I am happy as an AAPL stock holder :)
 

oak8292

Member
Sep 14, 2016
138
147
116
M3 Ultra is so mid for the price at which Mac Studio is selling. Thank fully there are enough Apple loonies to buy this over priced system and I am happy as an AAPL stock holder :)
My speculation is that Apple is buying most of these for their server farms. The I/O was improved for DRAM and storage. I will guess they ‘benchmark’ just fine on the loads that Apple wants in the cloud.
 

Doug S

Diamond Member
Feb 8, 2020
3,058
5,286
136
No, it's mostly an uncooperative limited amount of threads single task test not worth the "multi" denomination it keeps getting.


Not "MT" in any case.

In the real world loads that increase in performance with every core you throw at it are the exception not the rule. Once there is any interdependence between threads there is a point where more cores doesn't help at all, and in some cases can even hurt.

I think John ought to keep the current MT test and add back the GB5 style test, and label them "cooperative MT" and "parallel MT" respectively. Then people who want CB or SPECrate type numbers that go up and up until you run out of memory bandwidth (or memory) can get that, while people who care about stuff like cross core/cross cluster/cross chiplet synchronization overhead can perform relevant comparisons.

That said, I'm not sure GB6 is really testing what I'm talking about. I haven't looked into it, I don't pay much attention to that OR CB like numbers, since for the most part everything has more than enough cores for the kind of stuff I do.
 

naukkis

Senior member
Jun 5, 2002
989
840
136
In the real world loads that increase in performance with every core you throw at it are the exception not the rule. Once there is any interdependence between threads there is a point where more cores doesn't help at all, and in some cases can even hurt.

I think John ought to keep the current MT test and add back the GB5 style test, and label them "cooperative MT" and "parallel MT" respectively. Then people who want CB or SPECrate type numbers that go up and up until you run out of memory bandwidth (or memory) can get that, while people who care about stuff like cross core/cross cluster/cross chiplet synchronization overhead can perform relevant comparisons.

That said, I'm not sure GB6 is really testing what I'm talking about. I haven't looked into it, I don't pay much attention to that OR CB like numbers, since for the most part everything has more than enough cores for the kind of stuff I do.

Multithreading is actually getting best performance out of process. GB6 has MT test. GB5 and Spec doesn't even try to measure MT-performance, they are just throughput tests that measure single-thread performance throughput with n-copies runt concurrently. MT-performance need fast cores and fast interconnect between cores - single thread n-rate test only needs cores and memory bandwidth. N-rate tests are totally unrelated to any desktop or mobile workloads and should not be referred as MT-benchmarks. Spec got it right but GB5 just outright lies about being MT-benchmark.
 

moinmoin

Diamond Member
Jun 1, 2017
5,188
8,321
136
Sorry for going off topic with this line of discussion, I'll stop after this round of responses.

Do you see MT score is a bit higher than ST? That's how those workloads will scale to MT. For most use cases MT performance measured by running multiple independent ST threads concurrently won't tell anything about MT performance. And because of Amdahls law ST performance is very important part of MT-scaling. And that co-operative MT scaling is what actually users need - that score will present chips MT performance on well programmed MT workloads including gaming. GB5 or Cinebench won't tell you anything about those MT performance.
What you are talking about is MT capability of software, not hardware. I would wager that the majority of people run benchmarks to see the highest possible performance of their system's hardware, not some software (which in GB's case is completely opaque to boot). Most will try the software in question if they want to know how it performs; most won't even think of looking at GB6's "MT" for that purpose. And on your PC do you kill all tasks to run exactly one task? Then GB6's "MT" is perfect for you indeed. For all others the traditional hardware MT is much more reflective of how the hardware will perform when many applications run at the same time, them being multi threated or not.

The renderer test scales well.
I referred to that subtest before.
There is essentially no returns past 6 cores except for one single subtest.

Others show limitations (either due to structurally not scaling, like most software, or poor choice/programming of workload).

I agree that using the global MT score doesn't show a good picture for MT scaling. But OTOH using benchmarks that have perfect scaling is as stupid and misleading.
I think most PC users don't worry about scaling per se, that's a topic for programmers. What PC users worry about is of their PCs slowing down where they don't want it. Now if somebody buys a PC with many cores it's for being able to throw more stuff at its CPU. Throwing most stuff at a CPU usually doesn't mean heavier applications (this would get into details for programmers again, many applications being constrained by ST or very limited MT etc.) but exactly that, more applications at the same time without the PC bogging down.

In the real world loads that increase in performance with every core you throw at it are the exception not the rule. Once there is any interdependence between threads there is a point where more cores doesn't help at all, and in some cases can even hurt.

I think John ought to keep the current MT test and add back the GB5 style test, and label them "cooperative MT" and "parallel MT" respectively. Then people who want CB or SPECrate type numbers that go up and up until you run out of memory bandwidth (or memory) can get that, while people who care about stuff like cross core/cross cluster/cross chiplet synchronization overhead can perform relevant comparisons.

That said, I'm not sure GB6 is really testing what I'm talking about. I haven't looked into it, I don't pay much attention to that OR CB like numbers, since for the most part everything has more than enough cores for the kind of stuff I do.
Real world loads are many of them at the same time, sometimes interfering with each other. It's the exception that exactly one load runs completely inhibited for its whole run, but exactly this exception is what is being used for all kinds of benchmarks as it's the scientific reproduceable clean room approach. But nobody is using PCs this way.

While all real MT benchmarks so far are bad for different reasons they mostly do at least do one job sufficiently well: Using all available multicore performance and as such show the user a theoretical upper limit for a given system.

GB6's "MT" though combines the worst of both worlds: It's still the clean room approach, it benchmarks some opaque software's multithreading capability (which stops scaling beyond 6 cores except for one subtest) on a given hardware while using the term traditionally used to benchmarks the limits of the given hardware. While that kind of benchmark has its use (it's excellent for showing the quality of a CPU's cache setup, which - on topic - is what Apple excels at) it should be better referred to as some extended ST in the sense of a "single task" + "limited multithreading" benchmark.

Multithreading is actually getting best performance out of process. GB6 has MT test. GB5 and Spec doesn't even try to measure MT-performance, they are just throughput tests that measure single-thread performance throughput with n-copies runt concurrently. MT-performance need fast cores and fast interconnect between cores - single thread n-rate test only needs cores and memory bandwidth. N-rate tests are totally unrelated to any desktop or mobile workloads and should not be referred as MT-benchmarks. Spec got it right but GB5 just outright lies about being MT-benchmark.
I see it the completely inverse way. But you show it's mostly a matter of wording:

MT is traditionally used to refer to N-rate tests. And people buying PCs with many cores are usually more interested in overall throughput possible than in a single task's cooperative performance. For the latter most would look at ST instead, which is why I would propose to term "cooperative MT" more as an extension of ST than a replacement of how MT is commonly understood up to now.
 

naukkis

Senior member
Jun 5, 2002
989
840
136
MT is traditionally used to refer to N-rate tests. And people buying PCs with many cores are usually more interested in overall throughput possible than in a single task's cooperative performance. For the latter most would look at ST instead, which is why I would propose to term "cooperative MT" more as an extension of ST than a replacement of how MT is commonly understood up to now.

In that is case Intel got its hybrid scheme wrong - they only need one big core for ST and as many as possible small cores for best throughput. But in reality there really aren't stressful plain ST workloads that matter, everything is at least somehow multithreaded. But process multithreading in most cases only scales up to about 8 threads - because of Amdahls law.

So average Joe when deciding his cpu and comparing cpus should focus on GB6 MT - not plain ST and sure not those single-thread n-rate tests. With GB6 MT even nowadays gaming performance is pretty comparable to those MT results. For those users that need n-rate performance - they basically doesn't even need benchmarks - they can need to compare core counts and memory bandwidth - more the better. But 99% of desktop users and about 100% mobile users does not get more performance with more cores. Showing them benchmarks that show otherwise will only be beneficial for hardware makers - they can sell more expensive hardware for people which won't perform any better in real use cases. In some cases they actually might be worse option - and to overcome that fact those hardware makers need to offer "performance utilities" that just plain disable badly performing cores from lowering user experience.
 
  • Like
Reactions: Nothingness

Nothingness

Diamond Member
Jul 3, 2013
3,235
2,285
136
9800x3d begs to differ;) unless you claim 9950x or newest Intel cpus are better at gaming than it;)
That's because some of the GB6 MT tests scales much better than game engines. Half kidding 😉

Anyway, game engines are targetting a limited number of cores and have a potential hard bottleneck out of their control (GPU and their drivers), so it's unsurprising they don't scale well beyond a point.
 

mikegg

Golden Member
Jan 30, 2010
1,874
485
136
M3 Ultra is so mid for the price at which Mac Studio is selling. Thank fully there are enough Apple loonies to buy this over priced system and I am happy as an AAPL stock holder :)
M3 Ultra is a bargain for those wanting to run DeepSeek at home or other large LLM models.

Even compared to workstations such as the ones Puget Systems sells, it's a bargain.

32 core Threadripper + 512GB of RAM + RTX 4060ti is already $12k.

Meanwhile, an M3 Ultra has 32 core CPU and a 160 core GPU with 512GB of 819GB/s RAM for $9.5k.

1741679155215.png
 
Last edited: