Discussion Apple Silicon SoC thread

Page 164 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,587
1,001
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
7,848
6,012
136
My comments were not meant to gaslight anyone. I simply point out Apple picked their OS to be optimized to their hardware, and vice versa. Neither Intel nor AMD enjoy that support from an OS. Its not a bad thing by any means. Add in they are on smaller processes than Intel or AMD, it adds to the performance gap. But even Apple struggles on Windows when run in emulation, smaller relative process or not. So Apple's performance depends on their OS-hardware symbiotic relationship.

None of that matters for the SPEC results and those are quite good for Apple SoCs and have been for years. Apple is making an excellent CPU and any specialized hardware and OS optimizations that can add performance are just gravy on top.

If either AMD or Intel cared to they could release their own customized *nix distribution that does the same.
 

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
None of that matters for the SPEC results and those are quite good for Apple SoCs and have been for years. Apple is making an excellent CPU and any specialized hardware and OS optimizations that can add performance are just gravy on top.

If either AMD or Intel cared to they could release their own customized *nix distribution that does the same.
Intel does. But outside of an occasional Phoronix article you seldom see ClearLinux benchmarks.
It's like how my 1700X scores 10% better in stock debian than it does in Windows in GB5. Seems, to me, that operating system could make a significant difference in tests like GB5, which try to measure 'real world' performance.

But Mac OS is none-such operating system. My Intel Macs usually score lower in Mac OS than the same machine booted in Windows.
 
Last edited:

mikegg

Golden Member
Jan 30, 2010
1,756
411
136
Hardware H.265 acceleration, for example. The neural engine is another bit, yes, though at least for now, it seems like CPU/SoC reviewers point that out before benchmarks to differentiate between inference performance and "general CPU" performance.
The same H.265 acceleration that is available in every modern GPU?

Which benchmark uses the neural engine? I'll wait.
 

mikegg

Golden Member
Jan 30, 2010
1,756
411
136
You wouldn't want someone using that as the end-all, be-all of benchmarks, would you? No? Good. I wouldn't either. And frankly I don't care if Andrei agrees with you or not. Other people whom I respect will still die at the feet of SPEC when SPEC has its own problems. That's why you run a suite of as many applications as you can to gauge performance, keeping in mind what they're actually doing and why they're performing the way they do.
You're missing the point here.

We need to benchmark applications that most people actually use such as Excel, Slack, web browsing, VSCode, Electron apps, etc. These are the apps people use the most. Geekbench correlates with these benchmarks better than whatever Phoronix uses or Cinebench.

Then we need application-specific benchmarks so people who actually use those can make an informed decision.

Using 100 different random, non-optimized, obscure benchmarks that 99.99% of the people buying these computers will never use is not helpful.
 

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,856
136
The same H.265 acceleration that is available in every modern GPU?

Yes. And? Complete the thought.

We need to benchmark applications that most people actually use such as Excel, Slack, web browsing, VSCode, Electron apps, etc. These are the apps people use the most. Geekbench correlates with these benchmarks better than whatever Phoronix uses or Cinebench.

One of the things that gets lost in those benchmarks is that sometimes the difference in performance between two CPUs/SoCs might amount to a tiny amount of time saved in an everyday workload. It's hard to get excited about the PDF rendering segment of Geekbench if a CPU that is 60% faster can render a PDF in .04 seconds instead of .064 seconds (assuming the storage media allows it). There are reasons why people enjoy MT rendering and CPU encode benchmarks since it not only shows you the fpu grunt available, but it also shows you workloads where significant time savings can be had for people that actually have large CPU render or encode jobs. Mind you, people are offloading more and more of that work to other hardware now, so perhaps those benchmarks are getting a little long in the tooth. But CPU rendering still has its place. Look at OBS for example.
 
  • Like
Reactions: Ajay

mikegg

Golden Member
Jan 30, 2010
1,756
411
136
Yes. And? Complete the thought.
This dispells the myth that you're trying to spread: that Apple Silicon wins in benchmarks because they have many fixed function hardware.

1670988627375.png

Nobody is trying to compare Apple Silicon's dedicated H.265 hardware acceleration to AMD or Intel's CPU-only H.265 encoding/decoding performance.


One of the things that gets lost in those benchmarks is that sometimes the difference in performance between two CPUs/SoCs might amount to a tiny amount of time saved in an everyday workload. It's hard to get excited about the PDF rendering segment of Geekbench if a CPU that is 60% faster can render a PDF in .04 seconds instead of .064 seconds (assuming the storage media allows it). There are reasons why people enjoy MT rendering and CPU encode benchmarks since it not only shows you the fpu grunt available, but it also shows you workloads where significant time savings can be had for people that actually have large CPU render or encode jobs. Mind you, people are offloading more and more of that work to other hardware now, so perhaps those benchmarks are getting a little long in the tooth. But CPU rendering still has its place. Look at OBS for example.
Again, you're missing the point.

First, Apple Silicon excels in applications people use the most. This is by far, the most important use case, and Apple made sure that its chips are optimized for it.

Second, Apple Silicon is beastly in heavy rendering and video editing tasks, as long as the software isn't just an x86 piece of software being translated into ARM instructions. Anandtech's SPEC benchmarks proved this. An M1 Pro/Max trades blows with a desktop 5950x.

Third, it's not hard for Apple to do what Intel did, which is to have an obscene number of little cores in order to win benchmarks that only 0.01% of buyers would ever benefit from. It's a waste of the transistor budget. Adding MT performance that almost no one can make use of is the easiest thing a CPU can do. Instead of bashing Apple, you should be bashing AMD and Intel for focusing too much trying to win in Cinebench.

Lastly, we're in a geeky DIY computer forum, so I get it. People here get a hard-on for "raw" benchmarks. But the workloads that people get a hard on for do not represent the real world. And even then, Apple Silicon wins its fair share of these workloads - as long as the app is actually optimized for it.


 
Last edited:

MadRat

Lifer
Oct 14, 1999
11,910
238
106
Apple was smart to essentuate their memory bandwidth advantage. The strategy obviously has advantages where bandwidth is king. And they look to only continue down this path in the future. Unless their volatile memory starts going tits up across a wide spread of users, they can afford the few replacements for people on the extreme of usage.

Does the Max and Ultra expand memory bandwidth over the other Mx products?
 

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,856
136
This dispells the myth that you're trying to spread: that Apple Silicon wins in benchmarks because they have many fixed function hardware.

Au contraire; I'm sure it outperforms dGPU encoding on a power and area basis. You're comparing a logic block in a (relatively) small SoC to a massive dGPU add-in card. And once again I'll point out that Apple has top-down control of the environment in which you would operate that SoC, unlike an AMD or NV dGPU where the vendor has to provide drivers and then hope that application developers will make use of the hardware. At least in NV's case, NV has enough clout to get by, but support for AMD's solutions has not always been great (it's better now). The only reason why anyone on the PC side has hardware that can accelerate video encoding at all is either:

Intel built it into their CPU (quicksync) or
They bought a video card for something else (gaming/rendering) and used the video encoding as a bonus feature.

Anyone who goes out to the market today with dedicated video encoding hardware as an add-in card will get at best a tiny audience using their hardware with a small set of specialized applications utilizing said hardware. Best case for them is that they staff some coders to update FOSS projects like handbrake to support their hardware.

Apple can ship any kind of ASIC they want on their platform with zero resistance and no special effort needed to convince developers to utilize it, so long as they can convince developers to cater to MacOS at all.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,273
106
Will Apple ever match Nvidia's leadership in GPU performance?

Let's look at the M2 series' theoretical GPU performance (measured in TFLOPS)

M2 -> 10 core GPU -> 3.6 TFLOPS
M2 Pro -> 20 core -> 7.2 TFLOPS
M2 Max -> 40 core -> 14.4 TFLOPS
M2 Ultra -> 80 core -> 28.8 TFLOPS

That's still not enough to match the RTX 4090. The RTX 4090 has over ~80 TFLOPS of GPU performance!

But Apple has a ace up it's sleave. By combining four M2 Max dies, you get the -

M2 Extreme * > 160 core -> 57.6 TFLOPS

Such a theoretical 160 core GPU is still far away from the RTX 4090 in terms of raw TFLOP numbers, but as any techie knows, it folly to compare TFLOP numbers between different architectures of GPUs. And this i know well.

That said though, I believe that the Apple GPUs provide more performance per TFLOP, than the Ada Lovelace. I think this is a safe assumption to make, and there are subtle indicators this might indeed be the case.

If that be true, then the theoretical M2 Extreme would match or exceed the performance of the RTX 4090, particularly in creator workloads

Exciting times ahead.


*We do not know what the 4× M2 Max chip will be called, but let's call it M2 Extreme for the present.
 
Last edited:
  • Like
Reactions: scineram

Doug S

Platinum Member
Feb 8, 2020
2,269
3,520
136
Will Apple ever match Nvidia's leadership in GPU performance?

Let's look at the M2 series' theoretical GPU performance (measured in TFLOPS)

M2 -> 10 core GPU -> 3.6 TFLOPS
M2 Pro -> 20 core -> 7.2 TFLOPS
M2 Max -> 40 core -> 14.4 TFLOPS
M2 Ultra -> 80 core -> 28.8 TFLOPS

That's still not enough to match the RTX 4090. The RTX 4090 has over ~80 TFLOPS of GPU performance!

But Apple has a ace up it's sleave. By combining four M2 Max dies, you get the -

M2 Extreme * > 160 core -> 57.6 TFLOPS

Such a theoretical 160 core GPU is still far away from the RTX 4090 in terms of raw TFLOP numbers, but as any techie knows, it folly to compare TFLOP numbers between different architectures of GPUs. And this i know well.

That said though, I believe that the Apple GPUs provide more performance per TFLOP, than the Ada Lovelace. I think this is a safe assumption to make, and there are subtle indicators this might indeed be the case.

If that be true, then the theoretical M2 Extreme would match or exceed the performance of the RTX 4090, particularly in creator workloads

Exciting times ahead.


Just based on silicon area (i.e. add up the mm^2 for the GPU blocks on Apple's SoCs and compare with Nvidia GPUs) it seems folly to think they could. Nor, at 450 watts for the GPU alone, would Apple even want to try to match the RTX 4090. They are not in the space heater market.
 

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,273
106
Just based on silicon area (i.e. add up the mm^2 for the GPU blocks on Apple's SoCs and compare with Nvidia GPUs) it seems folly to think they could. Nor, at 450 watts for the GPU alone, would Apple even want to try to match the RTX 4090. They are not in the space heater market.

I doubt it. A single GPU core of the A15 Bionic measures roughly 2.34 mm². You can work out the die area using the image posted below. 2.34 × 160 = 374 mm². So 160 GPU cores take up 374 mm² of area. NOTE- 374 mm² is purely for the GPU cores alone. I don't know what is the size of a single SM, but the RTX 4090 has 128 SMs.

AD102 measures 608 mm². But of course a significant portion of that area is spent on things other than SMs such as controllers, caches, encoders, decoders etc.. Furthermore, the RTX 4090 has only 128 of 144 SMs that is physically present on the AD102 die.

The M2 Extreme on the other hand would be a true monstrousity. Measuring ~2000 mm² with a memory bandwidth of upto 2133 GB/s (Thanks to LPDDR5X-8533 which the M2 Max is rumoured to use. M2 Max would have a 512 bit bus with 533 GB/s of bandwidth. M2 Extreme will quadruple this of course, to get a 2048 bit bus with 2133 GB/s)
 

Attachments

  • chrome_screenshot_1669313018783.png
    chrome_screenshot_1669313018783.png
    2 MB · Views: 10
Last edited:

FlameTail

Platinum Member
Dec 15, 2021
2,356
1,273
106
Btw, This is how I arrived at the M2 Extreme die size estimate of 2000 mm².

The Apple M1 has a die size of 118 mm². The M2 is ~140 mm². I remember reading a Semianalysis article where they actually calculated the M2 die size to be something ~150mm². I'll edit this comment and post the link if i can find it.

M1 Max was ~400 mm². If we assume the increase from M1 -> M2 also follows in the M2 Max, since it is also rumoured to use the same TSMC N5P process, M2 Max would easily be around ~500 mm². 4× 500mm² = 2000mm² .
 

mikegg

Golden Member
Jan 30, 2010
1,756
411
136
Will Apple ever match Nvidia's leadership in GPU performance?

Let's look at the M2 series' theoretical GPU performance (measured in TFLOPS)

M2 -> 10 core GPU -> 3.6 TFLOPS
M2 Pro -> 20 core -> 7.2 TFLOPS
M2 Max -> 40 core -> 14.4 TFLOPS
M2 Ultra -> 80 core -> 28.8 TFLOPS

That's still not enough to match the RTX 4090. The RTX 4090 has over ~80 TFLOPS of GPU performance!

But Apple has a ace up it's sleave. By combining four M2 Max dies, you get the -

M2 Extreme * > 160 core -> 57.6 TFLOPS

Such a theoretical 160 core GPU is still far away from the RTX 4090 in terms of raw TFLOP numbers, but as any techie knows, it folly to compare TFLOP numbers between different architectures of GPUs. And this i know well.

That said though, I believe that the Apple GPUs provide more performance per TFLOP, than the Ada Lovelace. I think this is a safe assumption to make, and there are subtle indicators this might indeed be the case.

If that be true, then the theoretical M2 Extreme would match or exceed the performance of the RTX 4090, particularly in creator workloads

Exciting times ahead.
TFLOPS isn't really a good measure of performance anymore.

Will the M2 "Extreme" match the RTX 4090?

In gaming? No.

In machine learning? Probably not unless the model is supremely RAM starved.

In applications that can utilize potentially up to 384GB of VRAM that the M2 "Extreme" could have? Yes. RTX 4090 only has 24GB.

In applications that can utilize unified memory between CPU and GPU? Yes.

Most importantly, I doubt Apple cares about matching the RTX 4090. What they do care about is matching laptop Nvidia GPUs.

 
  • Like
Reactions: Viknet

scineram

Senior member
Nov 1, 2020
361
283
106
That's not correct. This is Mac Pro and Studio territory. Apple very much wants those to be competitive with alternatives, otherwise they won't sell enough. And with the amount and bandwidth of memory Apple throws at them they definitely have a chance.
 

Ajay

Lifer
Jan 8, 2001
15,468
7,872
136
That's not correct. This is Mac Pro and Studio territory. Apple very much wants those to be competitive with alternatives, otherwise they won't sell enough. And with the amount and bandwidth of memory Apple throws at them they definitely have a chance.
Not for games though - but for pro -video/photo/audio apps. I think that's why comparisons to the RXT 4090 are kind of off. It's a gaming card first, not a professional card like NV's A-series cards and AMD's Radeon Pro Graphics series.
 
  • Like
Reactions: scineram

Doug S

Platinum Member
Feb 8, 2020
2,269
3,520
136
Doesn't additional memory controllers imply added cycles to share access across cores? There has to be a point of diminishing returns.


Across cores on the same SoC? No. The memory controllers hang off the SLC, the cores don't directly access them. For that matter, the cores don't directly access the SLC, it is the shared L2 in a core complex which does. The L2 to SLC connection doesn't care about the number of memory controllers per se. Sure, it would be nice to have enough bandwidth in that channel to handle the combined bandwidth of the memory controllers, but based on benchmarks testing how much memory bandwidth Apple can use they do not. Why not? Probably because being able to use ALL the memory bandwidth via the CPU cores alone is a niche need, while the GPU is likely to require/use more bandwidth than the CPU in most typical use cases.

Now when a core in one SoC needs to access memory hanging off a memory controller on a separate SoC in an Ultra or "Extreme" MCM, sure there's some additional delay. But that has nothing to do with the number of memory controllers, just the sheer wire distance (RC delay...) and chip crossing penalty. Though the latter is mitigated (perhaps virtually eliminated, I'm not sure?) via TSMC's very closely coupled connection between dies, with Apple utilizing ~10,000 I/Os to connect two dies (and patent filings show 3x as many will be used to connect four dies in the "Extreme" variant for Mac Pro)

So really all the number of memory controllers is doing is dictating the number of SoC to SoC I/Os Apple is using. Which would have been a HUGE problem not so long ago, but TSMC has provided them a way to have a previously unheard of number of cross chip I/Os without breaking the bank.

No doubt the reason Apple used such a large number of I/Os versus the strategy of others who used fewer higher bandwidth I/Os is to allow dedicated I/Os for every SLC to memory controller combination. As well as power reduction, there's a price to be paid for clocking I/Os as fast as Intel and AMD do.
 
  • Like
Reactions: Viknet and Ajay

Eug

Lifer
Mar 11, 2000
23,587
1,001
126
Bloomberg claims the "M2 Extreme" has been canceled.


Gurman states that it's gonna stop at M2 Ultra. So, there will be a Mac Pro, but with the same top end chip as the Mac Studio. Hmmm... That would be weird. I'm not sure how much he is guessing here though.
 
  • Like
Reactions: scineram

moinmoin

Diamond Member
Jun 1, 2017
4,954
7,671
136
I'm not sure how much he is guessing here though.
"Instead, the Mac Pro is expected to rely on a new-generation M2 Ultra chip (rather than the M1 Ultra) and will retain one of its hallmark features: easy expandability for additional memory, storage and other components."

Considering he appears to think expandability for additional memory is feasible with Apple Silicon I'd think he's guessing.
 

Doug S

Platinum Member
Feb 8, 2020
2,269
3,520
136
I'm skeptical about this. I could see Apple bypassing an "M2 Extreme" if they ran into some sort of issue, but they are hardly going to be worrying about the cost. There will be a market for "Extreme" Mac Pros, even if they start at $8000 or more. But maybe we have to wait for M3 to get the Extreme.

I'm even more skeptical about memory upgrades for Mac Pro. They'd need to add DDR5 controllers to the Max die, and the bandwidth would be far less than the LPDDR memory. The OS and/or developers would have to manage whether something belongs in "fast" memory or "slow" memory.

There's no reason they can't have Micron (or whoever is building their custom LPDDR stacks) build bigger stacks and offer BTO memory configs. The practical limit would be 3TB with M2. Would that be more expensive than DDR sticks? Sure. Would that mean you can't upgrade after purchase? Sure. But it avoids some very difficult problems. The only feasible way I could see to manage it would be treating the DIMMs as a backing store to the LPDDR and page in/out - and then you have two level paging if you also want to support paging in/out to storage.
 

ashFTW

Senior member
Sep 21, 2020
312
235
96
It seems that a solid strategy and architecture for the Mac Pro, both CPU and GPU, wasn't fully worked out before the switch to Apple silicon. Or Apple miscalculated the future technology direction and competitive landscape. It’s interesting that where the world is going fully disaggregated, Apple has doubled down on an everything aggregated strategy.

I’m afraid that the “4x Mx Max chiplets” strategy won’t be adequate long term, given that the Mx Max chiplet, being tied to the laptop TDP, will likely grow by max 15% each year. In 2023 that gives me 32 big cores and perhaps 64 in 3-4 years, at which point double that will be common place in the x86 world. And the divide will only get worse with future x86 generations. If only Apple had a bigger TAM so they could design more targeted (and disaggregate) chip/chiplets instead of this Mx Max reuse model. I can see these sort of heated discussions happening at Apple pre “Nuvia exodus” …

Plan B: Maybe Apple should just continue with x86 for Mac Pro. They already address 95% of their existing market with current Apple silicon products.

Revised Plan A: Maybe Apple has to come up with a “Nx Mx Max” architecture, where N > 4 Max chiplets can be combined, in some sort of cartridge/backplane model. This architecture should allow for disaggregated chiplets (e.g. CPU only, GPU only etc) in the future …
 

mikegg

Golden Member
Jan 30, 2010
1,756
411
136
Bloomberg claims the "M2 Extreme" has been canceled.


Gurman states that it's gonna stop at M2 Ultra. So, there will be a Mac Pro, but with the same top end chip as the Mac Studio. Hmmm... That would be weird. I'm not sure how much he is guessing here though.
I sort of predicted this a year ago.

Basically, I always believed that in order to profitably produce an "Extreme" version, Apple would have to create Apple Cloud. No way to recoup the costs from just selling the Mac Pro alone.

Mark Gurman is saying that Apple is working on a 40-core SoC for the Mac Pro for 2022.

You're Tim Cook, sitting in his nice office, looking at how much money you just spent to make this giant SoC for a relatively small market. In fact, you have to do this every year or every two years to keep the Mac Pro relevant. How do you recuperate some of this money spent?

You create "Apple Cloud". No, not iCloud. Apple Cloud. Like AWS. Where anyone can come and rent a 40-core M3 SoC running on macCloudOS. You get into the cloud hosting business. You file this under the "Services" strategy that you keep pushing to make Wall Street happy.

Soon, you'll be releasing 64-core SoCs with 128-core GPUs, then 128-core SoCs with 256-core GPUs, and so on. Somehow, you're actually beating anything AWS, Azure, Google Cloud can offer... without really trying.

Apple Silicon Cloud.

It wouldn't surprise me if Apple is already testing their own SoCs to power their iCloud service, which currently depend on AWS. Apple was reportedly spending $30m/month on AWS in 2019. It might be $100m+ per month by now given how fast services have grown.

 

Heartbreaker

Diamond Member
Apr 3, 2006
4,228
5,228
136
I sort of predicted this a year ago.

Basically, I always believed that in order to profitably produce an "Extreme" version, Apple would have to create Apple Cloud. No way to recoup the costs from just selling the Mac Pro alone.

If they are just making 40 core by attaching 4 copies of a 10 core part together, then there is nothing new being fabbed to worry about. Apple cloud with Apple SoCs doesn't seem to make much sense.
 
  • Like
Reactions: Mopetar