Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

jeanlain · Dec 3, 2020

About SMT. I don't see why benchmark tools should aim at "saturating a CPU core".
These tools are designed to give a summary of the performance to expect while running different tasks.
ST benchmarks are useful to indicate how fast an architecture is at single-threaded tasks, which are very common.
Then there are tasks using several threads. You're free to test with 2 or more threads. But there's no point in constraining these threads to one particular core. None, unless you're interested in single-core performance from an academic point of view.
In the case of the M1, it beats the competition at most tasks that have 4 threads or fewer. Beyond that, the fact that it has only 4 high-performance cores will start to show. In 2-threaded tasks, it wins most of the time.
If you want to show that SMT has some benefits, then launch more threads than there are CPU cores, like everyone does. Constraining a task to just one core is not going to show anything relevant to real world use.

EDIT: the above was already pointed out by others on the previous page.

moinmoin · Dec 3, 2020

Carfax83 said:
Now here's an interesting thought. Despite the almost universal derision that many of us here have with WCCFTech, they posted an interesting article today.
(...)
So what do you guys think? Do you believe these assertions have any merit? To me it makes sense.

Did you skip reading the previous page when you posted that? Kind of irritating to see this discussion start over again since the claims fly in the face of any serious benchmarking.

biostud said:
Why m1 is beaten by x86 in single core benchmarks.

Exclusive: Why Apple M1 Single "Core" Comparisons Are Fundamentally Flawed (With Benchmarks)

I have something pretty exciting for our readers today; something that almost everyone appears to have missed in the clamor for Apple M1 benchmark comparisons. What if I told you that pretty much all of the single-core benchmark comparisons between the Apple M1 and modern x86 processors you see...

wccftech.com

dmens said:
This is one of the dumbest things I have ever read and demonstrates an absolute ignorance of how SMT is actually implemented, specifically, what is replicated and what is shared for SMT.

---

So no amount of SMT is ever going to replace the need for serious ST as that is boosting any non-parallel workload while SMT and multi-core needs embarrassingly parallelized software to work well.

jeanlain · Dec 3, 2020

amrnuke said:
Apple have chosen not to have their core do SMT, and instead used those transistors to focus on other areas.

How difficult would implementing SMT on a firestorm core be?
When Apple put their SoCs into more powerful desktops, the smaller cores will have little practical use and SMT will make better use of the silicon.

insertcarehere · Dec 3, 2020

amrnuke said:
It doesn't scale linearly.

That Xeon line looks very linear to me up to 24 threads, while first-gen threadripper suffers a bit from CCX/NUMA bottlenecks. Just looking at those charts it'd be tough to argue against a hypothetical 8+8 Apple Mx Chip doubling scores at the same clocks as the M1.

Bam360 said:
Yeah, obviously what they are trying to do is close the gap as much as possible in single thread performance while still tying or winning in multi threaded tasks (at least compared to A14), thanks to having more cores that, especially the A78 on a newer node, sip power. Besides, the key thing here is that it's much cheaper to cram 3 A78 cores instead of 3 X1 cores, because they need much less area, X1 may be more power hungry, like 50% more power at same GHz, but you would probably get lower power at the same performance by downclocking the X1. For laptops however, yeah, 4 X1 at the very least, and Cortex A55 just has to die, too slow and not really that efficient for the little performance it has.

S888 is certainly more conservative than it really should've been wrt layout and clocks, I am guessing the Samsung process does it no favors in this regard.
Agreed with the A55, at some point in time QC/Mediatek have got to be tired with licensing that core and just put a big core on a lower-voltage plane, right?.

Qwertilot · Dec 3, 2020

jeanlain said:
How difficult would implementing SMT on a firestorm core be?
When Apple put their SoCs into more powerful desktops, the smaller cores will have little practical use and SMT will make better use of the silicon.

Not for quite some time at least - their next step up is still going to be going into a lot of laptops and the smaller cores are very useful then.

In fact Apple's smaller cores are so powerful that they're probably worth keeping about regardless.

uzzi38 · Dec 3, 2020

jeanlain said:
How difficult would implementing SMT on a firestorm core be?
When Apple put their SoCs into more powerful desktops, the smaller cores will have little practical use and SMT will make better use of the silicon.

SMT isn't particularly difficult to implement on a hardware level at all, it just requires a lot of validation

bigggggggg · Dec 3, 2020

jeanlain said:
In the case of the M1, it beats the competition at most tasks that have 4 threads or fewer. Beyond that, the fact that it has only 4 high-performance cores will start to show.

But how is it possible in your opinion that M1 performs so well in multi-core/multi-thread tasks against 8c/16t CPUS, looking at SPEC2017 tests (even those that do not rely much on cache)?

Carfax83 · Dec 3, 2020

jeanlain said:
Now, they should have said "at single threaded tasks" to account for SMT, but I suppose everyone understood that.
Note also that the tests Apple relies on were performed on October with commercially available CPUs. At that time, the fastest core, including desktop CPUs was intel's (the 10900k I suppose), which is beaten by the M1 at almost every single-threaded task. So Apple's claim appear quite conservative. Specifying "when it comes to low-power silicon" makes the claim valid today.
Still, the M1 trade blows with the current best desktop CPU core in ST SPEC tests.

This is the last time I'm going to address this. They changed it from World's fastest CPU core to World's fastest CPU core in low power silicon on the same day apparently. Linus called them out on it in his video review:

Carfax83 · Dec 3, 2020

moinmoin said:
Did you skip reading the previous page when you posted that? Kind of irritating to see this discussion start over again since the claims fly in the face of any serious benchmarking.

Yeah I never saw that discussion, my bad. I hadn't actively participated in this thread since last week so I was merely responding to replies and didn't read all of the latest posts.

So no amount of SMT is ever going to replace the need for serious ST as that is boosting any non-parallel workload while SMT and multi-core needs embarrassingly parallelized software to work well.

I don't think that was what the author was implying when he wrote that article, but it doesn't matter. I don't want to rehash that argument again.

teejee · Dec 3, 2020

uzzi38 said:
SMT isn't particularly difficult to implement on a hardware level at all, it just requires a lot of validation

I don't think this is correct. Adding SMT to Apples core would probably be very difficult without significant ST perfomance regression.
M1 has the most advanced core on the market with extremely high IPC, suddenly make this work with two different threads would probably mess up the whole design.
Remember that Apple have never had to care about SMT in their development, I'm sure there are tons of big and small design decisions that has benefitted from that.

So don't expect Apple to have SMT in their cores. The have choosen ultra-high IPC and efficiency cores instead.

moinmoin · Dec 3, 2020

jeanlain said:
How difficult would implementing SMT on a firestorm core be?
When Apple put their SoCs into more powerful desktops, the smaller cores will have little practical use and SMT will make better use of the silicon.

The question is: Why would Apple bother? SMT is a way of ensuring better utilization of all available CPU resources at the cost of overall higher power usage and lower per thread performance. This is exactly the opposite goal for mobile, and at least M1 is still decidedly mobile in its design. The very first question for Apple is whether the desktop market is big enough to warrant a dedicated chip design (M1 isn't one yet), but there the increase of I/O capabilities should be way higher on the list than a feature that requires changes to a common core shared across all markets, changes that then are only really usable in the desktop market.

Carfax83 · Dec 3, 2020

jeanlain said:
About SMT. I don't see why benchmark tools should aim at "saturating a CPU core".
These tools are designed to give a summary of the performance to expect while running different tasks.
ST benchmarks are useful to indicate how fast an architecture is at single-threaded tasks, which are very common.

I don't disagree with any of that, but in modern times, it seems that single threaded tasks are not really used for anything performance intensive. Performance critical tasks all seem to be multithreaded. I could be wrong though I freely admit.

To me, single threaded performance is only useful in the context of the overall throughput of an architecture.

nxre · Dec 3, 2020

SMT doesn't change the uarch of the core in any meaningful way. You still have the same front end, the same execution units and the same back end. The only difference is that you can run TWO threads on the same core so as to make better use of the execution units. If a single thread can saturate ALL execution units, SMT is insignificant: there is no performance penalty to it nor advantage. If a single thread cannot saturate all units, SMT is significant: there is no advantage for single threaded performance but multithreaded benefits gets a big ~20 to 30% boost.
A lot of people are acting as if having SMT on x86 designs in any way makes single thread performance worse for single threaded tasks because somehow SMT would be eating away resources from the main thread. Not the case.
Apple massive ROB window and cache also problably eliminates the need for SMT. They can keep all execution units feed with one thread, so there would be NO benefit to running another thread on the core, as there are no underutilized parts.
The fact this discussion was only brought up after M1 just shows it is a non sensical coping mechanism that tries once again to invalidate M1 performance. It's boring at this point. SMT is measured on the multithread benchmarks, the same way Apple Little cores are. Leave it that way.

nxre · Dec 3, 2020

Carfax83 said:
I don't disagree with any of that, but in modern times, it seems that single threaded tasks are not really used for anything performance intensive. Performance critical tasks all seem to be multithreaded. I could be wrong though I freely admit.

To me, single threaded performance is only useful in the context of the overall throughput of an architecture.

Web-browsing, the most common task in any computer, is single thread. Compilers are also single thread. CPU-bound games are also extremely dependent on single threaded performance, which is why AMD has only been able to regain the gaming crowd this year after also regaining the single thread crown.
A lot of things simply can't be made to run parallel, a lot of algorithms are inherently serial problems, so even today single threaded performance is very relevant.

jeanlain · Dec 3, 2020

moinmoin said:
The question is: Why would Apple bother? SMT is a way of ensuring better utilization of all available CPU resources at the cost of overall higher power usage and lower per thread performance. This is exactly the opposite goal for mobile, and at least M1 is still decidedly mobile in its design. The very first question for Apple is whether the desktop market is big enough to warrant a dedicated chip design (M1 isn't one yet), but there the increase of I/O capabilities should be way higher on the list than a feature that requires changes to a common core shared across all markets, changes that then are only really usable in the desktop market.

Apple also sells desktop computers, and I suppose they want to make clear their SoCs will the best on that front as well. Here, power consumption won't enter much into consideration, and Apple will be competing against 32-thread CPUs.
I don't expect Apple to simply use higher-clocked iPhone cores on Mac Pros. If they implement SMT, this should pay off in the long term. Apple Silicon is here to stay.

jeanlain · Dec 3, 2020

teejee said:
I don't think this is correct. Adding SMT to Apples core would probably be very difficult without significant ST perfomance regression.
M1 has the most advanced core on the market with extremely high IPC, suddenly make this work with two different threads would probably mess up the whole design.
Remember that Apple have never had to care about SMT in their development, I'm sure there are tons of big and small design decisions that has benefitted from that.

So don't expect Apple to have SMT in their cores. The have choosen ultra-high IPC and efficiency cores instead.

I haven't seen hard evidence of a tradeoff between high IPC and SMT. That's why I'm wondering why people think SMT would be hard to implement on these cores. If anything, I would expect high IPC to naturally lend itself to SMT, since the core is wider.

jeanlain · Dec 3, 2020

Carfax83 said:
This is the last time I'm going to address this. They changed it from World's fastest CPU core to World's fastest CPU core in low power silicon on the same day apparently. Linus called them out on it in his video review:

"Apparently"?
I just showed you that Apple used the phrase "when it comes to low power silicon" the very first time they made that performance claim to the public. Have you watched the video I linked? Do you think they edited the video after the fact?
And even if they didn't initially include that mention on their webpage, (a claim which I have seen no evidence of), why would that not be just an oversight, since they make that point clear in their video?

Carfax83 · Dec 3, 2020

nxre said:
Web-browsing, the most common task in any computer, is single thread.

What browser are you using? All the major browsers nowadays are multithreaded, though they may go about it in different ways.

This is from several years ago:

Firefox 54 finally supports multithreading.

Compilers are also single thread.

If this is the case, how do you explain these benchmarks?

Zen 3 code compiling benchmarks

Code compiling might not be inherently multithreaded, but it seems to respond well to parallelization. You see the same thing in the Spec GCC sub test.

CPU-bound games are also extremely dependent on single threaded performance, which is why AMD has only been able to regain the gaming crowd this year after also regaining the single thread crown.

As a long time gamer, I can say that this is not correct at all. Games (especially big games) have been becoming increasingly parallelized over the years, abandoning the old programming models from years ago when rendering was done on one thread, physics on another, game logic on another, etcetera.... Also, newer APIs like DX12 and Vulkan also mesh much better with multithreaded programming than the legacy APIs. A lot of modern game engines use task based parallelism, and some of the most cutting edge 3D engines don't even have lead threads.

The main reason why Zen 3 is so dominant in gaming is because each core in the new CCX has access to twice as much cache as Zen 2, which reduces memory latency big time. Memory latency was the main advantage that Intel had over AMD throughout the years that accounted for the advantage in gaming.

A lot of things simply can't be made to run parallel, a lot of algorithms are inherently serial problems, so even today single threaded performance is very relevant.

Again, I'm not saying or implying that single threaded performance isn't relevant, because it definitely is. I'm just saying that its relevancy is tied to how it contributes to the overall throughput of a CPU, and this is because practically all the performance intensive algorithms are now either inherently multithreaded or respond well to parallelization with more cores increasing performance.

jeanlain · Dec 3, 2020

bigggggggg said:
But how is it possible in your opinion that M1 performs so well in multi-core/multi-thread tasks against 8c/16t CPUS, looking at SPEC2017 tests (even those that do not rely much on cache)?

I suppose that these tasks, although they may display 100% CPU usage, do not scale well with thread count (this is the case for geekbench). For these workloads, the benefits of having many cores is not as stark as for others like cinebench.

Carfax83 · Dec 3, 2020

jeanlain said:
"Apparently"?
I just showed you that Apple used the phrase "when it comes to low power silicon" the very first time they made that performance claim to the public. Have you watched the video I linked? Do you think they edited the video after the fact?
And even if they didn't initially include that mention on their webpage, (a claim which I have seen no evidence of), why would that not be just an oversight, since they make that point clear in their video?

You're assuming the video was the first bit of media that they used for the M1. If you look at the Apple M1 preview that Anandtech had which was done on Nov 10, there is an Apple marketing slide which claims "World's fastest CPU core."

Andrei F. stated in the preview:

The new CPU core is what Apple claims to be the world’s fastest. This is going to be a centre-point of today’s article as we dive deeper into the microarchitecture of the Firestorm cores, as well look at the performance figures of the very similar Apple A14 SoC.

So obviously they changed their tune at one point on that same day.

bigggggggg · Dec 3, 2020

jeanlain said:
I suppose that these tasks, although they may display 100% CPU usage, do not scale well with thread count (this is the case for geekbench). For these workloads, the benefits of having many cores is not as stark as for others like cinebench.

Maybe the massive L2 cache helps when some tests are performed, like libquantum that seems to like cache. I think also that massive arithmetic units helps in others. AFAIK blender really like multi-threading, but M1 > 4800U even there.
Strange.
I think large decoder + massive L2 cache + crazy single-thread performance help a lot, despite having a lower core count

jeanlain · Dec 3, 2020

Carfax83 said:
You're assuming the video was the first bit of media that they used for the M1. If you look at the Apple M1 preview that Anandtech had which was done on Nov 10, there is an Apple marketing slide which claims "World's fastest CPU core."

That video was the reveal of the M1 to the public. I watched the live stream. It came out before Anandtech's preview. The text is part of a video and as such, it must be interpreted in that context. What Anandtech published is certainly not a marketing slide sent to journalists, it's a screenshot of the video shown without its proper context. There is no evidence that Apple sent a different piece information about the M1 to Anandtech. For all we know, Anantech was only relying on the video. If someone was misleading here, it's Anandtech, not Apple.

scannall · Dec 3, 2020

uzzi38 said:
SMT isn't particularly difficult to implement on a hardware level at all, it just requires a lot of validation

SMT also introduces possible attack vectors, so it can be a security risk as well. I'd also add that SMT was introduced as a way to deal with pipeline stalls. The better your core, and the fewer stalls you have then the less benefit you'll see from SMT.

Carfax83 · Dec 3, 2020

bigggggggg said:
Maybe the massive L2 cache helps when some tests are performed, like libquantum that seems to like cache. I think also that massive arithmetic units helps in others. AFAIK blender really like multi-threading, but M1 > 4800U even there.
Strange.

Funny you should mention that, because I was reading a thread over at realworldtech forums, and two guys were arguing over whether the M1's width is a big factor in its performance or not.

Then one of them brought up Spec2017 blender, and he mentioned:

I also looked up what SPECfp 2017 does with Blender. They render a 'reduced version' of a data set at 320x200. Perhaps that's why this paper gets next to zero L1D misses with that Blender test on Haswell. Last time I poked around with Blender (rendering some models from a video game for fun), L1D hitrate was around 95%.

Geez, what's next from SPEC? Rendering a 1x1 image?

Source

So it could be that Apple's massive caches are helping out a lot in the blender benchmark, because all or almost all of the code can execute from the cache due to the small footprint?

On a side note, I've noticed that the Spec benchmark gets a lot of criticism on that forum, which is well known for having plenty of engineers, programmers and IT industry professionals. A common refrain is that it doesn't represent well the types of workloads that it claims to from a real world perspective.

Heartbreaker · Dec 3, 2020

amrnuke said:
It doesn't scale linearly.

That Xeon Line is linear up to 24 threads. If the CPU can deliver, it will scale linearly.

Discussion Apple Silicon SoC thread

Lifer

Member

Diamond Member

Member

Senior member

Golden Member

Platinum Member

Junior Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Member

Member

Member

Member

Member

Diamond Member

Member

Diamond Member

Junior Member

Member

Golden Member

Diamond Member

Diamond Member