Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M5 Family discussion here:

Page 431 - Discussion - Apple Silicon SoC thread

Page 431 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

moinmoin · Dec 3, 2020

jeanlain said:
Apple also sells desktop computers, and I suppose they want to make clear their SoCs will the best on that front as well. Here, power consumption won't enter much into consideration, and Apple will be competing against 32-thread CPUs.
I don't expect Apple to simply use higher-clocked iPhone cores on Mac Pros. If they implement SMT, this should pay off in the long term. Apple Silicon is here to stay.

I don't want to harp on you, but please try to read the texts you are replying to. As I wrote myself:

moinmoin said:
The question is: Why would Apple bother? SMT is a way of ensuring better utilization of all available CPU resources at the cost of overall higher power usage and lower per thread performance. This is exactly the opposite goal for mobile, and at least M1 is still decidedly mobile in its design. The very first question for Apple is whether the desktop market is big enough to warrant a dedicated chip design (M1 isn't one yet), but there the increase of I/O capabilities should be way higher on the list than a feature that requires changes to a common core shared across all markets, changes that then are only really usable in the desktop market.

Relative to all the other markets Apple is currently serving the desktop market is not really noteworthy. Doing a dedicated chip design (not just a variation on one to two dies per year like currently), changing the common core design just for a minor market, either using it only there (which would mean big gaps between updates) or using the core in all markets (which would mean a worse design in the mobile markets) just doesn't look feasible to me in Apple's overall picture. The only way I see Apple pursuing the desktop market not served already (so many cores + SMT) specifically is if it has plans to revive its server business (whether for internal use as part of iCloud or publicly), and I consider that possibility far-fetched.

insertcarehere · Dec 3, 2020

bigggggggg said:
But how is it possible in your opinion that M1 performs so well in multi-core/multi-thread tasks against 8c/16t CPUS, looking at SPEC2017 tests (even those that do not rely much on cache)?

Said 8C/16T CPUs (4800U/4900HS in particular) cannot run at near their peak ST clocks with a full MT workload while fitting in TDP constraints. ST boost clock is 4.2/4.4ghz for those processors but in MT workloads frequency drops to around or under 3ghz. The M1 by contrast runs at around the same clock speed whether ST or MT (a bit over 3ghz either way), along with the efficiency cores pitching in, this all means that the "4 core" M1 will punch well above its weight in MT scaling when compared in the context of other mobile CPUs.

Carfax83 · Dec 3, 2020

jeanlain said:
That video was the reveal of the M1 to the public. I watched the live stream. It came out before Anandtech's preview. The text is part of a video and as such, it must be interpreted in that context. What Anandtech published is certainly not a marketing slide sent to journalists, it's a screenshot of the video shown without its proper context. There is no evidence that Apple sent a different piece information about the M1 to Anandtech. For all we know, Anantech was only relying on the video. If someone was misleading here, it's Anandtech, not Apple.

If you say so. It doesn't really matter that much to me to be quite honest.

But a small correction, Anandtech uploaded their preview on Nov 10, the same day as that live stream. Also, it's doubtful that Andrei F. wrote that article on Nov 10. He probably wrote it well in advance.

Carfax83 · Dec 3, 2020

scannall said:
SMT also introduces possible attack vectors, so it can be a security risk as well. I'd also add that SMT was introduced as a way to deal with pipeline stalls. The better your core, and the fewer stalls you have then the less benefit you'll see from SMT.

The better your core? Or the shorter pipelined your core? Also, higher clock speeds increases memory latency, which also increases the chance of pipeline stalls right?

So it's probably correct that a higher clocked CPU with a longer pipeline would benefit more from SMT than a lower clocked CPU with a shorter pipeline.

But that doesn't mean the former is necessarily worse than the latter. It's just a different design.

coercitiv · Dec 3, 2020

nxre said:
Web-browsing, the most common task in any computer, is single thread.

Web-browsing, the most common task in any computer, has been multi-threaded for years.

Mobile browsers could take advantage of 4-6 cores while rendering pages even 5 years ago.

bigggggggg · Dec 3, 2020

insertcarehere said:
ST boost clock is 4.2/4.4ghz for those processors but in MT workloads frequency drops to around or under 3ghz. The M1 by contrast runs at around the same clock speed whether ST or MT (a bit over 3ghz either way), along with the efficiency cores pitching in, this all means that the "4 core" M1 will punch well above its weight in MT scaling when compared in the context of other mobile CPUs.

Yep, this should be taken into account. For what i can see from the 4900HS CB23 test, all cores run at 3.8-3.9 GHz for a while consuming 54 watt, but then slow down to 35 watt at 3.3 GHz.

bigggggggg · Dec 3, 2020

Carfax83 said:
So it could be that Apple's massive caches are helping out a lot in the blender benchmark, because all or almost all of the code can execute from the cache due to the small footprint?

On a side note, I've noticed that the Spec benchmark gets a lot of criticism on that forum, which is well known for having plenty of engineers, programmers and IT industry professionals. A common refrain is that it doesn't represent well the types of workloads that it claims to from a real world perspective.

I don't know if that is a problem and the technical reasons behind that resolution. I mean, i don't know if there is a specific reason or whatever, this goes beyond my knowledge.
I only know that SPEC is widely used by some organization/industries as a "general" benchmark. A few days ago i was reading a Clang vs GCC comparison from the Alibaba Tech blog and they used SPEC. AFAIK SPEC should be sufficiently reliable, though it can be misleading in some ways as every benchmark is.

A reason for that score in blender could be the one insertcarehere said: when multithread tasks are executed the M1 can run at full frequencies on all cores, while most of other CPUS lower theirs by a lot.

nxre · Dec 3, 2020

Carfax83 said:
What browser are you using? All the major browsers nowadays are multithreaded, though they may go about it in different ways.

coercitiv said:
Web-browsing, the most common task in any computer, has been multi-threaded for years.

Browsers have been multi threaded, web browsing has not. Javascript is still a fundamentally single threaded languague.

Carfax83 said:
Code compiling might not be inherently multithreaded, but it seems to respond well to parallelization. You see the same thing in the Spec GCC sub test.

Independent modules of the same code can be compiled in parallel which does not meant that compiling is a multithread workload even if it can benefit from it in certain occasions. You can google this in more detail as I grossly simplified it.
On the topic of games, the bottleneck is still single thread performance in most cases when it comes to CPU.
I dont understand what is hard to get about the idea that single thread code will never be obsolete: if your code requires the result of a previous operation to proceed, it cant be parallel.

scannall · Dec 3, 2020

Carfax83 said:
The better your core? Or the shorter pipelined your core? Also, higher clock speeds increases memory latency, which also increases the chance of pipeline stalls right?

So it's probably correct that a higher clocked CPU with a longer pipeline would benefit more from SMT than a lower clocked CPU with a shorter pipeline.

But that doesn't mean the former is necessarily worse than the latter. It's just a different design.

Well yeah, everything is a trade off. Apple has decided that high clocks are a dead end, and counter to their goals. So they are going for high throughput at lower clocks is all. Both are valid choices obviously, but there is more to look at than just clocks, or just SMT.

coercitiv · Dec 3, 2020

nxre said:
Browsers have been multi threaded, web browsing has not. Javascript is still a fundamentally single threaded languague.

The graph I posted with 4-8 simultaenous thread occupancy on an Android phone includes every workload needed to render a page, hence there's much more work to do on the CPU than just one JavaScript thread.

Web browsing is multi-threaded.

Bam360 · Dec 3, 2020

If all or most tasks could be hugely multithreaded, there would be no need to invest a huge amount of transistors on improving the core, because you would get much more performance and efficiency by cramming tons of cores and making each of them much simpler.

bigggggggg · Dec 3, 2020

Bam360 said:
If all or most tasks could be hugely multithreaded, there would be no need to invest a huge amount of transistors on improving the core, because you would get much more performance and efficiency by cramming tons of cores and making each of them much simpler.

In fact embarassingly parallel tasks nowadays should be accelerated by huge multicore "processors" like GPU-somethingPU. But single thread or mainly single thread tasks continues to exist so it is also important high-performance CPU cores.

jeanlain · Dec 3, 2020

moinmoin said:
I don't want to harp on you, but please try to read the texts you are replying to. As I wrote myself:

I read what you wrote. My point is that even with the current base of Mac desktops (which may increase), Apple has the incentive to make competitive desktop SoCs. If that requires SMT, they'll use it. That hypothetical SMT core could be used in all desktops, including Mac minis. That's not a small market. And it's not as if Apple would be the only one making specialised silicon.
The mere existence of the Mac Pro challenges the assumption that Apple only targets large markets. They even designed an FPGA specifically for that machine.
I don't know if they'll use SMT, but I believe that Apple is ready to make modifications to the CPU core design if that is required to claim leadership in performance.

IvanKaramazov · Dec 3, 2020

Carfax83 said:
If you say so. It doesn't really matter that much to me to be quite honest.

But a small correction, Anandtech uploaded their preview on Nov 10, the same day as that live stream. Also, it's doubtful that Andrei F. wrote that article on Nov 10. He probably wrote it well in advance.

Andrei made clear on twitter that their preview was pre-written based on the A14 in the iPad Air, and the few details from the livestream (like that slide) were inserted right before publication. Honestly, people have been repeatedly saying that Apple "changed" their claim but there's never been any evidence they did. I watched the "livestream" live, and they were pretty explicit both there and in the website that went up right after it ended about what their precise claims were.

As far as I can tell, the idea that Apple changed their definition of that claim comes largely from LTT. I respect Linus, but he's hardly a good source for that.

awesomedeluxe · Dec 3, 2020

insertcarehere said:
Said 8C/16T CPUs (4800U/4900HS in particular) cannot run at near their peak ST clocks with a full MT workload while fitting in TDP constraints. ST boost clock is 4.2/4.4ghz for those processors but in MT workloads frequency drops to around or under 3ghz. The M1 by contrast runs at around the same clock speed whether ST or MT (a bit over 3ghz either way), along with the efficiency cores pitching in, this all means that the "4 core" M1 will punch well above its weight in MT scaling when compared in the context of other mobile CPUs.

Yeah, I feel like big.LITTLE already accomplishes SMT's goal of more threads in an elegant way. The Icestorm cores do take up a little die space, but it's not much. The upside is appreciable: the thermal impact of the Icestorm cores is low, while SMT necessarily piles more work onto the hottest part of the APU.

I understand it's not necessary to choose one or the other, but SMT does have a thermal cost, and adding more CPU threads is a game of diminishing returns. The presumption right now is that the M1X will be 8+4 or 8+8. Have workloads really evolved to the point where having more than 12 or 16 threads is so desirable that it would be worth letting the Firestorm cores throttle more frequently?

Heartbreaker · Dec 3, 2020

awesomedeluxe said:
Yeah, I feel like big.LITTLE already accomplishes SMT's goal of more threads in an elegant way. The Icestorm cores do take up a little die space, but it's not much. The upside is appreciable: the thermal impact of the Icestorm cores is low, while SMT necessarily piles more work onto the hottest part of the APU.

I understand it's not necessary to choose one or the other, but SMT does have a thermal cost, and adding more CPU threads is a game of diminishing returns. The presumption right now is that the M1X will be 8+4 or 8+8. Have workloads really evolved to the point where having more than 12 or 16 threads is so desirable that it would be worth letting the Firestorm cores throttle more frequently?

I think they have completely different goals.

"Little" cores are to run at very low power.

SMT is to better utilize idle functional units, to maximize the performance of area used. Though at the expense of more heat and power.

It does seem like Apple could reap significant SMT gains from it's wide design packed with functional units, if it were feasible. I wouldn't be surprised if a future Apple high performance core had SMT.

moinmoin · Dec 3, 2020

jeanlain said:
That's not a small market

Yes, the Mac desktop market for Apple is a small market totally dwarfed by the iPhone market, by the iPad market and by the Mac laptop market, all of which use and profit from mobile chips. Currently the same core is shared across all markets, with the largest seeing a new die every year, the others every two years. You seriously think Apple will add a non-mobile optimized SMT capable core that can't even be shared with any other market just for the Mac desktop market? Apple will keep the chips for all markets as close as possible, with only the configuration (number of cores/units) differing so that improvements can be directly shared across all possible products.

FPGAs are actually a good cheaper way out to add further value to higher end desktop Macs, maybe Apple will double down on that.

bigggggggg · Dec 3, 2020

build2 compilation times:

Extrapolated means:
"Note that the results for the best mobile Intel (1185G) and AMD (4900HS) are unfortunately not yet available and the numbers above are extrapolated based on frequency and other benchmark results."

Source

jeanlain · Dec 3, 2020

bigggggggg said:
Well, x86 apps running through Rosetta could use specialized hardware from what i understand, but i don't know if this is the case, considering isn't doing any particular task on that images. And i found out that lightroom can take advantage from CUDA to accelerate tasks, so if that would be the case, the 4900HS + 2060 Super would have won certainly.

In the case of Lightroom, Adobe may use macOS APIs to render certain effects, decode raw files and/or convert to jpeg... Those APIs may automatically use the image signal processor or whatever specialised hardware on the M1. That's the only explanation I have for the M1 beating the 4900HS + nvidia 2060. This is not just a macOS thing, as the M1 using rosetta also soundly beats the fastest intel MacBook Pro.
I don't think Adobe makes heavy use of OS-specific APIs. They prefer using their own frameworks, but who knows?

Note that there were effects/corrections applied to the photos, in that test.

EDIT: macOS vs Windows could be a factor after all, as the 16" MacBook Pro is slightly faster than the Asus at that particular Lightroom task, despite having lower specs.

Bam360 · Dec 3, 2020

bigggggggg said:
build2 compilation times:
View attachment 35024
View attachment 35025

Extrapolated means:
"Note that the results for the best mobile Intel (1185G) and AMD (4900HS) are unfortunately not yet available and the numbers above are extrapolated based on frequency and other benchmark results."

Source

Well, this shows that the efficiency cores are fairly similar to HT/SMT additional thread in terms of performance, though it obviously depends on the workload. So it makes sense that it loses in some multithreaded workloads against CPUs with 8 cores and 16 threads, M1 is basically more similar to a 4/8 CPU being compared against 8/16 CPU.

Doug S · Dec 3, 2020

Carfax83 said:
According to Apple themselves, it's 8. But I suppose in your "supreme knowledge" you have probably convinced yourself that you know more about the M1 than the people that actually designed it.

I think he's referring to all the other cores on it that aren't exposed to end users. The secure enclave has its own ARM core, does that make it a 9 core CPU? I imagine there are other ARM cores that handle other stuff but I no idea what it all adds up to (and that doesn't count stuff like GPU, NPU, IPU etc. cores that don't execute ARM instructions)

Doug S · Dec 3, 2020

bigggggggg said:
But how is it possible in your opinion that M1 performs so well in multi-core/multi-thread tasks against 8c/16t CPUS, looking at SPEC2017 tests (even those that do not rely much on cache)?

The little cores are about 1/3 the performance of the big cores (at not much more than 1/10th the power) so the 4+4 M1 is roughly equivalent to 5 big cores.

As for why a 8c/16t CPU can't keep up, is it throttling? With Apple having fewer cores and using less power, a load that might throttle an 8c/16t would allow the M1 to run full speed and never throttle (other than maybe the fanless Macbook Air)

bigggggggg · Dec 3, 2020

Adobe may use macOS APIs to render certain effects, decode raw files and/or convert to jpeg...

Ok, but previous photoshop version for MacOS couldn't use such a DSP or whatever, because intel processor didn't have that. If rosetta can automatically use GPU, it can't automatically switch between a DSP that didn't exist on Intel CPUs to the M1 DSP.
I don't know if any accelerator explains the results because Adobe has probably a 10 years long CUDA support right now and i think that if some sort of specialized accelerated function is used on MacOS lightroom, it should be used on windows + NVIDIA gpu.

Anyway it is not possible to state how that results can be obtained, since we don't know all variables. The build2 benchmark i posted before is more interesting and replicable.

bigggggggg · Dec 3, 2020

Bam360 said:
M1 is basically more similar to a 4/8 CPU being compared against 8/16 CPU.

Agree, despite more tests are needed to analyze the behavior of 4+4 vs 4-only in many workloads

Doug S · Dec 3, 2020

teejee said:
I don't think this is correct. Adding SMT to Apples core would probably be very difficult without significant ST perfomance regression.
M1 has the most advanced core on the market with extremely high IPC, suddenly make this work with two different threads would probably mess up the whole design.
Remember that Apple have never had to care about SMT in their development, I'm sure there are tons of big and small design decisions that has benefitted from that.

So don't expect Apple to have SMT in their cores. The have choosen ultra-high IPC and efficiency cores instead.

Adding SMT does not affect the ST performance at all. Do you think Intel would have added SMT if it caused a significant ST performance regression? Heck, just look back at the performance of their CPUs before/after they added SMT, do you see a regression? How about for AMD? Or are you going to argue that somehow only Apple would see a performance regression, but Intel and AMD were somehow immune?

The question of adding SMT has nothing to do with "difficulty" (if Apple can design a CPU that's competition with the best x86 in ST using a fraction of the power, this would present little challenge to them) it is a question of what your market is and whether it makes sense. Apple's market is overwhelmingly mobile where there's no point to SMT since people don't run workstation/server levels of threading on a phone, and I would argue the little cores greatly reduce the utility of SMT.

Plus you introduce potential security headaches, as more attacks against CPUs with SMT enabled are discovered all the time.

SMT might make sense on say the Mac Pro, but that's too niche of a market for them to bother with all that effort when that's the only place it is used. If they start using their CPUs for their own servers, then I could see it being worth it - they'd just leave it permanently disabled on the phone/tablet and potential laptop cores.

Discussion Apple Silicon SoC thread

Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Junior Member

Junior Member

Member

Golden Member

Diamond Member

Member

Junior Member

Member

Member

Member

Diamond Member

Diamond Member

Junior Member

Member

Member

Diamond Member

Diamond Member

Junior Member

Junior Member

Diamond Member