Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

naukkis · Nov 22, 2020

ricebunny2020 said:
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.

Yes, but it translates up to SSE4.2.

software_engineer · Nov 22, 2020

ricebunny2020 said:
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.

Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.

moinmoin · Nov 22, 2020

Thala said:
That is unfortunately a bigger problem with Phoronix' test suite. Many packages are heavily hand optimized with x86 assembly and SIMD intrinsics.

Calling that "unfortunate" and a "bigger problem" sounds like it would be a bad thing for code to be optimized, so also for any M1 optimizations?

I consider Phoronix a good representation of open source userland one familiar with Unix-like systems may be interested in using. As such it does a good job showing the current status quo of just that on an M1-based system. And as I wrote before personally I consider the results positively impressive, even with Rosetta and the lack of native optimizations or ports altogether performance is already competitive. And Apple Silicon Macs are very likely going to be popular enough to invite such ports and optimizations from the scene. So if one can live with a locked down system while working with a Unix-like userland those Macs are bound to be a great additional option.

JoeRambo · Nov 22, 2020

software_engineer said:
Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.

SSE2 is baseline and is emulated by Rosetta2. In fact AVX, AVX2 were already "optional" and properly written software had detection for CPU features and codepath selection. So all Rosetta2 emu layer has to do in those cases is report that it does not support AVX+ and SSE codepath is used instead.
The problem with SSE family was always low number of architectural registers and as You mention the destructive type of operations. Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.

ThatBuzzkiller · Nov 22, 2020

guidryp said:
That is nonsense, start to finish.

amrnuke said:
All 3 paragraphs are wrong. You could go research why, but I fear you won't.

It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?

Heartbreaker · Nov 22, 2020

LightningZ71 said:
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.

There is nothing Apple needs to "fix" here. Phoronix is a poor choice for cross platform benchmarking, and it's not really representative of commercial software that most people will run.

People lambasted Geekbench for years, but the reality is they make an effort to actually have the benchmark fairly optimized for each platform.

The Phoronix benchmarks are the complete opposite. Heavily hand tuned x86 assembler, completely biased in favor of x86.

It's also not representative of what you will find in most commercial software. Optimizing assembler, is a hobby activity these days, not how production software is built.

KompuKare · Nov 22, 2020

senttoschool said:
Google Chrome is an active app on MacOS and Safari is also a native app on Windows 10.

Safari on Windows?
I though Apple dropped Windows Safari nearly a decade ago?
Wikipedia says Safari was available for Windows from 2007 until 2012.

DrMrLordX · Nov 22, 2020

guidryp said:
Optimizing assembler, is a hobby activity these days, not how production software is built.

Not sure if I agree with that. Are you suggesting that there's no SIMD support at all in "production" software?

mikegg · Nov 22, 2020

KompuKare said:
Safari on Windows?
I though Apple dropped Windows Safari nearly a decade ago?
Wikipedia says Safari was available for Windows from 2007 until 2012.

You're right. I was looking at the official Safari screen and it confused me when it compared Safari on MacOS to Windows 10 browsers.

Heartbreaker · Nov 22, 2020

DrMrLordX said:
Not sure if I agree with that. Are you suggesting that there's no SIMD support at all in "production" software?

Pretty sure you don't need hand tuned Assembler, to use SIMD instructions.

DrMrLordX · Nov 22, 2020

guidryp said:
Pretty sure you don't need hand tuned Assembler, to use SIMD instructions.

If that is the case, then what's the advantage to "hand tuned Assembler"?

amrnuke · Nov 22, 2020

ThatBuzzkiller said:
It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?

The M1 is as fast as the Mac Pro in Webkit compile time (according to one reviewer - even coming close is absurd). It compiles Xcode as fast as a 32-thread 3950X Hackintosh.It's within 20% of the i9+dGPU MBP 16" in Final Cut rendering. Apple get the same or better x86 performance in their Mac mini, MBA, and MBP13 via emulation, much better native performance, for less cost and less power consumption.

Not releasing documentation behind their CPU does not mean they don't like low-level programming or actively discourage it. Regardless, I'd like you to expand on what you mean by "low level programming". I'm also curious what you mean by "keep changing architectures". Also, I'd like to hear what you mean by inconsistent hardware designs and how that affects software development. Provide examples so we can actually have a discussion.

naukkis · Nov 22, 2020

JoeRambo said:
enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.

SSE has 8 registers in 32bit mode but 16 in x86-64.

Heartbreaker · Nov 22, 2020

DrMrLordX said:
If that is the case, then what's the advantage to "hand tuned Assembler"?

Same as it's always been, squeezing out every performance advantage.

DrMrLordX · Nov 22, 2020

guidryp said:
Same as it's always been, squeezing out every performance advantage.

. . . ah huh.

Look, based on what I've seen, unless you're dealing with highly-specialized-and-difficult-to-autovectorize code, hand-tuned asm doesn't really provide a performance advantage over what a standard C/C++ compiler can crank out on its own. Even stuff like OpenMP can work some of the time (though the limited exposure I had to it showed me it sucked, at least as of a few years ago). If you're dealing with the JVM (Java/Scala/Kotlin/etc.) then it can autovectorize extremely well via heuristics, and I've even gotten one of my own applications running quite well on my phone's Snapdragon 855+ in Java using that technique. Actually all I had to do was take the exact .class files I compiled for my x86-64 targets years ago and run them with a full JVM implementation via UserLAnD, and it worked quite well. That Snapdragon 855+ blew the doors off my old A10-7700k and A10-7870k. But I digress.

Outside of the obvious problems with the Phoronix test data when Rosetta2 was involved, you also have to consider that Phoronix was (probably) compiling "native" M1 binaries from source that was written with an x86-64 target. There probably wasn't a whole heck of a lot more "hand-tuned assembler" involved in any of those applications than there is in "commercial software packages". The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all. If I have an application with SSE2, 3, 4.x, AVX, and AVX2 code paths, the compiler can't use ANY of that on an M1 if I just try compiling straight up. It's not going to even try spitting out any NEON. The code would require a NEON code path for ARM targets. If that code path isn't there, too bad.

My guess is that Phoronix's FLAC results are affected by them compiling from source to a binary that has no SIMD support at all, which would explain the horrendous performance. You'd better believe that SIMD makes a big difference in performance under a benchmark like that.

wlee15 · Nov 22, 2020

JoeRambo said:
Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.

x86-64 has 16 Vector registers. The EVEX prefix that was introduced alongside AVX-512 extends that to 32.

Thala · Nov 22, 2020

DrMrLordX said:
If that is the case, then what's the advantage to "hand tuned Assembler"?

Hand tuned assembler is very rare. Compared to intrinsics you can hope that you manage to do better register allocation, but even this would be rare these days. Also you have more control over loop unrolling and SW pipelining - which can help.
Speaking of assembler, i have written a SHA-512 crypto implementation for ARMv8 in pure assembler not long ago - where i unrolled the round-loop 80 times

But yes, in general you just want to use the provided SIMD intrinsics and leave the rest to the compiler.

Thala · Nov 22, 2020

DrMrLordX said:
The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all.

Technically you explicitly have to make sure, that the ARM compiler never gets to see the SSE/AVX code path via preprocessor directives or by separation into different files.

JoeRambo · Nov 22, 2020

naukkis said:
SSE has 8 registers in 32bit mode but 16 in x86-64.

Good point, i keep forgetting Apple is x64 mode, 16 registers.

Heartbreaker · Nov 22, 2020

DrMrLordX said:
There probably wasn't a whole heck of a lot more "hand-tuned assembler" involved in any of those applications than there is in "commercial software packages".

There is typically near ZERO in commercial software in my Experience. It's rare in the extreme That was the point. If there is any at all, it's kind of an anomaly.

Flac encoder was mentioned first, then I notice Kvazaar: "... H.265 video encoder written in the C programming language and optimized in Assembly. Kvazaar is the winner of the 2016 ACM Open-Source Software Competition ..."

I pointed the optimized x86 assembler, because these benchmarks were first posted as "more representative" and someone else said this was a problem Apple needed to Fix.

This is about as tilted as comparisons get, with hand tuned x86 assembler on one side, likely ignored source fed through the compiler on the other.

Can you imagine how bad source code has to be, that it actually runs worse than Rosetta translation of the x86 version (Flac Encoder)?

The SW is also about as far from what buyers will use, as you can get. Kvazaar for encoding? I am sure >99% are using something like Handbrake, which already has beta Univeral Binary out...

These benchmarks are about as useless for cross platform comparisons as they claimed Geekbench to be, except this time it's true...

Mopetar · Nov 22, 2020

I mean they're kind of useful if you want to know how poorly a rushed sloppy port will do on the M1, so I guess it has that going for it.

biostud · Nov 23, 2020

What Apple has done is move from differentiate a computer by its processor, but instead only memory and storage, just like phones. People are used to buy phones and it is much easier to explain to the end user why they should by more storage, than getting a +200Mhz processor. Apple doesn't sell CPU's they sell a complete package, so the don't have the need to have a lot different CPU's. They have one for each generation, that is the most powerful they can com up with. Intels naming scheme and different generation of core processors on the mobile market was also getting absurd.
But just like phones Apples prices start at midrange and quickly escalate once you add storage.

Eug · Nov 23, 2020

biostud said:
What Apple has done is move from differentiate a computer by its processor, but instead only memory and storage, just like phones. People are used to buy phones and it is much easier to explain to the end user why they should by more storage, than getting a +200Mhz processor. Apple doesn't sell CPU's they sell a complete package, so the don't have the need to have a lot different CPU's. They have one for each generation, that is the most powerful they can com up with. Intels naming scheme and different generation of core processors on the mobile market was also getting absurd.
But just like phones Apples prices start at midrange and quickly escalate once you add storage.

I get what you’re saying but the MacBook Pro and MacBook Air have the same memory options and the same storage options. Ironically, the MacBook has a slightly lower performing SoC in some configurations due to the lower end GPU.

biostud · Nov 23, 2020

Eug said:
I get what you’re saying but the MacBook Pro and MacBook Air have the same memory options and the same storage options. Ironically, the MacBook has a slightly lower performing SoC in some configurations due to the lower end GPU.

They had to find a place for the chips with defects

mikegg · Nov 23, 2020

Mopetar said:
I mean they're kind of useful if you want to know how poorly a rushed sloppy port will do on the M1, so I guess it has that going for it.

To be honest, it's more of an ego thing on this forum. A lot of people here have Ryzen systems or are AMD fans. The minute their PC master race systems get blown up by a tiny Macbook Air in common applications that most people use, they start to find ways to boost their ego.

For example, people here will point to the Cinebench numbers and say "see, I told you M1 isn't as fast as mobile Ryzen", despite knowing full well that 99.99% of people will never use Cinebench on the Macbook or Ryzen systems. Cinebench is just an AMD-friendly tool to boost the ego of AMD buyers.

Meanwhile, the M1 runs circles around Ryzen in the most commonly used applications including web browsing, hardware-accelerated video editing, AI acceleration, etc. And somehow, saying that the M1 chip is the "fastest laptop CPU" or the "fastest overall laptop chip(SoC)" is somehow controversial here.

Discussion Apple Silicon SoC thread

Lifer

Golden Member

Junior Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Lifer

Golden Member

Golden Member

Diamond Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Golden Member