Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Roland00Address · Nov 21, 2020

If you say so @ThatBuzzkiller

jeanlain · Nov 21, 2020

Eug said:
This FLAC encoding one I find interesting though. The Rosetta score is way, way faster than the native M1 score.

Which makes no sense at all. Maybe they meant the i5 mini, not the M1 mini

lopri · Nov 21, 2020

It's a giant problem for all the usual X86 players. I am a PC person but if I were in a situation to spend $2K on a laptop, it would be difficult to justify getting anything else than the new Macbook.

software_engineer · Nov 21, 2020

Eug said:
Hmm... There are a few errors in there. For example, the Geekbench scores they provide as Mac mini M1 native are actually Rosetta scores.

This FLAC encoding one I find interesting though. The Rosetta score is way, way faster than the native M1 score.

View attachment 34312

The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.

jeanlain · Nov 21, 2020

software_engineer said:
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.

So Rosetta generates better ARM code than humans?

JoeRambo · Nov 21, 2020

jeanlain said:
So Rosetta generates better ARM code than humans?

It translates handtuned and hand vectorized SIMD code into ARM SIMD instructions that are then run on powerful hardware at near native speeds. While actual ARM port is typical of current ARM ports - very little if any optimization.

Heartbreaker · Nov 21, 2020

duplicate...

moinmoin · Nov 21, 2020

jeanlain said:
So Rosetta generates better ARM code than humans?

It's not farfetchted that a lot of existing ARM code is pretty barebone with regard to optimizations and that Rosetta is good at translating existing assembly and SIMD to ARM equivalents. To be honest I'm positively impressed that there are so many cases where native is already clearly better than Rosetta for exactly this reason. Now imagine all code getting actually optimized for M1's capability.

Heartbreaker · Nov 21, 2020

software_engineer said:
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.

This is a some indication of how these benchmarks might be skewed with heavy optimization for x86, vs unoptimized code for ARM.

I note that Kvazaar is also "written in the C programming language and optimized in Assembly". I would expect a lot effort into hand tuned x86 assembler, vs none on the ARM side.

DrMrLordX · Nov 21, 2020

senttoschool said:
The second sentence refers to the entire chip, aka SoC.

You actually bothered to differentiate between the two? Pff whatever. Next time say "SoC" if that's what you mean . . .

bakyt115 said:
tests by phoronix is more representative.

Glad you pasted that! Though . . .

Eug said:
This FLAC encoding one I find interesting though. The Rosetta score is way, way faster than the native M1 score.

Gonna have to take a minute to parse all that data since Phoronix typically throws a lot of stuff at you and not necessarily in useful context, but it does look like some of the attempts by Phoronix to native compile FOSS for the M1 resulted in a lot of unoptimized code.

ThatBuzzkiller said:
Mediocre showing overall ...

I don't necessarily agree. When running software that's been ready from day one (or nearly day one) from vendors optimizing specifically for M1, it looks really good. It's only going to lose some MT benchmarks to some higher-power CPUs that probably won't ever run Big Sur anyway. It has the usual Mac problems but it's hard to ding the M1 for that specifically.

ThatBuzzkiller · Nov 21, 2020

DrMrLordX said:
I don't necessarily agree. When running software that's been ready from day one (or nearly day one) from vendors optimizing specifically for M1, it looks really good. It's only going to lose some MT benchmarks to some higher-power CPUs that probably won't ever run Big Sur anyway. It has the usual Mac problems but it's hard to ding the M1 for that specifically.

The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...

Heartbreaker · Nov 21, 2020

ThatBuzzkiller said:
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...

That is nonsense, start to finish.

amrnuke · Nov 21, 2020

ThatBuzzkiller said:
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...

All 3 paragraphs are wrong. You could go research why, but I fear you won't.

DrMrLordX · Nov 21, 2020

ThatBuzzkiller said:
If you look at the Rosetta numbers specifically

You can safely ignore most of those. Rosetta 2 serves the same basic purpose that Rosetta did back in the day - as a transition kludge to get M1 buyers through until software vendors compile and optimize with M1 as a target. Not all software will "make it", but you can already get a fair amount of software already, with more to come. Anyone who's serious about selling software on MacOS needs to recompile. It's just that simple.

Try to look more at the M1 results that are native and (unlike the FLAC numbers) outperform the Rosetta 2 results from the same benchmark.

insertcarehere · Nov 21, 2020

ThatBuzzkiller said:
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

The M1 can't beat devices with discrete graphics gaming under Rosetta so it must be only good for Javascript, am I right?

Doug S · Nov 21, 2020

JoeRambo said:
It translates handtuned and hand vectorized SIMD code into ARM SIMD instructions that are then run on powerful hardware at near native speeds. While actual ARM port is typical of current ARM ports - very little if any optimization.

Rosetta2 doesn't handle AVX, it only goes up to SSE 4.2. If the native ARM code is just compiled it may not be vectorized at all - often you need to arrange the source code in a certain way for the compiler to recognize it can be vectorized. So it is easy to see why static translation of the SSE 4.2 code path could be faster than native ARM code that doesn't use vectorization at all.

This isn't going to be a problem for long, stuff that is popular on the Mac will get optimized vectorized ARM code (or maybe use the GPU, NPU or ISP blocks to go even faster in certain cases)

Phoronix's tests were using various open source software packages popular on Linux that may not be used much at all on the Mac.

dmens · Nov 21, 2020

ThatBuzzkiller said:
All AMD (or Intel) has to do is wait until they can transition to the latest process as well and they'll be able to automatically undo any of the gains that either Apple or any ARM vendor achieved in their designs ...

LOL why wait? Intel can use that vaunted war chest and pay TSMC for just a few 5nm wafer starts, then presto, problem solved, right?

Oh wait, Intel can fab their trash designs on TSMC and it would still be trash. Garbage in, garbage out. Sorry.

StinkyPinky · Nov 21, 2020

Very interested to see if Apple use an upscaled variant of this for their desktops. A 16 core variant with cooling could be a beast.

Thala · Nov 22, 2020

moinmoin said:
It's not farfetchted that a lot of existing ARM code is pretty barebone with regard to optimizations and that Rosetta is good at translating existing assembly and SIMD to ARM equivalents. To be honest I'm positively impressed that there are so many cases where native is already clearly better than Rosetta for exactly this reason. Now imagine all code getting actually optimized for M1's capability.

That is unfortunately a bigger problem with Phoronix' test suite. Many packages are heavily hand optimized with x86 assembly and SIMD intrinsics.

LightningZ71 · Nov 22, 2020

Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.

Thala · Nov 22, 2020

LightningZ71 said:
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.

This was not my point. Of course the larger ARM ecosystem will eventually make sure that these issues are going to be fixed - if it is Apple or anyone else like Amazon or even the open source community does not really matter.
I am for instance looking into improving the Intel Embree library with respect to Arm NEON - it is used by Blender, Maxon Cinema and other 3d applications. If you compile Embree from the official sources there is just C++ code path for ARM available.

mikegg · Nov 22, 2020

M1 vs Renoir:

Google Chrome is available as an Apple M1 native app today

Chrome runs much better natively than translated—we've got benchmarks inside.

arstechnica.com

mikegg · Nov 22, 2020

For what it's worth, my Macbook Pro 15" 2015 with 2.2Ghz Intel CPU measured 55.0 in Speedometer in Chrome. M1 is 4x faster.

mikegg · Nov 22, 2020

marrakech said:
thers only google chrome for windows 10?

Google Chrome is an active app on MacOS and Safari is also a native app on Windows 10.

ricebunny2020 · Nov 22, 2020

software_engineer said:
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.

Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.

Discussion Apple Silicon SoC thread

Lifer

Platinum Member

Member

Elite Member

Junior Member

Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Golden Member

Lifer

Senior member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Junior Member