Discussion Apple Silicon SoC thread

Page 40 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,586
1,000
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

lopri

Elite Member
Jul 27, 2002
13,209
594
126
It's a giant problem for all the usual X86 players. I am a PC person but if I were in a situation to spend $2K on a laptop, it would be difficult to justify getting anything else than the new Macbook.
 

software_engineer

Junior Member
Jul 26, 2020
8
11
41
Hmm... There are a few errors in there. For example, the Geekbench scores they provide as Mac mini M1 native are actually Rosetta scores.

This FLAC encoding one I find interesting though. The Rosetta score is way, way faster than the native M1 score. o_O

View attachment 34312

The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.
 

jeanlain

Member
Oct 26, 2020
149
122
86
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.
So Rosetta generates better ARM code than humans?
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
So Rosetta generates better ARM code than humans?
It's not farfetchted that a lot of existing ARM code is pretty barebone with regard to optimizations and that Rosetta is good at translating existing assembly and SIMD to ARM equivalents. To be honest I'm positively impressed that there are so many cases where native is already clearly better than Rosetta for exactly this reason. Now imagine all code getting actually optimized for M1's capability.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,226
5,227
136
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.

This is a some indication of how these benchmarks might be skewed with heavy optimization for x86, vs unoptimized code for ARM.

I note that Kvazaar is also "written in the C programming language and optimized in Assembly". I would expect a lot effort into hand tuned x86 assembler, vs none on the ARM side.
 

DrMrLordX

Lifer
Apr 27, 2000
21,612
10,816
136
The second sentence refers to the entire chip, aka SoC.

You actually bothered to differentiate between the two? Pff whatever. Next time say "SoC" if that's what you mean . . .

tests by phoronix is more representative.

Glad you pasted that! Though . . .

This FLAC encoding one I find interesting though. The Rosetta score is way, way faster than the native M1 score. o_O

Gonna have to take a minute to parse all that data since Phoronix typically throws a lot of stuff at you and not necessarily in useful context, but it does look like some of the attempts by Phoronix to native compile FOSS for the M1 resulted in a lot of unoptimized code.

Mediocre showing overall ...

I don't necessarily agree. When running software that's been ready from day one (or nearly day one) from vendors optimizing specifically for M1, it looks really good. It's only going to lose some MT benchmarks to some higher-power CPUs that probably won't ever run Big Sur anyway. It has the usual Mac problems but it's hard to ding the M1 for that specifically.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I don't necessarily agree. When running software that's been ready from day one (or nearly day one) from vendors optimizing specifically for M1, it looks really good. It's only going to lose some MT benchmarks to some higher-power CPUs that probably won't ever run Big Sur anyway. It has the usual Mac problems but it's hard to ding the M1 for that specifically.

The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,226
5,227
136
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...

That is nonsense, start to finish.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...

Apple doesn't like low-level programming and in fact discourages it since they don't want to release documentation behind their CPUs like either AMD or Intel does. AMD and Intel will forever have the edge when they want developers to micro-optimize for their architectures ...
All 3 paragraphs are wrong. You could go research why, but I fear you won't.
 

DrMrLordX

Lifer
Apr 27, 2000
21,612
10,816
136
If you look at the Rosetta numbers specifically

You can safely ignore most of those. Rosetta 2 serves the same basic purpose that Rosetta did back in the day - as a transition kludge to get M1 buyers through until software vendors compile and optimize with M1 as a target. Not all software will "make it", but you can already get a fair amount of software already, with more to come. Anyone who's serious about selling software on MacOS needs to recompile. It's just that simple.

Try to look more at the M1 results that are native and (unlike the FLAC numbers) outperform the Rosetta 2 results from the same benchmark.
 
  • Like
Reactions: Tlh97

insertcarehere

Senior member
Jan 17, 2013
639
607
136
The M1 basically boils down to high quality Java/browser performance but that's not a surprise since previous Apple designed ICs were already good at those benchmarks and then some ...

If you look at the Rosetta numbers specifically, the M1 is mediocre given all it's circumstances. Apple just wants to keep paying the emulation or high level abstraction tax ...
The M1 can't beat devices with discrete graphics gaming under Rosetta so it must be only good for Javascript, am I right?
119359.png

119360.png
 

Doug S

Platinum Member
Feb 8, 2020
2,248
3,476
136
It translates handtuned and hand vectorized SIMD code into ARM SIMD instructions that are then run on powerful hardware at near native speeds. While actual ARM port is typical of current ARM ports - very little if any optimization.

Rosetta2 doesn't handle AVX, it only goes up to SSE 4.2. If the native ARM code is just compiled it may not be vectorized at all - often you need to arrange the source code in a certain way for the compiler to recognize it can be vectorized. So it is easy to see why static translation of the SSE 4.2 code path could be faster than native ARM code that doesn't use vectorization at all.

This isn't going to be a problem for long, stuff that is popular on the Mac will get optimized vectorized ARM code (or maybe use the GPU, NPU or ISP blocks to go even faster in certain cases)

Phoronix's tests were using various open source software packages popular on Linux that may not be used much at all on the Mac.
 

dmens

Platinum Member
Mar 18, 2005
2,271
917
136
All AMD (or Intel) has to do is wait until they can transition to the latest process as well and they'll be able to automatically undo any of the gains that either Apple or any ARM vendor achieved in their designs ...

LOL why wait? Intel can use that vaunted war chest and pay TSMC for just a few 5nm wafer starts, then presto, problem solved, right?

Oh wait, Intel can fab their trash designs on TSMC and it would still be trash. Garbage in, garbage out. Sorry.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
It's not farfetchted that a lot of existing ARM code is pretty barebone with regard to optimizations and that Rosetta is good at translating existing assembly and SIMD to ARM equivalents. To be honest I'm positively impressed that there are so many cases where native is already clearly better than Rosetta for exactly this reason. Now imagine all code getting actually optimized for M1's capability.

That is unfortunately a bigger problem with Phoronix' test suite. Many packages are heavily hand optimized with x86 assembly and SIMD intrinsics.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.

This was not my point. Of course the larger ARM ecosystem will eventually make sure that these issues are going to be fixed - if it is Apple or anyone else like Amazon or even the open source community does not really matter.
I am for instance looking into improving the Intel Embree library with respect to Arm NEON - it is used by Blender, Maxon Cinema and other 3d applications. If you compile Embree from the official sources there is just C++ code path for ARM available.
 
Last edited:

ricebunny2020

Junior Member
Nov 19, 2020
2
4
36
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.