Discussion Apple Silicon M series thread

Page 41 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

software_engineer

Junior Member
Jul 26, 2020
8
11
41
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.
Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.
 

moinmoin

Diamond Member
Jun 1, 2017
3,595
5,063
136
That is unfortunately a bigger problem with Phoronix' test suite. Many packages are heavily hand optimized with x86 assembly and SIMD intrinsics.
Calling that "unfortunate" and a "bigger problem" sounds like it would be a bad thing for code to be optimized, so also for any M1 optimizations?

I consider Phoronix a good representation of open source userland one familiar with Unix-like systems may be interested in using. As such it does a good job showing the current status quo of just that on an M1-based system. And as I wrote before personally I consider the results positively impressive, even with Rosetta and the lack of native optimizations or ports altogether performance is already competitive. And Apple Silicon Macs are very likely going to be popular enough to invite such ports and optimizations from the scene. So if one can live with a locked down system while working with a Unix-like userland those Macs are bound to be a great additional option.
 

JoeRambo

Golden Member
Jun 13, 2013
1,603
1,712
136
Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.

SSE2 is baseline and is emulated by Rosetta2. In fact AVX, AVX2 were already "optional" and properly written software had detection for CPU features and codepath selection. So all Rosetta2 emu layer has to do in those cases is report that it does not support AVX+ and SSE codepath is used instead.
The problem with SSE family was always low number of architectural registers and as You mention the destructive type of operations. Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,101
224
116
That is nonsense, start to finish.
All 3 paragraphs are wrong. You could go research why, but I fear you won't.
It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?
 

guidryp

Platinum Member
Apr 3, 2006
2,220
2,497
136
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.
There is nothing Apple needs to "fix" here. Phoronix is a poor choice for cross platform benchmarking, and it's not really representative of commercial software that most people will run.

People lambasted Geekbench for years, but the reality is they make an effort to actually have the benchmark fairly optimized for each platform.

The Phoronix benchmarks are the complete opposite. Heavily hand tuned x86 assembler, completely biased in favor of x86.

It's also not representative of what you will find in most commercial software. Optimizing assembler, is a hobby activity these days, not how production software is built.
 
  • Like
Reactions: name99 and Viknet

senttoschool

Golden Member
Jan 30, 2010
1,552
230
106
Safari on Windows?
I though Apple dropped Windows Safari nearly a decade ago?
Wikipedia says Safari was available for Windows from 2007 until 2012.
You're right. I was looking at the official Safari screen and it confused me when it compared Safari on MacOS to Windows 10 browsers.
 

amrnuke

Golden Member
Apr 24, 2019
1,173
1,754
106
It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?
The M1 is as fast as the Mac Pro in Webkit compile time (according to one reviewer - even coming close is absurd). It compiles Xcode as fast as a 32-thread 3950X Hackintosh.It's within 20% of the i9+dGPU MBP 16" in Final Cut rendering. Apple get the same or better x86 performance in their Mac mini, MBA, and MBP13 via emulation, much better native performance, for less cost and less power consumption.

Not releasing documentation behind their CPU does not mean they don't like low-level programming or actively discourage it. Regardless, I'd like you to expand on what you mean by "low level programming". I'm also curious what you mean by "keep changing architectures". Also, I'd like to hear what you mean by inconsistent hardware designs and how that affects software development. Provide examples so we can actually have a discussion.
 
  • Like
Reactions: Tlh97 and coercitiv

DrMrLordX

Lifer
Apr 27, 2000
19,654
8,495
136
Same as it's always been, squeezing out every performance advantage.
. . . ah huh.

Look, based on what I've seen, unless you're dealing with highly-specialized-and-difficult-to-autovectorize code, hand-tuned asm doesn't really provide a performance advantage over what a standard C/C++ compiler can crank out on its own. Even stuff like OpenMP can work some of the time (though the limited exposure I had to it showed me it sucked, at least as of a few years ago). If you're dealing with the JVM (Java/Scala/Kotlin/etc.) then it can autovectorize extremely well via heuristics, and I've even gotten one of my own applications running quite well on my phone's Snapdragon 855+ in Java using that technique. Actually all I had to do was take the exact .class files I compiled for my x86-64 targets years ago and run them with a full JVM implementation via UserLAnD, and it worked quite well. That Snapdragon 855+ blew the doors off my old A10-7700k and A10-7870k. But I digress.

Outside of the obvious problems with the Phoronix test data when Rosetta2 was involved, you also have to consider that Phoronix was (probably) compiling "native" M1 binaries from source that was written with an x86-64 target. There probably wasn't a whole heck of a lot more "hand-tuned assembler" involved in any of those applications than there is in "commercial software packages". The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all. If I have an application with SSE2, 3, 4.x, AVX, and AVX2 code paths, the compiler can't use ANY of that on an M1 if I just try compiling straight up. It's not going to even try spitting out any NEON. The code would require a NEON code path for ARM targets. If that code path isn't there, too bad.

My guess is that Phoronix's FLAC results are affected by them compiling from source to a binary that has no SIMD support at all, which would explain the horrendous performance. You'd better believe that SIMD makes a big difference in performance under a benchmark like that.
 
  • Like
Reactions: Carfax83 and Tlh97

wlee15

Senior member
Jan 7, 2009
312
26
91
Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.
x86-64 has 16 Vector registers. The EVEX prefix that was introduced alongside AVX-512 extends that to 32.
 
  • Like
Reactions: Carfax83

Thala

Golden Member
Nov 12, 2014
1,323
632
136
If that is the case, then what's the advantage to "hand tuned Assembler"?
Hand tuned assembler is very rare. Compared to intrinsics you can hope that you manage to do better register allocation, but even this would be rare these days. Also you have more control over loop unrolling and SW pipelining - which can help.
Speaking of assembler, i have written a SHA-512 crypto implementation for ARMv8 in pure assembler not long ago - where i unrolled the round-loop 80 times :)

But yes, in general you just want to use the provided SIMD intrinsics and leave the rest to the compiler.
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,323
632
136
The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all.
Technically you explicitly have to make sure, that the ARM compiler never gets to see the SSE/AVX code path via preprocessor directives or by separation into different files.
 

guidryp

Platinum Member
Apr 3, 2006
2,220
2,497
136
There probably wasn't a whole heck of a lot more "hand-tuned assembler" involved in any of those applications than there is in "commercial software packages".
There is typically near ZERO in commercial software in my Experience. It's rare in the extreme That was the point. If there is any at all, it's kind of an anomaly.

Flac encoder was mentioned first, then I notice Kvazaar: "... H.265 video encoder written in the C programming language and optimized in Assembly. Kvazaar is the winner of the 2016 ACM Open-Source Software Competition ..."

I pointed the optimized x86 assembler, because these benchmarks were first posted as "more representative" and someone else said this was a problem Apple needed to Fix.

This is about as tilted as comparisons get, with hand tuned x86 assembler on one side, likely ignored source fed through the compiler on the other.

Can you imagine how bad source code has to be, that it actually runs worse than Rosetta translation of the x86 version (Flac Encoder)?

The SW is also about as far from what buyers will use, as you can get. Kvazaar for encoding? I am sure >99% are using something like Handbrake, which already has beta Univeral Binary out...

These benchmarks are about as useless for cross platform comparisons as they claimed Geekbench to be, except this time it's true...
 
  • Like
Reactions: name99 and Viknet

Mopetar

Diamond Member
Jan 31, 2011
6,781
3,878
136
I mean they're kind of useful if you want to know how poorly a rushed sloppy port will do on the M1, so I guess it has that going for it. :p
 
  • Haha
Reactions: amrnuke

biostud

Lifer
Feb 27, 2003
16,360
1,754
126
What Apple has done is move from differentiate a computer by its processor, but instead only memory and storage, just like phones. People are used to buy phones and it is much easier to explain to the end user why they should by more storage, than getting a +200Mhz processor. Apple doesn't sell CPU's they sell a complete package, so the don't have the need to have a lot different CPU's. They have one for each generation, that is the most powerful they can com up with. Intels naming scheme and different generation of core processors on the mobile market was also getting absurd.
But just like phones Apples prices start at midrange and quickly escalate once you add storage.
 

Eug

Lifer
Mar 11, 2000
23,313
770
126
What Apple has done is move from differentiate a computer by its processor, but instead only memory and storage, just like phones. People are used to buy phones and it is much easier to explain to the end user why they should by more storage, than getting a +200Mhz processor. Apple doesn't sell CPU's they sell a complete package, so the don't have the need to have a lot different CPU's. They have one for each generation, that is the most powerful they can com up with. Intels naming scheme and different generation of core processors on the mobile market was also getting absurd.
But just like phones Apples prices start at midrange and quickly escalate once you add storage.
I get what you’re saying but the MacBook Pro and MacBook Air have the same memory options and the same storage options. Ironically, the MacBook has a slightly lower performing SoC in some configurations due to the lower end GPU.
 

biostud

Lifer
Feb 27, 2003
16,360
1,754
126
I get what you’re saying but the MacBook Pro and MacBook Air have the same memory options and the same storage options. Ironically, the MacBook has a slightly lower performing SoC in some configurations due to the lower end GPU.
They had to find a place for the chips with defects :p
 

senttoschool

Golden Member
Jan 30, 2010
1,552
230
106
I mean they're kind of useful if you want to know how poorly a rushed sloppy port will do on the M1, so I guess it has that going for it. :p
To be honest, it's more of an ego thing on this forum. A lot of people here have Ryzen systems or are AMD fans. The minute their PC master race systems get blown up by a tiny Macbook Air in common applications that most people use, they start to find ways to boost their ego.

For example, people here will point to the Cinebench numbers and say "see, I told you M1 isn't as fast as mobile Ryzen", despite knowing full well that 99.99% of people will never use Cinebench on the Macbook or Ryzen systems. Cinebench is just an AMD-friendly tool to boost the ego of AMD buyers.

Meanwhile, the M1 runs circles around Ryzen in the most commonly used applications including web browsing, hardware-accelerated video editing, AI acceleration, etc. And somehow, saying that the M1 chip is the "fastest laptop CPU" or the "fastest overall laptop chip(SoC)" is somehow controversial here.
 
  • Like
Reactions: scannall

ASK THE COMMUNITY