• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."
  • Community Question: What makes a good motherboard?

Question New Apple SoC - M1 - For lower end Macs - Geekbench 5 single-core >1700

Page 41 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Thala

Golden Member
Nov 12, 2014
1,127
440
136
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.
This was not my point. Of course the larger ARM ecosystem will eventually make sure that these issues are going to be fixed - if it is Apple or anyone else like Amazon or even the open source community does not really matter.
I am for instance looking into improving the Intel Embree library with respect to Arm NEON - it is used by Blender, Maxon Cinema and other 3d applications. If you compile Embree from the official sources there is just C++ code path for ARM available.
 
Last edited:

ricebunny2020

Junior Member
Nov 19, 2020
2
4
36
The x86 build of FLAC seems to make use of x86 SIMD intrinsics in addition to x86 assembly. I don't see any evidence of any use of ARM SIMD intrinsics or of ARM assembly in the FLAC codebase, so that is likely to explain the performance disparity between the native ARM build of FLAC and the x86 build of FLAC run via Rosetta.
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.
 

software_engineer

Junior Member
Jul 26, 2020
6
9
41
Rosetta does not support the translation of AVX, AVX2, and AVX512 instructions.
Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.
 

moinmoin

Platinum Member
Jun 1, 2017
2,069
2,474
106
That is unfortunately a bigger problem with Phoronix' test suite. Many packages are heavily hand optimized with x86 assembly and SIMD intrinsics.
Calling that "unfortunate" and a "bigger problem" sounds like it would be a bad thing for code to be optimized, so also for any M1 optimizations?

I consider Phoronix a good representation of open source userland one familiar with Unix-like systems may be interested in using. As such it does a good job showing the current status quo of just that on an M1-based system. And as I wrote before personally I consider the results positively impressive, even with Rosetta and the lack of native optimizations or ports altogether performance is already competitive. And Apple Silicon Macs are very likely going to be popular enough to invite such ports and optimizations from the scene. So if one can live with a locked down system while working with a Unix-like userland those Macs are bound to be a great additional option.
 

JoeRambo

Senior member
Jun 13, 2013
912
645
136
Is this also the case for the SSE familiy of SIMD instruction sets? The FLAC codebase also has code paths that make use of SIMD intrinsics for each version of the SSE instruction family. In theory SSE instuctions making use of the 128-bit XMM registers should be translatable to ARM NEON instructions, which also operate on 128-bit sized vectors. However, SSE instructions may not map well onto NEON instructions as NEON is a 3 operand instruction set whilst SSE is a destructive 2 operand instruction set.

SSE2 is baseline and is emulated by Rosetta2. In fact AVX, AVX2 were already "optional" and properly written software had detection for CPU features and codepath selection. So all Rosetta2 emu layer has to do in those cases is report that it does not support AVX+ and SSE codepath is used instead.
The problem with SSE family was always low number of architectural registers and as You mention the destructive type of operations. Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,002
153
106
That is nonsense, start to finish.
All 3 paragraphs are wrong. You could go research why, but I fear you won't.
It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?
 

guidryp

Senior member
Apr 3, 2006
587
461
136
Apple chose ARM, fully knowing about the ecosystem surrounding it. I argue that the problem isn't with the phoronix test suite. Instead, I argue that it's a problem for Apple given that that's the state of software today. If they want to fix it, they can bloody well pay programers to fix it for them.
There is nothing Apple needs to "fix" here. Phoronix is a poor choice for cross platform benchmarking, and it's not really representative of commercial software that most people will run.

People lambasted Geekbench for years, but the reality is they make an effort to actually have the benchmark fairly optimized for each platform.

The Phoronix benchmarks are the complete opposite. Heavily hand tuned x86 assembler, completely biased in favor of x86.

It's also not representative of what you will find in most commercial software. Optimizing assembler, is a hobby activity these days, not how production software is built.
 
  • Like
Reactions: name99 and Viknet

amrnuke

Senior member
Apr 24, 2019
998
1,496
96
It's pretty much the truth sadly ...

They're touting mediocre emulation performance on 5nm chips and they don't want to release optimization manuals like either AMD or Intel would because they'd prefer to keep changing architectures rather than have developers optimizing their software for their hardware designs which is inconsistent from generation to generation ...

Rosetta numbers will arguably be closer to the reality than many people think because why even bother optimizing for Apple platforms when they don't want to guarantee compatibility ?
The M1 is as fast as the Mac Pro in Webkit compile time (according to one reviewer - even coming close is absurd). It compiles Xcode as fast as a 32-thread 3950X Hackintosh.It's within 20% of the i9+dGPU MBP 16" in Final Cut rendering. Apple get the same or better x86 performance in their Mac mini, MBA, and MBP13 via emulation, much better native performance, for less cost and less power consumption.

Not releasing documentation behind their CPU does not mean they don't like low-level programming or actively discourage it. Regardless, I'd like you to expand on what you mean by "low level programming". I'm also curious what you mean by "keep changing architectures". Also, I'd like to hear what you mean by inconsistent hardware designs and how that affects software development. Provide examples so we can actually have a discussion.
 
  • Like
Reactions: Tlh97 and coercitiv

DrMrLordX

Lifer
Apr 27, 2000
16,627
5,634
136
Same as it's always been, squeezing out every performance advantage.
. . . ah huh.

Look, based on what I've seen, unless you're dealing with highly-specialized-and-difficult-to-autovectorize code, hand-tuned asm doesn't really provide a performance advantage over what a standard C/C++ compiler can crank out on its own. Even stuff like OpenMP can work some of the time (though the limited exposure I had to it showed me it sucked, at least as of a few years ago). If you're dealing with the JVM (Java/Scala/Kotlin/etc.) then it can autovectorize extremely well via heuristics, and I've even gotten one of my own applications running quite well on my phone's Snapdragon 855+ in Java using that technique. Actually all I had to do was take the exact .class files I compiled for my x86-64 targets years ago and run them with a full JVM implementation via UserLAnD, and it worked quite well. That Snapdragon 855+ blew the doors off my old A10-7700k and A10-7870k. But I digress.

Outside of the obvious problems with the Phoronix test data when Rosetta2 was involved, you also have to consider that Phoronix was (probably) compiling "native" M1 binaries from source that was written with an x86-64 target. There probably wasn't a whole heck of a lot more "hand-tuned assembler" involved in any of those applications than there is in "commercial software packages". The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all. If I have an application with SSE2, 3, 4.x, AVX, and AVX2 code paths, the compiler can't use ANY of that on an M1 if I just try compiling straight up. It's not going to even try spitting out any NEON. The code would require a NEON code path for ARM targets. If that code path isn't there, too bad.

My guess is that Phoronix's FLAC results are affected by them compiling from source to a binary that has no SIMD support at all, which would explain the horrendous performance. You'd better believe that SIMD makes a big difference in performance under a benchmark like that.
 
  • Like
Reactions: Tlh97

wlee15

Senior member
Jan 7, 2009
302
6
81
Meaning a lot of algorithms have a ton of register spills, moves, saves etc.
When doing static translation like Rosetta does, noone is forcing Apple to do the mapping 1:1, they can simply apply optimizations enabled by 3 operand ops and having 32 architectural regs instead of 8 SSE has.
x86-64 has 16 Vector registers. The EVEX prefix that was introduced alongside AVX-512 extends that to 32.
 

Thala

Golden Member
Nov 12, 2014
1,127
440
136
If that is the case, then what's the advantage to "hand tuned Assembler"?
Hand tuned assembler is very rare. Compared to intrinsics you can hope that you manage to do better register allocation, but even this would be rare these days. Also you have more control over loop unrolling and SW pipelining - which can help.
Speaking of assembler, i have written a SHA-512 crypto implementation for ARMv8 in pure assembler not long ago - where i unrolled the round-loop 80 times :)

But yes, in general you just want to use the provided SIMD intrinsics and leave the rest to the compiler.
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,127
440
136
The issue is: what is the compiler going to do when it hits AVX or AVX2 code paths? Answer: puke all over the place, or revert to a non-SIMD code path since M1 doesn't support any of x86's SIMD at all.
Technically you explicitly have to make sure, that the ARM compiler never gets to see the SSE/AVX code path via preprocessor directives or by separation into different files.
 

ASK THE COMMUNITY