Discussion NV Re-Enter ARM PC market in 2025!

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

eek2121

Diamond Member
Aug 2, 2005
3,364
4,963
136
For multi-core performance and also single core performance with anything involving recent Apple chips, GB6 is useless.

Multicore benchmark no longer measures peak performance.

Single core with Apple chips (M4 and later) added an advantage for SME. I would not normally raise this particular issue except last I checked, GB6.x was the only application to support SME.

To GB6 authors: Maybe don’t put the cart before the horse. Also, for GB7, maybe also measure peak “theoretical“ performance and put that in a third number. Also, since we are adding more numbers, can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)
 

Jan Olšan

Senior member
Jan 12, 2017
530
1,050
136
For multi-core performance and also single core performance with anything involving recent Apple chips, GB6 is useless.

Multicore benchmark no longer measures peak performance.

Single core with Apple chips (M4 and later) added an advantage for SME. I would not normally raise this particular issue except last I checked, GB6.x was the only application to support SME.

To GB6 authors: Maybe don’t put the cart before the horse. Also, for GB7, maybe also measure peak “theoretical“ performance and put that in a third number. Also, since we are adding more numbers, can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)
AVX2 I wonder, it's finally present in enough processors to be viable as an assumed baseline (though still with a runtime-detection fallback for older/Atom processors). As for AVX-512, it seems GB6 may use it but it apparently adds close to no performance so you don't need to worry about that, sadly. Likely didn't get the attention/effot SME got in Primate Labs or whoever wrote the code, although I would argue it's more general and more widely used.
 

eek2121

Diamond Member
Aug 2, 2005
3,364
4,963
136
AVX2 I wonder, it's finally present in enough processors to be viable as an assumed baseline (though still with a runtime-detection fallback for older/Atom processors). As for AVX-512, it seems GB6 may use it but it apparently adds close to no performance so you don't need to worry about that, sadly. Likely didn't get the attention/effot SME got in Primate Labs or whoever wrote the code, although I would argue it's more general and more widely used.
AVX2 is really old. It is in most modern chips.

It came out in 2013. AVX-512 is a bit more complicated. You really need to hand tailor the code to the task at hand in order to see a good performance uplift. There are projects out that that use it though, and it does heavy lifting in those cases.

Curious as to if NVIDIA will include SME or not.

SME on Apple platforms wasn't adopted in ANY applications except GB6 last I checked.
 

Doug S

Diamond Member
Feb 8, 2020
3,188
5,458
136
can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)

I'd like to see a single core "native int" result that doesn't leverage any SIMD (other than incidental use by system libraries i.e. if a memcpy() call uses it, but the benchmarks compiled for GB6 would have SSE/AVX/NEON/SVE/SME disabled in the compiler flags) and doesn't include any fp, just test the regular integer instructions. It is pointless to include fp instructions in a single core result - I can't think of any real world tasks that are limited by single thread fp, when you're doing fp (and 98% of the time when you're heavily using SIMD) you're running across multiple threads and the ST number doesn't matter all that much to you.

Then you have three MT results: integer "max" MT, floating point "max" MT, and an integer "cooperative" MT. The "max" tests would be sort of like Cinebench type stuff where the more cores the better, at least until you run out of memory bandwidth or other resources. The "cooperative" test would something where all the threads have to talk to each other so it would test the efficiency of the fabric, OS scheduling and locking efficiency, that sort of thing. I think that's what GB6's MT test was intended to do but it doesn't do all that good a job of it.

Then I guess whatever AI/GPU type test(s) people feel are needed.
 

Jan Olšan

Senior member
Jan 12, 2017
530
1,050
136
I think going so hard back on SIMD (but why not other ISA extensions?) probably makes no sense. I mean, ARMv8/9 has SIMD as a baseline feature. Likewise, SSE2 is pretty much guaranteed on x86-64.

As you yourself mention, there is going to be SIMD usage in the libraries already. And in the kernel. Unfair to prohibit the application code to use the same.

At some point these extensions become a basic CPU feature that makes no sense to exclude. Clearly that point doesn't include AVX-512, it may not even include AVX2 yet. But trying to not use any SIMD on purpose makes no sense IMHO.

What you want is probably better served by having tests with different workload character. Multimedia, HPC for SIMD, compilation or (general data, zip-like) decompression for "pure integer"? It's not SIMD's problem if it finds ways to be useful in more and more fields...
 
  • Like
Reactions: Tlh97 and MS_AT

Doug S

Diamond Member
Feb 8, 2020
3,188
5,458
136
As you yourself mention, there is going to be SIMD usage in the libraries already. And in the kernel. Unfair to prohibit the application code to use the same.

Perhaps so, but that's primarily hand crafted assembly - and in the case of stuff like memcpy() it is far from being generally useful. That is the code has to check whether calling the SIMD code is worth doing which it isn't for short copies. Ditto for the kernel's use of SIMD.

Which is fine - choose your "native int" benchmarks to be ones that can't benefit too much from SIMD (there are plenty of those around) and don't use any assembly code in them. If the compiler can find a few places to use it fine, if the right benchmarks are chosen it won't affect things more than 1 or 2%.
 

MS_AT

Senior member
Jul 15, 2024
671
1,357
96
It is pointless to include fp instructions in a single core result
You browse internet without javascript?;) unless you mean to say that browser performance is irrelevant to the market.
Perhaps so, but that's primarily hand crafted assembly - and in the case of stuff like memcpy() it is far from being generally useful. That is the code has to check whether calling the SIMD code is worth doing which it isn't for short copies. Ditto for the kernel's use of SIMD.
You have pretty negative view on SIMD and at least somewhat outdated. To get what you want you would have to use unrealistic compiler settings, prohibit use of existing libriaries and disable standard libraries of some languages. What you propose would best fit as legacy score, for apps written years ago.
 
  • Like
Reactions: Tlh97 and Schmide

eek2121

Diamond Member
Aug 2, 2005
3,364
4,963
136
I'd like to see a single core "native int" result that doesn't leverage any SIMD (other than incidental use by system libraries i.e. if a memcpy() call uses it, but the benchmarks compiled for GB6 would have SSE/AVX/NEON/SVE/SME disabled in the compiler flags) and doesn't include any fp, just test the regular integer instructions. It is pointless to include fp instructions in a single core result - I can't think of any real world tasks that are limited by single thread fp, when you're doing fp (and 98% of the time when you're heavily using SIMD) you're running across multiple threads and the ST number doesn't matter all that much to you.

Then you have three MT results: integer "max" MT, floating point "max" MT, and an integer "cooperative" MT. The "max" tests would be sort of like Cinebench type stuff where the more cores the better, at least until you run out of memory bandwidth or other resources. The "cooperative" test would something where all the threads have to talk to each other so it would test the efficiency of the fabric, OS scheduling and locking efficiency, that sort of thing. I think that's what GB6's MT test was intended to do but it doesn't do all that good a job of it.

Then I guess whatever AI/GPU type test(s) people feel are needed.
I think it's fine to use a well supported instruction set. The thing that irked me was that Geekbench did it RIGHT AWAY with Apple, despite not a single app supporting SME.

They also changed the type of benchmark their application was. In a controlled environment, GB5 and SPEC could both be used similarly to accurately measure both single core and multicore performance. With GB6 you can't do this. They also basically killed GB5, they go out of there way to hide it from everyone. Some of us paid for the software in some form or another. Many of us WANTED GB5's way of benchmarking. GB6 also would've been fine if they had either maintained both versions, or included both sets of scores.

I used GB5 quite a bit to determine whether a piece of hardware was performing as it should, and to a lesser extent, overclocking/undervolting. It ran fast, results were reproducible (again, in a controlled environment), and they were roughly comparable, even across chip architectures. They butchered it all, and as a result, I've not bothered to buy GB6, so congrats, I guess? 🤣

Well, now I'm just ranting! Anyway, I do hope we get some additional details on the NVIDIA platform pretty soon. It is actually the first ARM platform outside of the Raspberry Pi that has me interested. It'll probably cost 2 arms (heh) and a leg. Maybe not. It would be nice if the platform was DIY friendly.
 

poke01

Diamond Member
Mar 8, 2022
3,496
4,812
106
All arm64 SoCs will use SME this year and Primate did say SME scores are comparable to other SME enabled CPUs. So we can compare ARM chips from other manufacturers again starting with this years releases.


Apple has almost launched ARM features before ARM themselves, it’s not Apples fault that Apple is ahead.
 

poke01

Diamond Member
Mar 8, 2022
3,496
4,812
106
do hope we get some additional details on the NVIDIA platform pretty soon. It is actually the first ARM platform outside of the Raspberry Pi that has me interested
Why? Without a custom CPU it’s meh. Only the GPU is interesting but then you can just get a RTX 5090 or RTX 6000 Pro with an x86 CPU.

I guess the CUDA with 128GB of RAM for $3000 is interesting but other than that it’s completely boring.
 
  • Like
Reactions: Io Magnesso

Doug S

Diamond Member
Feb 8, 2020
3,188
5,458
136
You browse internet without javascript?;) unless you mean to say that browser performance is irrelevant to the market.

I'm talking about using fp specific benchmarks designed to test the fp, not general purpose benchmarks that happen to include some fp.

The way Javascript uses fp doesn't make them fp benchmarks because using it for ALL numbers (which IMHO was the stupidest design decision in any language in the past 30 years) means almost all of it is simple stuff such as loop counters and the like, not the kind of tasks fp benchmarks do with say unrolled loops of a million muladds where it becomes more about how wide a CPU's fp scheduler is and how many loads/stores it can issue per cycle.
 

eek2121

Diamond Member
Aug 2, 2005
3,364
4,963
136
Why? Without a custom CPU it’s meh. Only the GPU is interesting but then you can just get a RTX 5090 or RTX 6000 Pro with an x86 CPU.

I guess the CUDA with 128GB of RAM for $3000 is interesting but other than that it’s completely boring.
because GeForce. We get ARM Windows drivers for GeForce, and Geforce GPUs with the platform.
 

DavidC1

Golden Member
Dec 29, 2023
1,493
2,441
96
I used GB5 quite a bit to determine whether a piece of hardware was performing as it should, and to a lesser extent, overclocking/undervolting. It ran fast, results were reproducible (again, in a controlled environment), and they were roughly comparable, even across chip architectures. They butchered it all, and as a result, I've not bothered to buy GB6, so congrats, I guess? 🤣
They were defeaturing it since GB5? You need to add some specific commands to sort. It used to be you can sort peak scores, and also filter out using OS(Linux/Android is 5-10% faster than Windows). And GB4 used to separate out the results into ST Int, ST FP, MT Int, MT FP, and Cryptography. Now you need to look at it one by one and combine them yourself.

The defeaturing is following the mobile trend, where everything has to be simpler/cut to fit for a 5-inch touchscreen device.
 

Io Magnesso

Junior Member
Jun 12, 2025
8
2
36
To be honest, how about putting the coprocessor performance in a single-threaded score that is shared among the cores
I wonder...