Discussion NV Re-Enter ARM PC market in 2025!

igor_kavinski · Tuesday at 12:54 PM

vs. M4: https://browser.geekbench.com/v6/cpu/compare/12348676?baseline=12372835

eek2121 · Tuesday at 10:17 PM

For multi-core performance and also single core performance with anything involving recent Apple chips, GB6 is useless.

Multicore benchmark no longer measures peak performance.

Single core with Apple chips (M4 and later) added an advantage for SME. I would not normally raise this particular issue except last I checked, GB6.x was the only application to support SME.

To GB6 authors: Maybe don’t put the cart before the horse. Also, for GB7, maybe also measure peak “theoretical“ performance and put that in a third number. Also, since we are adding more numbers, can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)

eek2121 · Tuesday at 10:18 PM

(oh, not bashing NVIDIA in any way, I am rather curious as to what they will put out!)

Jan Olšan · Wednesday at 8:51 AM

eek2121 said:
For multi-core performance and also single core performance with anything involving recent Apple chips, GB6 is useless.

Multicore benchmark no longer measures peak performance.

Single core with Apple chips (M4 and later) added an advantage for SME. I would not normally raise this particular issue except last I checked, GB6.x was the only application to support SME.

To GB6 authors: Maybe don’t put the cart before the horse. Also, for GB7, maybe also measure peak “theoretical“ performance and put that in a third number. Also, since we are adding more numbers, can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)

AVX2 I wonder, it's finally present in enough processors to be viable as an assumed baseline (though still with a runtime-detection fallback for older/Atom processors). As for AVX-512, it seems GB6 may use it but it apparently adds close to no performance so you don't need to worry about that, sadly. Likely didn't get the attention/effot SME got in Primate Labs or whoever wrote the code, although I would argue it's more general and more widely used.

eek2121 · Wednesday at 12:09 PM

Jan Olšan said:
AVX2 I wonder, it's finally present in enough processors to be viable as an assumed baseline (though still with a runtime-detection fallback for older/Atom processors). As for AVX-512, it seems GB6 may use it but it apparently adds close to no performance so you don't need to worry about that, sadly. Likely didn't get the attention/effot SME got in Primate Labs or whoever wrote the code, although I would argue it's more general and more widely used.

AVX2 is really old. It is in most modern chips.

It came out in 2013. AVX-512 is a bit more complicated. You really need to hand tailor the code to the task at hand in order to see a good performance uplift. There are projects out that that use it though, and it does heavy lifting in those cases.

Curious as to if NVIDIA will include SME or not.

SME on Apple platforms wasn't adopted in ANY applications except GB6 last I checked.

Doug S · Wednesday at 12:12 PM

eek2121 said:
can we exclude SME, AVX2, AVX-512, or at least get dedicated tests that test the actual int/fp performance of these chips?)

I'd like to see a single core "native int" result that doesn't leverage any SIMD (other than incidental use by system libraries i.e. if a memcpy() call uses it, but the benchmarks compiled for GB6 would have SSE/AVX/NEON/SVE/SME disabled in the compiler flags) and doesn't include any fp, just test the regular integer instructions. It is pointless to include fp instructions in a single core result - I can't think of any real world tasks that are limited by single thread fp, when you're doing fp (and 98% of the time when you're heavily using SIMD) you're running across multiple threads and the ST number doesn't matter all that much to you.

Then you have three MT results: integer "max" MT, floating point "max" MT, and an integer "cooperative" MT. The "max" tests would be sort of like Cinebench type stuff where the more cores the better, at least until you run out of memory bandwidth or other resources. The "cooperative" test would something where all the threads have to talk to each other so it would test the efficiency of the fabric, OS scheduling and locking efficiency, that sort of thing. I think that's what GB6's MT test was intended to do but it doesn't do all that good a job of it.

Then I guess whatever AI/GPU type test(s) people feel are needed.

Jan Olšan · Wednesday at 1:53 PM

I think going so hard back on SIMD (but why not other ISA extensions?) probably makes no sense. I mean, ARMv8/9 has SIMD as a baseline feature. Likewise, SSE2 is pretty much guaranteed on x86-64.

As you yourself mention, there is going to be SIMD usage in the libraries already. And in the kernel. Unfair to prohibit the application code to use the same.

At some point these extensions become a basic CPU feature that makes no sense to exclude. Clearly that point doesn't include AVX-512, it may not even include AVX2 yet. But trying to not use any SIMD on purpose makes no sense IMHO.

What you want is probably better served by having tests with different workload character. Multimedia, HPC for SIMD, compilation or (general data, zip-like) decompression for "pure integer"? It's not SIMD's problem if it finds ways to be useful in more and more fields...

Doug S · Wednesday at 2:12 PM

Jan Olšan said:
As you yourself mention, there is going to be SIMD usage in the libraries already. And in the kernel. Unfair to prohibit the application code to use the same.

Perhaps so, but that's primarily hand crafted assembly - and in the case of stuff like memcpy() it is far from being generally useful. That is the code has to check whether calling the SIMD code is worth doing which it isn't for short copies. Ditto for the kernel's use of SIMD.

Which is fine - choose your "native int" benchmarks to be ones that can't benefit too much from SIMD (there are plenty of those around) and don't use any assembly code in them. If the compiler can find a few places to use it fine, if the right benchmarks are chosen it won't affect things more than 1 or 2%.

MS_AT · Wednesday at 2:24 PM

Doug S said:
It is pointless to include fp instructions in a single core result

You browse internet without javascript?

unless you mean to say that browser performance is irrelevant to the market.

Doug S said:
Perhaps so, but that's primarily hand crafted assembly - and in the case of stuff like memcpy() it is far from being generally useful. That is the code has to check whether calling the SIMD code is worth doing which it isn't for short copies. Ditto for the kernel's use of SIMD.

You have pretty negative view on SIMD and at least somewhat outdated. To get what you want you would have to use unrealistic compiler settings, prohibit use of existing libriaries and disable standard libraries of some languages. What you propose would best fit as legacy score, for apps written years ago.

eek2121 · Wednesday at 2:40 PM

Doug S said:
I'd like to see a single core "native int" result that doesn't leverage any SIMD (other than incidental use by system libraries i.e. if a memcpy() call uses it, but the benchmarks compiled for GB6 would have SSE/AVX/NEON/SVE/SME disabled in the compiler flags) and doesn't include any fp, just test the regular integer instructions. It is pointless to include fp instructions in a single core result - I can't think of any real world tasks that are limited by single thread fp, when you're doing fp (and 98% of the time when you're heavily using SIMD) you're running across multiple threads and the ST number doesn't matter all that much to you.

Then you have three MT results: integer "max" MT, floating point "max" MT, and an integer "cooperative" MT. The "max" tests would be sort of like Cinebench type stuff where the more cores the better, at least until you run out of memory bandwidth or other resources. The "cooperative" test would something where all the threads have to talk to each other so it would test the efficiency of the fabric, OS scheduling and locking efficiency, that sort of thing. I think that's what GB6's MT test was intended to do but it doesn't do all that good a job of it.

Then I guess whatever AI/GPU type test(s) people feel are needed.

I think it's fine to use a well supported instruction set. The thing that irked me was that Geekbench did it RIGHT AWAY with Apple, despite not a single app supporting SME.

They also changed the type of benchmark their application was. In a controlled environment, GB5 and SPEC could both be used similarly to accurately measure both single core and multicore performance. With GB6 you can't do this. They also basically killed GB5, they go out of there way to hide it from everyone. Some of us paid for the software in some form or another. Many of us WANTED GB5's way of benchmarking. GB6 also would've been fine if they had either maintained both versions, or included both sets of scores.

I used GB5 quite a bit to determine whether a piece of hardware was performing as it should, and to a lesser extent, overclocking/undervolting. It ran fast, results were reproducible (again, in a controlled environment), and they were roughly comparable, even across chip architectures. They butchered it all, and as a result, I've not bothered to buy GB6, so congrats, I guess? 🤣

Well, now I'm just ranting! Anyway, I do hope we get some additional details on the NVIDIA platform pretty soon. It is actually the first ARM platform outside of the Raspberry Pi that has me interested. It'll probably cost 2 arms (heh) and a leg. Maybe not. It would be nice if the platform was DIY friendly.

igor_kavinski · Wednesday at 2:55 PM

eek2121 said:
The thing that irked me was that Geekbench did it RIGHT AWAY with Apple, despite not a single app supporting SME.

Apple Intelligence probably uses SME on M4.

poke01 · Wednesday at 6:36 PM

All arm64 SoCs will use SME this year and Primate did say SME scores are comparable to other SME enabled CPUs. So we can compare ARM chips from other manufacturers again starting with this years releases.

Apple has almost launched ARM features before ARM themselves, it’s not Apples fault that Apple is ahead.

poke01 · Wednesday at 6:38 PM

eek2121 said:
do hope we get some additional details on the NVIDIA platform pretty soon. It is actually the first ARM platform outside of the Raspberry Pi that has me interested

Why? Without a custom CPU it’s meh. Only the GPU is interesting but then you can just get a RTX 5090 or RTX 6000 Pro with an x86 CPU.

I guess the CUDA with 128GB of RAM for $3000 is interesting but other than that it’s completely boring.

adroc_thurston · Wednesday at 6:45 PM

igor_kavinski said:
Apple Intelligence probably uses SME on M4.

Pretty sure Apple ML stack is heavy on GPU for slow stuff, and ANE for anything real time.

gdansk · Wednesday at 6:45 PM

igor_kavinski said:
Apple Intelligence probably uses SME on M4.

And how many months after GB6,3 was Apple blessed with Intelligence? Investors wonder!

adroc_thurston · Wednesday at 6:50 PM

adroc_thurston said:
Pretty sure Apple ML stack is heavy on GPU for slow stuff, and ANE for anything real time.

https://twitter.com/x/status/1932920481746997502

Okay I was wrong it's all on ANE.
SME is there for Geekbench padding.

Doug S · 2025-06-12T13:59:17-0400

MS_AT said:
You browse internet without javascript? unless you mean to say that browser performance is irrelevant to the market.

I'm talking about using fp specific benchmarks designed to test the fp, not general purpose benchmarks that happen to include some fp.

The way Javascript uses fp doesn't make them fp benchmarks because using it for ALL numbers (which IMHO was the stupidest design decision in any language in the past 30 years) means almost all of it is simple stuff such as loop counters and the like, not the kind of tasks fp benchmarks do with say unrolled loops of a million muladds where it becomes more about how wide a CPU's fp scheduler is and how many loads/stores it can issue per cycle.

Doug S · 2025-06-12T14:58:21-0400

adroc_thurston said:
SME is there for Geekbench padding.

There are many uses for matrix multiplication that have nothing to do with AI you know.

adroc_thurston · 2025-06-12T15:03:07-0400

Doug S said:
There are many uses for matrix multiplication that have nothing to do with AI you know.

I know what GEMM is and no, you're not doing it on dinky client CPUs.

eek2121 · 2025-06-12T16:44:22-0400

poke01 said:
Why? Without a custom CPU it’s meh. Only the GPU is interesting but then you can just get a RTX 5090 or RTX 6000 Pro with an x86 CPU.

I guess the CUDA with 128GB of RAM for $3000 is interesting but other than that it’s completely boring.

because GeForce. We get ARM Windows drivers for GeForce, and Geforce GPUs with the platform.

DavidC1 · 2025-06-12T17:39:25-0400

eek2121 said:
I used GB5 quite a bit to determine whether a piece of hardware was performing as it should, and to a lesser extent, overclocking/undervolting. It ran fast, results were reproducible (again, in a controlled environment), and they were roughly comparable, even across chip architectures. They butchered it all, and as a result, I've not bothered to buy GB6, so congrats, I guess? 🤣

They were defeaturing it since GB5? You need to add some specific commands to sort. It used to be you can sort peak scores, and also filter out using OS(Linux/Android is 5-10% faster than Windows). And GB4 used to separate out the results into ST Int, ST FP, MT Int, MT FP, and Cryptography. Now you need to look at it one by one and combine them yourself.

The defeaturing is following the mobile trend, where everything has to be simpler/cut to fit for a 5-inch touchscreen device.

Io Magnesso · 2025-06-12T21:48:40-0400

To be honest, how about putting the coprocessor performance in a single-threaded score that is shared among the cores
I wonder...

Discussion NV Re-Enter ARM PC market in 2025!

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Junior Member