Question Geekbench 6 released and calibrated against Core i7-12700

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
Geekbench was never meant to be, nor ever has been, a good workload by which to judge servers.
There is a distinct lack of good cross platform benchmark suites. While not optimal GB4 and GB5 were rather serviceable. Now the MT score is pretty much useless for any kind of system, not only servers.
 
  • Like
Reactions: soresu

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
On a less serious note, seems there are currently some really powerful ARM cores out there :D

image_2023-02-16_1845incse.png


 

roger_k

Member
Sep 23, 2021
102
219
86
GB6 also including trivially parallelizable tasks is meaningless if the overall MT score barely reflects the actual core count. Ideally GB would split up MT scores between workload tests only using a limited amount of cores and parallelizable tests extending to all available threads.

It does reflect the actual core count and it does use all cores for every test. I think I’ve explained it in my post.

Of course, it is entirely possible that some of the algorithms used by GB6 for multi cure runs could be suboptimal.
 

Doug S

Diamond Member
Feb 8, 2020
3,125
5,372
136
I think the "unrealistic expectations" part is spot on here. With current set of tests, there are not that many where AVX512 could help much with direct recompile of same code. Maybe photo library stuff, photo filter.
The other tests could hardly use AVX512 without proper handwritten code paths and that would defeat multi platform and vendor "agnostic" part of the tests - ARM guys would ask for SVE3 with 666bit vectors and so on. In previous GB5 we had retarded outliers like say FFT or encryption that were directly calculating "throughput" and would probably make sense to double FLops just by using some vendor library or some short code that gets autovectorized and support AVX512.
Even then CPU vendors found that it is easier to pad the score by including some 256bit V_AES instruction with ridiculous throughput that made some new laptop beat a whole server in AES encryption throughput ( and of course utterly suck in real world, as to actually serve up content for encryption and deliver it/from requires actual server).

So GB6 is my opinion a great desktop/workstation performance test that emphasizes the way people actually use CPUs in 2023 and is brave enough to shatter illusions of people who disagree that 8 strong cores are plenty and the rest ( be it 8 more strong cores or 16 marketing cores ) gives very diminishing gains. Kudos to them for not catering to Cinebench/DC runner crowd.


Just playing Devil's Advocate here for a moment, the argument against your position would be that real world applications aren't self contained. They will all make plenty of calls to system libraries. People don't re-invent the wheel and code e.g. their own sort routine in their application, unless their application's entire purpose is to provide a better/faster sort. 99.999% of applications will call a sort function provided by the system libraries. Over the years system provided libraries have expanded and today might include stuff you could only dream about in the past like matrix multiplication, functions that are part of a rendering pipeline or actually perform the entire rendering task themselves. An application whose primary purpose is rendering won't call that, because hopefully it will do a better job. But if the application has another purpose and rendering is incidental, it will call the system's one even if it isn't the fastest one around or producing the highest quality results around. If it will benefit, these libraries may include some hand tuned SIMD code, which may include an AVX2 code path for CPUs that support it.

So if Microsoft provides a library that most applications call for a certain task, should a benchmark like Geekbench use that library in Windows builds, or should it include C code that provides the same function but would almost certainly be slower since compilers do a pretty poor job of converting C code into SIMD code? Sure, they CAN but in almost all cases you have to hand tune the source code to write it "just so" for the compiler to understand that it can convert it to SIMD instructions. Taking it further, you will with certainty run into a situation where you have to write the source code "just so" for Microsoft's compiler to turn it into some sort of AVX instructions, but that "just so" for Apple's compiler to turn it into NEON instructions is different, as is the source code for the Android compiler. Maybe Qualcomm releases their Nuvia cores and releases their own tweaked Android compiler that will generate SVE2 instructions if you write it "just so" in yet another way.

So do you have different hand tuned source code for the Windows, iOS and Android benchmarks for each to use SIMD? Do you use Google's or Qualcomm's compiler to build your Android version? Maybe you only use the Qualcomm compiler to build a couple subtests that can benefit from SVE2 and use Google's on the rest? What if you find out later that if you tweaked the code a bit you could get 50% more SIMD performance on iOS? Do you include that in the next point release update and leave everyone to wonder why the iPhone score in a few benchmarks skyrocketed? What if you could get even more performance on iPhone by using Apple's AMX instructions, do you figure out how to get the compiler to generate those from your source code, or decide it isn't worth the effort?

If you're calling libraries in your benchmarks then OS updates might raise the scores with the same benchmark. Maybe a Windows system that scored 2000 scores higher after a Windows update that updated the libraries? That's not going to cause all sorts of confusion and set the conspiracy theorists loose on forums like this one! What's more the libraries could make hardware calls - Apple's libraries might call some sort of fixed function hardware, or maybe the GPU, to help with certain tasks. Now in one sense that is totally fair because that's how real world applications that call those libraries would perform, the application writer doesn't care if the CPU is performing a function or some sort of black box hardware Apple includes on the SoC. On the other hand that's getting away from a 'CPU' benchmark if it isn't even the CPU performing the tasks. So then the question is, do you want a CPU benchmark or a system benchmark? If you're wanting a CPU benchmark and restrict to the CPU, you are shortchanging platforms that have additional hardware, and producing mostly worthless results for tasks that would be better performed on a GPU. If you allow using other facilities then people who want to compare CPUs are getting worthless results.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
@Doug S those are good points that i agree with in general, but we have to draw a line somewhere, as there are benchamarks from all camps:

1) We have benchmarks where source code is available - for example SPEC that follows what You wrote to the letter, they even allow to override malloc library use for allocations and use some custom heap library. And then compiler and vendor games fully start and results between two vendors or even two systems are not really comparable and who knows what they mean? They have to dislose setup, flags, but all sorts of crazy compiler, malloc tunes happened in the past.
Even when testing is done by one house like Anandtech used to, it still left a lot of question like "should we use latest compiler, what is vendor A compiler produced code scores that are in fact better on vendor B system, but noone will use it in real world etc.

2) Black box benchmarks, that focus some narrow parts of performance spectrum, the usual suspects are well known. At some point they have same connundrum of benchmark evolution and "optimization" as (A), say some sort of hardware video encoding acceleration is now available, should it be represented in their benchmark? What if whole field has changed and noone is using CPU to render images anymore? What about Apple or AMX acceleration if they test some CPU AI/DL workload? Lot's of questions here.

3) The effort from company that produces GB5 suite, that is closed source, with binaries provided, but who seem to get better with years with their benchmarks. They are not really calling any libraries and target CPU as i believe is important on mobile and desktop. I have already posted above that i like their choice of benchmarks, for example gone are the AES tests, gone are FFT. Heck, even thing like "jpeg encode speed" are no longer tested -> mobile phones have dedicated hardware and on desktop if anyone needs a mass JPEG processing pipeline for professional work, well they need to get their head checked if they are not using specialized libs and/or CUDA. In its place is a workload that uses DL to clasify images and dealing with tags -> much wider task than just encode/decode JPEG.


See for Yourself what they test, makes a lot of sense to me.
 

lightmanek

Senior member
Feb 19, 2017
508
1,245
136
I don't think it's that much off for code compile. Compilation is known to show diminishing returns with increasing core count.


Here you can see how doubling the amount of cores (with the same power/frequency per core) doesn't reduce the time in half. The scaling observed there isn't that different from what we see in GB6.


As always, there is truth to that, but there is also a fact that a lot depends on which compiler you're using, what type and size of the project, and what is your target language, so your mileage might vary a lot.
My brother does reasonably large C/C++ projects on his laptop and after increasing cores by 50% he got almost exactly 50% speedup in the projects he works on.
Besides, game developers often offload compile tasks to a dedicated box and it can run multiple compile jobs in parallel.
 
  • Like
Reactions: Vattila

Exist50

Platinum Member
Aug 18, 2016
2,452
3,105
136
No Xiaomi is known to boost scores and why is bug only for Xiaomi devices. Other Android devices are fine.
Such inflated numbers are clearly a bug. When companies trick benchmarks, historically that's been accomplished with whitelists and higher power limits/overclocks. That nets you a couple more percent, not triple.

And Xiaomi is far from the only company that's played benchmark shenanigans either.
 

roger_k

Member
Sep 23, 2021
102
219
86
As always, there is truth to that, but there is also a fact that a lot depends on which compiler you're using, what type and size of the project, and what is your target language, so your mileage might vary a lot.

This is undeniably true. I just wanted to point out that sometimes this stuff is not trivial.

My brother does reasonably large C/C++ projects on his laptop and after increasing cores by 50% he got almost exactly 50% speedup in the projects he works on.

Did he get a new laptop with exactly the same specs and 50% more cores, or was it maybe a new generation with more cache/architectural improvements/ faster SSD etc.? It's very easy to see spurious generalisations with these things.

Besides, game developers often offload compile tasks to a dedicated box and it can run multiple compile jobs in parallel.

That's very true, but here we are again at a server usage case which we are explicitly excluding.
 
  • Like
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
It does reflect the actual core count and it does use all cores for every test.
You are plain misleading people with statements like this. There is a huge difference between having all cores available and effectively scaling across all cores available. For single workloads you can measure a chips MT performance until the point that workload stops scaling with any more threads. Below that point the chip is the bottleneck, above that point the workload is the bottleneck. The more cores a chip has the more workloads are incapable to effectively scaling across all cores available, the more GB6's MT score is hampered and misleading by its increasing larger part of bottlenecking benchmarks. With GB6's MT score on bigger chips you are not measuring the overall MT performance of the chip but a mix of ST performances limited to the cores effectively used and workloads incapable of scaling across the remaining idling cores.

Nobody is interested in a score representing workloads bottlenecking themselves.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

Some of background information about choices stright from creator.

In Geekbench 6, the biggest change is probably the way multi-core scores are calculated, measuring "how cores cooperate to complete a shared task" rather than assigning different tasks to each core. This is meant to better reflect how actual multi-core workloads operate, especially for hybrid CPU architectures that mix big, fast cores and small, power-efficient ones, an ever-growing category of chips that includes most modern ARM processors and Intel's 12th- and 13th-generation CPUs.
 

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
Some of background information about choices stright from creator.
That sounds like the "MT" score is mainly for comparing hybrid processors to otherwise similar non-hybrid processors.

Not that I mind (aside the misleading title for the score). But that's a huge scheduler can of worms they get into right there. For hybrid processors it's essentially benchmarking how well the scheduler works.
 

Abwx

Lifer
Apr 2, 2011
11,783
4,692
136

Some of background information about choices stright from creator.

They are explicitely stating that GB6 is relevant for DT X86 CPUs, yet their MT scores are totally irrelevant for a lot of DT usages.

Most funny is that they removed XTS/AES, so can we conclude that such instructions are no more used in phones FI..?.

Their "explanations" are actually total BS.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Most funny is that they removed XTS/AES, so can we conclude that such instructions are no more used in phones FI..?.

Pure AES was removed cause:

1) It is unrealistic for phone or desktop to encrypt/decrypt so much data per second as even lowliest ARM phone SoC was capable off.
2) Real world usages like VPNs are not even stressing those paths
3) CPU vendors started to abuse AES tests by providing vector SIMD AES instructions that are not required in phone/desktop but were good at inflating the score.
4) AES probably lives on in their browser subtest, but part of workload now

It's their benchmark, they are free to evolve it, for some vendor fans everything will be conspiracy to show their hybrid, multidie or NUMA-node without memory architecture in bad light.
Your opinion is sadly worth less nothing without proper arguments, go create your own benchmarks, in article guy was unhappy with status quo and did that just.
 

Tup3x

Golden Member
Dec 31, 2016
1,231
1,343
136
I kinda understand the reasoning behind MT score (or how that test works). Currently it seems to measure lightly threaded performance. In most cases that's how applications scale - not that well past certain thread count. In almost all cases normal consumer CPUs would wipe the floor with server CPUs in normal tasks (gaming, word processing, browsing...). That being said, it would be nice if they would add third score that would make maximum use of the CPU.
 

Doug S

Diamond Member
Feb 8, 2020
3,125
5,372
136
@Doug S those are good points that i agree with in general, but we have to draw a line somewhere, as there are benchamarks from all camps:

1) We have benchmarks where source code is available - for example SPEC that follows what You wrote to the letter, they even allow to override malloc library use for allocations and use some custom heap library. And then compiler and vendor games fully start and results between two vendors or even two systems are not really comparable and who knows what they mean?


I think the SPEC benchmarks are very valuable since the source code is fixed, but updating them so infrequently leads to the compiler games you mention. Instead of tweaking the source "just so" so the compiler generates SIMD instructions they are basically tweaking the compiler (or at least they did in the workstation RISC days, since every vendor had their own compiler) for the same end result.

My big issue with SPEC isn't that though (it is well known which benchmarks are "broken" by compiler tricks so you can simply ignore those results and adjust for that contribution when looking at the geomean scores) it is that they allow PGO feedback in the BASE scores. I don't agree with them at all in their logic for doing so, and think it renders them useless. Instead of having a base and a peak score, you have two different flavors of peak score.

I get the argument that it takes a long time to put together benchmarks like SPEC and the people involved aren't doing it as their full time job, so no one is expecting them to produce a new suite every year. But they could either tweak benchmarks that have been "broken" to undo that, or at the very least remove the ones that have been broken. Simply knowing that was going to happen would be enough to remove the incentive to invest the time required to break them in the first place. I don't follow SPEC as closely as I used to, so I'm not sure if any of the SPEC2017 suite has been broken yet, but if they knew there would be a SPEC2018 or whatever if a benchmark was broken they hopefully wouldn't bother.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,105
136
That sounds like the "MT" score is mainly for comparing hybrid processors to otherwise similar non-hybrid processors.

Not that I mind (aside the misleading title for the score). But that's a huge scheduler can of worms they get into right there. For hybrid processors it's essentially benchmarking how well the scheduler works.
He's saying the change is to better reflect real workloads, which just happens to include how they utilize hybrid core schemes. Are you seeing some similar, comparable workload with radically different characteristics?

Though I do agree that a tweak in naming wouldn't be a bad idea. Something like "mixed parallel", maybe? Or just "parallel". I think that would be decent.
 
  • Like
Reactions: Lodix

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
He's saying the change is to better reflect real workloads, which just happens to include how they utilize hybrid core schemes. Are you seeing some similar, comparable workload with radically different characteristics?
Well, the quote is:
"In Geekbench 6, the biggest change is probably the way multi-core scores are calculated, measuring "how cores cooperate to complete a shared task" rather than assigning different tasks to each core. This is meant to better reflect how actual multi-core workloads operate, especially for hybrid CPU architectures"

The especially in there has a different weight to me than your just happens. Maybe they tried to avoid the "Cinebench accelerator" trap that way? If that was it the exakt opposite is actually happening right now, with 13900K/F/S filling pages of "MT" results before chips with more cores appear.

How cores cooperate to complete a shared task is a behaviour that's mainly enforced by schedulers, unless the software applies more specific definitions what cores to use how. Whether the "MT" workloads in GB6 include such I have no idea.

What I know is that so far I see no use for GB6's "MT" scores as a comparative metric since it's completely unclear what quality it is supposed to represent. Aside "best performance for an isolated lightly multi-threaded workload" maybe.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,105
136
The especially in there has a different weight to me than your just happens.
I think you're reading too much into it. It's just an observation about how hybrid CPUs are utilized in the real world. Most workloads don't spawn exactly N identical tasks for each core. They have a few ST-sensitive threads, and some more ST-insensitive ones.
If that was it the exakt opposite is actually happening right now, with 13900K/F/S filling pages of "MT" results before chips with more cores appear.
That's not because of all the E-cores, but rather primarily the 8 P-cores yielding strong lightly threaded performance. If someone has 8+0 vs 8+8 vs 8+16 numbers, that would probably help illustrate it.
How cores cooperate to complete a shared task is a behaviour that's mainly enforced by schedulers, unless the software applies more specific definitions what cores to use how.
I guess, but how's that an advantage for hybrid designs? If anything, it makes the benchmark harder than blindly giving each thread an identical, isolated task.
What I know is that so far I see no use for GB6's "MT" scores as a comparative metric since it's completely unclear what quality it is supposed to represent. Aside "best performance for an isolated lightly multi-threaded workload" maybe.
Well that does appear to be a succinct summary of (primarily) client MT workloads, so yeah, I think that was their intention. Other, better benchmarks exist for specifically workstation/server usage.
 

moinmoin

Diamond Member
Jun 1, 2017
5,204
8,366
136
That's not because of all the E-cores, but rather primarily the 8 P-cores yielding strong lightly threaded performance. If someone has 8+0 vs 8+8 vs 8+16 numbers, that would probably help illustrate it.
I actually agree. I was complaining before already that the correlation of the MT score to the ST score is way too strong now.

I guess, but how's that an advantage for hybrid designs? If anything, it makes the benchmark harder than blindly giving each thread an identical, isolated task.
It's not. I was previously complaining that the new MT score moves the bottleneck from the chip to the workload. And if this MT test suite is (as I interpreted the quote) indeed targeted at benching the particular difference between hybrid and non-hybrid designs, the bottleneck on hybrid designs possibly moves onward to the scheduler.

Anyway I've been way too active in this thread. GB6's MT score is useless. That's all there is for me to say in the end, and with that I'll see myself out.
 

dark zero

Platinum Member
Jun 2, 2015
2,655
140
106
No Xiaomi is known to boost scores and why is bug only for Xiaomi devices. Other Android devices are fine.
Well, I managed to test the Poco X4 GT and the scores are the following:
ST: 1138
MT: 3406
Open CL: 3680
Vulkan: 3911

I would want to have those insane results.
That would lead me that those Xiaomi users are blatantly cheating. Now makes me think how Geekbench allowed those things. Also hope that only Xiaomi is affected, and not other more brands.
 
Jul 27, 2020
24,226
16,887
146
M1 Macbook Air


OpenCL compute is just broken, at least on M1.

Anyone know why MacOS GB6 download is over 700MB?