Truth or Myth?: Is SYSmark a Reliable Benchmark?

MrTeal · Jan 20, 2016

TheELF said:
You are making it more complicated than it is but yes.
Baseline of sysmark is 1000,in the video the intel is 98.7% of that and the amd is 65.9% of that,that's a ~33% difference and not the claimed 50% .
It's in line of the single speed performance advantage that intel has.

So, if someone were to say Intel has a 200% single speed performance advantage vs an E-350, would you think that Intel scores 1000 sysmarks and the AMD netbook scores -1000 sysmarks?

TheELF · Jan 20, 2016

Half of a 100 ( % ) (1000 points) is 50 ( % ) (500 points) and half of that is 25 ( % ) (250 points)

200% lower speed does not mean 200% lower score in a bench.

Markfw · Jan 20, 2016

TheELF said:
You are making it more complicated than it is but yes.
Baseline of sysmark is 1000,in the video the intel is 98.7% of that and the amd is 65.9% of that,that's a ~33% difference and not the claimed 50% .
It's in line of the single speed performance advantage that intel has.

Depends on the wording. 98.7 is roughly 50% more than 65.9, but the inverse that 65.9 is roughly 33% lower than 98.7. You can really play tricks with statistics.

TheELF · Jan 20, 2016

Markfw900 said:
Depends on the wording. 98.7 is roughly 50% more than 65.9

But it's a freaking comparison towards the baseline and not between the two numbers themselves,it's "each out of 100" and not "a out of b" .
It's like you say wording ,it's indeed a 50% delta like they say in the video which does not mean 50% difference in performance like they try to make people believe to discredit sysmark.

AtenRa · Jan 20, 2016

Intel's score of 987 is almost 50% faster than AMD's score of 659

That means that Intel CPU that scored 987 is almost 50% faster than the AMD CPU that scored 659.

OK now ??

TheELF · Jan 20, 2016

AtenRa said:
Intel's score of 987 is almost 50% faster than AMD's score of 659

No, it's not.

AtenRa said:
That means that Intel CPU that scored 987 is almost 50% faster than the AMD CPU that scored 659.

No, it's not.

Pay more attention.

MrTeal · Jan 20, 2016

TheELF said:
No, it's not.

No, it's not.

Pay more attention.

Is 90mph 50% faster than 60mph?

AtenRa · Jan 20, 2016

TheELF said:
No, it's not.

No, it's not.

Pay more attention.

You really have to understand the concept of More/Faster vs Less/Slower

Intel has 9 goats
AMD has 6 goats

Intel has 50% more(faster) goats than AMD

or

AMD has 33% less(Slower) goats than Intel

DrMrLordX · Jan 20, 2016

The real question is: what are they doing with all those goats?!?!

MrTeal · Jan 20, 2016

DrMrLordX said:
The real question is: what are they doing with all those goats?!?!

It's Texas, what do you think?
http://www.thegoatrun.com/

TheELF · Jan 20, 2016

AtenRa said:
You really have to understand the concept of More/Faster vs Less/Slower

Intel has 9 goats
AMD has 6 goats

Intel has 50% more(faster) goats than AMD

or

AMD has 33% less(Slower) goats than Intel

And yet AMD focusses on the 50 and not on the 33 just to bad talk bepco.
The average workload of typical offices is mainly single threaded with a bit of multithreaded,that's what sysmarks tests and that's why you only get a ~30% difference instead of the pure difference in single threaded.

But 50% sounds more like there is something fishy going on.

DrMrLordX · Jan 21, 2016

MrTeal said:
It's Texas, what do you think?
http://www.thegoatrun.com/

At least the Sudanese are having a good time. Try lecturing them about SYSmark sometime, they'll probably get pissed off and leave.

Dresdenboy · Jan 21, 2016

ShintaiDK said:
But, but, but...RTG favours Intel CPUs!

That remembers me of the Quantum and Polaris discussions. As they pair their GPUs with good CPUs available on the market, they probably improve their drivers for the current majority of CPUs paired with the AMD cards by using this compiler. But as with SPEC, it might just the, that other compilers are simply worse.

Dresdenboy · Jan 21, 2016

In general, while some here circle around some red meat like hungry dogs, I'd ask, what would be a good way to measure user experience.

Example 1:
One video editing friend didn't get the expected perf. boost while going from some i5 IB to a i7-4790K with 2x RAM, SSD and other improvements. While the benchmark-visible encoding speed improved significantly (done over night), the editing with multiple streams itself is as stuttery and laggy as before. So how should we measure this?

Example 2:
There was this G-Sync/FreeSync comparison video done by some "techperts" on YT. They discovered, that while FPS, frame time distribution, etc. looked fine and about the same for both systems (IIRC), the actual lag (mouse click -> visible reaction on screen) was very different.

So I want a snappy system. How does that match with different Benchmarks? One could also bring the argument, that the skin-temp controlled turbo peaks, which surely improve user experience, get lost while running lengthy benchmarks.

Dresdenboy · Jan 21, 2016

The Stilt said:
I think AMD is pretty comfortable with Intel compilers, since they compile their display drivers and ACML library with Intel compilers. Surely they wouldn´t do that if it produced code which is crippled on their CPUs :sneaky:

I have two hypotheses: 1) Focus on the typical CPU matched with an AMD dGPU. 2) Performance of other compilers.

coercitiv · Jan 21, 2016

Dresdenboy said:
Example 1:
One video editing friend didn't get the expected perf. boost while going from some i5 IB to a i7-4790K with 2x RAM, SSD and other improvements. While the benchmark-visible encoding speed improved significantly (done over night), the editing with multiple streams itself is as stuttery and laggy as before. So how should we measure this?

Bottleneck was somewhere else, maybe even the software itself. For example I can tell you from experience that some (media related) tasks are single-threaded, resulting in near-immunity to hardware upgrades. (10-15% ST performance upgrade is equal to zero when system is way behind user needs)

Dresdenboy said:
Example 2:
There was this G-Sync/FreeSync comparison video done by some "techperts" on YT. They discovered, that while FPS, frame time distribution, etc. looked fine and about the same for both systems (IIRC), the actual lag (mouse click -> visible reaction on screen) was very different.

Audiophiles solved this problem long time ago, through the simple use of blinded testing. Some results are downright hilarious.

Dresdenboy said:
So I want a snappy system. How does that match with different Benchmarks?

You ask the Anandtech forum. We are the go to collective benchmark

On a more serious note, enthusiasts/professionals corroborate experience with benchmark results in order to make a more informed decision, normal users are simply blindsided and their best long-term choice is to always buy mid-range components. It's like playing poker with slightly better odds. (joking again, odds are worse)

MrTeal · Jan 21, 2016

coercitiv said:
Audiophiles solved this problem long time ago, through the simple use of blinded testing. Some results are downright hilarious.

Doing blind or double blind testing is a lot easier in audiophile land. It'd be interesting to see how you could implement a good testing regime for G-Sync/Freesync as a consumer. It'd be really interesting to see internal testing from a monitor manufacturer that could implement GSync on one port and Freesync on another.

coercitiv · Jan 21, 2016

MrTeal said:
Doing blind or double blind testing is a lot easier in audiophile land.

Why would it be harder in computer land? Even in audiophile land you test products, not isolated technologies.

MrTeal · Jan 21, 2016

coercitiv said:
Why would it be harder in computer land? Even in audiophile land you test products, not isolated technologies.

Yeah, but it's relatively easy to construct an ABX box for an amplifier that switches the speaker and input leads, preferably double blind without the tester even knowing what the DUT is. Even testing speakers, swapping out the driven signal with an ABX box doesn't provide a visual clue as to which speaker is being driven. Obviously there's an acoustic difference, but since you're testing that it's the desired outcome.

There's no dual G-Sync and Freesync panels, nor are their GPUs that can drive both Freesync and G-Sync. To do subjective testing of the system as a consumer with off-the-shelf parts, you would need both monitors, ideally from the same manufacturer, definitely with the same panel. You'd need to visually obscure which monitor is being used, so you'd need to mod the case so they both look the same to the user. Double blind would be almost impossible, but you could isolate the tester (or the tech doing the monitor swap) from the subject behind a screen so the subject can't see the tester placing the monitor in front of them.
You'd also need to deal with the video cards, probably the easiest way would be to build two identical systems with a fresh Windows install on each. A KVM could be used to swap inputs without the subject knowing which system was in use. The subject would need to be constrained on using the system though, the remote testing would have to load up the game so the subject doesn't get a glimpse of the CCC icon or something that would indicate which system is in use.

Even then, just as a test of Freesync vs G-Sync, you still have a huge uncontrolled variable in that one needs an AMD GPU and one needs an nVidia GPU, so even your results can be compromised by inherent advantages/disadvantages in the GPUs themselves vs the display tech.

Double blind testing in audio land is much simpler than this.

Cerb · Jan 21, 2016

Dresdenboy said:
One video editing friend didn't get the expected perf. boost while going from some i5 IB to a i7-4790K with 2x RAM, SSD and other improvements. While the benchmark-visible encoding speed improved significantly (done over night), the editing with multiple streams itself is as stuttery and laggy as before. So how should we measure this?

Service deadlines met, how far off some high percentile of missed service times were, like 99-99.5 (just to remove occasional spikes, that may not be indicative of repeatable problems). A perfect score would be possible, with some given workload. Not easy to benchmark, without cooperation from the software company in question, and assuming they implemented means in the code to log performance metrics.

There was this G-Sync/FreeSync comparison video done by some "techperts" on YT. They discovered, that while FPS, frame time distribution, etc. looked fine and about the same for both systems (IIRC), the actual lag (mouse click -> visible reaction on screen) was very different.

That one may be quite interesting, since you'd need a way to match up particular frames displaying on the monitor with input actions, using an external detection/recording device. However, if you could track the mouse data to a given frame's rendering, then it wouldn't be too difficult, after that, just a lot of work to implement.

No-Vsync results could be used to get a baseline.

Overall, the nature of individual programs going multithreaded for performance, and getting various bits of GPU-acceleration, has made composite benchmarks pretty much useless. It's good to know what will speed up Photoshop, or Premier. But that won't necessarily apply to other programs the same way. Even if it were completely fair, a composite benchmark today, like Sysmark, would mostly be good to show how bad that A4 is, compared to a Pentium, and not much anything finer-grained.

coercitiv · Jan 22, 2016

MrTeal said:
There's no dual G-Sync and Freesync panels, nor are their GPUs that can drive both Freesync and G-Sync.

Again, you are testing products, not technologies. You pick a benchmark criteria, "same budget" / "best of the best" etc, and run with that: GPU + display combos (AMD+FreeSync monitor, Nvidia+G-sync). The only difference is that in this case, your users report subjective experience without knowing what brands are being tested. They don't even have to know you're testing monitor sync tech. Nobody needs FreeSync or G-sync in a vacuum, and although it would be nice to have proper data on that, a good subjective test may actually point you towards appropriate benchmark criteria for a more classic evaluation.

Keep in mind these technologies were built especially for that, improve subjective experience. Fail to do that, either through tech or product ecosystem weakness, and it doesn't matter anymore. At least not for the consumer.

PS: even in scenarios with very limited resources, in which users are likely to know what techs/brands are being tested, one can still ask the user to identify the brand. It's amazing what data you can extract even when the user (unconsciously) wants to guide you on a certain path.

mysticjbyrd · Jan 23, 2016

Burpo said:
.. "doesn't absolve intel of potential wrongdoing"?

That's funny..
Maybe I shouldn't be absolved of potentially stealing a car.. lol

If a car was missing, and we a known car thief, and the prime suspect, then I am sure you would love to be absolved of guilt.

Truth or Myth?: Is SYSmark a Reliable Benchmark?

Diamond Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Golden Member