They're (Almost) All Dirty: The State of Cheating in Android Benchmarks

Roland00Address · Oct 2, 2013

ponyo said:
Say what?

Yes, some people are still using pentium 4 desktops and amd single core. If your desktop is 7 years old or more, or your laptop is 5 years old there is a good chance this phone is faster (unless you got the best of the best in 2006 or 2008.)

Crono · Oct 2, 2013

Eug said:
Both the 5S and the 5 feel very fast for basic OS navigation and app usage. However, video export on the iPhone 5S is twice as fast as on the 5.

So yeah, even today, having more performance helps.

That's more than a year old phone versus a current phone, and I'm not sure how exporting works on the iPhone, but I don't know if it's running at 100% CPU for it or for how long. So maybe it does help, but for the most part most Android and iOS users aren't able to use all the speed of their phones right now. I'm sure most of us who even overclock our phones know how to squeeze that kind of use out of our phones, but the vast majority of users aren't complaining about clock speed, which is my point.

I'm all for more power, but where benchmarks can show a big difference a midrange phone and the very latest and most expensive SoC from whoever, you might not necessarily see a difference in practical terms.

Roland00Address · Oct 2, 2013

Not trying to excuse the bad behavior of the phone designers.

But would it be cheating if they have an option in the settings (since they do their own version of android this is trivial for them) where you can turn "on" or "off" these cheats; yet they keep these cheats "on" by default since everybody else does it.

You know just misleading people by not lying to them.

Eug · Oct 2, 2013

Crono said:
That's more than a year old phone versus a current phone, and I'm not sure how exporting works on the iPhone, but I don't know if it's running at 100% CPU for it or for how long. So maybe it does help, but for the most part most Android and iOS users aren't able to use all the speed of their phones right now. I'm sure most of us who even overclock our phones know how to squeeze that kind of use out of our phones, but the vast majority of users aren't complaining about clock speed, which is my point.

Well, I suspect once people get used to faster speed, they don't want to downgrade, if they're sharing a lot of video for example.

I'm just using the 5 vs 5S comparison here because they're both fast and recent phones. The 5 in benches is slower than the latest 2013 Android phones, but it's faster than many 2012 Android phones, so if we're talking speed vs. Android, the iPhone 5 only a half-generation behind.

I'm all for more power, but where benchmarks can show a big difference a midrange phone and the very latest and most expensive SoC from whoever, you might not necessarily see a difference in practical terms.

The other thing that isn't well demonstrated in the benchmarks is the speed in other aspects of the SoC, and the implementation. For example, the ISP speed and implementation in smartphones' cameras have gotten a nice boost in the last year. Again if we take the 5S vs. the 5, the specs sound very similar (both 8 MP cameras, with "just" a 15% image sensor size increase and a aperture increase from f/2.4 to f/2.2). However, IMO the 5S camera overall is a massive improvement over the iPhone 5, and of course the 5 gets 10 fps shooting at 8 MP and other cool features that probably would be much harder to implement well with the older ISP and CPU. For example, have a look at the video below:

http://vimeo.com/75664844

thedosbox · Oct 2, 2013

Ns1 said:
Note: I realize I'm eating foot in mouth and I don't care, educate me ATMD&G. Help me understand the difference in computing power. Preferably in laymens terms.

It's no different than cars - not all 2.0L 4-cylinders perform the same. One might have a Turbo, another might be 50% heavier, gearing could be substantially different.

Anyhow, I've always laughed at those who threw around mobile benchmark results as if they mean something. Happy to see my Moto X doesn't participate in these shenanigans though.

MrX8503 · Oct 2, 2013

Ns1 said:
I know it's shocking, but there are users with less than 2.3ghz/3gb ram under the hood.

See: Netbook/MBA users

Lol. It's not faster than a MBA. The A7 is faster than snapdragon 800 and there's no way that Apple would put an A7 in a MBA. Maybe in the future, SOCs will be in Ultrabooks but that day is not today.

I bet you also think consoles are more powerful than high end PCs.

Btw, 2.3ghz means jack shit.

grkM3 · Oct 2, 2013

How exactly is setting a performance governor in the kernel when running a benchmark app cheating?

So if the 5s kernel runs the a7 at max MHz during a bench is that not the same thing?

Mopetar · Oct 2, 2013

grkM3 said:
How exactly is setting a performance governor in the kernel when running a benchmark app cheating?

Because it's not indicative of actual, real-world behavior. The Ars review found that if they just changed the name of the benchmark that the phone wouldn't cheat and the results would be lower. Clearly the default behavior for the phone isn't to run full-throttle at all times.

So if the 5s kernel runs the a7 at max MHz during a bench is that not the same thing?

It definitely would be if it only did that when running benchmark programs. If the iPhone doesn't treat the benchmark as special (which according to Anand it doesn't) compared to other programs, then it's not cheating. That's the expected, real-world behavior of the device.

grkM3 · Oct 2, 2013

Mopetar said:
Because it's not indicative of actual, real-world behavior. The Ars review found that if they just changed the name of the benchmark that the phone wouldn't cheat and the results would be lower. Clearly the default behavior for the phone isn't to run full-throttle at all times.

It definitely would be if it only did that when running benchmark programs. If the iPhone doesn't treat the benchmark as special (which according to Anand it doesn't) compared to other programs, then it's not cheating. That's the expected, real-world behavior of the device.

Android has governor steps in the kernel depending on load and how aggressive the maker wants to set the phone up.

If a benchmark app is supposed to bench a phone you want max out clocks to keep it repeatable and not the soc running different cores at different clock speeds and in idle states.

So if apple sets an aggressive governor and maxes CPU speeds with light use and some android benchmarks are not telling the kernel to run max clocks its cheating?

You can change performance governors in the kernel anyways so if I go and turn max performance on my gs4 is that cheating when I go for record bench runs?

dawheat · Oct 2, 2013

Mopetar said:
Because it's not indicative of actual, real-world behavior. The Ars review found that if they just changed the name of the benchmark that the phone wouldn't cheat and the results would be lower. Clearly the default behavior for the phone isn't to run full-throttle at all times.

This is what I hope some folks who actually can explain how these benchmarking programs work can help clarify. I get that running a game on different platforms can result in varied level of load due to the game code being optimized for a certain platform - that's how business works when you have limited resources. But why would this apply to a purely synthetic benchmarking program that supposed to show the spec capabilities of a platform. Sure not all every day programs will take advantage of it all, but that's dependent and varies on every program.

Shouldn't a benchmarking program like Geekbench run the CPU at 100%? If not, then it seems that you're at the mercy of how well that particular benchmark stresses each different platform.

Say the benchmark causes the iPhone to run both cores at 99%, the 4 cores in the S800 at 90% and the cores in the S4 Pro at 85%. It seems then you're testing how well the benchmarking program stresses each platform as much as platform itself. Clearly the results are better when all cores are fully loaded, so why isn't the benchmarking program doing so naturally.

Someone help me understand why making a benchmark run your hardware at 100% is disingenuous - it seems to actually remove a bias that could exist in the benchmarking code.

Mopetar · Oct 2, 2013

grkM3 said:
Android has governor steps in the kernel depending on load and how aggressive the maker wants to set the phone up.

If a benchmark app is supposed to bench a phone you want max out clocks to keep it repeatable and not the soc running different cores at different clock speeds and in idle states.

But the phones don't run that way any other time, so what you get is a false sense of performance.

So if apple sets an aggressive governor and maxes CPU speeds with light use and some android benchmarks are not telling the kernel to run max clocks its cheating?

What many Android manufacturers have done is to only adjust the SoC's clock speeds when certain benchmark apps are running. Apple and Motorola do not do that. As no one has reported inconsistent benchmark results for the phones from those manufacturers, it wouldn't appear as though they're treating them any different from any other application.

You can change performance governors in the kernel anyways so if I go and turn max performance on my gs4 is that cheating when I go for record bench runs?

If you want to manually change the way your phone's SoC operates that's your own business. However, you wouldn't try to claim that the performance that you get is the same as a stock phone.

Think of it this way, imagine that you go to a car dealership and are looking at cars. You see one that people have said has 350 HP. You take the care out on their test track and test drive the car, like it and decide to purchase it.

However when you get it home, you notice that you're not getting the advertised level of power. When you ask the dealer about it, he calmly explains that the only time you actually get that much power from the car is only if you're test-driving it (because otherwise the fuel efficiency would be terrible), but now that you've bought it, it won't operate in that mode unless you're on their test track.

If you're savvy enough, you can go online and find a hack that someone else made that will allow you to program the ECU to always run at 350 HP, but then you're getting fewer than the advertised MPG.

I'd be interested in seeing someone try to trick the phone to running in its benchmark mode for the battery life tests to see how much of a difference it makes. If they want to rig the phone to report false performance figures, then they have to accept what it will do to the battery life.

Eug · Oct 2, 2013

dawheat said:
Shouldn't a benchmarking program like Geekbench run the CPU at 100%? If not, then it seems that you're at the mercy of how well that particular benchmark stresses each different platform.

It should not. It should run how the CPU design intends it to run.

Say the benchmark causes the iPhone to run both cores at 99%, the 4 cores in the S800 at 90% and the cores in the S4 Pro at 85%. It seems then you're testing how well the benchmarking program stresses each platform as much as platform itself. Clearly the results are better when all cores are fully loaded, so why isn't the benchmarking program doing so naturally.

These chips are NOT designed to run all cores simultaneously at 100% for extended periods. That is the point.

Samsung and friends are cheating by giving a totally false set of results that don't reflect how the chip was intended to run.

dawheat · Oct 2, 2013

Eug said:
It should not. It should run how the CPU design intends it to run.

These chips are NOT designed to run all cores simultaneously at 100% for extended periods. That is the point.

Samsung and friends are cheating by giving a totally false set of results that don't reflect how the chip was intended to run.

Are you assuming or do you actually have insight into how these programs are designed? The chips are clearly designed to run at 100% for at least some period of time before throttling. I think what you're conflating and what is my question is - if a software benchmark doesn't stress a device at 100%, then is that due to the software or hardware?

I still don't understand how a software benchmarking program doesn't result in 100% load as long as the platform can support it. Now if Samsung was exceeding their designed thermal windows only for these programs, then yes I understand the outrage, but from what I've seen on ars and other sites - that is not the case. The hardware is able to run at 100% without issues for at least as long as it takes to run the benchmark.

<edited for clarity on what my question is>

MrX8503 · Oct 2, 2013

grkM3 said:
How exactly is setting a performance governor in the kernel when running a benchmark app cheating?

Because it can't run at those speeds all the time.

grkM3 said:
So if the 5s kernel runs the a7 at max MHz during a bench is that not the same thing?

That would be cheating as well, but the 5S doesn't do this.

Ns1 · Oct 2, 2013

MrX8503 said:
Lol. It's not faster than a MBA. The A7 is faster than snapdragon 800 and there's no way that Apple would put an A7 in a MBA. Maybe in the future, SOCs will be in Ultrabooks but that day is not today.

I bet you also think consoles are more powerful than high end PCs.

Btw, 2.3ghz means jack shit.

see:

http://forums.anandtech.com/showpost.php?p=35561394&postcount=22

MrX8503 · Oct 2, 2013

Ns1 said:
see:

http://forums.anandtech.com/showpost.php?p=35561394&postcount=22

An A7 dual core at 1.3 ghz outperforming a 2.3ghz quad snap dragon should be your first clue.

You can never ever compare CPUs by just clockspeed and cores alone. A core i series would mop the floor of the snap dragon.

Joe1987 · Oct 2, 2013

I always just assumed everything Samsung made cheated after the S4 was found out, am a little surprised how rampant it is.

ControlD · Oct 2, 2013

dawheat said:
Are you assuming or do you actually have insight into how these programs are designed? The chips are clearly designed to run at 100% for at least some period of time before throttling. I think what you're conflating and what is my question is - if a software benchmark doesn't stress a device at 100%, then is that due to the software or hardware?

I still don't understand how a software benchmarking program doesn't result in 100% load as long as the platform can support it. Now if Samsung was exceeding their designed thermal windows only for these programs, then yes I understand the outrage, but from what I've seen on ars and other sites - that is not the case. The hardware is able to run at 100% without issues for at least as long as it takes to run the benchmark.

<edited for clarity on what my question is>

The benchmark is simply a piece of code running on a given platform. It is up to the system as a whole to allocate the resources required to run that piece of code. If the CPU is not running up to 100% it is because the system (CPU is only one part of that) is not telling the CPU to ramp up to 100%. The program is not required to do that. That is what the benchmark is trying to show: how well does the system handle the task at hand? By pegging the CPU at 100% simply to run the benchmark you are not seeing how a real world problem similar to the benchmark might be solved.

It would be perfectly acceptable to have a benchmark that tests platforms at full 100% utilization but that is not the case here so there is some "cheating" going on.

In the end I'm not sure how this effects much of anything. A fast phone is a fast phone, but still this kind of stuff should not be going on.

ControlD · Oct 2, 2013

MrX8503 said:
An A7 dual core at 1.3 ghz outperforming a 2.3ghz quad snap dragon should be your first clue.

You can never ever compare CPUs by just clockspeed and cores alone. A core i series would mop the floor of the snap dragon.

A core series CPU would mop the floor with any phone processor be it Snapdragon or A7. They aren't even in the same league.

Eug · Oct 2, 2013

DaveStall said:
A core series CPU would mop the floor with any phone processor be it Snapdragon or A7. They aren't even in the same league.

Depends on what you're doing with it actually.

Generally what you say is true, but for specific usage, it may be false. The A7 has a bunch of other things going for it including hardware cryptography acceleration and a built-in image signal processor. It also has hardware H.264 decoding built into the SoC. For example, my Core 2 Duo MacBook does not. It can keep up with 1080p H.264 decoding but just barely, and the fans go into vacuum cleaner mode and the power usage is high since both cores are very active as the CPU struggles to accomplish this task. In contrastven last year's Apple A6 doesn't even break a sweat for this task.

BTW, speaking of my old 2.4 GHz Core 2 Duo MacBook, it gets about 340 ms in SunSpider 1.0.1 with Safari. The iPhone 5S gets about 420 ms with Safari. I know it's just one limited benchmark, but still, it does illustrate just how fast the A7 really is for some tasks.

dawheat · Oct 2, 2013

DaveStall said:
The benchmark is simply a piece of code running on a given platform. It is up to the system as a whole to allocate the resources required to run that piece of code. If the CPU is not running up to 100% it is because the system (CPU is only one part of that) is not telling the CPU to ramp up to 100%. The program is not required to do that. That is what the benchmark is trying to show: how well does the system handle the task at hand? By pegging the CPU at 100% simply to run the benchmark you are not seeing how a real world problem similar to the benchmark might be solved.

It would be perfectly acceptable to have a benchmark that tests platforms at full 100% utilization but that is not the case here so there is some "cheating" going on.

In the end I'm not sure how this effects much of anything. A fast phone is a fast phone, but still this kind of stuff should not be going on.

Hmm I do somewhat get that - but then how do you take bias out of a benchmark b/c it doesn't run equally well on all devices and certainly not across platforms.

Frankly, it seems like the only true real life "benchmarks" would be actual real life tests - not synthetic tests.
- production games available across platforms
- heavy duty video encoding using the same codec
- some super heavy Excel number crunching
- others like this

Eug · Oct 2, 2013

dawheat said:
Hmm I do somewhat get that - but then how do you take bias out of a benchmark b/c it doesn't run equally well on all devices and certainly not across platforms.

For example, the Snapdragon 800 is the same across different brands of phone models. The chip manufacturer (Qualcomm) has designed the chip to work in a specific way. Those who don't game the system allow the chip to function as designed. Those who do (esp. Samsung) specifically design the OS to bypass this normal behaviour, but they do it ONLY for benchmarks and nothing else. For EVERYTHING else, it works the way the manufacturer intended. Furthermore, if you simply change the name of the benchmark, it works the way the manufacturer intended, because the OS doesn't know you're running that specific benchmark.

That is indefensible.

ControlD · Oct 2, 2013

One thing I am curious about is how can anyone say for certain who is and is not cheating? It seems like Samsung took about the laziest approach there is. They look at the name of the executable and if it matches the benchmark program then the "cheat" is applied (if I am reading the Ars information correctly that simply changing the file name results in different results).

If I wanted the cheat and get away with it I might bake something into the system itself that looks at the type of code being run and then adjust the clocks based on my best guess that a benchmark is being run. Who is to say that those companies that have been cleared of dirty play aren't simply better at it?

ControlD · Oct 2, 2013

Eug said:
Depends on what you're doing with it actually.

Generally what you say is true, but for specific usage, it may be false. The A7 has a bunch of other things going for it including hardware cryptography acceleration and a built-in image signal processor. It also has hardware H.264 decoding built into the SoC. For example, my Core 2 Duo MacBook does not. It can keep up with 1080p H.264 decoding but just barely, and the fans go into vacuum cleaner mode and the power usage is high since both cores are very active as the CPU struggles to accomplish this task. In contrastven last year's Apple A6 doesn't even break a sweat for this task.

BTW, speaking of my old 2.4 GHz Core 2 Duo MacBook, it gets about 340 ms in SunSpider 1.0.1 with Safari. The iPhone 5S gets about 420 ms with Safari. I know it's just one limited benchmark, but still, it does illustrate just how fast the A7 really is for some tasks.

I should have been clearer, I was talking about modern (Core i series chips) not ancient Core 2 systems. Still the point is fair, mobile CPUs are getting to be quite powerful in a short period of time.

Eug · Oct 2, 2013

DaveStall said:
One thing I am curious about is how can anyone say for certain who is and is not cheating? It seems like Samsung took about the laziest approach there is. They look at the name of the executable and if it matches the benchmark program then the "cheat" is applied (if I am reading the Ars information correctly that simply changing the file name results in different results).

If I wanted the cheat and get away with it I might bake something into the system itself that looks at the type of code being run and then adjust the clocks based on my best guess that a benchmark is being run. Who is to say that those companies that have been cleared of dirty play aren't simply better at it?

I dunno, but that is a really big waste of time IMO. What these guys really should be doing is optimizing the hell out of the OS. That's what Apple does. Apple also optimizes the compilers to make use of the chip it designs as well as possible. That does lead to faster benchmark results, but it also leads to faster software in general.

Samsung looking bad x 2 here, because not only are they wasting time paying attention to benchmarks, they're doing so with a product that doesn't even run basic apps properly. The worst example of this was illustrated in Ars' review. It took 2.5 MINUTES to open the Gallery app just to look at pictures on the phone. What is this 1995?

They're (Almost) All Dirty: The State of Cheating in Android Benchmarks

Platinum Member

Lifer

Platinum Member

Lifer

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer