HardOCP says anandtech has bad methods

PrincessFrosty · Feb 13, 2008

I've read HardOCP for as long as I can remember, but I've not really found their new "evaluation" methods very helpful at all. I do use lots of sources for my information before buying video cards and have slowly stopped bothering with H reviews. So I wondered why bother hanging around on their forums, so I've just registered here

Their "evaluations" are totally subjective, based on what settings they think are best to increase, and what sort of frame rate they consider acceptable to play with, and even down to how they create their own custom time demos. You can only take away something helpful from the review if you can relate to what they're doing, for example, I personally play FPS games with very high frame rates, closer to 100fps as it aids competative play, as such their reviews become completely useless to me.

As for their article more specifically, I think canned benchmarks being as close to real gameplay is only important to their "evaluatuion" method, if you're making a relative comparison between 2 cards to asertain which is faster then it's really not too important. Interestingly enough their results show the same kind of pecentage difference between the cards using both methods, proving this point and really just shooting themselves in the foot by pointing out a glaring weakness in their own "evaluation" methods.

As for them re-playing the game several times and trying to get the same walk through each time, thats laughable.

For some reason this gives me the mental image of me as a kid trying to record top of the pops (old UK music chart show) off the TV by putting a tape recorder next to it and recording using the mic, in the days of modern technology we'd just use cable from audio out, to mic in. In the context of "evaluating" the cable would of course be the time demos that ensure exact replication.

BadRobot · Feb 13, 2008

ZOMG kylez wtfpwntbbqed anandtech!

and awesomedude your wall of text crits my IQ for -100

dreddfunk · Feb 13, 2008

Originally posted by: awesomedude
For all the people claiming that crysis is one game, and that because HardOCP did not show the same results for other games that they are wrong. In scientific theories it only takes one case to prove an entire theory wrong, even if there are million cases that support the theory all you need is one case where the theory is wrong to invalidate the entire theory. Scientific method/theory doesn't work on a only when its convenient for us basis.

There is a big *if* that you really aren't mentioning here about scientific theory only requiring a single failure: that failure must be reproducible by anyone, given the exact same testing methods and materials. I think apoppin is rightly saying to Kyle: if you want us to believe your disproving test over numerous other, contradictory tests, we need to be able to reproduce it.

Originally posted by: awesomedude
Then there are the people who are claiming they need 100% repeatable proof before they are convinced of anything, but then they turn around and make absurd claims that Kyle probably picks the best run for card a and the worst run for card b, or picks a scenario to favor card a over card b without any sort of proof. If you guys are going to stick to your proof guns at the very least don't make up stuff about Kyle cause you don't like him calling AT out. I am not saying Kyle is right for calling AT out, but for you guys to say Kyle calling out AT was wrong and unprofessional and then to turn around and say that Kyle picks and chooses data to fit the results they want, without any sort of proof, is outright hypocritical.

I don't think many people here are claiming that they know with certainty that Kyle *is* manipulating benchmarks. Most posters on this thread have noted, however, that his method would allow for that because it is not at all transparent. Again, if you want credibility in the scientific world, your methods and materials must be transparent to everyone examining the data.

Originally posted by: awesomedude
To the guys discussing population and samples, the population would be all the frames one could create in the entire game, or at least that level, and Kyles run-through would be a sample of that. Now granted both samples are likely different and it is impossible to recreate the exact same sample by hand. But two random samples from the same population should give the same averages. Now I understand they aren't taking random samples, but we can conclude that two similar samples from the same population, will give similar results.

The key here is knowing that the samples do not suffer from selection bias. The only way to confirm the presence or absence of selection bias is to submit your selection process to independent observers for critical review. Again, this is apoppin's point.

Originally posted by: awesomedude
To the people saying that because HardOCP compare apple to oranges that it is impossible to distinguish performance differences because you could run card a at 600 x 800 and card b at 1600 x 1200 and get the same frame rate and determine that the cards are equally fast is again absurd. First HardOCP determines the "fastest" card as the one that can play at the highest settings IE: 1600 x 1200 would be pronounced a faster card then one run at 600 x 800. HardOCP doesn't rate the fastest card by the best average frame rates in their apple to oranges comparison like everyone seems to going off about. Since HardOCP gives its readers all the setting information readers can come to their own conclusions IE: card a has an average fps of 20 and card b has an average fps of 20 at the same resolution, but card b is running with 4xAA and 16x AF while card a is running no aa or af, anyone can easily see that card b is the faster card. There is usually enough information present that you can tell which card is faster even though they are using apple to orange comparisons.

I would agree in principle, but the real question is: has the test been conducted in an appropriate fashion. If so, then extrapolating results becomes possible. If not, then the initial results themselves, to say nothing of any extrapolations, are suspect.

Originally posted by: awesomedude
Also note on this page that they describe the exact map they choose to play, and some of the various effects, affecting the graphics card, and the length in which they played. This is for all the people claiming they chose specific effects for one video card over another, and for those saying they probably just played for 10 seconds. I understand that this isn't a save point or a video but it does give a lot of data about their run through the game. Someone could easily go to the selected map, and make a 10 or so minute run through containing the listed effects and if the data gotten was drastically different in that scenario then HardOCP's we could conclude that something was wrong with someones data.

Simply stating the map played, effects seen during play, and duration of play is wholly insufficient for reproducing a test, either to confirm or deny its results. I can do a run through Oblivion, describing the area I'm in (Kvatch), some of the effects I'm seeing (the Oblivion Gate), and how long my run is (10 minutes). If I run loops around the Oblivion Gate (which stresses the card heavily) for 10 minutes, that is a vastly different test than spending 9 minutes running around the refugee camp at the bottom of the hill (not taxing) and only 1 minute running around the Oblivion Gate. Again, to repeat the test, apoppin rightly calls for a video of the run, so we can attempt to repeat it.

Originally posted by: awesomedude
Also making demands that someone do something or all there data is false, does not make it so. And if all data was held to the meet my demands or your data is no good standard no data would be good, as everyone would have ever increasing demands. While I would love to see more openness from HardOCP reviews, including save points and videos of the run through, the fact that they don't doesn't make there data somehow false. Making demands on how things be done on someone else's forum, and making threats (if you don't do it, your a bunch of pussy's/liars/corporate whores/etc...) has never been a good way to get things done the way you want them.

...

I realize that this is biased in favor of HardOCP. Also I know I gave Kyle the benefit of the doubt in this thread, but it only seems fair to me to give someone the benefit of the doubt unless there is proof stating otherwise.

Forgive me, but you seem to be missing the point. While making unfounded accusations against someone doesn't make their data false, nor does it make it true. The entire point of the scientific 'method' is to attempt to reduce the necessity of simply 'trusting' one person and their results. There is no 'benefit of the doubt' in science. Unlike a criminal defendent in the law, where the burden is on the prosecution to prove guilt beyond reasonable doubt, the burden on the scientific community is to continue to doubt all results, until they have been confirmed over, over, and over again. Kyle doesn't get a 'benefit of the doubt' as a scientist, though he certainly deserves one as a human being (as do we all). We have to separate these two things.

The final 'fact' in this case is that, according to everyone's admission, Kyle's benchmarking method is not transparent; thus, all squabbles over minor variances of each run aside, his results can't even begin to be validated through additional testing. Thus, he is essential asking the community to take his results on faith--faith in his own integrity.

In a world (hardware review sites) driven by page-hits and the advertising revenue they generate, I am "disinclined to acquiesce to his request."

Cheers.

apoppin · Feb 13, 2008

Originally posted by: awesomedude

I realize that this is biased in favor of HardOCP. Also I know I gave Kyle the benefit of the doubt in this thread, but it only seems fair to me to give someone the benefit of the doubt unless there is proof stating otherwise.

see, the difference in the *scientific method* is to give NO ONE the "benefit of the doubt"

the BURDEN of PROOF is on KyleB ... not on the established benchmarking community

WHY is he HIDING his timedemo? ... *ALL* he is doing is creating a custom *canned* time demo - and then attempting to play it in "real time"

Let US determine if he is telling the truth ... *Why* should i take his word for ANYTHING? He is the one that claims to be a *prophet* ... i AM saying he is doing it all for profit and is not to be trusted - especially since his WILL NOT let us examine his HIDDEN methods

evidently he has something to HIDE ... something that will probably destroy his site and bring down his house of cards. It appears that he bit off more than he can chew and is choking on his own FUD right now - he HAD to lock his OWN thread.
... He is attempting to 'take on' a Forum ... our forum with FUD ... and we can see right thru him ... he IS transparent.
:roll:

Kyle must be desperate ... i almost feel bad for him in his failed bid for 'traffic'.

nullpointerus · Feb 13, 2008

OK, let me see if I understand this situation...

ATI and nVidia cheat on the benchmarks... to the point of absurdity...
... and... in response... the tech sites' communities start fighting each other. :Q

The debate about benchmarking methods can be reduced to two useful statements:

(1) Purchase decisions should be based on reproducible performance-test results.
(2) Subjective testing is needed to ensure these results represent real-world performance.

Everything else is politics...

hclarkjr · Feb 13, 2008

i never rely on only 1 websites advice for anything i buy. i read as many as i can then make my decision that way

Gary Key · Feb 13, 2008

Originally posted by: apoppin

Originally posted by: jaredpace
isn't the reason his "real time gameplay benchmarks" are lower than the "canned benches" because he is using Fraps? I remember using fraps and it knocking like 30% off my framerates while i had it running

Click to expand...

In testing Crysis on Vista-64 with the DX10 renderer (ASRock 650i, 8800GTS-512, Q9650, 4GB mem) FRAPS has an overhead of .07% at 1280x1024, going up to .13% at 1920x1200. Obviously each system will be a little different and CPU type does make a slight difference as will the video card, however, we have not seen FRAPS go above 1.7% overhead in this game.

I will not say anymore about this subject other than this, download FRAPS (2.9.4), set your system to 1280x1024,1600x120,1920x1200 medium settings (depends on your video card, but with a 8800 series this works every time) for each run, then run the GPU benchmark.

This benchmark will run four flybys on the island and then create a log file in the Island sub-directory. Write those numbers down or print them. Please note the first flyby will always have lower numbers as the data is then cached and the next three will be similar, but generally the last flyby has the best numbers. We average these numbers. Do this for each resolution if possible or whatever your monitor will support.

Open up FRAPS (be sure to set it to capture FPS & min/avg/max), run the GPU benchmark at each resolution, there is a slight pause right before the benchmark starts, you have to be quick on the trigger with the F11 key to catch it right at the beginning, a practice run helps. Let the GPU benchmark complete and hit the F11 key at the end of the fourth flyby. Go to the FRAPS directory, open up the text file for that test, write down the numbers.

Now compare the average of the GPU benchmark in the first test to the second test, that will give you an approximation of FRAPS overhead. The next (the fun one) item of business is to compare the average of the GPU benchmark (four runs totaled for each category then divided by four) and compare that number to what FRAPS reported. Depending on how fast you are with the F11 key, you might get very close to the 8000 frames generated by the GPU benchmark, the best I have done so far is 7989, the worst 7954. However, any difference this slight only benefits FRAPS results.

Post up what your results are...

trung1977 · Feb 13, 2008

I figure if you test a large enough population of games it doesn't matter if either Nvidia or ATI cheats since they can't possible optimize for every game. So in the end it should all even out anyways.

I never had high regards for Hardocp and the only reason I go there is for their frequently updated news. Other than that, I find their articles very unprofessional and this recent debacle just reinforces everything for me. It doesn't really matter who's right, it is all about how you go about it and Hardocp has failed miserably. Kudos for AT for not stooping as low as Hardocp.

Munky · Feb 13, 2008

Originally posted by: awesomedude
To the people saying that because HardOCP compare apple to oranges that it is impossible to distinguish performance differences because you could run card a at 600 x 800 and card b at 1600 x 1200 and get the same frame rate and determine that the cards are equally fast is again absurd. First HardOCP determines the "fastest" card as the one that can play at the highest settings IE: 1600 x 1200 would be pronounced a faster card then one run at 600 x 800. HardOCP doesn't rate the fastest card by the best average frame rates in their apple to oranges comparison like everyone seems to going off about. Since HardOCP gives its readers all the setting information readers can come to their own conclusions IE: card a has an average fps of 20 and card b has an average fps of 20 at the same resolution, but card b is running with 4xAA and 16x AF while card a is running no aa or af, anyone can easily see that card b is the faster card. There is usually enough information present that you can tell which card is faster even though they are using apple to orange comparisons.

http://www.hardocp.com/article...wzLCxoZW50aHVzaWFzdA==

We can see here that the geforce has an avg fps of 28 and the radeon has an avg fps of 26.2. We can also see that the shader quality on the geforce is set at high, while the 1.8 avg fps difference is negligible especially in RW testing, we can conclude that the geforce is a faster card cause it is running at a higher setting.

You are ignoring one important factor, and which is that the cards are not necessarily affected the same way by changing the same settings. Things like AA, HDR or the amount of foliage in a scene will play to the strengths and weaknesses of each architecture. You can use the "playable" settings judgment as a way of masking the truth by simply raising or lowering some settings in order to make the game "playable." For example, say there's a game where one card takes a nosedive with a lot of foliage in a scene. Instead of showing that fact, the reviewer can simply state that neither card offers "playable" performance with full foliage, and then reduce the amount of foliage it to show that the cards are basically equal.

maxster · Feb 13, 2008

Man what a long read. Quite interesting though.

Demoth · Feb 13, 2008

Reviewing video cards is not an exact (or even approximate) science. Just the slightest differences in view angles can really effect outcomes. Not only that, but every computer has it's own quirks and even identicle machines running the same game will likely give significant differences in results. There are simply too many uncontrolable variables to give a perfect comparision when it comes to vid cards.

The only way to mitigate this when benching for real world performance is to play for hours on each benchmark, multiple times on different machines. No one has the resources, manpower or could meet deadlines doing this.

For us end users, the best compromise is to check multiple reviews using both real world and controlled testing to get an idea as to where each card stands.

HardOCP contributes to the overall picture and I have no problems with their reviews. However, I do have a problem with their implications that Anandtech is somehow being dishonest, could be in some conspiratorial collusion with card makers and is somehow liable to buyers. Let's not forget, most of the controversy came because Anandtech fully disclosed their testing methods. Even though some of the methods were reported wrongly initially, the mods here corrected themselves publically, not trying to hide anything.

However, this flame fest is probably overall a good thing. It makes more people aware of the unexact science of this type of testing and starts discussion on how to improve testing. I think more end users would agree they want to see more info given in reviews. More want to see AA vs no AA at lower res, a stressing of min frame rates in the heaviest parts of games, IQ comparisons and maybe even consolidations of past reviews into a format like Tom's vid card chart.

IEC · Feb 13, 2008

Here's a suggestion to HardOCP...

Write custom scripts to make the game run through a specific sequence in exactly the same manner every time.
Write these scripts for multiple areas of the game (urban, outdoors, water, etc.) and then average it out
Run multiple (at least 3x) for verification

Post scripts/saves/results out in the open. Without complete transparency in "real world testing" no one can validate your results. If you're too lazy to make a script make a macro. Use some kind of action logger like the macro recorder on the Belkin n52. Then use the recorded macro every time you run through the sequence.

There are too many variables that affect a subjective real world test. Different input will yield different FPS. Different levels might even favor different cards. Are you cold booting or restarting, etc. AnandTech does not have bad methods - they try to be as replicable (read: scientific) as possible. However, for some games (such as Crysis) it is clear there is timedemo cheating. In that case a "real world" methodology might be a good approximation.

But before you attack AT's methodology, you should consider cleaning up yours, HardOCP...

ronnn · Feb 14, 2008

Originally posted by: BadRobot
ZOMG kylez wtfpwntbbqed anandtech!

and awesomedude your wall of text crits my IQ for -100

Nope is wrong.

Dadofamunky · Feb 14, 2008

Originally posted by: tuteja1986
Hardocp are fools :! they had war with many tech website :!

Firingsquad fought hard and beat the f out of kyle :! FS made kyle look like a fool :!

I have to say, based on what I've seen in his previous columns and his 'writings,' he does a good job making himself look like a fool on many occasions.

Sheesh. Who does he think he is, the John Rockefeller of the PC Tweaking community? Like no one else can comment on things in cyberspace?

ronnn · Feb 14, 2008

Originally posted by: munky

You are ignoring one important factor, and which is that the cards are not necessarily affected the same way by changing the same settings. Things like AA, HDR or the amount of foliage in a scene will play to the strengths and weaknesses of each architecture. You can use the "playable" settings judgment as a way of masking the truth by simply raising or lowering some settings in order to make the game "playable." For example, say there's a game where one card takes a nosedive with a lot of foliage in a scene. Instead of showing that fact, the reviewer can simply state that neither card offers "playable" performance with full foliage, and then reduce the amount of foliage it to show that the cards are basically equal.

Most sites have been accused of picking games or settings to support their favorite. And I am convinced that most sites do as unintentional bias is everpresent and omnipotent. This is a very murky field. Almost as bad as real estate agents or car sales people. Read many reviews and than buy something is my motto.

Hardin · Feb 14, 2008

I'm not a big fan of Hard OCP. I find their benchmarks to be unusual and confusing. Not just because they use their own "real gameplay" benchmarks but also because they use the highest playable setting. It's sometimes confusing to understand when I only care for the direct comparisons. I play on the highest settings and lower them if I have to, so I like to see what each card can do when the game is maxed out. Long live anandtech.

Finalnight · Feb 14, 2008

Begun, these forum wars have.

lopri · Feb 14, 2008

Originally posted by: Gary Key
In testing Crysis on Vista-64 with the DX10 renderer (ASRock 650i, 8800GTS-512, Q9650, 4GB mem) FRAPS has an overhead of .07% at 1280x1024, going up to .13% at 1920x1200. Obviously each system will be a little different and CPU type does make a slight difference as will the video card, however, we have not seen FRAPS go above 1.7% overhead in this game.

I will not say anymore about this subject other than this, download FRAPS (2.9.4), set your system to 1280x1024,1600x120,1920x1200 medium settings (depends on your video card, but with a 8800 series this works every time) for each run, then run the GPU benchmark.

This benchmark will run four flybys on the island and then create a log file in the Island sub-directory. Write those numbers down or print them. Please note the first flyby will always have lower numbers as the data is then cached and the next three will be similar, but generally the last flyby has the best numbers. We average these numbers. Do this for each resolution if possible or whatever your monitor will support.

Open up FRAPS (be sure to set it to capture FPS & min/avg/max), run the GPU benchmark at each resolution, there is a slight pause right before the benchmark starts, you have to be quick on the trigger with the F11 key to catch it right at the beginning, a practice run helps. Let the GPU benchmark complete and hit the F11 key at the end of the fourth flyby. Go to the FRAPS directory, open up the text file for that test, write down the numbers.

Now compare the average of the GPU benchmark in the first test to the second test, that will give you an approximation of FRAPS overhead. The next (the fun one) item of business is to compare the average of the GPU benchmark (four runs totaled for each category then divided by four) and compare that number to what FRAPS reported. Depending on how fast you are with the F11 key, you might get very close to the 8000 frames generated by the GPU benchmark, the best I have done so far is 7989, the worst 7954. However, any difference this slight only benefits FRAPS results.

Post up what your results are...

Did this but only at one setting.. kinda time consuming, I realized.. hehe

Anyway, the specs and results are as below.

CPU: E8400 @3.60GHz
Motherboard: 780i SLI
Memory: 4 x 2GB DDR2-667 @800MHz/4-4-4-12
GPU: 2 x 8800 GT (SLI)
Monitor: Dell 2405FPW
OS: Vista Home Premium 64-bit

Testing was conducted at the monitor's native resolution (1920x1200). Reason being to compare my result with that of AnandTech's found here. I didn't crop the screenshots so they are all at 1920x1200.

Crysis
Built-in Benchmark 64-bit
Default High Quality
ForceWare 169.25 Default Control Panel Setting

Result with Fraps NOT running: 33.79

http://img514.imageshack.us/im...8800gtslinofraput0.jpg

Result with Fraps running: 33.94

http://img177.imageshack.us/im...8800gtslifrapsnmz2.jpg

Result with Fraps running and manually benchmarked using F11: 33.14 / Result reported by Fraps: 33.80

http://img177.imageshack.us/im...8800gtslifrapsrei7.jpg

I am not sure what I'm supposed to look at here?

lopri · Feb 14, 2008

ImageShack is not behaving so I re-uploaded the first screenshot.

Result with Fraps NOT running: 33.79

http://img292.imageshack.us/im...8800gtslinofrapwg3.jpg

goinginstyle · Feb 15, 2008

Lopri,

Gary's post just made sense to me. This is what is he wants you to look for - HardOCP Results - as they have different numbers between the Crytek GPU benchmark and FRAPS on both video cards. That is the crux of his entire story. Your results and mine listed below show there is no real difference. The question becomes what the hell are these guys smoking? This might explain why AT did not respond, they knew Kyle was wrong all along, well besides the obvious reasons.

System Specs -
E8400
ASUS P5K Deluxe
ASUS 8800GTS-512
Vista 64 Home Premium plus GPU hot fixes /Crysis 64, DX10, 1.1 patch - all medium settings
4GB GSkill DDR2-800 4-4-3-15

Results (averaged the four scores on the GPU benchmark as suggested)-
1280x1024-
GPU.bat results - 35.2 min / 61.14 avg / 94.3 max
FRAPS results - 34 min / 60.45 avg / 94 max (captured 7967 frames)

Running the other resolutions now but this is amazing as there are no differences to speak of with fraps running.

Citizen86 · Feb 15, 2008

Running the other resolutions now but this is amazing as there are no differences to speak of with fraps running.

Yes I just registered today, but I think you guys are misunderstanding a little bit. The HardOCP results are not simply the difference between FRAPS and what the Benchmark says, it's the difference between running the demo in REALTIME and running it as a Timedemo. You guys should read, because it's explained on the same page you you link to:

The ?Real Time Timedemo FRAPS? data you see is gleaned from running the canned GPU timedemo in real time, and recording the framerate with FRAPS. The ?Traditional Timedemo Benchmark? results are as you might expect from running in timedemo mode where the recorded demo runs as fast as it can till completion then gives you your benchmark scores.

So to put it simply, one is the canned GPU demo run real time and the other is the demo run in timedemo benchmark mode.

Now what you will immediately notice is that the two sets of results using the Crysis canned GPU demo are not even close to the same. Simply running the timedemo as a traditional ?timedemo benchmark? gives us a 38% increase in average framerate over running the canned demo at real time speed using the 3870 X2. Average framerate increased 38% going from a real time canned demo to a traditional ?fast as it can draw it? timedemo benchmark. Same demo, same settings, same hardware, same driver.

Am I missing something? I could be completely wrong, but I believe it's all explained right there.

lopri · Feb 15, 2008

Just noticed my other thread got locked. I guess AnandTech wants to take the high road when faced with benign attack (even in the forums), and I respect that.

Citizen86: I stand corrected on generalizing in my other thread. I made a mistake saying they record time demos using Fraps, which is probably not the case. But the rest of my post still stands. In this specific Crysis demo, what they did is what I did, unless if you can tell me how to run the Crysis' built-in time demo in 'real time'. Does that mean that there is a way to 'slow down' that demo so it runs @24fps or @30fps? Or what other way do they use to 'slow down' the time demo? I would love to know.

lopri · Feb 15, 2008

In case anyone's interested, I'll provide the link to the locked thread.

http://forums.anandtech.com/me...=2154935&enterthread=y

IL2SturmovikPilot · Feb 15, 2008

I find that HardOCP video card reviews are biased towards NVIDIA,and they say that they're better than Anandtech,hypocrisy is bliss i guess.

GonePlaid · Feb 16, 2008

What I do when purchasing Hardware is read as many reviews from all over the internet about the product I've got my eye on and average out all the results between them. In other words, I've got no set source on who I use as a bible for the end all and be all of reviews. I'd like to thank all you hardware reviewers for making us consumers life a little bit easier when choosing computer components.

HardOCP says anandtech has bad methods

Platinum Member

Senior member

Senior member

Lifer

Golden Member

Lifer

Senior member

Junior Member

Diamond Member

Banned

Senior member

Elite Member

Diamond Member

Platinum Member

Diamond Member

Junior Member

Golden Member

Elite Member

Elite Member

Junior Member

Junior Member

Elite Member

Elite Member

Senior member

Member