Wow, that was quite the stint of posts of going back and forth.
This is just my musings on the subject as the threads were being posted.
In essence, can we believe these numbers? Are they credible? If they are not, there is no use beating a dead horse and pulling out conclusions that it is still alive. In support of, "they are not," we have to look at the patterns that are exhibited over the course of the test. Typically speaking there is a pattern to a repeated series of events performed by the same device over an extended period of time. In other words, we should see a similiar rate of increase/decrease from one performance standpoint to another at any given point in time. We have seen that this is not true, thanks to the efforts of duvie. A possible theory of how this is so may stem from THG maximizing and minimizing windows to check that things are indeed going smoothly. Unfortunately, they are ruining the test at the same time. Another aspect deals with the downtime of the Intel processor. The only real consistency we've seen in stability is later in the test. What did these higher temps and instability issues do to the numbers earlier on? (Not just down time, but effect on the performance as well.) We don't know. A further point is the validity of the test in the end proving or disproving the ability of each processor. Is four threads the "say all, end all," of tests or do we need to verify these results through another independent study? As I have mentioned before, if this test has not been properly designed, why are some supporting the numbers when they could have been altered or flawed? The test is invalid in terms of performance. The only conclusion I will make is there is not enough data to support the abilities of one processor over the other.
Now, if the test is valid there are some conclusions we can also make. I have not looked at the recent numbers and I personally don't care to because my opinion lies in the test is invalid. If the numbers are correct we can conclude that there is a definitive issue with scheduling. Since Duvie was kind enough to quote straight from Microsoft's webpage we can rightfully blame the OS in this situation. We thus conclude there is a difference in terms of scheduling. Further, we can conclude that the tests are independent of each other and have no relation to each other because, as was mentioned, what is a Farcry score in relation to a winrar score, etc. If we look at it in these lights there are winners and losers, but again because we have to conclude the tests independent from each other we can only assume a 2 win-2 loss for Intel and a 2 win-2 loss for AMD, or whatever the numbers indicate. If it is 50/50 there is no real winner or loser. Frankly speaking, I have no possible reason to support the second paragraph. Looking at the data I can't justify myself saying "yes" or "no" to either processor. Even if the second paragraph were true there can be no rightful conclusion.
In conclusion, from what I have seen the people on this thread fall into either of these two camps. Reasoning would conclude the test invalid and would force the reintroduction of a new test with tighter controls and specs. If a test has a large amount of controversy associated with it the experimentars must create a test that is at least 99% sure the data is correct. For example, if we wanted to test abortion, gay rights, or the death penalty we would have to make sure the experiment is nearly failproof either for or against, there cannot be any grey area. However, if you wanted to determine which colors, red or blue, are better or more popular you can get away with less strict methodology. From this thread there is apparent controversy and the only way to prove in either direction is to have a better, stricter test.