Keysplayr has already addressed the fact that this "Is Steam reliable/indicative or not" issue has been beaten to death on a previous thread.

I do not wish to quote anybody who participated in this current thread so that my post here does not come off as a personal attack or as a post that "jumped" on a mistake.

Now that the formalities are over, please understand that Steam's data collection is not indicative of anything because of the methodology. Somebody who may have read a few things about statistics and/or probability, and perhaps thinks of polls and surveys and "random samples" as simpler than they are is prone to assume a few things wrongly and thus conclude that a sample is "random" simply when it isn't.

Let's tackle "random sample". If you understand it merely as a lay person would and think of it as a "sample that is *random*" and think "random" here means anything as long as it was not pre-determined, or as long as it is unpredictable, then you are wrong.

You see, the thing about random sampling and a "random sample" is that the actual sample population involved (I mean the composition of the sample; for example, the thousands of Steam users that did participate) does not determine whether the sample is a random sample or not. Even if you review each element in the sample and think "Yeah, looks varied enough based on [criteria/s]", that won't tell you that the sample is a random sample.

What makes a sample a *random sample* is the __methodology involved in obtaining the sample__. If the methodology involved meets the criterion of randomness, then the sample obtained is a __random sample__ (it helps if you think of it as one word instead of two separate words). The criterion of randomness is rather straightforward: each element in the population you are targeting to derive a random sample from must have an equal chance to be picked in the sample.

If a sample is truly a random sample, we can then expect it to reasonably reflect the reality had we asked the entire population instead of just a sample. Of course, this is not magic so there's still a margin of error that the statistician/pollster takes into account, sampling error / margin of sampling error. For lay people, it means "that's what you get for talking to a sample instead of the whole population". (If you have managed to get this far, I hope it is also clear that every time I say "population" here, I do not mean the human or world population)

Now to the meat. Are every poll valid or invalid? Naturally, saying "all are invalid" is nonsense. But using that to then say "then that means all of them are valid, and so is Steam!" is also a bit off.

It depends on the methodology. If a pollster for an election uses random digit dialling (and only random digit dialling) to obtain a sample, that's not a random sample. That's because each household may have several adults. If you phone a household only once, and you immediately pick as a sample the adult that answered it, then your sample is biased to "phone-answering adults".

"But that's random!" you might say. No, it's not. It's not about whether you can determine it or not. Here, randomness means everybody must have an equal chance to be picked. Those who loathe answering phones and let other members of the household pick it up are immediately disqualified.

So what do good pollsters do? They introduce another element of randomness. When they call up a household for a poll, they may, for example, ask for the adult whose had the most recent birthday, or perhaps the earliest upcoming birthday, or something related to birthdays. Why? Because unlike "phone answerers" versus "phone let-it-ring-until-some-else-picks-it-uppers" , birthdays are actually randomly distributed. It's not perfect, mind you, but far better than the previous case. Also, there's the fact to deal with that households have different numbers of adults, which also skews the probabilty for each element. Instead of using an additional "randomizer" like a randomly-distributed factor such as birthdays, I have heard of some polls before that simply asks a household for all adults to participate, separately, so as to mitigate the difference in probability based on number of adults in a household (which isn't randomly-distributed). Anyway...

So which polls are valid and which are invalid? Check the methodology.

How about Steam?

Doesn't pass the criteria of randomness. Remember, we don't mean "random" here as you may have it in your head such as "unpredictable" or "non-determined". Criteria of randomness means every element in the population (here, the population is gamers, and each element is each one of us) must have an equal chance to be picked.

If you are a gamer and never buy games online? Your chances of being included are now lower than those who do. You don't have internet connection or don't bother with it? You are certainly out. You don't buy stuff online, but have internet connection, but never really play Valve games or any game using the Steam platform? Never really encountered Steam before in your life? Out and out.

Doesn't pass the criteria of randomness. Note that if the population were to be reduced from "gamers" to "Steam gamers", it might pass as a random sample.

Please understand that, despite this, Steam is probably one of the best data available for game devs anyway. Yes, it is not a random sample, and yes, for all we know people who don't buy stuff online and thos who don't have a net connection may comprise a big chunk of the population (gamers) and may have hardware different enough to skew the results very much. But there is currently nothing else out there (to my knowledge) that is Steam-like. Using Steam, as well as sales results for CPUs and GPUs (similarly, these are unreliable just like the Steam survey), it helps game developers get a feel of where things are headed. How right or how accurate is almost impossible to determine without any MOSE, and I doubt there is any MOSE figured in here.

So while it might be the best out there to get a feel of the hardware in the wild, calling it "indicative" or in the same league as valid polls is just not right, especially if you use it as a determinant for a population simply categorized as "gamers" that is too broad and undefined.