So are the test criteria oriented for the maximum production of high speed parts, or are they oriented to meet projected demand for each speed rating?
To be frank, neither. They are optimized to find defects and test the limits of the processor at various "corners" of voltage/frequency and temperature.
The issue of whether or not parts are "down-binned" comes up frequently here at AT and I can honestly say that I don't know the answer as whether or not this happens. If it does happen, it's must be more rare than people seem to think. There is no business benefit to down-binning beyond the immediate short-term needs of the channel. Frequency distribution among parts is a Guassian/normal distribution and thus it makes sense to have pricing - and thus market demand - follow this same Gaussian curve. If you have an excess of high-speed parts then it makes more sense to sock it to the competition by dropping prices on this excess high-speed capacity than it does to mark them and sell them at a lower speed - this lower speed part will be less competitve in the marketplace and yet you are not making more money from it (since you marked it as a lower speed). While I can picture rare cases where this might occur in order to maintain supply in the channel and minimize distruption of product roadmaps, it only makes business sense to me to have these events occur infrequently. Otherwise you are minimizing profits by selling a less competitive product line than you otherwise could be. This subject comes up at least once every couple of months, but I have yet to see anyone come up with a sound business reason for long-term "down-binning" by a company.
As far as why overclocking works, my explanation is simple: overclocking eats into engineering margin. Microprocessors are used in a variety of applications in a variety of environments under a variety of stresses. Thus in order to ensure that a product will meet operational specifications across a broad spectrum of operating conditions, there is margin in the product. A given CPU might be spec'd to run at a minimum voltage than is 10% less than the specification, at a temperature of, say 80C. So the way that overclocking works is that you decrease the operational envelope that this CPU will see by : for example. spending more for a good power supply, using some expensive heatsink/fan, increasing the cooling in the case. If the user can ensure that a CPU never experiences extreme conditions, then is reducing the operational envelope and can thus increase the frequency. The most common overclocking technique, however, is to increase the voltage. This eats into the longevity of the CPU, however, and there is an exponential decreasing in operating lifetime with each increase voltage. There is a sound reason why manufacturers aren't increasing the voltage on the CPU's that they sell - if they did they'd be able to sell higher performing parts... but statistically they would not last as long. The voltage that a CPU ships at has been carefully engineered to maximize performance within a given operating lifetime.
My reply to the question on hypothetical downbining is how much success someone would have overclocking a CPU using a lousy power supply, at stock voltage, using typical RAM, on a poorly designed motherboard, in a case that doesn't not have very good ventilation, on a hot summers day running the code that maximally stresses the CPU? Most people would agree that any potential overclocking gains would be pretty small. Which leads back to my point about eating into engineering margin.
Also, how good of a test is an overnite torture testing program like prime95, relative to the factory tester?
Prime95 is not actually a bad test because it maintains a high level of CPU activity thus increasing the processor temperature. The testers, however, are capable of stressing the CPU much more and the tests are optimized to try and run the code that stresses specific circuitry that was known during the design phase to be the chip limiter on frequency. Still, Prime95 and some of the video games out there (like the fly-by in Unreal Tournament, and 3DMark) are pretty good tests.