GPU design and defects

giladr

Junior Member
Jul 27, 2009
2
0
0
Hi, I'm looking for information regarding the following issues, any advice will be appreciated!

1. GPUs are designed to operate at a wide temperature range, let's say 0c-110c. How does this fact impact their design? I've heard they (NV and ATI) must dedicate large sections of the GPU to contrast the effects of temperature changes (which changes electric resistance). Any information regarding the truthfulness of the above and specifically how much space is actually used up to do that would be great.

2. How much of the GPU's defect rate (or its opposite, the yield) is related to its need to operate at the above (say) temperature range? if one could eliminate the need for the temperature range and guarantee the GPU's operating temperature at say, 40c, by how much would NV/ATI's yield improve?

Thanks in advance
 

alyarb

Platinum Member
Jan 25, 2009
2,425
0
76
#1 is completely untrue. no portion of the die is allocated to offset intense thermal output by another region. i'm not even sure how this would work. allowing or not allowing current to flow through one region will not heat or cool a distant region. imagine looking through an infrared camera down at a chip with no heatsink. can you imagine the die surface ever reaching an equilibrium if only one corner is active? for almost ten years we have used heat spreaders to partially address this phenomenon with processors, controllers, and memory.

architectural appointments actually have nearly nothing to do with cooling the processor. thermal regulations are decided upon and complied with after the fact, when operating parameters (like frequencies and voltages) are married to an appropriate heatsink/fan solution.

you could say all defects have some thermal implications that go along with them, but they do not determine the quality of product you get as a consumer, because in most cases there is zero flexibility here from the manufacturer. if a defective part can be sold as a product, then the troublesome area is disabled and the aggregate thermal output decreases. the product you get should do exactly what it says it should do, even if you bought a "scaled down" version of a higher-end part. AMD briefly sold an RV770 with 160 shaders disabled, presumably because the chips were harvested from the best of the worst that did not satisfy elaborate criteria for quality. nvidia has done the same with its GT200 generation. GPUs do not pass or fail QC because of the temperature they run at, however. They dissipate thermal energy based on how much current is flowing through how many transistors at a given time. QC tests for the soundness of the functional units of the chip and its ability to take instructions and do its job. anomalous heat output would certainly be cause for rejection, but i can't imagine one die on the wafer being normal while the adjacent die outputs a prohibitively additional amount of heat. I would say that zero percent of the defect rate is related to satisfying the temperature range.

Disregarding the range and mandating all operation occur at below 40 C would render most GPU architectures from the last ten years useless. The majority of GPUs are for computing and/or gaming performance and are expected to perform trillions of math operations per second. Switching transistors billions of times per second and transporting trillions of charge carriers around a square inch of silicon is going to have enormous thermal implications. in this universe, you cannot get any work done without a thermal dissipation component. if a modern 1-billion transistor GPU never put out enough energy to exceed 40C (against air cooling of course), you can be sure that it is not operating correctly. now, there are architectures that are designed to be small and power efficient, but they will also come near or exceed 40C from time to time, and are so slow we can't rely on them for much.