Originally posted by: thilan29
Originally posted by: Idontcare
My bachelors degree is in materials science engineering,
Cool. Same here.
I would have thought that nVidia would have gone through enough testing to realize that it could be a problem...especially with cramped environments like laptops...and not just Apple laptops...it's the same type of failure as before isn't it?
Awesome, its pretty rare to bump into a fellow MSE :beer:
Yeah it appears to be the same manner of inherent failure. I view it as simply one of engineering margin. Ball size for the bumps, bump pitch, current density, etc.
Go too aggressive on the layout and it won't matter how good or piss-poor your solder formulation is as you'll still manage to create a product that destroys itself in time.
Determining engineering margin with these types of mechanical/materials related failures is challenging because of their exponential dependence on the environment variables. (Arrhenius equation type kinetics) Run 10°C too hot and suddenly your chip dies in 1/10 the time because of the log-log nature of the correlation between stress, peak-to-trough in the thermal cycle, and the absolute temperature.
Not that I am saying anything new to you

Just expounding on the topic for the benefit of the readers. AT TI (Texas Instruments) we had our own packaging debacle on the 180nm node, massive in-field failures after about 6 months. We learned a lot about the things we hadn't been paying attention to, not because we didn't know we needed too but because management wasn't convinced they needed to resource the engineering with staff and equipment to characterize those aspects of our chips.
TSMC management is probably getting similar "learning curve" experience at the moment, but it takes a couple years to turn a ship that size, doesn't happen in 90 days as we consumers and shareholders like to see things change.
Originally posted by: senseamp
What does this have to do with the fan speed not being set right? You can have whatever bumps you want, you need to keep a chip within it's thermal envelope. Turn the fan on your CPU off, see how well it does with its bumps
Yeah no one is disagreeing with you. All systems will have a distribution to them. The fans will have their distribution, not all of them operate at 3000rpm just because they are told to, and not all chips will tolerate the same level of thermal expansion for any given thermal envelope. So everything has to be guard-banded with engineering margin to protect against overlap in the tails of the distributions of the components of the system.
Guard-bands cost money and/or performance, so in a cost or performance sensitive market the guard-bands (engineering margin against the weak tails of the distribution) are the first to be challenged and the envelope is "pushed". We do this as consumers when we overclock our chips, or if they come factory overclocked then the guard-bands have been challenged by the factory engineers.
If the guard-bands were set too wide to begin with then challenging the guard-band does not result in a noticeable increase in the fail-rate. Think Intel CPU's for the past 2 yrs. If the guard-bands were set too tight then you have in-field fails even when the specs are adhered too.
The question here is two-fold IMO - is the Apple rig undercooling the chip (fan too slow) and/or is the guard-band set by NV on their chips too tight (not enough margin against the weak tail of the distribution of their own chips)?
Changing the fanspeed can be a cure but that isn't to say the problem was too little engineering margin in the thermal operating specs for the chip.