Epic Nvidia fail? Failing 9600M GT's

thilanliyan · Mar 8, 2009

Originally posted by: josh6079
I've never really seen SEM shots of inorganic material like you have. Would you say those pictures seem plausible?

Yes those seem plausible...for my samples I usually cut them where I want and polish them flat and they do look like that sometimes.

Here's a SEM image I've taken which was polished flat:
http://i53.photobucket.com/alb...51/thilan29/Site07.jpg

Looking at the Inq SEM pic, it looks like the 9600 bump is very porous (dark spots scattered throughout which are absent in the 9400 bump pic) and those could be crack initiation sites.

EDIT: As Idontcare mentioned fatigue due to thermal cycling is most likely the cause and the high lead bump looks like it would have a more prone microstructure to that type of failure.

MrSpadge · Mar 8, 2009

Wow, is the Inq finally growing up? Almost can't believe my eyes..

Would you say those pictures seem plausible?

Definitely! And an SEM is not that special.. we've got 2 in the small lab of my old Uni. One of them's an old one with a ~10" black-green CRT and even this machine could provide you with such information easily. And regarding the 3D-thing: SEM pictures of 3D structures appear 3-dimensional because the incoming electrons are partially reflected by bumps / 3d-structures, which is not the case if they hit a flat surface. That's why you get good contrast at bumps and it's easy for the brain to imagine the 3D-structure, although the image itself is strictly 2D. You could try to get real 3D-data by moving the focus.. but this is not common and the depth resolution would likely be quite bad as even an unfocussed beam still leads to a signal, it's just a blurred one.

And another cool thing about SEMs: you get the rough material analysis almost for free, just attach a proper x-ray detector and you're good to go.

MrS

Idontcare · Mar 8, 2009

The failure is mechanical in nature, this explains why the bumps fail:
http://en.wikipedia.org/wiki/Fatigue_(material)

And this explains where the stress on the bumps originates (and why it cycles, as is needed to induce fatigue failure):
http://en.wikipedia.org/wiki/C...t_of_thermal_expansion

TSMC and NVidia engineers know all this, of course, but that doesn't mean their employers want to publicly acknowledge they have a problem or that they understand the science behind the problem. It's all about PR containment from this side which we are standing.

theeedude · Mar 9, 2009

http://www.slashgear.com/macbo...-nvidia-issue-0736776/

Idontcare · Mar 9, 2009

Originally posted by: senseamp
http://www.slashgear.com/macbo...-nvidia-issue-0736776/

Yeah anything that can be done to minimize the stress (minimize the temperature, minimizing the extent of the expansion from CTE) will keep the bumps from shearing.

This is one of those situations where you've got one symptom (video artifacts) which is systematically controlled by the same medicine (minimize the heat) but the root-cause disease can be any number of things (bad bumps from packaging, overheated IC operating with data corruption, etc).

The bottom line is the consumer should never be placed in a position of purchasing an item which has the symptoms, regardless of which disease is causing them. Reminds me of the OCZ SSD "its MS windows that is the problem" debacle...no one cares what the disease is, we just want symptom free (which means disease free) products, period.

habbakuk87 · Mar 9, 2009

Hmmm,surprisingly a certain member hasn't still commented.

Let's not derail this technical discussion with flamebaiting, mmkay?

AmberClad
Video Moderator

theeedude · Mar 9, 2009

Originally posted by: Idontcare

Originally posted by: senseamp
http://www.slashgear.com/macbo...-nvidia-issue-0736776/

Click to expand...

Yeah anything that can be done to minimize the stress (minimize the temperature, minimizing the extent of the expansion from CTE) will keep the bumps from shearing.

This is one of those situations where you've got one symptom (video artifacts) which is systematically controlled by the same medicine (minimize the heat) but the root-cause disease can be any number of things (bad bumps from packaging, overheated IC operating with data corruption, etc).

The bottom line is the consumer should never be placed in a position of purchasing an item which has the symptoms, regardless of which disease is causing them. Reminds me of the OCZ SSD "its MS windows that is the problem" debacle...no one cares what the disease is, we just want symptom free (which means disease free) products, period.

Seems to me like you'd want to know what the disease is that you need to cure if your goal is to be symptom free.

Idontcare · Mar 9, 2009

Originally posted by: senseamp
Seems to me like you'd want to know what the disease is that you need to cure if your goal is to be symptom free.

If by "you" you mean the NV and/or TSMC and/or Apple engineers then yes you are absolutely right.

If by "you" you mean us naive consumers then no, not really necessarily. We want confidence that the smart people know what they are doing (usually not in question) and that their management structure is motivated to not get in their way (usually is the problem). This confidence is generated/accumulated thru delivery of symptom-free products, proof of the pudding is in the eating.

My bachelors degree is in materials science engineering, trust me when I say there is nothing rocket science'y about what is happening with the bumps and the packaging, the engineers there know exactly what is going on and why certain corners were cut as a measure of calculated risk by the management of the projects involved.

Question is will the companies be motivated to direct their management to direct their engineers to undo (cntrl-z ftw) the corner cutting that is leading to the disease.

We naive customers will never gain visibility to this aspect of any company's business model, even most non-management level employees don't get to sit-in on those meetings.

Regardless, where's there smoke it doesn't matter if there is fire, the smoke itself is undesirable and unacceptable. So buyer beware.

theeedude · Mar 9, 2009

Well, if the consumers don't care what is causing the problem, why are you singling out the chip maker?

josh6079 · Mar 9, 2009

Originally posted by: senseamp
Well, if the consumers don't care what is causing the problem, why are you singling out the chip maker?

I think Idontcare does care.

Idontcare · Mar 9, 2009

Originally posted by: senseamp
Well, if the consumers don't care what is causing the problem, why are you singling out the chip maker?

I'm an engineer who worked in the industry, specifically with TSMC (among others) and specifically with packaging (among others), and I rarely make the distinction between adding too much info to my posts versus too little.

Some consumers want to learn more about the background of the products which they purchase, others aren't consumers of the specific product but are individuals interested in knowing more about the industry which caters to their hobbies. (ever watch How it's Made on the Discovery Channel?)

I add information I deem relevant to my posts based on my education, experience, background, and the fact I am posting to a forum filled with plenty of eager-to-learn folks who would like to broaden their understanding beyond the mere naive consumer aspects of our adult lives.

If my mentioning TSMC appeared like I was singling them out then there may be a reason I am doing that. Having worked in the capacity I have precludes me from spilling everything I know into the public domain, but once it is in the public domain there is little preventing me from discussing it at the point.

konakona · Mar 9, 2009

I might be repeating what thilian said, but they have cut the chip in half and looking at the cross-section, right? in that case, it will look all but 3d. The pic, to my relatively inexperienced eyes look plausible. Then they did an EDX area scan to get material composition I guess.

OCGuy · Mar 9, 2009

Where have I seen something like this before from another GPU maker?

I guess I cant think of it. Anyways, bye guys, I'm going to play my XBOX360.

Idontcare · Mar 9, 2009

Originally posted by: konakona
I might be repeating what thilian said, but they have cut the chip in half and looking at the cross-section, right? in that case, it will look all but 3d. The pic, to my relatively inexperienced eyes look plausible. Then they did an EDX area scan to get material composition I guess.

The dimensionality of a SEM image depends entirely on whether the field of view contains any features that are closer or farther to the detector relative to the other features.

A wall is a 3D entity, but if you take a picture of the wall from 6 inches away your photo will only be of a 2D surface.

SEM images can be 2D or 3D in perception, no different than photography. Its a matter of depth of focus and field of view combined with there actually being something of 3 dimensionality in front of the e-beam.

Take a look at some of the 3D SEM images here: http://www.micromagazine.com/a...e/05/10/chipworks.html

And here are some 2D SEM images: http://www.micromagazine.com/a...e/06/06/chipworks.html

These guys offer commercial "teardown" services for cross-sectional analysis of just about every IC available on retail. (of similar nature to what TheINQ did with TSMC's packaging)

http://www.semiconductor.com/

http://www.chipworks.com/default.aspx

theeedude · Mar 10, 2009

What does this have to do with the fan speed not being set right? You can have whatever bumps you want, you need to keep a chip within it's thermal envelope. Turn the fan on your CPU off, see how well it does with its bumps

thilanliyan · Mar 10, 2009

Originally posted by: Idontcare
My bachelors degree is in materials science engineering,

Cool. Same here.

I would have thought that nVidia would have gone through enough testing to realize that it could be a problem...especially with cramped environments like laptops...and not just Apple laptops...it's the same type of failure as before isn't it?

exar333 · Mar 10, 2009

Originally posted by: thilan29

Originally posted by: Idontcare
My bachelors degree is in materials science engineering,

Click to expand...

Cool. Same here.

I would have thought that nVidia would have gone through enough testing to realize that it could be a problem...especially with cramped environments like laptops...and not just Apple laptops...it's the same type of failure as before isn't it?

How do we know Nvidia completely fixed the problem the first time around? Recalls and press releases are great, but they don't actually fix anything.

Idontcare · Mar 10, 2009

Originally posted by: thilan29

Originally posted by: Idontcare
My bachelors degree is in materials science engineering,

Click to expand...

Cool. Same here.

I would have thought that nVidia would have gone through enough testing to realize that it could be a problem...especially with cramped environments like laptops...and not just Apple laptops...it's the same type of failure as before isn't it?

Awesome, its pretty rare to bump into a fellow MSE :beer:

Yeah it appears to be the same manner of inherent failure. I view it as simply one of engineering margin. Ball size for the bumps, bump pitch, current density, etc.

Go too aggressive on the layout and it won't matter how good or piss-poor your solder formulation is as you'll still manage to create a product that destroys itself in time.

Determining engineering margin with these types of mechanical/materials related failures is challenging because of their exponential dependence on the environment variables. (Arrhenius equation type kinetics) Run 10°C too hot and suddenly your chip dies in 1/10 the time because of the log-log nature of the correlation between stress, peak-to-trough in the thermal cycle, and the absolute temperature.

Not that I am saying anything new to you

Just expounding on the topic for the benefit of the readers. AT TI (Texas Instruments) we had our own packaging debacle on the 180nm node, massive in-field failures after about 6 months. We learned a lot about the things we hadn't been paying attention to, not because we didn't know we needed too but because management wasn't convinced they needed to resource the engineering with staff and equipment to characterize those aspects of our chips.

TSMC management is probably getting similar "learning curve" experience at the moment, but it takes a couple years to turn a ship that size, doesn't happen in 90 days as we consumers and shareholders like to see things change.

Originally posted by: senseamp
What does this have to do with the fan speed not being set right? You can have whatever bumps you want, you need to keep a chip within it's thermal envelope. Turn the fan on your CPU off, see how well it does with its bumps

Yeah no one is disagreeing with you. All systems will have a distribution to them. The fans will have their distribution, not all of them operate at 3000rpm just because they are told to, and not all chips will tolerate the same level of thermal expansion for any given thermal envelope. So everything has to be guard-banded with engineering margin to protect against overlap in the tails of the distributions of the components of the system.

Guard-bands cost money and/or performance, so in a cost or performance sensitive market the guard-bands (engineering margin against the weak tails of the distribution) are the first to be challenged and the envelope is "pushed". We do this as consumers when we overclock our chips, or if they come factory overclocked then the guard-bands have been challenged by the factory engineers.

If the guard-bands were set too wide to begin with then challenging the guard-band does not result in a noticeable increase in the fail-rate. Think Intel CPU's for the past 2 yrs. If the guard-bands were set too tight then you have in-field fails even when the specs are adhered too.

The question here is two-fold IMO - is the Apple rig undercooling the chip (fan too slow) and/or is the guard-band set by NV on their chips too tight (not enough margin against the weak tail of the distribution of their own chips)?

Changing the fanspeed can be a cure but that isn't to say the problem was too little engineering margin in the thermal operating specs for the chip.

thilanliyan · Mar 10, 2009

I really hope they learn a lesson from all this. Thankfully they've been hurt by it financially since sometimes that's the only way a company chooses to learn from mistakes. Maybe now they won't skimp out on testing or safety margin for future products.

I'd be interested to know what temperatures are actually reached in and around the bumps but I doubt I could find that out.

MrSpadge · Mar 10, 2009

That's some nice information, Idontcare. I'm sure not only I appreciate your posts here.

MrS

MarcVenice · Mar 10, 2009

Here's how I look at it, and I stole this from some1 who recently posted it, but I can't recall who did and where. I do know it's from fightclub

It's basicaly what Idontcare is saying. It's all about statistics and bell-curves. Either way, I haven't heard much about this anymore. Going to have to do some digging soon.

"A new car built by my company leaves somewhere traveling at 60 mph. The rear differential locks up. The car crashes and burns with everyone trapped inside. Now, should we initiate a recall? Take the number of vehicles in the field, A, multiply by the probable rate of failure, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one."

Replace the car with the Macbook Pro, and the out of court settlement for the money it takes to replace/fix the MacBook Pro.

MrSpadge · Mar 10, 2009

Some idea just went through my head: the European ROHS thing was meant to ban lead from use in solder. So if the bad bumps are 95% lead they should not have used the stuff in anything sold in Europe, shouldn't they? And the law was in place before the disaster struck last spring. Would that mean my notebook is safe.. ? (sorry, didn't keep up with the news on this)

MrS

ronnn · Mar 10, 2009

Originally posted by: thilan29

I would have thought that nVidia would have gone through enough testing to realize that it could be a problem ...

Most of these decisions were made in a fast and loose financial environment. Many companies are in trouble for attempting to obtain unrealistic growth - bonus structures for top exec almost assured this would happen. I think ati avoided this because they were already a huge mess. As I have said many times, Jensen and his can of whoop ass and trash talk about intel - really signaled bad times ahead for nvidia. The only surprise for me is no talk about creative accounting and market manipulation or whatever. So the fact that they remained honest suggests they should be able to rebound quickly.

bryanW1995 · Mar 10, 2009

Originally posted by: Ocguy31
Where have I seen something like this before from another GPU maker?

I guess I cant think of it. Anyways, bye guys, I'm going to play my XBOX360.

:laugh:

bryanW1995 · Mar 10, 2009

Originally posted by: MarcVenice
Here's how I look at it, and I stole this from some1 who recently posted it, but I can't recall who did and where. I do know it's from fightclub It's basicaly what Idontcare is saying. It's all about statistics and bell-curves. Either way, I haven't heard much about this anymore. Going to have to do some digging soon.

"A new car built by my company leaves somewhere traveling at 60 mph. The rear differential locks up. The car crashes and burns with everyone trapped inside. Now, should we initiate a recall? Take the number of vehicles in the field, A, multiply by the probable rate of failure, B, multiply by the average out-of-court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one."

Replace the car with the Macbook Pro, and the out of court settlement for the money it takes to replace/fix the MacBook Pro.

maybe THAT's why microsoft finally fixed their xbox's: massive product failure plus red rings of DEATH must have exponentially increased their settlement costs...

Epic Nvidia fail? Failing 9600M GT's

Lifer

Member

Elite Member

Lifer

Elite Member

Member

Lifer

Elite Member

Lifer

Diamond Member

Elite Member

Diamond Member

Lifer

Elite Member

Lifer

Lifer

Diamond Member

Elite Member

Lifer

Member

Moderator Emeritus <br>

Member

Diamond Member

Lifer

Lifer