Can a single transistor failure render a processor useless?

Smartazz · Jan 16, 2012

I'm not sure whether this is a highly technical question but here it goes. Given the vast number of transistors on today's CPUs and GPUs, is it possible that the failure of a single transistor could start giving errors or even render the chip unusable? It's hard to imagine that every single transistor has to work, so are some transistors redundant? Even if some are redundant, what happens if transistor(s) break in a non-redundant part of the die?

PsiStar · Jan 16, 2012

Yes. But that is not what fails. It is a wire bond between the chip & the package pin. In the end, a mechanical failure. But then I am an EE so of course I would blame it to a mechanical issue.:$

Raghu · Jan 16, 2012

Yes. Some transistors are critical. For instance, in a clock tree (that supplied clock to flip-flops), a single transistor failure could screw a bunch of flops causing wide failure.

Some other transistors are dispensable. Most chips have thousands of power gates to turn off sections of the chip. Single transistor failures in a power gate does not cause failure. The current is carried by the other power gates.

Most chips do not have redundant/non-redundant parts.

exdeath · Jan 16, 2012

Minor defects can be corrected with a focused ion beam.

Unfixable things that render cores or sections of cache unusable can be disabled, then binned as a lower cost part with less cores, less cache, etc.

Many errors and defects aren't simply a case of "it works or it doesn't", but manifest as reliability problems at high clocks. Again, binned down as a lower cost lower clocked version.

Rate of errors resulting in completely useless chips is surprisingly low. Most problems on a new process are throughly tested and resolved with prototype samples before large scale marketable chips are even attempted. Things like well size, minimum feature width, etc. are figured out well in advance through testing of the production process, and incorporated into actual chip designs as a tolerance factor to minimize possibility for defects. A 22nm process was probably tested just fine at 20-24nm. It stands to reason that circuits that can make or break the entire chip and are less redundant would be designed with more "room" to maximize photographic/doping/contact exposure and minimize errors.

Smartazz · Jan 17, 2012

Does something like electron migration have the potential to render a transistor broken, potentially after it was packaged and sold?

exdeath · Jan 17, 2012

I hope not. Electron migration is the whole reason transistors and electronics in general work in the first place

If you meant metal migration or electromigration, that's a long term issue affecting the metal layers of any semiconductor. Generally if a chip works immediately after production it will work for several years before experiencing premature failure if at all. Either way it's outside the intended life cycle of the product.

CanOWorms · Jan 18, 2012

This happens all the time. Every single transistor has to work.

By the time it's sold to the consumer, most failures will be screened out. Most customer failures tend to be due to electrical overstress damaging 1 or more transistors on an I/O.

FrankSchwab · Jan 19, 2012

In my experience, the vast majority of the transistors are critical and would cause a fault of some sort if they were bad. The issue, of course, is finding the fault.

If the transistor were in the clock chain, or in the register that held the Program Counter, or was responsible for driving a memory address out of the CPU, the fault would be catastrophic and the chip would be dead - it would likely never run long enough to execute more than a few instructions.

If the transistor were in the data cache, you might have a one-bit failure out of the 6 MB (48,000,000 bits) in the cache; you would notice flaky behavior and your PC would probably never make it through a Windows boot.

If the transistor were in a lookup table in the floating-point processing unit, your PC would probably work just fine. 99.999% of users might never even notice the problem, and only someone who was carefully checking the results of scientific calculations would see it (see http://en.wikipedia.org/wiki/Pentium_FDIV_bug).

LiuKangBakinPie · Jan 19, 2012

Transistor is part of a electrical circuit. If it brakes that part of the circuit will malfunction

GammaLaser · Jan 19, 2012

There are some CPUs (like IBM z-series, Itanium, and high-end Xeons) that offer enough RAS features to recover from both soft and/or hard failures in the circuitry using redundancy and error checking in both the core logic and memory arrays. At the extreme, IBM mainframes support redundant CPUs that are automatically brought online if another processor fails. The system does not even have to be rebooted for this to happen!

Vast majority of client CPUs that we buy in desktops/laptops do not have this level of resiliency. They would be way to expensive then

.

BrightCandle · Jan 20, 2012

They are talking about making CPUs much more resilient with ECC like functionality all over the place for future processes because they can't rely on every transistor to be perfect anymore but right now everyone of them running at the necessary speed is critical.

You could make a lot of money if you could come up with a processor that didn't require clock signals and could detect and repair errors rapidly. It would be a reasonable amount faster than the current clocking based designs and that resilience would move the industry on a notch.

Smartazz · Jan 20, 2012

BrightCandle said:
They are talking about making CPUs much more resilient with ECC like functionality all over the place for future processes because they can't rely on every transistor to be perfect anymore but right now everyone of them running at the necessary speed is critical.

You could make a lot of money if you could come up with a processor that didn't require clock signals and could detect and repair errors rapidly. It would be a reasonable amount faster than the current clocking based designs and that resilience would move the industry on a notch.

Is quantum tunneling one of these obstacles that could require the use of ECC functionality throughout the die? I'm sorry if my understanding of quantum mechanics and electrical engineering are lacking, but my understanding is that stopping the flow of electricity past the gate of the transistor gets very difficult when electrons can go through the gate due to their wave-particle duality? Sorry if this understanding is way off, but can anyone clarify?

ArchAngel777 · Jan 31, 2012

I would imagine that a processor has built in redundancy, much like that of a CD/DVD ROM disc. Can an electronic engineer comment in regards to this?

Ooops, I missed that last part of the OP where he specifically mentioned 'non-redundant part of the die'. My bad... How often is a crucial part not engineered with redundancy?

pm · Jan 31, 2012

I work at Intel and have worked on reliability for server CPU's. (since talk is cheap on the internet, my Intel email is in my profile, you can email me at Intel if you don't believe me)

I agree with what FrankSchwab, GammaLaser, exdeath, rahgu and CanOworms all wrote.

So, yes, most of the time a single transistor failure anywhere on the CPU will result in the entire CPU not working - because the transistors that would tend to fail would be transistors that are used a lot and thus a failure in one of these commonly used transistors (or wires) would result in the whole thing not working. As FrankSchwab wrote, sometimes this failure will be immediate and catastrophic, other times it might be more of a flaky failure.

There is redundancy in server CPU's - register files use ECC and can fix failures on-the-fly, Intel uses a technology called "Cache Safe Technology" to detect cache regions with too many ECC errors and can disable that region and then swap in a redundant region on the fly, and there are other redundancies. There are also lots of transistors on a CPU that are used for test or debug that have no effect on the operation of the CPU. Or you could have a bit get stuck high (or low) in something the LRU (least-recently used) circuitry for a cache, or the branch prediction circuitry such that the CPU would suffer a performance issue but otherwise continue to work fine.

But most of the time, when I look at CPU that doesn't pass the test screen (and one of these performance issues would be caught), I tend to see it's just one transistor that failed. And generally it's a specific lithographic mask structure that fails... like you'll get something where there's all metal 1 in a square, with a via (a vertical wire) through the middle of it, and there will be a transcription error in the fab where the via mask and the wire mask didn't totally line up resulting in via that's much narrower than it's supposed to be. And you'll see this particular structure show up all the time in the failures list, and then you can go back, search through the whole design and just fix every single one of these errors on the whole CPU (for example by making the via slightly bigger). And then you'll see a huge improvement in reiability and yield by fixing just this one issue (that happened thousands of times all over the whole die).

To be honest though, my thinking that any one failure generally results in a CPU failure could be a form of sampling bias. Because the failures that I've looked are like this, I then think that they are all like this. But if a transistor fails in an unused chunk of circuitry, would anyone ever know?

I will say, in response to ArchAngel777, that in my experience there's a surprising lack of redundancy in desktop and consumer CPU's. Modern CMOS process technologies produce transistors that generally work for a really long time (ignoring early failures caught in burn-in), and so redundancy doesn't do have much function in real life and burns power, and take up space. So, no, there's an amazing lack of redundancy in most consumer oriented CPU's.

Smartazz · Jan 31, 2012

Thanks for the informative post pm! I assume redundancy is important for the Xeon and Itanium line then. I assume AMD does the same with their Opteron line correct?

pm · Jan 31, 2012

Smartazz said:
Thanks for the informative post pm! I assume redundancy is important for the Xeon and Itanium line then. I assume AMD does the same with their Opteron line correct?

I honestly don't know. Certainly the engineers at AMD are a smart, capable bunch and they know exactly how to implement it... but it's a matter of cost vs. return. Redundancy burns extra power, takes up die area, and it's only worth adding if your customers want/need it as a feature. I'm sure that they add the level of redundancy needed to address the market that they are targeting, so I agree with your assumption, but we'd need to look at the datasheets or get someone from AMD to confirm it.

TuxDave · Feb 1, 2012

ArchAngel777 said:
I would imagine that a processor has built in redundancy, much like that of a CD/DVD ROM disc. Can an electronic engineer comment in regards to this?

Ooops, I missed that last part of the OP where he specifically mentioned 'non-redundant part of the die'. My bad... How often is a crucial part not engineered with redundancy?

Just wanted to add my two cents. Besides redudancy, there's two other vectors to consider. You can put in logic to disable functions/regions that have the faulty logic and you can also over-design for failure.

The first one, at least my team calls them "chicken bits". They're mostly there for new logic that is difficult to validate and so if silicon comes back and it's all screwed up, you start flipping these chicken bits on to disable new features one at a time to figure out who's causing the problem. Once you find it, you can still get the chip running, just not with 100% goodness. I don't think it's used for consumers but it's one way our post-silicon team can get around a faulty transistor or faulty design.

The second is that while we don't put redudancy everywhere, we overdesign for failure. So we test against worst case scenarios for electromigration and may end up overdesigning devices or vias or power to improve reliability.

DDR4 · Feb 2, 2012

If a single transistor fails in a critical part of the processor such as the ALUs, BU, or dispatcher it will fail. It will also fail if a unit in another part of the processor fails, since neither Intel nor AMD are going to put needless circuits. You may be also be thinking that if the cache fails the processor will still work. If a memory location is corrupt and a program or processor refers to it the program will fail to work. That being said Intel and AMD have high quality products that undergo rigorous testing.

Search

Can a single transistor failure render a processor useless?

Smartazz

Diamond Member

PsiStar

Golden Member

Raghu

Senior member

exdeath

Lifer

Smartazz

Diamond Member

exdeath

Lifer

CanOWorms

Lifer

FrankSchwab

Senior member

LiuKangBakinPie

Diamond Member

GammaLaser

Member

BrightCandle

Diamond Member

Smartazz

Diamond Member

ArchAngel777

Diamond Member

pm

Elite Member Mobile Devices

Smartazz

Diamond Member

pm

Elite Member Mobile Devices

TuxDave

Lifer

DDR4

Junior Member

TRENDING THREADS