Can a single transistor failure render a processor useless?

Smartazz

Diamond Member
Dec 29, 2005
6,128
0
76
I'm not sure whether this is a highly technical question but here it goes. Given the vast number of transistors on today's CPUs and GPUs, is it possible that the failure of a single transistor could start giving errors or even render the chip unusable? It's hard to imagine that every single transistor has to work, so are some transistors redundant? Even if some are redundant, what happens if transistor(s) break in a non-redundant part of the die?
 
Last edited:

PsiStar

Golden Member
Dec 21, 2005
1,184
0
76
Yes. But that is not what fails. It is a wire bond between the chip & the package pin. In the end, a mechanical failure. But then I am an EE so of course I would blame it to a mechanical issue.:$
 

Raghu

Senior member
Aug 28, 2004
397
1
81
Yes. Some transistors are critical. For instance, in a clock tree (that supplied clock to flip-flops), a single transistor failure could screw a bunch of flops causing wide failure.

Some other transistors are dispensable. Most chips have thousands of power gates to turn off sections of the chip. Single transistor failures in a power gate does not cause failure. The current is carried by the other power gates.

Most chips do not have redundant/non-redundant parts.
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
Minor defects can be corrected with a focused ion beam.

Unfixable things that render cores or sections of cache unusable can be disabled, then binned as a lower cost part with less cores, less cache, etc.

Many errors and defects aren't simply a case of "it works or it doesn't", but manifest as reliability problems at high clocks. Again, binned down as a lower cost lower clocked version.

Rate of errors resulting in completely useless chips is surprisingly low. Most problems on a new process are throughly tested and resolved with prototype samples before large scale marketable chips are even attempted. Things like well size, minimum feature width, etc. are figured out well in advance through testing of the production process, and incorporated into actual chip designs as a tolerance factor to minimize possibility for defects. A 22nm process was probably tested just fine at 20-24nm. It stands to reason that circuits that can make or break the entire chip and are less redundant would be designed with more "room" to maximize photographic/doping/contact exposure and minimize errors.
 
Last edited:

Smartazz

Diamond Member
Dec 29, 2005
6,128
0
76
Does something like electron migration have the potential to render a transistor broken, potentially after it was packaged and sold?
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
I hope not. Electron migration is the whole reason transistors and electronics in general work in the first place ;)

If you meant metal migration or electromigration, that's a long term issue affecting the metal layers of any semiconductor. Generally if a chip works immediately after production it will work for several years before experiencing premature failure if at all. Either way it's outside the intended life cycle of the product.
 
Last edited:

CanOWorms

Lifer
Jul 3, 2001
12,404
2
0
This happens all the time. Every single transistor has to work.

By the time it's sold to the consumer, most failures will be screened out. Most customer failures tend to be due to electrical overstress damaging 1 or more transistors on an I/O.
 

FrankSchwab

Senior member
Nov 8, 2002
218
0
0
In my experience, the vast majority of the transistors are critical and would cause a fault of some sort if they were bad. The issue, of course, is finding the fault.

If the transistor were in the clock chain, or in the register that held the Program Counter, or was responsible for driving a memory address out of the CPU, the fault would be catastrophic and the chip would be dead - it would likely never run long enough to execute more than a few instructions.

If the transistor were in the data cache, you might have a one-bit failure out of the 6 MB (48,000,000 bits) in the cache; you would notice flaky behavior and your PC would probably never make it through a Windows boot.

If the transistor were in a lookup table in the floating-point processing unit, your PC would probably work just fine. 99.999% of users might never even notice the problem, and only someone who was carefully checking the results of scientific calculations would see it (see http://en.wikipedia.org/wiki/Pentium_FDIV_bug).
 

GammaLaser

Member
May 31, 2011
173
0
0
There are some CPUs (like IBM z-series, Itanium, and high-end Xeons) that offer enough RAS features to recover from both soft and/or hard failures in the circuitry using redundancy and error checking in both the core logic and memory arrays. At the extreme, IBM mainframes support redundant CPUs that are automatically brought online if another processor fails. The system does not even have to be rebooted for this to happen!

Vast majority of client CPUs that we buy in desktops/laptops do not have this level of resiliency. They would be way to expensive then ;).
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
They are talking about making CPUs much more resilient with ECC like functionality all over the place for future processes because they can't rely on every transistor to be perfect anymore but right now everyone of them running at the necessary speed is critical.

You could make a lot of money if you could come up with a processor that didn't require clock signals and could detect and repair errors rapidly. It would be a reasonable amount faster than the current clocking based designs and that resilience would move the industry on a notch.
 

Smartazz

Diamond Member
Dec 29, 2005
6,128
0
76
They are talking about making CPUs much more resilient with ECC like functionality all over the place for future processes because they can't rely on every transistor to be perfect anymore but right now everyone of them running at the necessary speed is critical.

You could make a lot of money if you could come up with a processor that didn't require clock signals and could detect and repair errors rapidly. It would be a reasonable amount faster than the current clocking based designs and that resilience would move the industry on a notch.

Is quantum tunneling one of these obstacles that could require the use of ECC functionality throughout the die? I'm sorry if my understanding of quantum mechanics and electrical engineering are lacking, but my understanding is that stopping the flow of electricity past the gate of the transistor gets very difficult when electrons can go through the gate due to their wave-particle duality? Sorry if this understanding is way off, but can anyone clarify?
 

ArchAngel777

Diamond Member
Dec 24, 2000
5,223
61
91
I would imagine that a processor has built in redundancy, much like that of a CD/DVD ROM disc. Can an electronic engineer comment in regards to this?

Ooops, I missed that last part of the OP where he specifically mentioned 'non-redundant part of the die'. My bad... How often is a crucial part not engineered with redundancy?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
I work at Intel and have worked on reliability for server CPU's. (since talk is cheap on the internet, my Intel email is in my profile, you can email me at Intel if you don't believe me)

I agree with what FrankSchwab, GammaLaser, exdeath, rahgu and CanOworms all wrote.

So, yes, most of the time a single transistor failure anywhere on the CPU will result in the entire CPU not working - because the transistors that would tend to fail would be transistors that are used a lot and thus a failure in one of these commonly used transistors (or wires) would result in the whole thing not working. As FrankSchwab wrote, sometimes this failure will be immediate and catastrophic, other times it might be more of a flaky failure.

There is redundancy in server CPU's - register files use ECC and can fix failures on-the-fly, Intel uses a technology called "Cache Safe Technology" to detect cache regions with too many ECC errors and can disable that region and then swap in a redundant region on the fly, and there are other redundancies. There are also lots of transistors on a CPU that are used for test or debug that have no effect on the operation of the CPU. Or you could have a bit get stuck high (or low) in something the LRU (least-recently used) circuitry for a cache, or the branch prediction circuitry such that the CPU would suffer a performance issue but otherwise continue to work fine.

But most of the time, when I look at CPU that doesn't pass the test screen (and one of these performance issues would be caught), I tend to see it's just one transistor that failed. And generally it's a specific lithographic mask structure that fails... like you'll get something where there's all metal 1 in a square, with a via (a vertical wire) through the middle of it, and there will be a transcription error in the fab where the via mask and the wire mask didn't totally line up resulting in via that's much narrower than it's supposed to be. And you'll see this particular structure show up all the time in the failures list, and then you can go back, search through the whole design and just fix every single one of these errors on the whole CPU (for example by making the via slightly bigger). And then you'll see a huge improvement in reiability and yield by fixing just this one issue (that happened thousands of times all over the whole die).

To be honest though, my thinking that any one failure generally results in a CPU failure could be a form of sampling bias. Because the failures that I've looked are like this, I then think that they are all like this. But if a transistor fails in an unused chunk of circuitry, would anyone ever know?

I will say, in response to ArchAngel777, that in my experience there's a surprising lack of redundancy in desktop and consumer CPU's. Modern CMOS process technologies produce transistors that generally work for a really long time (ignoring early failures caught in burn-in), and so redundancy doesn't do have much function in real life and burns power, and take up space. So, no, there's an amazing lack of redundancy in most consumer oriented CPU's.
 
Last edited:

Smartazz

Diamond Member
Dec 29, 2005
6,128
0
76
Thanks for the informative post pm! I assume redundancy is important for the Xeon and Itanium line then. I assume AMD does the same with their Opteron line correct?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
Thanks for the informative post pm! I assume redundancy is important for the Xeon and Itanium line then. I assume AMD does the same with their Opteron line correct?

I honestly don't know. Certainly the engineers at AMD are a smart, capable bunch and they know exactly how to implement it... but it's a matter of cost vs. return. Redundancy burns extra power, takes up die area, and it's only worth adding if your customers want/need it as a feature. I'm sure that they add the level of redundancy needed to address the market that they are targeting, so I agree with your assumption, but we'd need to look at the datasheets or get someone from AMD to confirm it.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
I would imagine that a processor has built in redundancy, much like that of a CD/DVD ROM disc. Can an electronic engineer comment in regards to this?

Ooops, I missed that last part of the OP where he specifically mentioned 'non-redundant part of the die'. My bad... How often is a crucial part not engineered with redundancy?

Just wanted to add my two cents. Besides redudancy, there's two other vectors to consider. You can put in logic to disable functions/regions that have the faulty logic and you can also over-design for failure.

The first one, at least my team calls them "chicken bits". They're mostly there for new logic that is difficult to validate and so if silicon comes back and it's all screwed up, you start flipping these chicken bits on to disable new features one at a time to figure out who's causing the problem. Once you find it, you can still get the chip running, just not with 100% goodness. I don't think it's used for consumers but it's one way our post-silicon team can get around a faulty transistor or faulty design.

The second is that while we don't put redudancy everywhere, we overdesign for failure. So we test against worst case scenarios for electromigration and may end up overdesigning devices or vias or power to improve reliability.
 

DDR4

Junior Member
Feb 2, 2012
16
0
0
If a single transistor fails in a critical part of the processor such as the ALUs, BU, or dispatcher it will fail. It will also fail if a unit in another part of the processor fails, since neither Intel nor AMD are going to put needless circuits. You may be also be thinking that if the cache fails the processor will still work. If a memory location is corrupt and a program or processor refers to it the program will fail to work. That being said Intel and AMD have high quality products that undergo rigorous testing.
 
Last edited: