• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

How does the debugging process for core logic chipsets work?

Sunner

Elite Member
Considdering there's been quite a few articles all over the net lately, mostly about PCI issues, but also about other issues with various chipsets as well, I've started wondering about how the debugging process works.

Say stuff like the PCI writes/reads issues that seem to plague most chipsets these days, I can't quite fathom how some of it gets through the debug process.

Is the process entirely "synthetic", or do the companies(say VIA since they've had numerous issues) actually have "beta testers" using reference boards based on the new chipsets?
Say taking a KT1000 chipset, and stuffing 1000 workstations based on a KT1000 mobo full with GeForce10's, SB Ultragy Plus's, and Ultra1280 SCSI cards to really torture it.
A test like that should reveal at least some of these issues, no?
 
Chipsets are usually made using a synthesis flow with some custom analog design at the pins. This is an overview of the flow with comments on your bug catching ability at each stage.

First an architectural model is made that shows the logic flow, and can do preliminary performance analysis. The architectural model will show a high-level overview of the performance. You write various simulation cases and you run them through this model to simulate events, like the CPU trying to grab a chunk of data from memory and then doing a page swap or something like this. You could catch the problem at this stage.

When you have this architectural model ready, you usually then create a functional RTL model using Verilog or VHDL that works (ideally) exactly the same as the architectural model but will actually model the detailed inner workings of the design. So in the architectural model you might have a block that is labelled "64-bit register" and you put stuff into the register and take it back out, but in a functional model you actually show how this register would work - which signals would cause it to activate, which buses the data would use to move data in and out, what the hierarchy of the structure looks like. Then you take this model and try and find bugs. You verify it against the architectural model (by running chunks of code on both and debugging where they differ) and you also try to just slam the heck out of the functional model.

Synthesis doesn't tend to like functional RTLs - which look a lot more like programming language programs (say, C++ code) than hardware descriptions, so one step that is often added is converting the functional model (which emulates the function without detailing the structure of the model) into a structural model (which shows the actual structure of the hardware). Some companies skip the functional stage and just write structural RTL from the start, others will synthesize from functional code (but this is not smart). But most normal people will write functional code at first and then gradually swap in structural code over time. But I digress. Whether the code is structural or functional, the model is simulated using code that exercises the functions of the model. This is where you would catch the problem ideally.

Once you have an RTL model, you pull out your "make_chip" suite of programs from Synopsys or Mentor and synthesize and "plane 'n route" it which automatically takes RTL code and converts then into transistors. You check that what the program has produced makes sense and will work, but your model is still the RTL. This step will almost never catch a "bug" like you described, Sunner.

Then you fab the chip and you get silicon back. Once you have silicon you slap it in and fire it up and most of the stuff you simulated on the models above gets done in like the first 500ms of operation. So you have the ability to check a lot more, but another issue comes into play: schedule risk vs. impact of a problem. Everything you fix introduces potential (or real) delay to the shipping schedule. There's the risk that you mess it up, or that something that you do messes something else up. And you only get one or two snapshot points (steppings of the mask) where you can fix things. So if you only get two chances to fix things and if there's the possibility that a complex fix could hose a whole lot of other things, then it needs to go into the first stepping - the risk is too high to put it in the second one because if you mess it then the chip doesn't ship.

Also mask sets are expensive although this is something you don't hear a lot about. I couldn't quote real figures but pulling numbers from the air I would guess that a 0.13um mask set must cost upwards of US$3m. For a company you might think that $3m is not that big a deal but in this case you are talking several mask sets to produce final ship-worthy silicon - maybe 3-4 - that's $9m-12m total... that's a fair amount of capital cost that you need to recover selling a chipset that doesn't have a lot of margin on it. So, it's not like you spin a rev of silicon when you find a bug. It needs to be a big bug to be fixed.

I would guess in this case it was something that slipped through to silicon and then was too expensive/complex/risky to fix. If it was caught in RTL they'd have fixed it, so it wasn't until silicon that they saw it. Then it was judged not to be a disasterous bug but one that would require substantial risk to fix. So they deferred the fix to a later product release.

Patrick Mahoney
Microprocessor Design Engineer
Intel Corp.
Fort Collins, CO
 
So, it's sorta like the car industry, they find a defect in a car that can cause it to burst into flames, but if the calculated cost of a bunch of lawsuits is lower than the cost of recalling the cars, they don't recall 🙂?
 
Interesting, pm. So do real designers actually use high-level languages like Verilog and VHDL to lay out their chips? I've used a bit of Verilog on a recent project, and it really made functional design easy. 🙂 So easy, in fact, that it's hard to believe that real designers would use it. Does the conversion from functional to structural code involve hand-optimization of the layout, or is this mostly automated? It's cool to see that tools like I used in class are actually used in industry; I used some Synopsys tools for the design.
 
Originally posted by: Sunner
So, it's sorta like the car industry, they find a defect in a car that can cause it to burst into flames, but if the calculated cost of a bunch of lawsuits is lower than the cost of recalling the cars, they don't recall 🙂?
In a way. But if a mask set costs $3m and your market window on a successful product is 6 months and you are making $10 per chip sold in profit, you need to sell a lot of chips in a short period to cover this defect that most people won't notice.

Interesting, pm. So do real designers actually use high-level languages like Verilog and VHDL to lay out their chips?
Actually most of Intel doesn't. Intel has it's own internal HDL. Many of the bigger companies have their own internal HDL's that were created over the years within the company to solve various problems that weren't being fixed in the "real" version. But definitely Verilog and VHDL are used in real designs.
Does the conversion from functional to structural code involve hand-optimization of the layout, or is this mostly automated?
Functional code looks like C code, so there are things like "for" and "while" loops and other things that are pretty far removed from how you would do it in real transistors. What you tend to find is that even if you can generate working schematics from functional code, it is hard to verify equivalency (check that the schematics do the same thing as the RTL) and the schematics often as not as efficient. Functional code also makes it hard to estimate timing and there are numerous other problems. But most of the time the designer will right functional code because it's easier and will start to swap in structural code as needed to solve various problems as they are encountered. Usually this will be well before the "place 'n route" stage which generates the layout.
 
so pm can i ask you, if a company wanted to do a propreitary dsp type chip for a sub10Gbps propreitory protocol router how much would that process cost if costs were cut to the bone, do you think?
 
Also mask sets are expensive although this is something you don't hear a lot about. I couldn't quote real figures but pulling numbers from the air I would guess that a 0.13um mask set must cost upwards of US$3m. For a company you might think that $3m is not that big a deal but in this case you are talking several mask sets to produce final ship-worthy silicon - maybe 3-4 - that's $9m-12m total... that's a fair amount of capital cost that you need to recover selling a chipset that doesn't have a lot of margin on it. So, it's not like you spin a rev of silicon when you find a bug. It needs to be a big bug to be fixed.

Speaking from experience masksets for complex logic (6-7 metal layers) are just now approaching $1 million at .13. Costs at future nodes will depend on what exposure system a customer is using. If they are trying to push their 248 scanners to .10 the costs for a mask set will be much higher as only hard phase shifiting AAPSM masks can achieve that resolution. If they decide to transition to 193nm scanners at critical layers such as poly and drain/source mask the cost will be reduced by the ability to use EAPSM material masks.

Also remember that revision changes a company makes will likely not impact every layer of a mask set, therefore only requiring reordering of a few masks.
 
Back
Top