There are two distinct (and very different) phases to silicon design: before first tape-out and after first tape-out. Different companies have different terms for these two phases, Intel calls it "front-end" and "back-end". Other companies call it pre-silicon, and post-silicon and others call it more simply "phase 1" and "phase 2". Whatever you want to call it, one thing is very common and that is that back-end development is generally underestimated by people - even engineers - who haven't actually done it.
The post-silicon phase is basically about 2 things: finding and fixing functional bugs and finding and fixing electrical marginalities and manufacturability issues. A functional bug is very similar to a software bug. In fact, I think there are almost so many similarities between the logical design of a CPU and writing code, that I think they are essentially the same thing. Finding bugs usually revolves around running code against the logical simulator and on the chip and seeing if they differ, and running legacy code (that should work) and see if it works. A CPU can't ship with any major bugs (or else it risks a very expensive recall), so all of the functions all need to be checked thoroughly. As the Pentium floating-point bug illustrated, it's not enough to verify that the CPU can add 1+1 and get 2. Ideally every single combination needs to be checked or else you risk the possibility that some combination of random numbers could end up with the wrong value (as was the case with FDIV). So a large chunk of post-silicon engineering is devoted to trying to check all of the functionality of the CPU as completely as possible.
Even on relatively simple microprocessors, it usually mathematically impossible to check every possible combination of logical input and processor state within a reasonable amount of time. I'm sure that I could find the formula for it somewhere, but exhaustively testing every possible input of a CPU with several hundred pins and hundreds of millions of transistors is essentially impossible within any reasonable product shipment goal. So at some point you have to "work smart" and test the combinations that are likely to be at risk and then hope you have done enough to ensure that it is correct. It's not a process that lends itself to being rushed.
The other chunk of post-silicon work comes from trying to get all of the marginalities out. When a CPU comes back, you might find that it only works at a narrow range of frequencies, temperatures and voltages. You may find that the number of good parts yielding is a little less than one might hope. You may find that certain combinations of instructions or data cause unexpected problems. You may find that, say, 90% of all of the parts coming back from fab don't work because a certain combination of shapes in metal end up coming back as a short due to a lithography problem. The list is extensive (endless?) of all of the problems that one might see. In any case, you will almost certainly find that the chip doesn't behave or yield exactly as well as one might have hoped.
Once you know that you have a problem (and this can be tricky in and of itself), then you need to debug the problem and find a fix for it. In modern times, high-speed testers and things like JTAG scan, and TAP functionality enable a large degree of debug to occur with just a few Unix commands, but still there is nearly always some form of intuitive leap that an engineer (or team) needs to make to figure out exactly what the problem is. Scan may tell you that a latch is not getting the correct value at a certain time, but trying to figure out why the value is incorrect can be very hard. And once the mechanism for the failure is found, then you need to find a fix for it that minimizes the amount of rework that needs to be done and to be very sure that the fix doesn't somehow cause some other problem inadvertently.
Due to the cost of lithography masks, fixes normally pile up and then are all performed at the same time in what is called a "stepping". A stepping is essentially a revision ot the mask in much the same way as software versions. Once you have a fix in, you then need to wait a while for the fab to make it and then the process starts all over again. Since you are finding problems and fixing them and then having to wait for a while, it's hard to know whether or not the particular problem is definitely fixed and to know whether or not this problem is masking some other problem. For example, a speedpath might prevent a latch from getting a correct value in time, and you can fix that speedpath and wait for a stepping turn only to find out that there was another speedpath just behind the first that means that you are only marginally further along towards your frequency goal than you were way back when you made the fix. There are a lot of things to check, it can be hard to get visibility into exactly what the root cause of a failure is and the inability to instantly fix problems and move on slows the whole process.
Similar to the reason why it takes literally a decade to make a deep-space probe (like Cassini), or the reason why it takes a decade to get a drug through the FDA process, the reason why it takes a long time to get a chip from tape-out to customer shipment is that you can't fix problems later on.
It's a very complex process with a lot of variables, you have one chance to get it right, and the cost of failure is extremely high.
One might say that if you take a known good core "A" and step it twice on the same die, then it should be easy to know that it works fine. And definitely it helps a lot, but there is still plenty of work to be done. Interactions between the two cores need to be checked - and depending on the way that dual-core is implemented this can actually be a fairly daunting task. Increasing the die size can cause litho structures that worked fine in a smaller core, to run into problems. Doubling the cores can lead to power issues that need to resolved and can cause marginalities that previously could have been ignored to become unexpected problems. Thermal issues need to be investigated. Board-level power delivery needs to able to fill power demands. Reliability issues need to be checked. It should be a faster process than normal, but it is still a lot of work.
The
International Test Conference (ITC) is going on this week in Charlotte, NC. If you picked up the proceedings, you'd probably get everything that I mentioned above in mind-numbing detail.
Patrick Mahoney
Senior Design Engineer
Intel Corp.