I saw a really interesting poster from Intel at
DAC comparing semicustom hand-implementation to fully automated synthesis, place & route. They found that hand implementation actually didn't buy them anything - in fact, the semicustom design consumed dramatically more power
and area while gaining only a trivial amount of performance (~1%?). If you think about it, there are a few reasons that place&route can beat a human:
1) Humans can design fantastic bit-slices, but bit-slices aren't always optimal. Bitslices are great sometimes, but hand design tends to leave a lot of empty space and waste a lot of power. For example, if you have a shifter feeding an adder (like some ugly instruction sets allow), the adder needs the lower bits to be available before the upper bits. A human isn't going to be able to optimize the shifting logic separately at every bit, and is either going to plop down one high-speed shifter optimized for bit 0 everywhere, or, best case, break the datapath into a few chunks and use progressively smaller (lower power, slower) shifters for each block of e.g. 16 bits. A tool can optimize every bit differently.
Some structures are really pathological for humans, like multipliers. The most straightforward way to place them is a giant parallelogram, which leaves two large unused triangles. You can get into some funky methods of folding multipliers to cut down on wasted space, but it gets complicated fast (worrying about routing tracks, making sure you are still keeping the important wires short, etc). A place&route tool can create a big, dense blob of logic that uses area very efficiently.
2) Modern place&route tools have huge libraries of implementations for common structures that they can select. For example, Synopsys has something called DesignWare, which provides an unbelievable selection of circuits for (random example) adders, targeting every possible combination of constraints (latency, power, area, probably tradeoffs of wire delay vs. gate delay, who knows what else). A human doing semicustom implementation doesn't actually have to beat a computer - he has to beat every other human who has attacked the problem before, and had their solution incorporated into these libraries.
3) An automated design can adapt quickly to changes. You have to break a semicustom design up into pieces and create a floorplan for the design, giving each piece an area budget and planning which directions its data comes from/goes to (e.g. "the multiplier's operands come from the left"). Once the designs are done, you now have to jiggle things around to handle parts that came in over/under budget, and you end up with a lot of whitespace. If, half way through the project, you realize you want to make a large change, you may find that too much rework is required and you're stuck with a suboptimal design.
Plop a
quarter micron K7 on top of a
32nm llano... is it really likely that the same floorplan has been optimal since the days when transistors were slow and wires were fast, through to the days where wires are slow and transistors are fast? Engineers always talk about logic and SRAM scaling differently, yet the L1 caches appear to take a pretty similar amount of area. Shouldn't 7 process generations have caused enough churn that a complete redesign would look pretty different, even from a very high level? With an autoplaced design, you can try all sorts of crazy large-scale floorplan changes with minimal effort. If you try a new floorplan with a hand-placed design, you won't know for sure that it works until you've redesigned every last piece. You could discover a nasty timing path pretty late, and suddenly be in big trouble. It's interesting to see how on that original K7, the area was used pretty efficiently - pretty much every horizontal slice is the same width. The llano image doesn't look quite as nice. For what it's worth, you can do similar comparisons with Pentium Pro/P2/P3/Banias/etc. On a related note, the AMD website used to have a bunch of great high-res photos of various processors. Anyone know where to find them now?
4) Not all engineers are the best engineers. You might be able to design the most amazing multiplier in the world, but a company might have a hard time finding 100 of you, and big custom designs require big teams.
If you look carefully at die photos of some mainstream Intel processors, it looks like they've actually been using a
lot of automated place & route since at least as far back as Prescott.
This blurry photo of Prescott shows a mix of what appears to be custom or semi-custom logic at the bottom and top-right, as well as a lot of what appears to be auto-placed logic (note the curvy boundary of logic and what looks like whitespace (darker) left of and above the center... humans just don't do that). I've also read a paper by a company involved in Cell (I think it was Toshiba) that found that an autoplaced version of Cell was faster and smaller than the original semicustom implementation.