Intel17 said:
What are the kinds of "problems" that CPU architects/design engineers have to face? What kinds of problem solving cleverness is there in the transition from saying, "oh, the CPU will have these execution units and this cache" to actually designing it, and then from there actually building it?
What kinds of things make one CPU design team produce a better CPU than another (say, Bulldozer v.s. Sandy Bridge)? Is it at the high architectural level, or is the circuit design & such just better? Or a combination of both?
The problems are a broad swath of issues. In the logic design stage, you develop the CPU using code that looks suspiciously like the programming languages of either C or Pascal, and during this phase, coders can create bugs and these bugs can be hard to find and hard to fix. For example, a portion of the design might have a counter and then a series of “if” statements (or a “case”

, and the coder might forget to include a stage in the “if” flow for a scenario that they can’t imagine happening, but can. In the circuit design stage, you are getting input signals from other engineer’s blocks and those signals might be under-driven and thus produce a really slow edge-rate (it takes too long to get from a 0->1 or visa-versa) so then you have to get those fixed, or you might find that you have a reliability problem in a large wire and you need to change the metal width, but this causes other things to move, or you might find that someone goofed and the schematics don't do the same thing as the high-level RTL language. Or, and this might seem unbelievably odd, but you might spend a lot of time trying to get the “tools” (the CPU CAD software) to understand whatever clever thing you have happened to put together and thus waste a whole lot of time trying to get some program to agree with you that your circuit will work.
There’s a huge amount of cleverness that goes into everything. Some of it is minor cleverness and some is major. I remember working on a CPU codenamed McKinley and someone on my team came up with the idea of a “prevalidated cache” and that this would greatly speed up cache accesses, and instead of a two or three cycle L1 cache, we had a single cycle cache. So that’s a cache that has ½ to 1/3 lower latency than a traditional design. There were some negatives to the idea too, but overall it was a huge performance improvement. But there’s an idea that took the traditional design and changed it fairly dramatically to result in a much faster design. But beyond the big ideas, there’s a lot of little ideas of the way things are assembled. I remember once I was working on circuit that involved two 64-bit rotators – which allows you to shift a 64-bit value right or left (numbers wrap around, so if you rotated “0111” by two bits right, you’d get “1101”

, and every byte could rotate to every other byte, and, it supported switching “endianness” (the ordering of the bits) on the fly. It was a sub-component to a register file. So, when you think about how to do this, it’s more than a bit confusing. You draw this diagram and any one bit can move to any one of 16 places. So, two of us were thinking about it and we realized that we have lots of time to do this, so rather than build two of these monstrosities for the two 64-bit values, we could build one and then use the circuit on the first clock phase to do one 64-bit register, and then swap and use the second clock phase to do the other 64-bit register. Thus saving a lot of space and a bit of power. I’m not sure where this falls in on the cleverness spectrum, but I thought it was pretty cool. And this is for some circuit that normally no one would ever hear about that was a precalculation for a “store bypass” from a cache to a register file.
So what makes one team better is a super hard question, and I’m not sure that I know the answer. There’s a lot to a successful design. It’s very hard to get a huge team of people to work together so that everyone is working together as a cohesive team. For example, I remember one project that had the marketing department constantly changing the requirements. On another I remember that practically the entire first and second levels of management were made up of engineers who had never managed anything before. I remember hearing about one that had a huge problem figuring out the timing of paths between units so that when the chip pieces were all fitted together, there were huge timing paths. I remember teams that struggled with large team defections (a large number of influential engineers left as a group) or layoffs in the middle of the project. I remember hearing of problems with fabs where the circuit performance specs were supposed to be pretty good, but then the fab had yield problems and to resolve them, they slowed the circuits and then a decent design was now much slower and less competitive in the marketplace. I remember one design that was spread across 5 time zones, and had about a third of the team working 12.5 hours off from another third of the team, so that if you asked a question, it took a day to get an answer. I remember one design that didn’t have good access to the electronic characteristics of the circuits (the process file) and a lot of the design was very aggressive and when it was manufactured it had huge electrical design problems.
Beyond what can go wrong, another big lever towards a great design is how many engineers your team has and how much you rely on automated design tools. A great example of this was the
DEC Alpha.
The main contribution of Alpha to the microprocessor industry, and the main reason for its performance, was not so much the architecture but rather its implementation. At that time (as it is now), the microchip industry was dominated by automated design and layout tools. The chip designers at Digital continued pursuing sophisticated manual circuit design in order to deal with the overly complex VAX architecture. The Alpha chips showed that manual circuit design applied to a simpler, cleaner architecture allowed for much higher operating frequencies than those that were possible with the more automated design systems.
So I have seen a lot of what can go wrong. I’m not sure what exactly makes projects go well – but it’s my opinion (which a lot of engineers disagree with) that a large portion of it comes down to who your “lead engineers” are, and how good your management team is. I think good leadership is essential in a large project and without good leadership, you end up with a muddled mess that turns out to be late. If you have literally a hundred people trying to work together, you really need a core team of leaders who are experienced at this who can get everyone to work together as a cohesive whole. But beyond just good leadership, you need all of the core design team sub-groups to know what they are doing – a validation team that doesn’t find a lot of bugs until after the design has gone off to manufacturing can totally crater a project – and you need a good manufacturing group to make the chip, and then you need good marketing team to sell it and work with other companies to use the design – the classic example to me of what happens without good marketing is the DEC Alpha

.