What were the high-level microarchitectural differences between P4 and Core2 that resulted in the substantial boost in IPC at the time?
Being that this occurred nearly 5yrs ago I have surprisingly (or not) little recollection of "why" P4 was such a cluster-f whereas Core2 turned out to be the bee's knees.
And given that info, what would it take (microarchitecture-wise) to provide a similar boost in IPC over westmere? 1-cycle L1$?
A lot.
Pipeline stages-
The original Willamette(and Northwood too) CPU had 20 pipelines when there was a hit on the Trace Cache. From what I read there were 8 more stages when it was a "miss". The number of pipeline stage didn't affect it enough to make all the differences in performance, but indirectly I guess you can say it did.*
Trace Cache-
Usually when performance features are added, it isn't done to replace whatever it exists, but in the Pentium 4's case, it did. When there's a miss on the Trace Cache, it has only 1 decoder to fall back to, and essentially turns into a 1-issue CPU. I know there's jokes regarding how ILP era ended and such, but 1-issue is a problem. Trace Cache also worked in place of the L1 I cache, because Pentium 4 did not have one.
Execution Units-
There was a lot that Intel did to save die space on the Pentium 4. There was 3 ALUs on the Pentium 4, 1 Full "slow" ALU that ran at clock speed, and 2 Simple "fast" ALU which ran at 2x the clock speed. Trace Cache can only feed 6 instructions every 2 cycles which means it might not be able to feed when two fast ALU worked at once. Although that should be a small thing.
Misc-
*And then there's replay. Because of the pipeline stages were so long they needed very aggressive speculation to keep the pipeline fed with data. Replay is essentially a "clone" pipeline that was brought to work in case that speculation failed. I read although the idea itself was nice it brought various problems, one for example which could effectively increase pipeline stages, and thus misprediction penalty.
Replay was also thought to be the reason SMT on Pentium 4 did not work as well as it should have.
Core 2 managed to increase performance by ~90% over Pentium 4 at the same clock. To do it again compared to Westmere would be quite a sight.
