Well, modern computing is all about memory. The CPU is spending wsay too much time waiting for memory.
Prescott, believe it or not, actually has many improvements over Northwood in terms of "IPC". It added a dedicated shifter, it issues more micro-ops per cycle, its cache fetching scheme is slightly improved. All in all, it wasn't just about hiking clockspeed. I think the problem is that Intel wanted it both ways and ended up getting neither. If the caching system and pipeline had not been messed with, Prescott would've outperformed Northwood quite a bit at the same clockspeed. Of course, they just had to crave more clockspeed and hence, they hiked the L1 cache latency by 4x and the L2 cache latency by 2x to allow higher clockspeeds. They lengthened the pipeline as well. Now, common stigma asside, I doubt the increase in pipeline length really caused too much of a performance hazard. Prescott's branch prediction algorithms were improved upon and with modern branch prediction being ~95% accurate or even more, it's not much of a problem. Of course, this added to transistor count and therefore heat, so, without the added scalability, it was a waste.
However, in terms of performance, it's all about the memory. 4x the L1 latency means integer instructions (those that have to run at *twice* the core clock) have to wait 4 cycles vs 1 or 2 before data is ready. Modern code achieves maybe an ILP of 2 if lucky. Most of the time, it's less than 1. So executing out of order simply cannot feed enough independent instructions to the execution core.
The case is slightly less bad for FP processing, however, it's still worse. All FP data is fetched from the L2 cache and with a 20+ cycle latency, I have a hard time imagining any code that can provide enough ILP to mask that. The best scheduler in the world (which happens to be on Prescott) cannot extract that much ILP from your average code.
So, we have reached the point where not only is memory too slow to feed the processor, but with Prescott, the *cache* itself is too slow to feed the execution core. With only 8 GPR's (of which only 6 are usable by the programmer IIRC) and 8 SIMD/MMX/FP registers, going to cache will occur quite a bit.
How to improve performance? Well there are quite a few ways. For one, since the longer pipeline isn't doing anything but adding heat, take it down a notch to around Northwood's. Another is to change the caching structure back to Northwood's. 1-cycle to 2-cycle L1 latency would help dramatically. A reduced L2 cache would be great too. Some of the added improvements in Prescott should be kept as well. The dedicated shifter, the wider-issue trace cache. If the larger L1 data cache really causes so much latency, then reduce it to 8KB. The improved branch prediction algorithms would be great as well.
Some more experimental things would be an added stack cache, since a good bit of programs access the stack instead of the heap and a stack-like structure is easier to design for high-speed. A stack manager would help as well. Micro-ops fusion (perhaps even better than on Dothan/Banias) and a more flexible FP execution unit would help as well. Currently, Netburst is only good at FP when performing vector SSE operations. Another issue port for FP would help much more and allow scalar code to execute much faster.
These are only some ideas, I'm sure there are many more. Sadly, the future of MPU design seems to be going away from making more elegant and efficient cores. Instead, they're content on being lazy and just slapping multiple cores together and call that a performance increase.