I didn't find one of the nice IPC over time diagrams I have somewhere on my HDD.
The best I could find was this one, off-hand, but it's over the course of whole tests, just for Conroe
(BZip2 was quite a surprise, actually):
http://www.ece.lsu.edu/lpeng/papers/isast08.pdf
Do you have the module concept in mind here? Otherwise delivering more than X decoded ops per cycle while execution only processes them at a maximum rate of X would be no bottleneck.
"That leaving execution starved, with code that offers high ILP, seems like a silly thing to do, compared to matching the capability end to end, to prevent inefficient execution of efficient code."
Again, I must refer to the Captain
(motto: never pass up a chance to reference that, the hard boiled egg scenes, or plastic Jesus). How is that not basically the same thing? If most code only runs at <=1 IPC, but you can either find some, or have reasonable good simulations of workloads you want to be good for, able to manage ~1.5 in some oft-run loops, if not limited by the front end, would you make a front end as close to 1 instruction per cycle as possible--the common case--or try to make sure it can manage >=1.5 all day long?
Now, if >=1.5 basically means around 2, and you add a processor, you add another 2 for that processor
(assume RISC, so it's not a binary-dependent, nor that there is the duplicated complex decoder). With a shared setup, 3 would be enough to take care of peaks,
and the total width would be less than that of two single-threaded processors. It would only be the very rare care cases where that 1.5 could be exceeded by both at the same time, for some length of time, that it would be slower, but this would be both highly uncommon
(as defined by being difficult to come up with programs that can do it, processing something akin to real data), and with higher ILP code tending to be less sensitive to CPI, making up for it elsewhere shouldn't be hard
(such as those increasing the actual IPC of more common cases...OK, that is hard, but it's also far more important, as a 2% improvement across the board is better than worrying about a 2% hit in 1/10,000 of performance-oriented programs).