Interestingly, I think AMD can get 33% IPC improvement almost for free.
The module design is suppose to be about 85~90% efficient (each core performing about 85~90% of its optimum). The cache system is horrendous, and cost performance from phenom II's cache system (about 5~6% it seems, load dependent). Bulldozer integer cores are massively narrow, with only two ALUs and two AGUs.
The module design is 85-90% efficient when fully loaded. With only 1 core active, that core gets full access to the front end, its own integer execution units, and the FPU. Steamroller and excavator have improved this to some extent with each SR core getting its own 4-wide decoder (way more than it needs). ST efficiency will not improve in single thread workloads (per module) by removing CMT. If you took off the other integer core, L1 instruction cache, and decoder you would see the same performance as running 1 thread in a module (your core is no longer CMT).
That is a 30~42% improvement right there, with no other changes. Of course, these aren't all fully cumulative, so AMD will have had to make more changes to fully flesh out a 40% improvement. I just think many people underestimate just how much of a burden the front end in a Bulldozer CPU has to process the load for two cores. Sure, it can do it, which is amazing, but it can't do it without adding retirement latency, which reduces performance in MANY different types of workloads.
That is 2 core performance, NOT single thread IPC.
http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper
Nice explanation.