The module design is 85-90% efficient when fully loaded. With only 1 core active, that core gets full access to the front end, its own integer execution units, and the FPU. Steamroller and excavator have improved this to some extent with each SR core getting its own 4-wide decoder (way more than it needs). ST efficiency will not improve in single thread workloads (per module) by removing CMT. If you took off the other integer core, L1 instruction cache, and decoder you would see the same performance as running 1 thread in a module (your core is no longer CMT).
That is 2 core performance, NOT single thread IPC.
http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper
Nice explanation.
I didn't see any explanation in that link
In any event, the module overhead is always there. The OS is always scheduling something (I've seen the code, no guesswork

). The instructions are to halt the core, engage in context and ring mode switches, run the OS scheduler or kernel tasks, handle interrupts, and so on. Only "parked" cores get any reprieve from this background onslaught.
For the module, this means there is always something being addressed to each core, and there are more pipeline stages to deal with this - and that added latency is always present. You can't get away from it no matter what, even if a core is parked and not receiving anything other than C6 state commands (which keeps the core off, but ready).
This overhead is more meaningful than you might think. If one core is in a c-state, though, the overhead is only around 3%, IIRC. At best-idle, it is closer to 5~6%, but then there's thread scheduling from the OS to consider, which moves the overhead right up to about 15% on the original Bulldozer - though I think the nominal front-end overhead was somewhere around 10% in this scenario, which is pretty low when you really think about it what it takes to make that happen.
That means you should be able set thread affinity and get about an 8 or 9% improvement with an otherwise unladen system, by utilizing only one thread per module. However, this overhead is less for SIMD-heavy work loads, as the front-end is less stressed, and the burden shifts more to the caches, execution units, and schedulers.
Of course, the more you have going on for that second thread, the worse the overhead gets as not only the front-end is slowing things down now, but you also have the shared caches and dispatch controller to consider at that point (mostly the slow-arse L2, even with the help of the WCC).
Of course, not all of these come together to hurt performance at the same time, but when they do, that's around a 20% or so drop in IPC on Bulldozer due to the module, with about half of that being a full-time expense.
Please feel free to refute any of my math, this is all from the memory of core documentation I read even before the first FX-8150 entered production.
--
EDIT:
I got to thinking about corroborating evidence for my post and did a quick search for some benchmarks to show the scaling costs:
http://techreport.com/review/21865/a-quick-look-at-bulldozer-thread-scheduling/2
Here, they ran two threads using affinity masking.
0x55 is 10101010, so one core per module is scheduled for the task. The results seem to be pretty much in line with my above statements (seems my memory isn't as bad as I thought :-D).
Of course, we can't tell the IPC that is uniformly lost at all.