First, BD is designed with enough resources not to bottleneck.
Second, you asked for a benchmark and I showed you the benchmark, so don't accuse me of cherry picking. Why did I choose SPEC int? Because that is a processor-only benchmark. If you want to argue any other benchmarks you start down the path of the platform and the processor only becomes a part of the equation.
The term "not to bottleneck" is a too strong term for what Bulldozer was designed for. This sounds like never ever which I doubt.
It was designed not to bottleneck in the very majority of workloads that is right as far as I can see from the design. However I could create workloads where Bulldozer would suffer from bottlenecks. I could get the decode to bottleneck using plenty SSE instructions which are all using 128 Bit constants (worst case scenario for decoder). Possibly I could even get the load/store units to become a bottleneck using only integer load and store instructions in parallel with SSE load and store instructions at both cores of a module at the same time until the L/S buffer is full. Yes that are rather artificial workloads but you used a too strong term there. Okay my claims that I could write some code which brings up some bottleneck in Bulldozer must be proven with an actual part but I think this never ever claim is questionable.
Sandy Bridge e.g. has a general bottleneck in the predecoder stage. Intel uses a trace cache to compensate this to some degree in case of loops (as long as the loop is small enough to fit into this trace cache, so e.g. loop unrolling optimizations create severe problems for Sandy Bridge). Bulldozer may not have a general bottleneck like this but for all processors you could create code snippets which would make them stall due to a bottleneck.
Taking SPEC(CPU2006) is not cherry picking. SPEC is the only fair real world CPU benchmark set out there and one which is fully transparent. All professionals use SPEC as a decision base for general CPU benchmarks. And I really don't know why SPEC benchmark results are not used at Anandtech and they prefer to use very specialized benchmarks.
Therefore, after the last few posts, it is possible to claim that in some cases, a hyperthreading CPU will only be able to run 1 thread, at any precise point in time, whereas a bulldozer module can run 2.
Sure it can switch between 2 threads quickly, but we are not discussing this.
Point made.
No, that is wrong.
In
ALL cases, a hyperthreading CPU will only be able to run 1 thread, at any precise point in time, whereas a bulldozer module can run 2.
Then it is correct. So at any given clock cycle only one thread is running. And they splitted it into a priority thread (the fast one) and a "waiting" thread (the slow one).
But do not get fooled, because of processor stalls this still gives performance improvement in many cases. But this is the reason why Hyperthreading gives only -5 to 30% of performance boost while this Bulldozer "module technology" gives 80-100% of performance boost.
There are even scenarios possible where the waiting thread in Intel's HT will be never executed at all.
E.g. for chess programs the performance will even degrade if you enable HyperThreading, where Bulldozer's split core approach would perfectly shine. So e.g. chess programs would be a benchmark which will favor AMD Bulldozer and totally cannot profit from HT.
Generally speaking HT shines if the workload is extremly memory intense and the priority thread has often to wait for memory results to come in. If this is the case the waiting thread will execute. The waiting thread can also use other pipeline stalls of the priority thread (e.g. because of branch misprediction) to execute.
This gives the funny effect that if Intel improves branch predition and memory subsystem then the priority thread gives more performance while the waiting threads gives less performance. So overall there will be no performance improvement. For Bulldozer if branch prediction and/or memory subsytem is improved both threads will be faster so they double the overall performance gain from such improvements.
Likly that was a reason why AMD made significant improvements in the Bulldozer design for that (prefetcher, predictor with/deepth, double memory bandwith to caches, reduced latency of memory controller, increased load/store queue).
On the other hand this double half cores create more stress to the memory subsystem (not to the predictors and L1 since those are doubled).