At some point the manual states 5 cycles latency for FMA, FMUL and FADD (128b each). This could be some back-to-back latency when there is a dependent op. This would mean an increase of 1 cycle.
Next, mem ops get handled more efficiently. As it looks like, L/S manages loads independently of other ops and out-of-order. So the manual states that mem access latencies might often be much less than the 4 cycles.
Further SMT FPU means that even if there is not enough ILP in one thread's FP code to issue the instructions pipelined (so latency wouldn't matter that often), there could still be FP code of the other thread to fill in the empty issue slots -> converging to max throughput.
Regarding the faster L/S handling, this is coming from the prefetcher right. But the latency of L1 increased by 1 cycle. However I mean also several instructions which have increased latency.
Regarding FP I already said that a shared unit is more than two halves because of exactly what you stated. But regarding FP there is half the throughput. Regarding SSE FP there is same throughput, means they can do more because of the sharing of units. But also there it is a bit different in Stars, because they have two pipelines as well (though a bit restricted).
Aren't these two paragraphs in conflict with each other? You claim Bulldozer can do 4 Macro Ops with the decoders, and on the second paragraph you criticize JFAMD for saying 4 Micro Ops instead. If you know its Macro Ops not Micro Ops, what is the point of criticizing him? Or are you confused?
You have to remember that in a Bulldozer module there is only one decoder but two integer cores. Means you have a decoding bandwith of 2 MacroOps per cycle per core.
These 2 MacroOps per core can be decoded to 2-4 MicroOps / core. These 2-4 MicroOps can be executed by the 4 MicroOps pipelines. Means these 4 pipelines and execution resources can do 2 MacroOps / cycle.
This... the problem with this statement is the word seldom... reverse it and you are nearing closer to the truth.. the ooo window is limited, the pipes have specific functions, code is execution latency dependant... It is more rare to have all units at work then that a div stall occurs...
The problem is to understand that in most cases you have either all 3 units at work or all three units stalling.
Wether or not it is useless, there are also other combinations that will result in 1MOps >1uOp.
That is right for mov it is important, I just wanted to point out that the LOAD-EXECUTE are unimportant.
But the Stars core can do this as well because it handles MacroOps and has for each ALU an according AGU. Stars core executes each of those instuctions 1 cycle faster than Bulldozer.
False. BD can decode up to 4Mops and schedule 4uOps on the fpu ergo the maximum is 4.
Bd can decode 4Mops and schedule 4uOps on the integer.
Bd can also do opFusion (which I din't take into consideration).
Now about what are we talking? About Module or Core. I talked about core you are talking about module. I am talking about integer performance you now mix in FP.
Exactly some bells should have ring with this one... If not: reread your own posts that tries to use maximum ipc values in the unrealistic scenarios for one operation as a measurement about performance, or as a manner to deduct performance in average applications. Fact is no applications reaches the absolute maximum ipc possible... ergo.. maximum ipc is rather irrelevant when determening performance when the frond end is completely different.
You have to understand the difference of IPC which is instructions per cycle and the execution capability of the core. First many instructions are not finished within one cycle. So it uses up the pipeline/execution unit for more than one cycle. That means that the absolute IPC for a certain program drops well below the number of pipelines/execution units, but hell that doesn't mean they are doing nothing! But it is also clear that one unit less will exactly drop this absolute IPC again by the relative reduction in pipeline/execution width. And heck it does not matter if the absolute IPC for a given application is 0.08 or 1.94. Again the absolute number just does not tell you anything about pipeline/unit utilization. If a pipeline has to do more work in several cycles it is still used and useful.
As a second you have stalls, where no pipeline/execution unit executes anything. In those cases there is no difference in performance or relative IPC no matter how many units/pipelines you have. Clear if all are waiting it does not matter how many you have. Again this drops absolute IPC.
Now looking at both you see immedeatly why more units are used and better but they do not add exeactly the amount of more pipelines/units to performance. The reality is again even more complex (e.g. scheduler depth) but a discussion on that level would crush the limits of a forum discussion.
Also from that you see 4 major items how you can improve a CPU performance:
1.) Make the CPU wider, more Pipelines/execution resources
(e.g. the success of Core architecture - 2 wide in PIII, 3 wide in Core, 4 wide in Core2 (plus MacroOp fusion!), another enhancement in Sandy Bridge resolving L/S address generation units in AGUs)
[Of course this must be backend by highly capable schedulers regarding depth and reorder capability, e.g. this is because the 4-wide Intel have deeper schedulers and features like speculative load/store reordering]
2.) Reduce number of stalls
Better branch prediction, more/faster/wider caches, prefetchers
3.) Reduce instruction latency
See Core2 incarnations which improve them with nearly each new gen. (esp. div: 1 Bit - 2 Bit - 4 Bit divider), as well as K7-K10.5 same with divider.
4.) More clock rate
Now what was done in Bulldozer regarding these items.
1.) The very important instruction width was reduced by ~33%.
2.) Regarding stalls there are some improvemnts regarding prefetchers and branch misprediction but also step backs regarding caches (size/latency) and higher misprediction penality. Those step backs are however related to 4.)
3.) Instruction latency got worse, partly because of 4.)
4.) Clock rate improved greatly
So you have (3) slower cores with (1) less througput with less stalls [better!] (2) at much higher clock (4).
Now 1+2+3 make up for a little slower core than Stars but 4 pushes the performance to be better than Stars. Or just see my previous calculated estimations bringing that in some numbers.
Basically Bulldozer has two major design items:
1.) CMT
CMT works great, the problem here is the dies size consumption. Both together it is a failure how it is currently implemented.
2.) High frequency design
This Netburst like approach brings many of the cons* we know from Netburst but it enables a very high frequency. If this is a success or not is mainly determined by TDP, as TDP was also the main problem of Intel Netburst.
* smaller caches[L1], higher instruction latencies, incresed misprediction penalties (more increased than frequency bump can compensate). But the cons are not as worse as in Netburst.
And it has as well only 2 integer pipelines/execution units - same as Netburst.