AMD increased the Instruction pipeline of the NEW Bulldozer cpu much like Intel did with the Pentium 4.
I largely agree with you, but we have to consider what Netburst is in comparison to Bulldozer. (or invert that...)
Netburst had up to 31 stages in its pipelines. Bulldozer is believed to be about 18 stages. BIG difference. From the 11/12 stage pipelines now, that is 50% more, still...
Now, however, we have to read into delays on each stage of those pipelines. Certain material I've read has indicated that the integer FO4 stage delays were reduced by 20%, helping to negate the pipeline length costs considerably. Not each stage of the pipeline is equal, and the 20% FO4 reduction is said to keep IPC constant.
So, the pipeline length issue may have been largely negated.
Now, each integer core in Bulldozer can also do 4uops cycle, a 33% improvement over phenom II which could only do 3 uops/cycle.
Now, from that, we have to consider the highly aggressive front-end, which should help to limit stalls by good 20% more than phenom II's and help reduce the costs for missed predictions DRAMATICALLY.
Now, we have to look at the cache:
Starting from an improvement of 33%, mind you...
(module view)
Item: 1 thread / 2 thread's maxed
Shared L1I: 0%/-2%
Small L1D: -1%/+2%
Write-Through: - 2%/-3%
Shared L2: +3%/+5%
(note: higher latency costs in single-thread exec
2MB means the data is most likely there)
Write Coalescing Cache: +1%/+2%
Shared L3: +2%/+3%
EVERYTHING about Bulldozer says.. I'm going FASTER.
However, there is ONE **MAJOR** issue I can see with the module design and the power management features that we've actually already seen in a past AMD design: the original Phenom.
The problem was that of differing clock speeds per core and Windows scheduling. This problem was NEVER resolved. Even today, you can use k10stat to enable per-core CPU clocking and be amused as Windows shuffles fast-thread onto idling, but down-clocked, cores only to then watch as the system delayed in bringing that core's clock back up just before Windows cycled the thread off once again.
Windows does this for a number of good reasons:
1. Load balancing
2. Next-available-core scheduling method is only O(1) complex.
Module overhead further complicates the matter, with performance-mode requiring to put one thread per module until they are loaded, and energy-saver trying to stay on as few modules as possible.
Microsoft/AMD, I'm certain, will be releasing a patch/driver for the issue.
Disabling Cool-n-Quiet and overclocking should resolve the issue SOME, but not entirely as you still have module contention not being considered (20% overhead is something you MUST consider!).
That overhead, BTW, comes mostly from the front-end - something I'm sure will be improved upon with Pile-driver.
Now, one final note:
The shared L1I design and the small L1D (mostly inclusive) indicates to me a desire to use BOTH integer cores to execute a single thread. If not now, then certainly in the future. There is NO other reason I can see for this L1 design except that.. which doesn't mean there isn't another reason...
Just imagine, when one complete module is put to one thread, you have eight integer pipelines, 4x ALUs, 4x AGUs, 32KB L1d, 64KB L1I, 256-bit FPU w/ FMAC capable of doing 8 ops concurrently!! Wait, suddenly we see a symmetry: 8 integer uops and 8 float uops per cycle, halving the L1d cache from phenom II and making it (mostly) inclusive means you can quickly execute an extra thread when called upon to do so.
Damn, AMD, do you see in your hardware what I see?? I'm betting (and hoping) so!! Good times ahead, indeed!
--The loon