With the latest information from the optimization guide my expectations for Bulldozer come back to what I had in the beginning after Hot Chips - no it is even worse.
I thought about the decoders of Bulldozer. Let's look at the module. The decoders can decode 4 ops/cycle. Now as AMD claims they have 12 pipelines which want to be fed by that. How can this work? The new and quite strange information in the optimization guide plus other issued information gives the clue. The 4 decoders decode one or two micro ops. E.g. a add rdx, rax is decoded in one yop and add rdx, [qword ptr] z is decoded in two yops, one for the memory and one for the add.
If we eliminate the micro op intermezzo we have as a result that each Bulldozer core can do 2 Macro-Ops / cycle plus the seperate vector scheduler.
So we can safly assume that a Bulldozer core is slower than a PII/Llano core. By what margin exactly has to be determined. To compensate for these slower cores we have a high frequency design. However this design again slows down the performance of the Bulldozer core per clock. On the other hand the cores can be clocked higher, e.g. at 4.5 GHz.
And we got CMT for Bulldozer which gives ~80% speedup.
So let's see what we have with an 8 core Bulldozer compared to a current Phenom II in integer:
BDPerf = PIIPerf * 0.8 // -Reduction in core capability, +Core Improvements
BDPerf = PIIPerf * 0.8 * 1.2 // + Higher clock (4.5 GHz), - cost of high freq. design
BDPerf = PIIPerf * 0.8 * 1.2 * 1.8 // CMT
results in:
BDPerf = PIIPerf * 1,728
means a Bulldozer is 1.7 to 1.8 times faster than a Phenom II
Sounds good so far and able to beat a current Sandy Bridge.
But I assume that Bulldozer is actually a failure.
Why?
Now let's take the performance in regard to die space utilization.
4 BD module = 120 mm²
8 MB L3 cache = 60 mm²
Uncore = 100 mm²
~280 mm² in total
So let's see what Llano could do with that die space:
8 Llano cores = 80 mm²
12 MB L3 cache = 90 mm²
Uncore = 100 mm²
~270 mm²
So regarding die space consumption Llano has an advantage.
But how is it with performance?
Llano8CPerf = 1.05 * PII4CPerf // better clock due to 32 nm
Llano8CPerf = 1.9 * 1.05 * PII4CPerf // +double core count, -memory bandwidth
final result:
Llano8CPerf = 2 * PII4CPerf
Means Llano will offer more performance on less die space.
Bulldozer features FMA, AVX, SSE4.1, SSE4.2, AES. Bulldozer's float performance is even more worse than that of a Llano part, integer SSE is much slower and float SSE might be same.
I am very afraid that Bulldozer was a backstep. It just consumes way too much die space for what it offers. Sure it is faster by delivering more cores with CMT but you could have had all this without investing so much R&D into that design.
CMT was a design failure. If you look at core sizes, CMT is a great idea but obviously does not work out. SMT would be much better and it would even work better on an AMD core because the decoders bandwith is higher.
If you compare die space consumption of Sandy Bridge and Bulldozer then you can only cry. Sandy Bridge is so smart in this regard with even a GPU on it. If you drop that off you get 8C/16T Sandy Bridge on the same die space as a 4M/8C Bulldozer.
The real problem arises that with CMT AMD can only provide little execution resources because otherwise they are immedeatly in decoder stall. So there is also a problem with fixing the performance. Because of linearity of a program it would be more costly to further increase decoding power than to get away from CMT.
So what should a Bulldozer Version 2 look like to fix the upcoming problems of AMD?
1.) Consideration. How fix CMT.
The only advantage I would see in CMT would be the use of a vector unit by two cores. So there is either the option to get away from CMT or to extend it to having two decoder units and 2 I-caches. But then adding another vector unit would be little more and would reduce all special handling because of CMT.
2.) Implement SMT (cost little/gains a lot)
3.) Fix Integer SSE
4.) Add a ALU unit/pipe to integer core (cost about nothing, gains like hell)
5.) Reconsider high frequency design - really worth?
6.) Get the abnormal high uncore die consumption fixed.