I've been wondering about that. The math simply doesn't add up well for BD.
For argument's sake, Let's assume that a 12 core Magnycours does 100 'X' ops at 1ghz. 100/12= 8.333 X ops per core at 1ghz.
Then we assume that a 16 core interlagos does 150 'X' op's at the same speed. 150/16=9.375 X ops per core at 1ghz.
So right there we have a problem.
9.375/8.333= 1.125 = 112.5%. Which means that clock for clock and core for core. Bulldozer is on average 12% faster than K10.5/Stars.
Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.
The problem here and why you get to a 12% performance boost is that the Bulldozer parts will be clocked higher than Interlagos.
So core for core and clock for clock BD is slower as Magny Cours. However because Bulldozer will have a significantly higher clock as Magny Cours it will be able to be 12% faster.
This as well is an AMD statement.
You have the ISSCC statement
and with the latest information from AMD combined with the architecture information there is an issue.
I again try to explain exactly what is the problem. Basically it is a marketing bubble:
If you look at the picture everything looks basically nice. The question however which comes up is how a single decoder could be able to fead all those hungry pipelines?
And here comes AMD marketing into play. They draw nice 4 pipelines which look more than the 3 pipelines of Magny Cours. Great marketing they just don't tell that the 4 pipelines are only carring micro ops and two micro ops make up a macro op. Now that gives only 2 pipelines. Now look again at this picture and have the integer pipelines of Magny Cours. Then you know what the issue is. Instead of 6 integer pipelines and 2 128 Bit FPU of Magny Cours in Bulldozer you have only 4 integer pipelines and 2 128 Bit FPU.
And now here comes what I say is obvious from the decoders, so you could have seen this even before from optimization guide it turned out that this 4 pipelines are not what they appeared to be. In Magny Cours the decoders can do 3 MacroOPS/cycle. Enough to feed all 3 integer pipes, two decoders of two cores can feed 6 integer pipelines, everything fine. In Bulldozer you have enhanced decoder which can do 4 MacroOPS/cycle. This is enough to feed 2 pipelines on two integer cores, 4 pipelines in a module.
You also see that the issue is mostly integer related since for FP SSE it looks okay or good however you like that.
I fully agree on that IPC is not the point. A Bulldozer core will be faster than a Stars core. But only because it is clocked
MUCH higher.
You critize my 0.8 per core statement? I will tell you something:
Interlagos 50% faster than Magny Cours.
If you strip off the 33% more cores you are at 1.12 faster. Now with the BD design of 30% (22 FO4 vs. 17 FO4) higher clock:
1.12 / 1.3 = 0.86
Okay that is 0.06 more than I claimed, however it might shrink to 0.8 if you consider another clock bump from 32 nm!
All those is coming directly from AMD. AMD performance statements, AMD engineer statements, AMD presentations and AMD documents. And if you look at the design it is totally clear why it is like that.
Now coming back and to explain why JFAMD is right as well:
He says "IPC is higher" and he is right. Because he talks about micro ops IPC. So BD core does 4 micro ops / cycle which is more than 3 MacroOPS per cycle by sheer number. 4 is more than 3. However if you do not compare apples and oranges then you get a lower IPC 2 vs. 3 (simplified). Now that is the difference of a statement from an engineer and a marketing guy. The marketing guy just does not specify what is meant by "instruction". And as Stars don't know microOps he is even right and no liar though the statement is completly useless (yeah a marketing statement).
On the other hand, yes AMD Bulldozer has 8 cores and that will be enough to surpass current Sandy Bridge. The problem is not in this year. The problem will come next year when Intel issues 8 core Sandy Bridge. The thing which makes this so worse is that a 8 core / 16 thread Sandy Bridge will consume around same die space as the 4 module / 8 core Bulldozer. And that on 32 nm and Intel's 22 nm is coming as well next year.
AMD has a brand new design and a brand new process and both is not sufficient to stay in competition. Means that in one year from now AMD is in the exactly same position as it is now, but they have shoot one's wad (new process, new design).
And heck I am not telling anyone to buy or not buy something. I just make technical analysis to get educated guess on what to expect from brand new architecture Bulldozer.
And yes I am disappointed about what they achieved.