Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Abwx · Apr 12, 2011

IntelUser2000 said:
Even if you could, unless you have a Bulldozer architecture simulator(or you could do that in your head), you won't be able to know how it performs other than general expectations.

So every time you try to go more detailed you are just making up excuses for wasting time on something you'll never be able to figure it out.

We can assume that you have such a simulator :

IntelUser2000 said:
I don't believe AMD will gain the performance crown either. But they will be far closer thanks to proliferation of multi-threaded apps. Die size/performance wise they'll squeeze in between 4 core and 6 core Sandy Bridge chips. It's also likely even Nehalem will have some single thread advantage.

As for single thread perfs. expectations, you should add that
it is for integer code, but what about if there s heavy FP code ?...

RobertPters77 · Apr 12, 2011

JFAMD said:
The 12% math is not right. The performance claim is on throughput, you are trying to nail down clock speed, there is a difference.

It was hypothetical algebraic equations based on the information released. If at the same clock speed BD does 50% more 'throughput' with 33% more cores then define throughput.

HW2050Plus said:
The problem here and why you get to a 12% performance boost is that the Bulldozer parts will be clocked higher than Interlagos.

So core for core and clock for clock BD is slower as Magny Cours. However because Bulldozer will have a significantly higher clock as Magny Cours it will be able to be 12% faster.

Clock for performance is what matters most. I don't care what anyone says. If cpu 'A' does 100 ops at 1ghz. While cpu 'B' does 50 ops at the same speed, then I'll take cpu A. The productivity boost speaks for itself. Of course price is a concern but in the Professional space all that matters is getting things done faster.

Maximilian · Apr 12, 2011

So... 65 pages later, what do we know? Any truth to these rumors?

RobertPters77 · Apr 12, 2011

My guess is that 50% faster than PHII in select apps that benefit servers and Virtual Machines.

Not faster than Sandy. Maybe equal to Nehelam.

drizek · Apr 12, 2011

Maximilian said:
So... 65 pages later, what do we know? Any truth to these rumors?

I think all the rumors are lies. I don't think Bulldozer will ever come out.

eflat123 · Apr 12, 2011

drizek said:
I think all the rumors are lies. I don't think Bulldozer will ever come out.

Ha! Y'know that's almost as reasonable as everybody else's positions...

HW2050Plus · Apr 13, 2011

Abwx said:
A truly apple/apple comparison would take account of the
architectural efficency of BD, which we are not aware of at this point.

In short, what about the number of cycles needed by K10
to execute those theorical 3 macro ops and the comparative
speed of BD to execute those 3 macro ops , since they are
broken in a number of micro ops by the scheduler before
execution by the two ALUs and eventually the two AGLUs ?....

Unless you can answer this question, all speculations about
BD architectural perfs are only wild guesses..

Quite easy, because we have this information from the optimization guide.

These MicroOps are executed even slower than the MacroOps from Stars before! However the reason for this slower execution comes from the high frequency design (less work per cycle).

As I said already, Bulldozer has less throughput at slower speed (higher latency).

On the opposite it features more clock and less stalls (better branch predicition, prefetchers, more L2 cache, etc.).
So altogether it is faster as Stars (high clocks of 4.5+ GHz, Turbo included).
But regarding per clock it has less performance than Stars.

HW2050Plus · Apr 13, 2011

Abwx said:
but what about if there s heavy FP code ?...

This is quite interesting. In normal FPU code (fadd, fsub, fmul, fdiv) we have similar problems as we have with integer: 4 FPU operations for 2 cores in Stars vs. 2 FPU operations at slower speed (higher latency) with a 2 core Bulldozer module.

But most FPU code is handled today by using SSE and there Bulldozer will Shine: Double of resources and that shared.

So for SSE FP (addps, subps, mulps, divps) you can expect a really good performance (more SSE FP IPC and more clock) increase and that is also an official statement from AMD (from a AMD slide).

Also that would match with preliminary benchmarks out. I could imagine that the bad C-Ray results come from that C-Ray uses FPU instead of SSE FP.

However you can be sure that most applications - especially the desktop applications you are interested in - use SSE FP rather than FPU.

AtenRa · Apr 13, 2011

HW2050Plus said:
This is quite interesting. In normal FPU code (fadd, fsub, fmul, fdiv) we have similar problems as we have with integer: 4 FPU operations for 2 cores in Stars vs. 2 FPU operations at slower speed (higher latency) with a 2 core Bulldozer module.

Well one difference, BD module has 2x 128-bit FMACs and K10 has one 128-bit FADD and one 128-bit FMUL. So BD Module has 2x 128-bit FMACs and two K10 cores have 2x FADD + 2x FMUL (plus the MISCs).

The problem I see is not in the execution but in the Scheduler, BD has a 60 entry FP Scheduler when two K10 cores have 2x 36 entries schedulers. Since the FP unit will work in SMT mode when we have two threads, then the single 60 entry FP scheduler could became bottlenecked.

Riek · Apr 13, 2011

HW2050Plus said:
Quite easy, because we have this information from the optimization guide.

These MicroOps are executed even slower than the MacroOps from Stars before! However the reason for this slower execution comes from the high frequency design (less work per cycle).

As I said already, Bulldozer has less throughput at slower speed (higher latency).

On the opposite it features more clock and less stalls (better branch predicition, prefetchers, more L2 cache, etc.).
So altogether it is faster as Stars (high clocks of 4.5+ GHz, Turbo included).
But regarding per clock it has less performance than Stars.

I wanted to react on your previous long post but the forum couldn't process it with some obscure issue report.

Stop focussnig on the execution units and maximum values, it is becoming anoying.
Yes K8 could if the stars align process 3 MOps.. However the ALU-AGU constraint, the L/S constraint and the specific uOps/pipeline constraint made this so rare that even 2Mops(calc+load or store) max is a rareness. But you completely miss the point that Mop != Load + calc and that BD knows opFusion...... So even your max values is wrong... (BD has max of 4 (calc)(3 with mem and opFusion) (and you shouldn't use max values in any case either).

BD is a ooo machine, it geared towards efficiency and keeping as much of the execution resources occupied. Testament about that is the almost complete decoupled design. ALU and AGU decoupled, int decoupled from fpu as much as possible. The goal of BD is having a higher AVERAGE IPC than its precessor. This can be done by upping the maximum you can process or by upping the minimum.... guess what would do the most if average ipc in application < 1

JFAMD · Apr 13, 2011

AtenRa said:
The problem I see is not in the execution but in the Scheduler, BD has a 60 entry FP Scheduler when two K10 cores have 2x 36 entries schedulers. Since the FP unit will work in SMT mode when we have two threads, then the single 60 entry FP scheduler could became bottlenecked.

So, if one 60 entry FP-only scheduler can become bottlenecked having to feed a 256-bit FPU, then how can a 54-entry scheduler that needs to handle the integer (2 threads) and the 256-bit FPU for Sandybridge?

Remember that each Bulldozer integer core has its own 40-entry scheduler in addition to the 60-entry FP scheduler. Intel is trying to do all of that on a single 54-entry scheduler.

HW2050Plus · Apr 13, 2011

Riek said:
Yes K8 could if the stars align process 3 MOps.. However the ALU-AGU constraint, the L/S constraint and the specific uOps/pipeline constraint made this so rare that even 2Mops(calc+load or store) max is a rareness.

That is just not correct. In fact normally all 3 units are completly busy and only in seldom cases they are not (div stall, misprediction stall, memory stall, etc.).
Just load any normal code in AMD Code Analyst Tool and start pipeline analysis, you will see that.

Riek said:
But you completely miss the point that Mop != Load + calc and that BD knows opFusion...... So even your max values is wrong... (BD has max of 4 (calc)(3 with mem and opFusion) (and you shouldn't use max values in any case either).

Load + Calc. This is so useless. Never used and yes this is a strongpoint of Stars but it could never be utilized. Now it is still a strongpoint in Bulldozer and still isn't used.

And no Bulldozer cannot process more than 2 MacroOps / cycle.

Riek said:
BD is a ooo machine, it geared towards efficiency and keeping as much of the execution resources occupied. Testament about that is the almost complete decoupled design. ALU and AGU decoupled, int decoupled from fpu as much as possible.

Yes efficient and slow.

Riek said:
The goal of BD is having a higher AVERAGE IPC than its precessor.

Surly not. Also official AMD statement is that they do not take part in the IPC race.

Riek said:
This can be done by upping the maximum you can process or by upping the minimum.... guess what would do the most if average ipc in application < 1

That is toal misleading bullshit. There is no average IPC and no average IPC < 1. IPC as a absolute number is primarily a factor which comes from the application and not from the CPU. So just forget about using absolute numbers regarding IPC, because you simply do not understand the absolute value.

You can treat IPC only as a relative measure. E.g. compared with precious generation or competing CPU.

HW2050Plus · Apr 13, 2011

JFAMD said:
So, if one 60 entry FP-only scheduler can become bottlenecked having to feed a 256-bit FPU, then how can a 54-entry scheduler that needs to handle the integer (2 threads) and the 256-bit FPU for Sandybridge?

Remember that each Bulldozer integer core has its own 40-entry scheduler in addition to the 60-entry FP scheduler. Intel is trying to do all of that on a single 54-entry scheduler.

It is a bit more complicated but surly the Bulldozer scheduler is no issue and under guarantee no bottleneck. Scheduler depth defines out of order window size. So the larger it is the more you can reorder and the earlier you can load data. But you do not bottleneck because of scheduler. Even with a scheduler depth of 0 - means no scheduler at all you do not bottleneck, you just don't have out of order capability.

And @JFAMD forget the two threads regarding Intel, that is no issue. Intel has a severe bottleneck in decoder - only 16 Byte fetch width, decoder limit. That is a bottleneck. That is why Sandy Bridge has a trace cache after decoding. Anyway even in Sandy Bridge the scheduler is no issue.

hamunaptra · Apr 13, 2011

Well the less OOO capability you have today - doesnt that translate into a performance bottleneck of sorts?
So I see it as the smaller the scheduler depth then yeah...not as much OOO optimization, that makes sense to me.

Dresdenboy · Apr 13, 2011

HW2050Plus said:
These MicroOps are executed even slower than the MacroOps from Stars before! However the reason for this slower execution comes from the high frequency design (less work per cycle).

As I said already, Bulldozer has less throughput at slower speed (higher latency).

At some point the manual states 5 cycles latency for FMA, FMUL and FADD (128b each). This could be some back-to-back latency when there is a dependent op. This would mean an increase of 1 cycle.

Next, mem ops get handled more efficiently. As it looks like, L/S manages loads independently of other ops and out-of-order. So the manual states that mem access latencies might often be much less than the 4 cycles.

Further SMT FPU means that even if there is not enough ILP in one thread's FP code to issue the instructions pipelined (so latency wouldn't matter that often), there could still be FP code of the other thread to fill in the empty issue slots -> converging to max throughput.

AtenRa · Apr 13, 2011

@Dresdenboy

what if we have 2 threads that both need the 128-bit FP FMACs?? Will the 60-entry scheduler with the 160 Physical Register File be able to feed the execution units ??

IntelUser2000 · Apr 14, 2011

1.

HW2050Plus said:
And now here comes what I say is obvious from the decoders, so you could have seen this even before from optimization guide it turned out that this 4 pipelines are not what they appeared to be. In Magny Cours the decoders can do 3 MacroOPS/cycle. Enough to feed all 3 integer pipes, two decoders of two cores can feed 6 integer pipelines, everything fine. In Bulldozer you have enhanced decoder which can do 4 MacroOPS/cycle. This is enough to feed 2 pipelines on two integer cores, 4 pipelines in a module.

2.

Now coming back and to explain why JFAMD is right as well:
He says "IPC is higher" and he is right. Because he talks about micro ops IPC. So BD core does 4 micro ops / cycle which is more than 3 MacroOPS per cycle by sheer number. 4 is more than 3. However if you do not compare apples and oranges then you get a lower IPC 2 vs. 3 (simplified). Now that is the difference of a statement from an engineer and a marketing guy. The marketing guy just does not specify what is meant by "instruction". And as Stars don't know microOps he is even right and no liar though the statement is completly useless (yeah a marketing statement).

Aren't these two paragraphs in conflict with each other? You claim Bulldozer can do 4 Macro Ops with the decoders, and on the second paragraph you criticize JFAMD for saying 4 Micro Ops instead. If you know its Macro Ops not Micro Ops, what is the point of criticizing him? Or are you confused?

The 4 decoders decode one or two micro ops. E.g. a add rdx, rax is decoded in one yop and add rdx, [qword ptr] z is decoded in two yops, one for the memory and one for the add.

If we eliminate the micro op intermezzo we have as a result that each Bulldozer core can do 2 Macro-Ops / cycle plus the seperate vector scheduler.

It's time to do a bit of document searching.

Joseph F said:
^ Also remember that K8 is essentially K7+IMC and 64-Bit.

Borrowing the quote from Idontcare. From a 50k foot perspective yea, that's right. But there are lot of details that makes it different. AMD achieved 30% performance gain with K8, and has said 20% of that has been from integrating the high performance memory controller. So a K7 with just the K8 IMC would have resulted in 20% increase in performance, rather than 30%.

Riek · Apr 14, 2011

HW2050Plus said:
That is just not correct. In fact normally all 3 units are completly busy and only in seldom cases they are not (div stall, misprediction stall, memory stall, etc.).
Just load any normal code in AMD Code Analyst Tool and start pipeline analysis, you will see that.

This... the problem with this statement is the word seldom... reverse it and you are nearing closer to the truth.. the ooo window is limited, the pipes have specific functions, code is execution latency dependant... It is more rare to have all units at work then that a div stall occurs...

Load + Calc. This is so useless. Never used and yes this is a strongpoint of Stars but it could never be utilized. Now it is still a strongpoint in Bulldozer and still isn't used.

Wether or not it is useless, there are also other combinations that will result in 1MOps >1uOp.

And no Bulldozer cannot process more than 2 MacroOps / cycle.

False. BD can decode up to 4Mops and schedule 4uOps on the fpu ergo the maximum is 4.
Bd can decode 4Mops and schedule 4uOps on the integer.
Bd can also do opFusion (which I din't take into consideration).

Yes efficient and slow.

So you say.

Surly not. Also official AMD statement is that they do not take part in the IPC race.

You might want to stop pulling things out of context into your own misguided context. you might also note that i said they will try to have a higher average ipc compared to K8. That has nothing to do with the highest ipc possible. Highest ipc possible would be the first K9 design where they were thinking of 8wide.

That is toal misleading bullshit. There is no average IPC and no average IPC < 1. IPC as a absolute number is primarily a factor which comes from the application and not from the CPU. So just forget about using absolute numbers regarding IPC, because you simply do not understand the absolute value.

You can treat IPC only as a relative measure. E.g. compared with precious generation or competing CPU.

Exactly some bells should have ring with this one... If not: reread your own posts that tries to use maximum ipc values in the unrealistic scenarios for one operation as a measurement about performance, or as a manner to deduct performance in average applications. Fact is no applications reaches the absolute maximum ipc possible... ergo.. maximum ipc is rather irrelevant when determening performance when the frond end is completely different.

Idontcare · Apr 14, 2011

IntelUser2000 said:
Borrowing the quote from Idontcare. From a 50k foot perspective yea, that's right. But there are lot of details that makes it different. AMD achieved 30% performance gain with K8, and has said 20% of that has been from integrating the high performance memory controller. So a K7 with just the K8 IMC would have resulted in 20% increase in performance, rather than 30%.

I agree with both you and Joseph F.

To be sure the K8 was NOT merely a K7 with an integrated memory controller.

In the same respect though I think Joseph F was actually intending for the comment to be a compliment to the underlying microarchitecture itself.

The Athlon formed the basis of a strong pedigree of successive architectures that built upon the foundation that the K7 formed. That's something to be proud of IMO.

HW2050Plus · Apr 14, 2011

Dresdenboy said:
At some point the manual states 5 cycles latency for FMA, FMUL and FADD (128b each). This could be some back-to-back latency when there is a dependent op. This would mean an increase of 1 cycle.

Next, mem ops get handled more efficiently. As it looks like, L/S manages loads independently of other ops and out-of-order. So the manual states that mem access latencies might often be much less than the 4 cycles.

Further SMT FPU means that even if there is not enough ILP in one thread's FP code to issue the instructions pipelined (so latency wouldn't matter that often), there could still be FP code of the other thread to fill in the empty issue slots -> converging to max throughput.

Regarding the faster L/S handling, this is coming from the prefetcher right. But the latency of L1 increased by 1 cycle. However I mean also several instructions which have increased latency.

Regarding FP I already said that a shared unit is more than two halves because of exactly what you stated. But regarding FP there is half the throughput. Regarding SSE FP there is same throughput, means they can do more because of the sharing of units. But also there it is a bit different in Stars, because they have two pipelines as well (though a bit restricted).

IntelUser2000 said:
Aren't these two paragraphs in conflict with each other? You claim Bulldozer can do 4 Macro Ops with the decoders, and on the second paragraph you criticize JFAMD for saying 4 Micro Ops instead. If you know its Macro Ops not Micro Ops, what is the point of criticizing him? Or are you confused?

You have to remember that in a Bulldozer module there is only one decoder but two integer cores. Means you have a decoding bandwith of 2 MacroOps per cycle per core.

These 2 MacroOps per core can be decoded to 2-4 MicroOps / core. These 2-4 MicroOps can be executed by the 4 MicroOps pipelines. Means these 4 pipelines and execution resources can do 2 MacroOps / cycle.

Riek said:
This... the problem with this statement is the word seldom... reverse it and you are nearing closer to the truth.. the ooo window is limited, the pipes have specific functions, code is execution latency dependant... It is more rare to have all units at work then that a div stall occurs...

The problem is to understand that in most cases you have either all 3 units at work or all three units stalling.

Riek said:
Wether or not it is useless, there are also other combinations that will result in 1MOps >1uOp.

That is right for mov it is important, I just wanted to point out that the LOAD-EXECUTE are unimportant.

But the Stars core can do this as well because it handles MacroOps and has for each ALU an according AGU. Stars core executes each of those instuctions 1 cycle faster than Bulldozer.

Riek said:
False. BD can decode up to 4Mops and schedule 4uOps on the fpu ergo the maximum is 4.
Bd can decode 4Mops and schedule 4uOps on the integer.
Bd can also do opFusion (which I din't take into consideration).

Now about what are we talking? About Module or Core. I talked about core you are talking about module. I am talking about integer performance you now mix in FP.

Riek said:
Exactly some bells should have ring with this one... If not: reread your own posts that tries to use maximum ipc values in the unrealistic scenarios for one operation as a measurement about performance, or as a manner to deduct performance in average applications. Fact is no applications reaches the absolute maximum ipc possible... ergo.. maximum ipc is rather irrelevant when determening performance when the frond end is completely different.

You have to understand the difference of IPC which is instructions per cycle and the execution capability of the core. First many instructions are not finished within one cycle. So it uses up the pipeline/execution unit for more than one cycle. That means that the absolute IPC for a certain program drops well below the number of pipelines/execution units, but hell that doesn't mean they are doing nothing! But it is also clear that one unit less will exactly drop this absolute IPC again by the relative reduction in pipeline/execution width. And heck it does not matter if the absolute IPC for a given application is 0.08 or 1.94. Again the absolute number just does not tell you anything about pipeline/unit utilization. If a pipeline has to do more work in several cycles it is still used and useful.

As a second you have stalls, where no pipeline/execution unit executes anything. In those cases there is no difference in performance or relative IPC no matter how many units/pipelines you have. Clear if all are waiting it does not matter how many you have. Again this drops absolute IPC.

Now looking at both you see immedeatly why more units are used and better but they do not add exeactly the amount of more pipelines/units to performance. The reality is again even more complex (e.g. scheduler depth) but a discussion on that level would crush the limits of a forum discussion.

Also from that you see 4 major items how you can improve a CPU performance:
1.) Make the CPU wider, more Pipelines/execution resources
(e.g. the success of Core architecture - 2 wide in PIII, 3 wide in Core, 4 wide in Core2 (plus MacroOp fusion!), another enhancement in Sandy Bridge resolving L/S address generation units in AGUs)
[Of course this must be backend by highly capable schedulers regarding depth and reorder capability, e.g. this is because the 4-wide Intel have deeper schedulers and features like speculative load/store reordering]
2.) Reduce number of stalls
Better branch prediction, more/faster/wider caches, prefetchers
3.) Reduce instruction latency
See Core2 incarnations which improve them with nearly each new gen. (esp. div: 1 Bit - 2 Bit - 4 Bit divider), as well as K7-K10.5 same with divider.
4.) More clock rate

Now what was done in Bulldozer regarding these items.
1.) The very important instruction width was reduced by ~33%.
2.) Regarding stalls there are some improvemnts regarding prefetchers and branch misprediction but also step backs regarding caches (size/latency) and higher misprediction penality. Those step backs are however related to 4.)
3.) Instruction latency got worse, partly because of 4.)
4.) Clock rate improved greatly

So you have (3) slower cores with (1) less througput with less stalls [better!] (2) at much higher clock (4).

Now 1+2+3 make up for a little slower core than Stars but 4 pushes the performance to be better than Stars. Or just see my previous calculated estimations bringing that in some numbers.

Basically Bulldozer has two major design items:
1.) CMT
CMT works great, the problem here is the dies size consumption. Both together it is a failure how it is currently implemented.
2.) High frequency design
This Netburst like approach brings many of the cons* we know from Netburst but it enables a very high frequency. If this is a success or not is mainly determined by TDP, as TDP was also the main problem of Intel Netburst.

* smaller caches[L1], higher instruction latencies, incresed misprediction penalties (more increased than frequency bump can compensate). But the cons are not as worse as in Netburst.

And it has as well only 2 integer pipelines/execution units - same as Netburst.

HW2050Plus · Apr 14, 2011

Idontcare said:
I agree with both you and Joseph F.

To be sure the K8 was NOT merely a K7 with an integrated memory controller.

In the same respect though I think Joseph F was actually intending for the comment to be a compliment to the underlying microarchitecture itself.

The Athlon formed the basis of a strong pedigree of successive architectures that built upon the foundation that the K7 formed. That's something to be proud of IMO.

K7 -> K8 was the biggest design step in the line K7-K10.5. Regarding performance the very improved reorder capability of the scheduler of K8 compared to K7 cannot be emphasized enough. Only since the K8 scheduler AMD could easily sustain all 3 pipelines at steady work.*

*@Riek That is also why your concerns about keeping all 3 pipelines/exec units full are true for K7 but that changed with K8 where this is no longer the case and you have all three units steadily full. And by that I mean with the average assembly result a compiler produces.

I can only propose to analyze some code with pipeline simulation in K7 and then have a look at the same code on K8. It's just incredible what AMD achieved in this regard with K8.

Also with K10 came important improvements regarding SSE latency and throughput according with improved fetch to feed SSE. But there have been only little changes regarding integer, afaik another div improvement.

So to bring it to a point AMD competed with the K8 integer still in K10.5 compared to Intel which made constant improvements and added features to integer. Intel even surpassed AMD in respect to execution width and scheduler capability.

I don't know why AMD stood still regarding integer since K8.
And yes the AMD bosses (Meyer?) said they don't want to compete in this (integer) race.

Triskain · Apr 14, 2011

Yay another iteration of the redundant IPC discussion, here are some actual performace traces for reference.

AMD Codeanalyst was used to read the performance counters, the readings were performed on my Phenom I 9950 BE.

Maximum theoretical IPC for this processor is 3.

Core Damage: 3,0

Core2MaxPerf: 2,7

Linpack: 2,3

TrueCrypt: 1,7-2,1 (during the Bench)

Prime95: 1,3-1,8 (Blend 1,3, large FFTs 1,5, Small FFTs 1,8)

x264Bench: 1,2 (fluctuates between 0,8-1,4)

Cinebench R11.5: 1,1 (fluctuates depending on the part that is rendered)

POV-Ray: 1,1

Fritz Chess: 1,0 (pure Integer Benchmark)

Furmark: 0,4

WinRAR Bench: 0,2

Riek · Apr 14, 2011

HW2050Plus said:
Also from that you see 4 major items how you can improve a CPU performance:
1.) Make the CPU wider, more Pipelines/execution resources
(e.g. the success of Core architecture - 2 wide in PIII, 3 wide in Core, 4 wide in Core2 (plus MacroOp fusion!), another enhancement in Sandy Bridge resolving L/S address generation units in AGUs)
[Of course this must be backend by highly capable schedulers regarding depth and reorder capability, e.g. this is because the 4-wide Intel have deeper schedulers and features like speculative load/store reordering]
2.) Reduce number of stalls
Better branch prediction, more/faster/wider caches, prefetchers
3.) Reduce instruction latency
See Core2 incarnations which improve them with nearly each new gen. (esp. div: 1 Bit - 2 Bit - 4 Bit divider), as well as K7-K10.5 same with divider.
4.) More clock rate

Now what was done in Bulldozer regarding these items.
1.) The very important instruction width was reduced by ~33%.
2.) Regarding stalls there are some improvemnts regarding prefetchers and branch misprediction but also step backs regarding caches (size/latency) and higher misprediction penality. Those step backs are however related to 4.)
3.) Instruction latency got worse, partly because of 4.)
4.) Clock rate improved greatly

I'm not going to respond on the non-quoted first part since that would makes us run in loops

.
But I do have a few remarks on parts of the quoted parts:

1.) Make the CPU wider, more Pipelines/execution resources
(e.g. the success of Core architecture - 2 wide in PIII, 3 wide in Core, 4 wide in Core2 (plus MacroOp fusion!), another enhancement in Sandy Bridge resolving L/S address generation units in AGUs)

The wideness statement is on the decoder not on the execution resources. macroOp fusion is helping the decoder during branches or basicly allowing a higher decoder throughput. However decoder throughput is still limited to probably a sustainable rate of 2.5-3MacroOps/cycle. (mostly due to the limitation of the decoder window and that the decoders are unever, 3partial and one full).

So it says nothing about the execution resources behind it. SB still has 3ports or 3 execution pipielines. But yeah macroOp fusion, uOp fusion, loop detection, ..are good ways to optimize ipc. Most of those affect the decoding parts or optimize the decoding on branches.
However on crucial difference is the decoding of AMD and Intel is completely different and can hardly be compared by saying 4-wide. Intel decoders consist out of 1 full and 3 partial decoders, this is different for AMD.

All in all AMD 4decoders will get a higher average throughput du to the double 16bit window. You have to take into account that intel is also running 2threads from those decoders. So basically 2threads on BD will both deliver a higher sustainable op flow then intel decoders for 2 threads. If you talk about single threaded the single thread will have the full width of the decoder at its disposal. So basically i believe AMD this the right thing with the decoding for BD. It might be not enough for 2 fully activated threads, but in that scenario it will do better then two threads over the intel.

2.) Reduce number of stalls
Better branch prediction, more/faster/wider caches, prefetchers

This is indeed something where BD invests heavily and will make up alot of ground on nehelam architecture.

3.) Reduce instruction latency
See Core2 incarnations which improve them with nearly each new gen. (esp. div: 1 Bit - 2 Bit - 4 Bit divider), as well as K7-K10.5 same with divider.

There are only a few instruction latency upped for BD (mostly the more legacy or less used ones). For fpu this si mostly due to the FMAC design, so expect similar increases for hasswel. Also nehalemn to core2 also saw increases due to the longer pipeline.
But a big part of instruction latency is hided by ooo. Therefore saving 1cycle on 6cycles does not result in a noticabel difference in 80% of the cases.

Basically i'm disputing some of your correlations:

Now what was done in Bulldozer regarding these items.
1.) The very important instruction width was reduced by ~33%.
2.) Regarding stalls there are some improvemnts regarding prefetchers and branch misprediction but also step backs regarding caches (size/latency) and higher misprediction penality. Those step backs are however related to 4.)
3.) Instruction latency got worse, partly because of 4.)
4.) Clock rate improved greatly

1) this is nothing something I see, i see a decoder that is better for single threaded and better for two threads than SB decoding engine. I see a theoretical 33% of execution resources like you, but I do see the optimizations of those resources to compensate those execution resources( e.g. higher L/S, possibility AGLU do operations, doing loads ooo, ..). This was also the finding of Dresdenboy which can be seen in his blog from a few months back. includes an excell file i believe).
2) agree the biggest changes for BD are in this area, the improvement are also worldshattering compared to K8.
3) yes and no. Yes some instructions got worse, no most courant instructions didn't. (and we still have no official instruction latencies imo)
4) yes but unknown. Yes the design would allow that, unknown ->gate first = slower clocks and we don't know process maturity. Maybe clocks are higher, maybe only turboboost will exceed current frequency.

Probably we differ in the way we see BD performing. You believe higher clock and lower ipc. I still believe ipc is alot higher than K8 but clocks are 'dissapointing' because we would expect more due to certain design choices.

jones377 · Apr 14, 2011

Triskain said:
Yay another iteration of the redundant IPC discussion, here are some actual performace traces for reference.

AMD Codeanalyst was used to read the performance counters, the readings were performed on my Phenom I 9950 BE.

Maximum theoretical IPC for this processor is 3.

Core Damage: 3,0

Core2MaxPerf: 2,7

Linpack: 2,3

TrueCrypt: 1,7-2,1 (during the Bench)

Prime95: 1,3-1,8 (Blend 1,3, large FFTs 1,5, Small FFTs 1,8)

x264Bench: 1,2 (fluctuates between 0,8-1,4)

Cinebench R11.5: 1,1 (fluctuates depending on the part that is rendered)

POV-Ray: 1,1

Fritz Chess: 1,0 (pure Integer Benchmark)

Furmark: 0,4

WinRAR Bench: 0,2

Interesting, but hardly surprising, that the apps with the biggest IPC are either synthetic or using a highly optimized kernel. Can you also show the instruction mix, branch accuracy etc for these ?

HW2050Plus · Apr 15, 2011

jones377 said:
Interesting, but hardly surprising, that the apps with the biggest IPC are either synthetic or using a highly optimized kernel. Can you also show the instruction mix, branch accuracy etc for these ?

Except for Fritz Chess, it is quite obvious that the counts get lower the more you acces main memory and then you are much limited to memory access which reduce absolute IPC by memory stalls.

Whereas synthetic benchmarks obviously run in L1 cache only.

As I said the absolute IPC is more a function of the application than the CPU. It differs between 0.2 and 3.0 so by a factor of 15 on the same CPU.

Riek said:
1) this is nothing something I see, i see a decoder that is better for single threaded and better for two threads than SB decoding engine. I see a theoretical 33% of execution resources like you, but I do see the optimizations of those resources to compensate those execution resources( e.g. higher L/S, possibility AGLU do operations, doing loads ooo, ..). This was also the finding of Dresdenboy which can be seen in his blog from a few months back. includes an excell file i believe).

Exactly the bold marked statement is something which will not happen. I really have no idea why they call it AGLU. But according to the optimization manual it appers as the AGUs are only MicroOp assists for address calulations for CALL/LEA/Load-Execute/MOV running on EX0/EX1. It looks more like a helper to not further increase instruction latencies on those instructions.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Lifer

Senior member

Lifer

Senior member

Golden Member

Member

Member

Member

Lifer

Senior member

Senior member

Member

Member

Senior member

Golden Member

Lifer

Elite Member

Senior member

Elite Member

Member

Member

Member

Senior member

Senior member

Member