• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 67 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.
since it's all over the net, but not here..

2jcv90z.jpg

9znd53.jpg

9znd53.jpg

2j5nbc.jpg

471f9a3c28948b196905361.jpg

471f9a3c28948b196905361.jpg


cliffnotes:

If true.. up to 3.2Ghz base clock, with max 4.1Ghz Turbo boost. The 3220 number is a bit of an oddity. Should it be 4.2Ghz? 😱

CPU-z shot shows.. well nothing really, apart from the fact it cant determin Core #'s, but it's a start of some leaks at least
 
Last edited:
You know, I wonder what conditions are needed to see this 4.1ghz clock. If it only happens when one core is active, well that's kinda meh. If it can do 4.1 with 4 cores active, that's nice and what I would call reasonable. I hope it doesn't take bulldozer running at 4.1 to match a SB running at 3.4 though. That's just, well, kinda... off-putting. I'd hope it'd at least be able to edge it out a little. Yes, I know this would be benchmark dependent, just talking in general.
 
If it can do 4.1 on one core, then it can do 4.1 on 4 cores(one per module) at the very least, assuming that you have this level of granularity in terms of overclocking options.

The problem is going to be cooling and power at this point, and that will come down to your motherboard and heatsink.
 
If it can do 4.1 on one core, then it can do 4.1 on 4 cores(one per module) at the very least, assuming that you have this level of granularity in terms of overclocking options.

The problem is going to be cooling and power at this point, and that will come down to your motherboard and heatsink.


Granularity should be at the module level, so it should be two cores on one module, if 4.1ghz really is the turbo. I don't think there is any granularity at the core level because cores share resources, so you can't shut down much at the core level.
 
This assumption is not true, yet, all your twisted logic
is built upon this erroneous point...

BD, as pointed by AMD, is 4 issues width for each
integer core.

How they manage to do it is still an unknown,
but as i already posted it, the optimisation
manual say explicitly that the AGLUs perform
not only adresses generation , but also logical
and arithmetic operations....
Yes but for address calculation in call and lea e.g. And yes that helps for lea and call as can be seen in the latency tables.

The BD architecture is 2 wide, 2 ALUs, 2 x86-Ops / cycle. That is the information we have so far and this information is an official document from AMD. As long as there is no revision to that document that are the facts. Even more as I constantly repeat: As the decoders cannot decode more that 4 instructions for 2 cores it would also make no sense to be able to process more than 2 instructions / core. 2 Llano or Sandy Bridge Cores can however decode much more than that and Sandy Bridge has in addition a loop trace cache of already decoded instructions.

I found just a lengthy article about Bulldozer from an CPU expert who comes to the exact same consclusions as I came.

http://www.h-online.com/newsticker/...rs-About-latencies-and-compilers-1232290.html

Conclusion from there:
So, to prevent the Bulldozer from lagging behind, clock speed and turbo core will have to be cranked up considerably. As for the much-discussed topic of IPC (instructions per clock) the Bulldozer will probably not be able to compete with its predecessor in spite of some architectural improvements; especially, because each pair of two cores has to share the frontend with the decoders while also using the same FPU.

Just read the article of the link.
 
Yes but for address calculation in call and lea e.g. And yes that helps for lea and call as can be seen in the latency tables.

The BD architecture is 2 wide, 2 ALUs, 2 x86-Ops / cycle. That is the information we have so far and this information is an official document from AMD. As long as there is no revision to that document that are the facts. Even more as I constantly repeat: As the decoders cannot decode more that 4 instructions for 2 cores it would also make no sense to be able to process more than 2 instructions / core. 2 Llano or Sandy Bridge Cores can however decode much more than that and Sandy Bridge has in addition a loop trace cache of already decoded instructions.

Let me quote the Bulldozer SW optimization manual, which I had for a couple of months (NDA version) before it went public. So I've already seen some interesting things. First some code (now on p. 136)
Code:
add_loop:
   vmovsd xmm0, QWORD PTR [rax] ; Load double pointed to by RAX
   vaddsd xmm0, QWORD PTR [rbx] ; Add double pointed to by RBX
   vmovsd QWORD PTR [rax], xmm0 ; Store double result.
   add rax, 8            ; Point to next element of array a.
   add rbx, 8            ; Point to next element of array b
   dec rcx               ; Decrement counter.
   jnz add_loop          ; If elements remain, then jump.
Then the cycle counting:
Code:
cyc #instrs instrs                    note
 1   3       movsd, addsd, movsd      Only one load/store pair per dispatch
 2   4       add, add, dec, jnz       max of 4cyc
They finally count it as 7 instructions in 2 cycles. There is just no specific information on the type of counted cycles. OTOH this code would work since FPU and integer cores could issue in parallel. The new rax, rbx, rcx values could be calculated in the same cycle as the FPU instructions, since they would use renamed internal registers (OOO).

Further according to some AMD papers:
The execution unit supports single-cycle operand bypass from an instruction to
a dependent instruction. Two ALU ops and two AGU ops can be executed in a
cycle. AGU ops include increment/decrement (INC), address generate, and x86-
64 LEA instructions.
http://www.planet3dnow.de/vbulletin/showthread.php?p=4416527#post4416527

I found just a lengthy article about Bulldozer from an CPU expert who comes to the exact same consclusions as I came.

http://www.h-online.com/newsticker/...rs-About-latencies-and-compilers-1232290.html
[...]
Just read the article of the link.
I've got the original paper mag containing this column on Saturday. It's actually the rumour column ("whispers"). Articles like the one about Llano contain much more detail and are located in the articles section of the c't mag.
 
Let me quote the Bulldozer SW optimization manual, which I had for a couple of months (NDA version) before it went public. So I've already seen some interesting things. First some code (now on p. 136)
Code:
add_loop:
   vmovsd xmm0, QWORD PTR [rax] ; Load double pointed to by RAX
   vaddsd xmm0, QWORD PTR [rbx] ; Add double pointed to by RBX
   vmovsd QWORD PTR [rax], xmm0 ; Store double result.
   add rax, 8            ; Point to next element of array a.
   add rbx, 8            ; Point to next element of array b
   dec rcx               ; Decrement counter.
   jnz add_loop          ; If elements remain, then jump.
Then the cycle counting:
Code:
cyc #instrs instrs                    note
 1   3       movsd, addsd, movsd      Only one load/store pair per dispatch
 2   4       add, add, dec, jnz       max of 4cyc
They finally count it as 7 instructions in 2 cycles. There is just no specific information on the type of counted cycles. OTOH this code would work since FPU and integer cores could issue in parallel. The new rax, rbx, rcx values could be calculated in the same cycle as the FPU instructions, since they would use renamed internal registers (OOO).
I mean here we see another problem of Bulldozers CMT approach.
Okay 3 AVX instructions issued in two cycles, 4 integer instructions in 2 cycles. Appears to be possible. But in detail it is very strange. All three AVX instructions have strong dependency. So they not even cannot be issued in subsequent cycles, you have in addition to wait the full latency of all three instructions. Of course the scheduler can register rename and pull the vmovsd much in advance, so this sequence is fine for the scheduler, but that helps only if you get your scheduler filled with a lot of other stuff. A main issue and why latency is important even if you have great schedulers. Since Core2 Intel is just fantastic on latency.

Now integer. There is also strong dependancy. So no chance to do that in one cycle. Even Intel could not do it because of the dependancy. [Update] Yes Intel can do that because of MacroOp fusion and integer dependency is much less, there is only the status flag dependency (dec/jnz).

I am very afraid that they mean that the given cycles 1/2 means that in addition to the latency you have to wait an additional wait cycle to let the previous result propagate! As you might know, AMD achieved the small FO4 count especially by dropping the ability to propagate results at 0 cycle cost. An AMD engineer working on Bulldozer told this but I expected this in a different way. But now if you need an additional cycle for this propagation because of the high speed design this would mean, that performance is getting even much worse than I have expected so far.

I've got the original paper mag containing this column on Saturday. It's actually the rumour column ("whispers"). Articles like the one about Llano contain much more detail and are located in the articles section of the c't mag.
Okay this is new information about increment/decrement. But that is realy not much. And again how to feed that with the limited decoders? Maybe the inc/dec capability is related to the:

dec ryx
jnz [address]

So that this can be fused. But I did not remember that AMD said something about MacroOp fusion. This would also be very difficult especially in a high speed design like Bulldozer.

And to put more bad news on Bulldozer:
http://www.chiphell.com/thread-190177-1-1.html

I have absolutly no idea if this is fake or not, but it would somehow fit to the information we have (remember: Slower cores with less throughput but at double core count and higher frequency).

Now as this comparison is at same frequency (which is stupid for Bulldozer of course) this would fit to the capabilities of the processors taking away the high frequency advantage of Bulldozer.

However I hope that is fake and Bulldozer performs somehow better, especially the 8 core result on Cinebench is too low in my opinion (fdiv?).

Regarding SuperPI I have the question how is this compiled that it performs always so terrible on AMD CPUs. I would throw SuperPI away as an indicator of CPU performance anyway.

Again I say from the architectural standpoint AMD failed with Bulldozer because of the heavy "gain vs die space" mismatch.

If it realy turns out to be true that the removed register result propagation ability is resulting in that dependand instructions need an additional wait cycle then Bulldozer's performance is unrepairably doomed. I still hope that I just misunderstand that because the performance outlook for Bulldozer is not great even without such an additional catastrophic issue.

If I have time I reread this manual part. But anyway I am not very happy with the bad quality of the public version of this manual. As I read it the first time I was very happy because I understood the AGLU as a MacroOP ALU functionality (for mov/add/sub) which would have been great, unless another user here pointed me to the Appendix information.
 
Last edited:
Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???

I was under the impression that only one 128-bit FMAC could only be used per core per cycle.
 
Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???

I was under the impression that only one 128-bit FMAC could only be used per core per cycle.

I'm trying to think of real code that would do this....Two FMAC's in a row without anything between them??
 
I mean here we see another problem of Bulldozers CMT approach.
Okay 3 AVX instructions issued in two cycles, 4 integer instructions in 2 cycles. Appears to be possible. But in detail it is very strange. All three AVX instructions have strong dependency. So they not even cannot be issued in subsequent cycles, you have in addition to wait the full latency of all three instructions. Of course the scheduler can register rename and pull the vmovsd much in advance, so this sequence is fine for the scheduler, but that helps only if you get your scheduler filled with a lot of other stuff. A main issue and why latency is important even if you have great schedulers. Since Core2 Intel is just fantastic on latency.

Now integer. There is also strong dependancy. So no chance to do that in one cycle. Even Intel could not do it because of the dependancy. [Update] Yes Intel can do that because of MacroOp fusion and integer dependency is much less, there is only the status flag dependency (dec/jnz).

I am very afraid that they mean that the given cycles 1/2 means that in addition to the latency you have to wait an additional wait cycle to let the previous result propagate! As you might know, AMD achieved the small FO4 count especially by dropping the ability to propagate results at 0 cycle cost. An AMD engineer working on Bulldozer told this but I expected this in a different way. But now if you need an additional cycle for this propagation because of the high speed design this would mean, that performance is getting even much worse than I have expected so far.


Okay this is new information about increment/decrement. But that is realy not much. And again how to feed that with the limited decoders? Maybe the inc/dec capability is related to the:

dec ryx
jnz [address]

So that this can be fused. But I did not remember that AMD said something about MacroOp fusion. This would also be very difficult especially in a high speed design like Bulldozer.

And to put more bad news on Bulldozer:
http://www.chiphell.com/thread-190177-1-1.html

I have absolutly no idea if this is fake or not, but it would somehow fit to the information we have (remember: Slower cores with less throughput but at double core count and higher frequency).

Now as this comparison is at same frequency (which is stupid for Bulldozer of course) this would fit to the capabilities of the processors taking away the high frequency advantage of Bulldozer.

However I hope that is fake and Bulldozer performs somehow better, especially the 8 core result on Cinebench is too low in my opinion (fdiv?).

Regarding SuperPI I have the question how is this compiled that it performs always so terrible on AMD CPUs. I would throw SuperPI away as an indicator of CPU performance anyway.

Again I say from the architectural standpoint AMD failed with Bulldozer because of the heavy "gain vs die space" mismatch.

If it realy turns out to be true that the removed register result propagation ability is resulting in that dependand instructions need an additional wait cycle then Bulldozer's performance is unrepairably doomed. I still hope that I just misunderstand that because the performance outlook for Bulldozer is not great even without such an additional catastrophic issue.

If I have time I reread this manual part. But anyway I am not very happy with the bad quality of the public version of this manual. As I read it the first time I was very happy because I understood the AGLU as a MacroOP ALU functionality (for mov/add/sub) which would have been great, unless another user here pointed me to the Appendix information.

Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.
 
Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.

i noticed the exact same thing. i just chalked it up to a game of reverse phsycology.
 
Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.


I won't fault anybody for changing their mind due to new information. I think HW2050Plus is wrong about BD's IPC, but I think he has explained his position pretty well. I think AMD hasn't revealed all of their cards with Bulldozer yet. Since we actually don't have any proof of that (besides we want it to be true) I can't really find much fault in HW2050Plus' arguments.


I just think it is unlikely that AMD would make such a huge mistake. As other's have said, an X8 shrink of Stars would be much easier, cheaper, and faster than a Bulldozer with lower IPC*.


*IF Bulldozer is capable of clocking to truly stupid-high levels (7ghz+) this argument falls apart. But I think OCers will top out in the mid 5ghz.
 
Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.
You are absolutly correct on this. What can I say?

First I praised the design because if you can get 80% advantage with 5-10% of added die space this is just so much greater than getting only 30% with Intel's SMT. Also from an IBM research this 17 FO4 was analyzed as the optimum performing CPU layout. So in both major design decisions AMD obviously did the exact right thing.

What happen's now is that more and more details emerge. First the heavy die space consumption. Then the fact that Intel's Sandy Bridge is so small that Intel will bring a 8 core variant this year. And the last one was the really bad news from the optimization manual (latencies, etc.).

Maybe the design decisions weren't that bad but how AMD executed them. And yes I and AMD likly oversaw the big issue of x86 instruction decoding.

The core and main problem is the die size. Bulldozer in <200 mm² and everything would be okay. If Llano or Sandy Bridge Cores just give better performance per die space than Bulldozer than there is something wrong with Bulldozer.

I already wrote that AMD should fix these issues. However the decoding limit will make this difficult. So really fixing this will basically mean a new processor design ...

Not everything is so bad. Let's assume that my concern on this 1 extra wait cycle in dependency chains is wrong. Then you will get a CPU which will be faster than current Sandy Bridge 4 core offerings. This statement is unchanged since the very beginning of my posts here.

But when Intel comes out with their 8 core Sandy Bridge, that will just wipe the floor with Bulldozer.

At the beginning I though that will be no problem because I (if you reread my posts then) said that AMD will then come out with a 16 core BD. But with ~600 mm² die space that will just not happen. I assumed they could do it with less than 400 mm².

To sum it up. CMT vs. SMT is much more effective. But if it consumes so much more die space compared to SMT it is just the wrong approach. Break CMT up (add another decoder unit), add SMT, widen the execution width significantly. Go back to Macro Ops processing, give up high frequency design and Bulldozer is fixed but then it lost it's two major design components. The shared FlexFPU can remain but the integer SSE latencies must be fixed.
 
At the beginning I though that will be no problem because I (if you reread my posts then) said that AMD will then come out with a 16 core BD. But with ~600 mm² die space that will just not happen. I assumed they could do it with less than 400 mm².

I am pretty sure AMD is releasing a 16 core Bulldozer, and I am pretty sure we have seen leaks of performance from it. It has been stated over and over again that there will be a 16 core Bulldozer CPU, as that is where we got the "50% more performance from 33% more cores" line from in the first place (16 cores versus the current 12 cores).

I honestly think you are worrying way too much about something you have no control over. It will be released soon enough, and its release won't significantly change the world for either the better or worse. Just check it out in a couple months when we will have actual reviews and benchmarks on the processor. It isn't that long of a wait anymore.
 
I think you are overplaying the importance of decoding. The decoders don't need to be as wide as the execution units, because most real code starts with data stalls, during which you get to run a little forward with the decode. Intel is decode-limited without the uop cache mostly because they really suffer whenever crossing 16-byte alignment. AMD has somewhat faster decode simply because their window is 32 bytes. If you can decode full 4 instructions every second cycle without exceptions, I'm pretty sure you can extract most of the ILP in normal x86 code.

Decode bandwidth is also tightly dependent on branch prediction, as a mispredict causes you to waste decoding. Looking at all the added SRAM in the decode units, it's frankly possible that the real decode BW goes up in BD, because of reduced mispredicts.

As for execution width, BD is 4-wide, making it the widest AMD cpu ever. (still narrower for integer code than Intel's 5-wide, though). Sure, there is 1 less add pipe, but way more than 33&#37; of x86 is movs (either as separate instructions or as memory operands in alu instructions), making it a win versus Phenom.

If your fears are founded and there is always a wait cycle between dependent instructions, then yes, that would totally ruin performance. I'm still hoping that the rumors for that are unfounded, and it only happens in some (not very common) situations.
 
Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???

I was under the impression that only one 128-bit FMAC could only be used per core per cycle.
From scheduling and execution standpoint for sure yes, that was part of question clarification on AMD's Bulldozer preview web site. Regarding sustained usage it depends of course what the other thread does and throughput information was missing in the optimization manual.
 
I am pretty sure AMD is releasing a 16 core Bulldozer, and I am pretty sure we have seen leaks of performance from it. It has been stated over and over again that there will be a 16 core Bulldozer CPU, as that is where we got the "50% more performance from 33% more cores" line from in the first place (16 cores versus the current 12 cores).

Yes I know Interlagos. But JF-AMD said this is only issued as a server part and that they do not release that as a desktop part.

Therefore in server market everything seems to be okay because CPU prices are not so much an issue and AMD uses MCM. What I mean is that they cannot do a single die 16 core part. Whereas I think that Sandy Bridge E will be a single die part.

I honestly think you are worrying way too much about something you have no control over. It will be released soon enough, and its release won't significantly change the world for either the better or worse. Just check it out in a couple months when we will have actual reviews and benchmarks on the processor. It isn't that long of a wait anymore.
Oh I am not worrying about the Bulldozer launch. As I said to compete with current Intel offerings it is enough. I am worrying about the Sandy Bridge E launch! And then about Ivy Bridge launch ...
 
I think you are overplaying the importance of decoding. The decoders don't need to be as wide as the execution units, because most real code starts with data stalls, during which you get to run a little forward with the decode. Intel is decode-limited without the uop cache mostly because they really suffer whenever crossing 16-byte alignment. AMD has somewhat faster decode simply because their window is 32 bytes. If you can decode full 4 instructions every second cycle without exceptions, I'm pretty sure you can extract most of the ILP in normal x86 code.

Decode bandwidth is also tightly dependent on branch prediction, as a mispredict causes you to waste decoding. Looking at all the added SRAM in the decode units, it's frankly possible that the real decode BW goes up in BD, because of reduced mispredicts.

As for execution width, BD is 4-wide, making it the widest AMD cpu ever. (still narrower for integer code than Intel's 5-wide, though). Sure, there is 1 less add pipe, but way more than 33&#37; of x86 is movs (either as separate instructions or as memory operands in alu instructions), making it a win versus Phenom.

If your fears are founded and there is always a wait cycle between dependent instructions, then yes, that would totally ruin performance. I'm still hoping that the rumors for that are unfounded, and it only happens in some (not very common) situations.
Okay hey the 4-wide decoder and 32 Byte fetch is great but you forget that for one core that means only 2-wide decoding and 16 Byte fetch.

Decoding is an issue because AMD would need 6-wide decoders which is nearly impossible with x86. The main reason why obviously IBM was able to get it right with Power was that they do not have the decoding issues. And that is what makes Bulldozer suffer.

You have to be careful with execution width. Now the width you use indicates me that you talk about micro ops not macro ops. So just consider that Phenom II was 6-wide!

But it is very misleading to discuss the execution width in MicroOps because what counts in the end is the MacroOps execution width and there you have for each core:

Phenom II, Decoding: 3 wide, Execution 3 wide, Max. allowed address operands ~2.5
Sandy Bridge, Decoding: 3 wide *, Execution 3 wide + MacroOp fusion (=4 if fused), Max allowed address ops. 2
Nehalem, Decoding: 3 wide, Execution 3 wide + MacroOp fusion (=4 if fused), Max allowed address ops. ~1.5
Bulldozer, Decoding: 2 wide, Execution 2 wide, Max allowed address ops 2

* Sandy Bridge has a loop trace cache

Now to come to the fetch in Intel:
Intel fetches only 16 Byte but Intel has an extra buffer so they can average over fetches whereas the fetch width limitation issues are reduced. But yes they can hardly sustain 4 MacroOps per cycle.
 
Yes I know Interlagos. But JF-AMD said this is only issued as a server part and that they do not release that as a desktop part.

Therefore in server market everything seems to be okay because CPU prices are not so much an issue and AMD uses MCM. What I mean is that they cannot do a single die 16 core part. Whereas I think that Sandy Bridge E will be a single die part.


Oh I am not worrying about the Bulldozer launch. As I said to compete with current Intel offerings it is enough. I am worrying about the Sandy Bridge E launch! And then about Ivy Bridge launch ...

You forget that Sandybridge uses a ring bus, so adding additional cores is relatively easy, but also adds latency. Core to core communication will have double the current latency with 8 cores, and communication to L3 cache also has double the latency due to the way it was designed. There will definitely be areas where the E series will be slower than current S1155 processors when it comes to IPC for this reason.
 
Status
Not open for further replies.
Back
Top