Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

grimpr · Apr 6, 2011

Heres some fresh slides.

http://news.ati-forum.de/index.php/news/36-mainboards/1827-am3-auch-bei-msi-mit-bios-update

Martimus · Apr 6, 2011

I love the Engrish on those slides.

EDIT: And I just noticed that they give a launch date of June 7th, 2011 for the FX series processors. Even though I just bought a new 2600K, I am still thinking about buying a new Bulldozer CPU as well.

nonameo · Apr 6, 2011

Looks like Gigabyte is saying they will support bulldozer in 3.1 revision motherboards, without a doubt.

Ajay · Apr 6, 2011

grimpr said:
Wont be long for the Chinese rumour tsunami, give it another 2-3 weeks.

I guess we can hope :whiste:

bryanW1995 · Apr 6, 2011

Ajay said:
Wow, 1500 post in this thread! And we don't really know much more than when it started

Just be glad that this isn't a duke nukem forever discussion thread...

Ajay · Apr 6, 2011

bryanW1995 said:
Just be glad that this isn't a duke nukem forever discussion thread...

:biggrin: Good point!

drizek · Apr 6, 2011

Wow, I'm glad I didn't buy one of the Asus boards.

I might consider a Gigabyte now though, or just wait for a real AM3+ board.

ElFenix · Apr 6, 2011

grimpr said:
Heres some fresh slides.

http://news.ati-forum.de/index.php/news/36-mainboards/1827-am3-auch-bei-msi-mit-bios-update

so gigabyte is basically advertising the fact that their current boards aren't forward compatible?

classic marketing technique, make your weaknesses into strengths!

drizek · Apr 6, 2011

Except for this giant list:

http://gigabyte.com/products/list.aspx?s=42&jid=10&p=2&v=26

HW2050Plus · Apr 6, 2011

Abwx said:
They claim that each core is capable of 4 integer issues/cycle
compared to 3 for K10, but i still didn t see any technical
explanation about this in any site.

This is complicated. For e.g. Phenom it could do 3 ALU or AGU operations per cycle. It had 3 AGU and 3 ALU units, a total of 6 units!

So in a cycle Phenom might be able to do:
3 ALU, 2 ALU + 1 AGU, 1 ALU + 2 AGU, 3 AGU

Bulldozer core has only 4 units! 2 AGU and 2 ALU units. Normally you would say, hey that is 33% less. Now comes Bulldozer with a trick! They are all in parallel with a dedicated pipeline. Therefore despite having 2 units less they can do more. However the usage is restricted somewhat:

Bulldozer in a cycle might be able to do:
2 ALU + 2 AGU

So if the code to execute fits to this 50% ALU/AGU scheme everything is fine and Bulldozer can do 4 ops / cycle. If it's very uneven this might drop to 2 in some cases. Now what helps here really a lot is that you have long queues in the scheduler so you can compensate short times of uneven code by the scheduler "backlog".

This 4 ops/cycle/core is therefore a bit less than you might expect. Intel CPUs are also restricted in their code mix, only CPU which could fully intermix AGU and ALU was the AMD K7-K10. But well at the expense of using 6 units and a lot of die space therefore. The advantage is that Intel could not intermix fully as well, so code optimized for Intel will run much better on AMD Bulldozer than on previous generations. And that without having to recompile them!

ElFenix · Apr 6, 2011

drizek said:
Except for this giant list:

http://gigabyte.com/products/list.aspx?s=42&jid=10&p=2&v=26

is that revision available right now or is newegg still clearing out rev 3.0 still?

Abwx · Apr 6, 2011

HW2050Plus said:
This is complicated. For e.g. Phenom it could do 3 ALU or AGU operations per cycle. It had 3 AGU and 3 ALU units, a total of 6 units!

So in a cycle Phenom might be able to do:
3 ALU, 2 ALU + 1 AGU, 1 ALU + 2 AGU, 3 AGU

Bulldozer core has only 4 units! 2 AGU and 2 ALU units. Normally you would say, hey that is 33% less. Now comes Bulldozer with a trick! They are all in parallel with a dedicated pipeline. Therefore despite having 2 units less they can do more. However the usage is restricted somewhat:

Bulldozer in a cycle might be able to do:
2 ALU + 2 AGU

So if the code to execute fits to this 50% ALU/AGU scheme everything is fine and Bulldozer can do 4 ops / cycle. If it's very uneven this might drop to 2 in some cases. Now what helps here really a lot is that you have long queues in the scheduler so you can compensate short times of uneven code by the scheduler "backlog".

This 4 ops/cycle/core is therefore a bit less than you might expect. Intel CPUs are also restricted in their code mix, only CPU which could fully intermix AGU and ALU was the AMD K7-K10. But well at the expense of using 6 units and a lot of die space therefore. The advantage is that Intel could not intermix fully as well, so code optimized for Intel will run much better on AMD Bulldozer than on previous generations. And that without having to recompile them!

Thank you, that sound logical.

But then, in case a cycle request 3 ALU or 3 AG , theorical throughput of BD will take a 33% hit compared to K10.

Certainly that such requests are the exception rather than the rule,
but i hardly understand why AMD didn t match completely the exec.
ressources of its preceding CPU, though overall, they claim IPC will
improve in a 15/20% range for INTEG and as much as 50% if FPUs
are heavily involved.

Hope we ll soon get some infos since we are left wildly speculating
around vague slides and uncertain benchmarks..

podspi · Apr 6, 2011

Abwx said:
Certainly that such requests are the exception rather than the rule,
but i hardly understand why AMD didn t match completely the exec.
ressources of its preceding CPU, though overall, they claim IPC will
improve in a 15/20% range for INTEG and as much as 50% if FPUs
are heavily involved.

Probably because the expense of adding it in wasn't worth the edge cases where it would be useful. Consider this, theoretically it's "better" but it didn't help the Phenom II against Nehalem (or Sandy Bridge).

A lot of that might just be because of compiler optimizations (someone more knowledgeable in this area than me would have to comment on that) but for AMD it makes a lot of sense to design their processor's so that optimizations for their CPUs are similar to Intel's. That way the ubiquitous Intel compiler optimizes for them, too

itsmydamnation · Apr 6, 2011

several AMD engineers have said (quite a few times actually) that the 3rd AGU in STAR is only there for symmetry and hasn't been used for quite a few cores.

hamunaptra · Apr 6, 2011

itsmydamnation said:
several AMD engineers have said (quite a few times actually) that the 3rd AGU in STAR is only there for symmetry and hasn't been used for quite a few cores.

I was gonna mention that as well =)

wanderer27 · Apr 6, 2011

Martimus said:
I love the Engrish on those slides.

EDIT: And I just noticed that they give a launch date of June 7th, 2011 for the FX series processors. Even though I just bought a new 2600K, I am still thinking about buying a new Bulldozer CPU as well.

You should have to read the Technical Documentation I have to read

.

podspi · Apr 6, 2011

itsmydamnation said:
several AMD engineers have said (quite a few times actually) that the 3rd AGU in STAR is only there for symmetry and hasn't been used for quite a few cores.

What does that mean, exactly? Why advantage does symmetry have if it never fires up?

Tuna-Fish · Apr 6, 2011

podspi said:
What does that mean, exactly? Why advantage does symmetry have if it never fires up?

All of the three AGUs can be used, but only two of them can be used concurrently. This makes scheduling easier, as for the schedulers the units are symmetric.

AtenRa · Apr 7, 2011

AMD "Bulldozer" Interactive Series - Introduction

http://www.youtube.com/watch?v=mr7kr4kimeM&feature=player_embedded

ehehe nice videos John

JFAMD · Apr 7, 2011

The chicago accent really comes out.

Schmide · Apr 7, 2011

AtenRa said:
AMD "Bulldozer" Interactive Series - Introduction

http://www.youtube.com/watch?v=mr7kr4kimeM&feature=player_embedded

ehehe nice videos John

jmwebb5 said:
Yeah, what's up with the front end loader? Apparently identifying heavy equipment is not their forte.

Funny

itsmydamnation · Apr 8, 2011

as per http://support.amd.com/us/Processor_TechDocs/47414.pdf page 36.

there are 4 integer execution units per core, two which and do all arithmetic/shift etc. there are also two AGLU's which do address gen and simple ALU operations. So looks like bulldozer core is wider then STARS :wub:.

Abwx · Apr 8, 2011

itsmydamnation said:
as per http://support.amd.com/us/Processor_TechDocs/47414.pdf page 36.

there are 4 integer execution units per core, two which and do all arithmetic/shift etc. there are also two AGLU's which do address gen and simple ALU operations. So looks like bulldozer core is wider then STARS :wub:.

Right , it has also been pointed by Hans de Vries at Semiaccurate forum...
BD looks likes a number crunching machine...
Could be that AMD is going to strike very hard..

Idontcare · Apr 8, 2011

Schmide said:
Funny

LOL, I noticed this too...

Bulldozer:

Front-end loader:

HW2050Plus · Apr 8, 2011

itsmydamnation said:
as per http://support.amd.com/us/Processor_TechDocs/47414.pdf page 36.

there are 4 integer execution units per core, two which and do all arithmetic/shift etc. there are also two AGLU's which do address gen and simple ALU operations. So looks like bulldozer core is wider then STARS :wub:.

Thank you for this link, so it is out now finally.

The AGLU's can do instructions like add/sub/or/and/xor/lea/mov, compared to a AGU which can only do lea/mov.

That is really good new. So for some instruction types (see above), Bulldozer has even more instruction resources than Sandy Bridge (4 vs. 3), for others it is less (2 vs. 3 general purpose ALUs) and again for others it is the same (1 for div/mul).

However there are also some really bad news in this document. The latency for SSE/AVX integer is still very bad and the extremly bad latency of 42 for fdiv is now confirmed. I have really no idea why that one is so slow if you compare that to a divps which is 24 (exactly the same operation only in normally slower SSE form) and even the divpd (which does 2 fdiv in one cycle!) has a great value of only 27 cycles latency. Really very strange.

Looks like AMD Bulldozer makes a large step in directon of Intel Core type architecture and complete drops it's "DEC Alpha"-like architecture. So it is a bit in between Intel Core and DEC Alpha. Taking some Core features and dropping some "Alpha"-Features it could not really generate performance from.

Looks fine for me except the two flaws in my opinion: fdiv and integer SSE performance, in both Bulldozer has only 50% of performance of Intel Core, and I do not see any reason why they made it like that -> flaw. Especially the integer SSE is realy not good. And maybe the bad performance in C-Ray comes from exactly the fdiv latency. Though that one could be partly compensated by using divps but requires new compilers/compilation.

But again the brand new news that the AGUs are not only AGUs but can handle at least very simple (but also very frequent) ALU operations as an AGLU is very good.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Golden Member

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Golden Member

Elite Member

Golden Member

Member

Elite Member

Lifer

Golden Member

Diamond Member

Senior member

Platinum Member

Golden Member

Golden Member

Lifer

Senior member

Diamond Member

Diamond Member

Lifer

Elite Member

Member