Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 61 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

grimpr

Golden Member
Aug 21, 2007
1,095
7
81
Heres some fresh slides.

http://news.ati-forum.de/index.php/news/36-mainboards/1827-am3-auch-bei-msi-mit-bios-update

a0jdl4.jpg


66j97n.jpg
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
I love the Engrish on those slides.

EDIT: And I just noticed that they give a launch date of June 7th, 2011 for the FX series processors. Even though I just bought a new 2600K, I am still thinking about buying a new Bulldozer CPU as well.
 

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
Looks like Gigabyte is saying they will support bulldozer in 3.1 revision motherboards, without a doubt.
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
Wow, I'm glad I didn't buy one of the Asus boards.

I might consider a Gigabyte now though, or just wait for a real AM3+ board.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
They claim that each core is capable of 4 integer issues/cycle
compared to 3 for K10, but i still didn t see any technical
explanation about this in any site.
This is complicated. For e.g. Phenom it could do 3 ALU or AGU operations per cycle. It had 3 AGU and 3 ALU units, a total of 6 units!

So in a cycle Phenom might be able to do:
3 ALU, 2 ALU + 1 AGU, 1 ALU + 2 AGU, 3 AGU

Bulldozer core has only 4 units! 2 AGU and 2 ALU units. Normally you would say, hey that is 33% less. Now comes Bulldozer with a trick! They are all in parallel with a dedicated pipeline. Therefore despite having 2 units less they can do more. However the usage is restricted somewhat:

Bulldozer in a cycle might be able to do:
2 ALU + 2 AGU

So if the code to execute fits to this 50% ALU/AGU scheme everything is fine and Bulldozer can do 4 ops / cycle. If it's very uneven this might drop to 2 in some cases. Now what helps here really a lot is that you have long queues in the scheduler so you can compensate short times of uneven code by the scheduler "backlog".

This 4 ops/cycle/core is therefore a bit less than you might expect. Intel CPUs are also restricted in their code mix, only CPU which could fully intermix AGU and ALU was the AMD K7-K10. But well at the expense of using 6 units and a lot of die space therefore. The advantage is that Intel could not intermix fully as well, so code optimized for Intel will run much better on AMD Bulldozer than on previous generations. And that without having to recompile them!
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
This is complicated. For e.g. Phenom it could do 3 ALU or AGU operations per cycle. It had 3 AGU and 3 ALU units, a total of 6 units!

So in a cycle Phenom might be able to do:
3 ALU, 2 ALU + 1 AGU, 1 ALU + 2 AGU, 3 AGU

Bulldozer core has only 4 units! 2 AGU and 2 ALU units. Normally you would say, hey that is 33% less. Now comes Bulldozer with a trick! They are all in parallel with a dedicated pipeline. Therefore despite having 2 units less they can do more. However the usage is restricted somewhat:

Bulldozer in a cycle might be able to do:
2 ALU + 2 AGU

So if the code to execute fits to this 50% ALU/AGU scheme everything is fine and Bulldozer can do 4 ops / cycle. If it's very uneven this might drop to 2 in some cases. Now what helps here really a lot is that you have long queues in the scheduler so you can compensate short times of uneven code by the scheduler "backlog".

This 4 ops/cycle/core is therefore a bit less than you might expect. Intel CPUs are also restricted in their code mix, only CPU which could fully intermix AGU and ALU was the AMD K7-K10. But well at the expense of using 6 units and a lot of die space therefore. The advantage is that Intel could not intermix fully as well, so code optimized for Intel will run much better on AMD Bulldozer than on previous generations. And that without having to recompile them!

Thank you, that sound logical.

But then, in case a cycle request 3 ALU or 3 AG , theorical throughput of BD will take a 33% hit compared to K10.

Certainly that such requests are the exception rather than the rule,
but i hardly understand why AMD didn t match completely the exec.
ressources of its preceding CPU, though overall, they claim IPC will
improve in a 15/20% range for INTEG and as much as 50% if FPUs
are heavily involved.

Hope we ll soon get some infos since we are left wildly speculating
around vague slides and uncertain benchmarks..
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Certainly that such requests are the exception rather than the rule,
but i hardly understand why AMD didn t match completely the exec.
ressources of its preceding CPU, though overall, they claim IPC will
improve in a 15/20% range for INTEG and as much as 50% if FPUs
are heavily involved.

Probably because the expense of adding it in wasn't worth the edge cases where it would be useful. Consider this, theoretically it's "better" but it didn't help the Phenom II against Nehalem (or Sandy Bridge).

A lot of that might just be because of compiler optimizations (someone more knowledgeable in this area than me would have to comment on that) but for AMD it makes a lot of sense to design their processor's so that optimizations for their CPUs are similar to Intel's. That way the ubiquitous Intel compiler optimizes for them, too
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,076
3,908
136
several AMD engineers have said (quite a few times actually) that the 3rd AGU in STAR is only there for symmetry and hasn't been used for quite a few cores.
 

wanderer27

Platinum Member
Aug 6, 2005
2,173
15
81
I love the Engrish on those slides.

EDIT: And I just noticed that they give a launch date of June 7th, 2011 for the FX series processors. Even though I just bought a new 2600K, I am still thinking about buying a new Bulldozer CPU as well.

You should have to read the Technical Documentation I have to read o_O

.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
several AMD engineers have said (quite a few times actually) that the 3rd AGU in STAR is only there for symmetry and hasn't been used for quite a few cores.

What does that mean, exactly? Why advantage does symmetry have if it never fires up?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,537
136
What does that mean, exactly? Why advantage does symmetry have if it never fires up?

All of the three AGUs can be used, but only two of them can be used concurrently. This makes scheduling easier, as for the schedulers the units are symmetric.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
as per http://support.amd.com/us/Processor_TechDocs/47414.pdf page 36.

there are 4 integer execution units per core, two which and do all arithmetic/shift etc. there are also two AGLU's which do address gen and simple ALU operations. So looks like bulldozer core is wider then STARS :wub:.

Right , it has also been pointed by Hans de Vries at Semiaccurate forum...
BD looks likes a number crunching machine...
Could be that AMD is going to strike very hard..
 
Last edited:

HW2050Plus

Member
Jan 12, 2011
168
0
0
as per http://support.amd.com/us/Processor_TechDocs/47414.pdf page 36.

there are 4 integer execution units per core, two which and do all arithmetic/shift etc. there are also two AGLU's which do address gen and simple ALU operations. So looks like bulldozer core is wider then STARS :wub:.
Thank you for this link, so it is out now finally.

The AGLU's can do instructions like add/sub/or/and/xor/lea/mov, compared to a AGU which can only do lea/mov.

That is really good new. So for some instruction types (see above), Bulldozer has even more instruction resources than Sandy Bridge (4 vs. 3), for others it is less (2 vs. 3 general purpose ALUs) and again for others it is the same (1 for div/mul).

However there are also some really bad news in this document. The latency for SSE/AVX integer is still very bad and the extremly bad latency of 42 for fdiv is now confirmed. I have really no idea why that one is so slow if you compare that to a divps which is 24 (exactly the same operation only in normally slower SSE form) and even the divpd (which does 2 fdiv in one cycle!) has a great value of only 27 cycles latency. Really very strange.

Looks like AMD Bulldozer makes a large step in directon of Intel Core type architecture and complete drops it's "DEC Alpha"-like architecture. So it is a bit in between Intel Core and DEC Alpha. Taking some Core features and dropping some "Alpha"-Features it could not really generate performance from.

Looks fine for me except the two flaws in my opinion: fdiv and integer SSE performance, in both Bulldozer has only 50% of performance of Intel Core, and I do not see any reason why they made it like that -> flaw. Especially the integer SSE is realy not good. And maybe the bad performance in C-Ray comes from exactly the fdiv latency. Though that one could be partly compensated by using divps but requires new compilers/compilation.

But again the brand new news that the AGUs are not only AGUs but can handle at least very simple (but also very frequent) ALU operations as an AGLU is very good.
 
Status
Not open for further replies.