Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Daedalus685 · Jan 15, 2011

BLaber said:
If you google you will find the software that lets you run SLi on amd mobos with latest nvidia drivers .. many people are already running SLI on AMD mobos

Yes I know.

Those would be the enthusiast mods, but just like using a dedicated PhysX card with a Radeon it isn't sported officially and likely never will be.

Stoneburner · Jan 15, 2011

bmadd89 said:
Intels way has always been never to do a new process with a new arch.

Always? No. Prescott taught this this lesson.

bryanW1995 · Jan 15, 2011

JFAMD said:
Desktop wasn't pushed ahead of server. Every single release is different based on a set of variables that I won't get into. Desktop was first this round, but each round is a seperate decision (like the number that landed on the last spin of the roulette wheel.)

Desktop has always been ahead of server on this architecture, from way back.

ok, thanks. sorry, I thought that you told us last year that server was first again this time.

When I said "BF 2011" upgrade I meant "BF year 2011" upgrade, wasn't necessarily referring to intel there. stupid pin counts...

hamunaptra · Jan 15, 2011

bmadd89 said:
Intels way has always been never to do a new process with a new arch. As much as i want BD to be what we all want, what are the chances of a brand new process with a brand new arch (even if it is designed for frequency) getting close to 4ghz?

I mean how long has amd had 45nm and phenom II out of the gate for?? And there doing 3.5ghz on them only now?

Sure they can be clocked to over 4ghz long ago (Hell my 720 has been sitting on 3.8 core, 2.8 NB almost since i bought it at release) but they were not sold at that speed for a reason just like SB can do over 4ghz easy so why not release them at that?

I mean im going to probably buy one anyway cause i want an 8core but you gotta look at the facts.

New process + New Design = Ooodles of things that can go wrong just like barcelona

Well first of all AMD's current 45nm designs arent designed to be a high clocking architecture, which says something if you can reach 4ghz now, then AMD's new arch which is aimed at high clocks should be able to do that no problem, and possibly release in areas around 4ghz stock.
Rumours Ive seen so far on various websites point to AMD's 32nm being very capable at this point in time. So, Im hopeful.
I also have a feeling that AMD's been working on this architectural design for a very very long time, maybe thats why we havent seen any much current advancement.
So, Im hoping putting these 2 together will yeild results that will surprise all of us =)

Cogman · Jan 15, 2011

I really hope that AMD does pull out the performance crown. I'll always have a soft spot when it comes to AMD (Even though all my machines right now are Intel based.)

However, I'm skeptical of pre-release numbers from any company.

bryanW1995 · Jan 15, 2011

as long as they are able to execute even reasonably well at launch then they'll hit the target imho. figure clock/clock +10% vs ph2 and clocks +20% they'll be fighting toe to toe with intel for a couple years at least instead of getting pasted like what we've seen since c2d launched 4 1/2 years ago. heck, even my 1055 @ 3.35 is much stronger in DC than my 920 @ 4.0 with ht enabled. an 8 core BD even at 4.0 should blitz both of them.

hamunaptra · Jan 15, 2011

bryanW1995 said:
as long as they are able to execute even reasonably well at launch then they'll hit the target imho. figure clock/clock +10% vs ph2 and clocks +20% they'll be fighting toe to toe with intel for a couple years at least instead of getting pasted like what we've seen since c2d launched 4 1/2 years ago. heck, even my 1055 @ 3.35 is much stronger in DC than my 920 @ 4.0 with ht enabled. an 8 core BD even at 4.0 should blitz both of them.

Whats DC?

Ajay · Jan 15, 2011

hamunaptra said:
Whats DC?

Distributed Computing

Cogman · Jan 15, 2011

Ajay said:
Distributed Computing

No... Its Direct Current of course

(or Direct compute, or District of columbia, or Digital circuit.....)

beginner99 · Jan 15, 2011

JFAMD said:
No, this is not true, they are all full cores. Every integer core has its own FP unit (actually a more powerful FMAC.)

It will be 8 cores, 8 threads.

So this is basically false? Or I misunderstood it?

The basic building block is the Bulldozer module. AMD calls this a dual-core module because it has two independent integer cores and a single shared floating point core that can service instructions from two independent threads. The two thread machine is larger than a single core but smaller than two cores with straight duplication of resources.

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/4

As far as I understood, only 1 FP instruction can be scheduled per module and clock. Meaning if the 2 threads running on 1 module both only run FP instructions a module will be as fast (or slow) as a single core?
(this will sure be the case with AVX-Instructions)
Also are OS'es "module aware" so that above case does nto happen? (I think this is the case with HT)

AtenRa · Jan 15, 2011

JFAMD said:
No, this is not true, they are all full cores. Every integer core has its own FP unit (actually a more powerful FMAC.)

It will be 8 cores, 8 threads.

If im not mistaken, there is only ONE FP unit (Shared Resources) for every Bulldozer Module. This FP unit is shared by the two INT Units in the module and its divided to dual 128-bit FMACs. The FP unit can Hyperthread (2x 128-Bit FMACs) or can be combined to ONE 256-bit FMAC (For AVX).

Any application needing Integer calculations, the BD (4 Module, 8 Cores) will execute 8(Max) threads in 8 INT execution units, but when the application needs FP calculations the BD will execute 8(Max) Threads in 4 FP execution units (8x 128-bit FMACs(Hyper threading) or 4x 256-bit).

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5

Anandtech said:
While there are two integer schedulers in a single Bulldozer module (one for each thread), there’s only one FP scheduler. There’s some hardware duplication at the FP scheduler to allow two threads to share the execution resources behind it. While each integer core behaves like an independent core, the FP resources work as they would in a SMT (Hyper Threading) system.

The FP scheduler has four ports to its FPUs. There are two 128-bit FMAC pipes and two 128-bit packed integer pipes. Like Sandy Bridge, AMD’s Bulldozer will support SSE all the way up to 4.2 as well as Intel’s new AVX instructions. The 256-bit AVX ops will be handled by the two 128-bit FMAC units in each Bulldozer module.

Each Bulldozer module has its own private L2 cache shared by both integer cores and the FP execution hardware.

JFAMD · Jan 15, 2011

Cogman said:
I really hope that AMD does pull out the performance crown. I'll always have a soft spot when it comes to AMD (Even though all my machines right now are Intel based.)

However, I'm skeptical of pre-release numbers from any company.

Which is why we release benchmarks at launch.

beginner99 said:
So this is basically false? Or I misunderstood it?

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/4

As far as I understood, only 1 FP instruction can be scheduled per module and clock. Meaning if the 2 threads running on 1 module both only run FP instructions a module will be as fast (or slow) as a single core?
(this will sure be the case with AVX-Instructions)
Also are OS'es "module aware" so that above case does nto happen? (I think this is the case with HT)

No, there are 2 FMACs and on any cycle you could have a 128-bit FP instruction scheduled on each of the FMACs or you could merget the FMACs together to execute a single 256-bit AVX execution.

AtenRa said:
If im not mistaken, there is only ONE FP unit (Shared Resources) for every Bulldozer Module. This FP unit is shared by the two INT Units in the module and its divided to dual 128-bit FMACs. The FP unit can Hyperthread (2x 128-Bit FMACs) or can be combined to ONE 256-bit FMAC (For AVX).

Any application needing Integer calculations, the BD (4 Module, 8 Cores) will execute 8(Max) threads in 8 INT execution units, but when the application needs FP calculations the BD will execute 8(Max) Threads in 4 FP execution units (8x 128-bit FMACs(Hyper threading) or 4x 256-bit).

http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5

No, you have it backwards. Each module has 2 128-bit FMACs that can merge into a single 256-bit AVX pipe.

So on any cycle each module could execute either 2 integer executions and 2 128-bit FMAC executions OR 2 integer execution and 1 256-bit AVX execution.

When you think about the FP capabilities, think of it as 2 units merging into one, not one unit splitting into two.

hamunaptra · Jan 15, 2011

So, let me get this right...the sandybridge is capable of executing 3x 256bit AVX instructions as long as they ar different types per clock cycle. Per core.
The BD can execute only 1 256bit instruction per module per clock of any type.

From what I think I understand, is sandybridge borrows execution resources from the int ports to do 256bit calculations, does that mean it cant do int calculations while doing its avx stuff per clock?

AtenRa · Jan 15, 2011

JFAMD said:
No, you have it backwards. Each module has 2 128-bit FMACs that can merge into a single 256-bit AVX pipe.

So on any cycle each module could execute either 2 integer executions and 2 128-bit FMAC executions OR 2 integer execution and 1 256-bit AVX execution.

When you think about the FP capabilities, think of it as 2 units merging into one, not one unit splitting into two.

AMDs slides talk about a shared FP Unit

Anyway, lets say that we have 2x 128-bit FMACs but with a single 60-Entry FP Scheduler to feed 2 threads, I have to ask if this could be a bottleneck.

bryanW1995 · Jan 15, 2011

Ajay said:
Distributed Computing

yes. I run seti, folding@home is another popular one. There are probably 30-40 mainstream ones now in addition to these two.

AtenRa · Jan 15, 2011

hamunaptra said:
So, let me get this right...the sandybridge is capable of executing 3x 256bit AVX instructions as long as they ar different types per clock cycle. Per core.
The BD can execute only 1 256bit instruction per module per clock of any type.

From what I think I understand, is sandybridge borrows execution resources from the int ports to do 256bit calculations, does that mean it cant do int calculations while doing its avx stuff per clock?

I believe the SB models with HT enable could use inactive ports for INT execution

JFAMD · Jan 15, 2011

AtenRa said:
AMDs slides talk about a shared FP Unit

Anyway, lets say that we have 2x 128-bit FMACs but with a single 60-Entry FP Scheduler to feed 2 threads, I have to ask if this could be a bottleneck.

well our 60 entry FP scheduler is bigger than the SB scheduler for both Int and FP (they do not have dedicated schedulers.)

hamunaptra said:
So, let me get this right...the sandybridge is capable of executing 3x 256bit AVX instructions as long as they ar different types per clock cycle. Per core.
The BD can execute only 1 256bit instruction per module per clock of any type.

From what I think I understand, is sandybridge borrows execution resources from the int ports to do 256bit calculations, does that mean it cant do int calculations while doing its avx stuff per clock?

SB actually uses the SSE registers so if you want to use AVX you need to not only recompile your code to understand AVX-256 but you need to remove SSE and replace it with AVX-128.

We can do FMAC which is essentially an FMUL and FADD. If you have 2 FADD commands at the same time, for SB you would have to do that on 2 cycles. For BD you could do it in 1 cycle because each FMAC can do an FADD or FMUL.

In the end I am betting that 256-bit is probably going to be pretty close but we'll have a clear lead on 128-bit.

piesquared · Jan 15, 2011

JFAMD said:
well our 60 entry FP scheduler is bigger than the SB scheduler for both Int and FP (they do not have dedicated schedulers.)

SB actually uses the SSE registers so if you want to use AVX you need to not only recompile your code to understand AVX-256 but you need to remove SSE and replace it with AVX-128.

We can do FMAC which is essentially an FMUL and FADD. If you have 2 FADD commands at the same time, for SB you would have to do that on 2 cycles. For BD you could do it in 1 cycle because each FMAC can do an FADD or FMUL.

In the end I am betting that 256-bit is probably going to be pretty close but we'll have a clear lead on 128-bit.

It might depend on who reviews them too. I bet we'll see certain sites focus on which makes intel look best. Sort of like how walton chose to focus on benchmarks and comments which make intel look as good as possible against Fusion(as hard as that is), in the Brazos review/preview/commentary on the front page.
For sure Bulldozer architecture is far superior on paper and it seems intel supporters are starting to show nerves lol.

Meph3961 · Jan 15, 2011

piesquared said:
It might depend on who reviews them too. I bet we'll see certain sites focus on which makes intel look best. Sort of like how walton chose to focus on benchmarks and comments which make intel look as good as possible against Fusion(as hard as that is), in the Brazos review/preview/commentary on the front page.
For sure Bulldozer architecture is far superior on paper and it seems intel supporters are starting to show nerves lol.

Wow. How about we wait until Bulldozer is out until we decide which is better. Right now all we have about Bulldozer is paper. I really hope Bulldozer lives up to the recent hype, if it doesn't AMD is going to be in a real bad position, with Ivy Bridge and 22nm right around the corner.

bmadd89 · Jan 15, 2011

Skurge said:
Wasnt Barcelona on the same process as the last Athlon X2s?

Yeah it was and they still had there issues. And barcelona was to big for for 65mn to start with but it still shows what can happen.

Stoneburner said:
Always? No. Prescott taught this this lesson.

Sorry. I sould have said tick tock. But do you not agree that it just increases the potential risk? Im not engineer but if intel has a hard time predicting what will happen then a smaller R&D budget is going to have a harder time. Last thing we need is 3/4 of a BD core being released because of these issues.

HW2050Plus · Jan 15, 2011

hamunaptra said:
Well Im hoping for midend parts that clock EXTREMELY well and for really nice prices, bringing back amazing bang for buck. Im also wondering if there will be any 4 core mid end parts that will be full fledged 8 core parts but disabled due to defects / demand, in which case if they will be unlockable or not like current X2's and so on....

This would be freakin awesome, and like I said 5.5ghz on air OC!!! OOOO that would be so nice woot!

From all we know we can really guess that AMD will put Bulldozer 8 core CPUs to compete with Intel Sandbridge 4 core CPUs. So they will be priced similar. I think AMDs strategy will be: Buy us, you get the double core count (and better overall performance therefore).

Also on the cost side it would be no problem, since an 8 core AMD costs in production around the same as an 4 core Intel.

However that AMD is reactivating FX brand and Intel priced SandyBridge that low, I expect that AMD will want to earn really money and won't make it too cheap, at least at the beginning. AMD could possibly force Intel in a price war. If you also consider, that AMD is making record investments - or better say Global Foundries does - in new fabs, they seem very confident to sell really a lot.

But far more worse the situation will get in server market, where core count is mission critical and there simply exists no important application which would not scale almost linear.

Desktop market share will be limited anyway because of AMD having not enough fabs, but in server market AMD could regain those 50% market shares, they once had with AMD64 and double core opterons. And this time this could come as well for dual or single socket servers.

We have to see, which price level they will choose and how it will develop, it will be maybe cheap regarding performance but not really cheap overall. So lets hope they ramp up more fabs and that quickly.

200 USD and below for an eight core is likly, but that is a low bin part (clock/cache) and higher clocked (FX) could reach 1000 USD.

HW2050Plus · Jan 15, 2011

bmadd89 said:
Yeah it was and they still had there issues. And barcelona was to big for for 65mn to start with but it still shows what can happen.

Sorry. I sould have said tick tock. But do you not agree that it just increases the potential risk? Im not engineer but if intel has a hard time predicting what will happen then a smaller R&D budget is going to have a harder time. Last thing we need is 3/4 of a BD core being released because of these issues.

Sure there is a risk and that hit AMD already. They had Phenom but their 65 nm ramped very slowly so there was a huge delay for the new design.

That is the risk and Intel avoids just that. Intel has just the financial power to pull a new process one year ahead.

On the other hand AMD can optimize the design to the process or vice versa.

AtenRa · Jan 16, 2011

JFAMD said:
well our 60 entry FP scheduler is bigger than the SB scheduler for both Int and FP (they do not have dedicated schedulers.)

I was comparing the FP Schedulers from AMDs CPUs,

Barcelona has a 36 Entry FP Scheduler to feed an FP Execution unit with 3 pipeline FADD, FMUL, FMISC plus the SSE while Istanbul has a 42 Entry FP Scheduler to feed the same FP Execution unit.

Now in Bulldozer we have dual 128-bit FMACs meaning we have double the execution units of Barcelona/Istanbul inside the SB Module but with a single 60 entry FP Scheduler.

My question was that, because we have doubled the FP Execution Units (2x 128-bit FMACs) but we get a single, narrower than doubled (36x2) 60 Entry Scheduler, could we have a bottleneck in the FP execution Units? So even if we get an 128-bit FMAC for each core the utilization could not be the same as we would have two independent FP Schedulers + two FP Execution Units.

hamunaptra · Jan 17, 2011

Whats more beneficial per core? Shared schedulers or dedicated schedulers? To me, it seems like shared is more flexible as in the following case: If only one pipe is needed for some execution, it has most of the schedulers dedicated to it?
If both pipes are being used, the schedulers are working their hardest in distributing appropriately, but hopefully can mix and match / be more effecient in a shared setup?

If thats the case, couldnt the "not quite" doubling of schedulers not be an issue, since going to shared is more effecitive utilization of them?

Bateluer · Jan 17, 2011

I'm looking to rebuild my machine in a few months and wait to be able to see SB against Bulldozer models to make an informed decision. My E8500 is getting long in the tooth, but I can stave off the upgrade bug for a time with my impending purchase of a Radeon 6970 though.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Golden Member

Diamond Member

Lifer

Senior member

Lifer

Lifer

Senior member

Lifer

Lifer

Diamond Member

Lifer

Senior member

Senior member

Lifer

Lifer

Lifer

Senior member

Golden Member

Junior Member

Member

Member

Member

Lifer

Senior member

Lifer