Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

LoneNinja · Feb 17, 2011

ShadowVVL said:
I'm looking forward to

1.seeing zamb x8 vs sandy bridge.

2.the performance difference 4x zambezi will have over phenom ll x4 970, hoping for 30% increase.

I hoping to see zamb x4 @ $139-$149 and a x8 at $249.

if the price is right I might jump on the 8 core train.

I expect the cheapest 8 core part to be priced about where the cheapest I7 is from Intel, or higher. AMD isn't a charity, they price their parts for how they perform, and 8 core Bulldozer better out perform 4 core Sandy. They've also shown 6core Bulldozer on some of the more recent roadmaps as well. Llano X2/X4 and Bulldozer X4/X6/X8 would crowd their pricing bracket even more than it is currently if everything is <$250.

Mopetar · Feb 17, 2011

LoneNinja said:
I expect the cheapest 8 core part to be priced about where the cheapest I7 is from Intel, or higher. AMD isn't a charity, they price their parts for how they perform, and 8 core Bulldozer better out perform 4 core Sandy. They've also shown 6core Bulldozer on some of the more recent roadmaps as well. Llano X2/X4 and Bulldozer X4/X6/X8 would crowd their pricing bracket even more than it is currently if everything is <$250.

I'm guessing that Llano will aimed at the mainstream whereas the 8 core BD chips will be aimed at professionals and enthusiasts. If the performance is good compared to SB, AMD will probably price it higher. I'm pretty sure they'd love to be able to sell a $400 CPU.

nonameo · Feb 17, 2011

ShadowVVL said:
I'm looking forward to

1.seeing zamb x8 vs sandy bridge.

2.the performance difference 4x zambezi will have over phenom ll x4 970, hoping for 30% increase.

I hoping to see zamb x4 @ $139-$149 and a x8 at $249.

if the price is right I might jump on the 8 core train.

I dunno, I'd think sub-100$ for a 4 thread bulldozer sounds more right.

drizek · Feb 17, 2011

Ya, quads for less than $100 and octals for less than $200.

AMD is up against some pretty stiff competition. Unless Bulldozer performance is super amazing, they will need to get hte prices down.

They can kinda get away with selling Thuban for $200 because it is compatible with everyones old motherboard. But Bulldozer is a whole new platform, which means it actually needs to be competitively priced.

ElFenix · Feb 17, 2011

Mopetar said:
I'm guessing that Llano will aimed at the mainstream whereas the 8 core BD chips will be aimed at professionals and enthusiasts. If the performance is good compared to SB, AMD will probably price it higher. I'm pretty sure they'd love to be able to sell a $400 CPU.

outside of the graphics, llano sounds like a decidedly low end chip. it's an a2x4 with a die shrink. that was fast about 3 years ago. it's still serviceable for everything but high end gaming. that probably reflects more on the state of software than anything else.

it'll be a fantastic notebook chip, i bet. that's where the volume is in the x86 market right now, so not a bad design target.

ShadowVVL · Feb 17, 2011

well i think llano will be around $99 .but it might be $69 for llano and $129 for 4c bd. yeah 8c zamb at $349 seems more accurate, $400 might be to high unless it can take out i7 2600 other wise sb would be a better deal.

Phynaz · Feb 17, 2011

Octacore for $200 isn't going to happen unless performance is horrible.

AMD will price these comparable to their Intel counterparts. AMD hasn't had pricing power in nearly five years.

If AMD hits it out of the park, then they will have the power to set prices, and they will not be cheap. Remember A64X2 pricing?

ShadowVVL · Feb 17, 2011

Thats most likely true since amd needs money and the more money they make the more they can spend on cpu development to better keep up with intel.

mosox · Feb 17, 2011

More rumors:
http://i.imgur.com/JY3oC.jpg

http://i.imgur.com/oWbgN.png

http://scarletwhore.com/?p=3277

drizek · Feb 17, 2011

It's all BS.

There is no such thing as "bulldozer". That thing might as well be a dual socket 16-core opteron "bulldozer", or it might be a quad core desktop "bulldozer" or anything in between. Until they actually tell us what kind of CPU that is, the entire benchmark, even if ti is real, is completely pointless. They're comparing it to CPUs that cost less than $200 and more than $1000. There is no point of reference here whatsoever.

Also, his claim about Apple using AMD fusion products is still a big question mark, especially since the MBAirs are now rumored to be using Sandy Bridge and not Brazos.

hamunaptra · Feb 17, 2011

Considering AMD is bringing back the FX branding, which they havent had since they were top of the world in performance. I would perceive that eludes to AMD being quite confident in their performance of BD vs anything intel has out right now.
Also I would guess AMD has a clever method of countering intels future releases as long as the FX brand is around, this tme go around.

AMD wouldnt be bringing back the FX if the BD downright sucked.

drizek · Feb 17, 2011

Well the other rumor was that all bulldozer chips were going to be called FX.

Also, remember AMDs "fusion utility"? They don't seem to mind whoring out their branding too much.

HW2050Plus · Feb 17, 2011

AtenRa said:
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf

page 10, figure 6: Out-Of-Order execution engine detailed pipeline

Carefully read this please. See e.g.:
If a logical processor has used its limit of a needed resource, such as store buffer entries, the allocator will signal stall for that logical processor and continue to assign resources for the other logical processor.

All what I said already. If there is a stall the other logical processor may execute.

You have to differ between just filling buffers (which does not mean execution but getting buffers full which will reduce switching time if a switch occurs) and the execution of instructions. As I have written above the other thread can recover from a stall while waiting.

AtenRa said:
Page 11 ...

Intel's HT CAN do that

This one statement:

For example, the schedulers could

dispatch two uops from one logical processor and two
uops from the other logical processor in the same clock
cycle.

Is really indicating that because usually dispatch means readying for execution (assigning an Intel issue port). And normally there is no more buffer in the execution unit as of just buffer filling.

However I read now multiple documents about HT including Intel's Architecture Reference and book excerpts of the book "Programming with Hyper-Transport" from Intel Press (ISBN 0-9702846-9-1) and various (!) other.

After reading all that I come to the conclusion that the above statement is related to a special NetBurst feature. With Netburst you have Port 0 and Port 1 beeing "double pumped" means they can execute two instructions in one cycle and the simple ALU is devided in a low cycle half and a high cycle half.

Therefore the scheduler may dispatch 2 yops of the first logical processor in the low cycle half then the other 2 yops of the other logical processor in the high cycle half for the other logical processor. Maybe that is why this sentence starts with the quite unique "For example" which isn't used elsewhere in this description. That could indicate that this happens only in an example case and that is if such half cycle yops are dispatched.

That and the document in general means that the two threads are alternativly dispatched so I am wrong on this preferred and waiting thread. It works as I previously described for Symetric Threads for SUN sparc T1. Don't know why I had this very different in mind, maybe I mixed the ways Intel's HT does it with SUN Sparc T1 or it came from Power 6 I don't know.

So I still say that the above statement comes from this Netburst special ability of having two units running with half cycles so formally you would be correct with your cycle statement but practically you are not since that does not work in general but only for instructions which can be executed in half cycles and then half cycle is what a real cycle is. And even more that would only apply to Netburst-HT and not to HT used in Core2.

But as I read all those many documents about HT and in detail they all differ from each other how Hyperthreading works it is no more possible to make such distinct statements. It's so weired one document says that if one thread stalls on Netburst it causes the other thread to stall as well which is quite the opposite of what another document says that in that case one thread will get all resources. But there are arguments since the document describes that this was the reason to add two new CPU instructions in Prescott, the monitor and mwait instruction to get away some nasty Hyperthreading issues of Northwood (e.g. usage of mwait to reduce this stall causing other unit to stall in at least some cases).

I am afraid that no one really understands this HT in depth and detail when even several Intel authors come to different statements.

And in at least NetBurst there are some flaws in HyperThreading which makes it not work right in some cases. The additional Prescott HT instructions should reduce these issues if they are used at least for some cases. I do not know if Core CPUs have still these HT issues (at least they have still these mwait and monitor instructions). And I could hardly find any information about HT in Core2. So it is just an assumption that it works exactly in the same way as in Netburst.

maddie · Feb 17, 2011

Phynaz said:
Octacore for $200 isn't going to happen unless performance is horrible.

AMD will price these comparable to their Intel counterparts. AMD hasn't had pricing power in nearly five years.

If AMD hits it out of the park, then they will have the power to set prices, and they will not be cheap. Remember A64X2 pricing?

I knew we could agree on something.:biggrin:

HW2050Plus · Feb 17, 2011

jvroig said:
It isn't. This reply also goes for maddie, above.

We've all gone through this discussion before, several months ago. I am in no mood to do it all over again for people who still don't get it.

Here is a link to Anand clarifying this issue: http://www.anandtech.com/show/2881.
Scroll down to the very bottom:

I am not singling out maddie or HW2050 or anybody, but before anybody accuses anybody of being "flawed in reasoning" or "just your own speculation" in a technical discussion, it might be a good idea to please get the facts straight. I personally keep out of CPU discussions these days because I have gotten sick of teenagers (or adults maybe) who have never even read through the first few chapters of any computer microarchitecture / comp organization book, yet feel obliged to argue technical issues using "knowledge" from marketing materials against real professionals who either deal with these issues directly through hardware (design) or low-level software optimizations, or teach the damn comp organization / uarch course, or any or all of those combined. I can see the signal to noise ratio still hasn't improved. I should not have bothered in the first place.

All I can do now is say thank you to the participants I have interacted with, and bid everyone goodbye. I bow out of the thread already, there is nothing more I can do.

Regards.

Okay now I got the exact module information about AMD Bulldozer from AMD: A Module consists of 213 million transistors and has a die area of 30.9 mm².

Let's recap my calculation with that:

4 Modules would take exactly 123.6 mm²
8 MB of L3 cache takes ~36 mm²
Uncore takes ~43 mm²
That would result in 203 mm².
AMD Zambezi: 32 nm process, 203 mm² die area, 4M/8C, 16 MB cache

So it would take a little less die size than a Sandy Bridge and somewhat less than a current Thuban (in 45 nm).

For comparison:
Intel Sandy Bridge: 32 nm process, 216 mm² die area, 4C/8T, 8 MB cache
AMD Phenom II: 45 nm process, 258 mm² die area, 4C/4T, 8 MB cache

Let's also recap the statement regarding core size increase by BD Module as there were many numbers flying around (5%, 12%, 50% more for BD Module).

A shrinked Thuban core would have ~73 mm²/4 = ~18.25 (w/o L2) whereas a Bulldozer module has 30.9 mm² (w/ 2 MB L2). 2 MB L2 would be 9 mm². So without L2 it would be 21.9 mm². That is 20% more than a Thuban core.

And there is I think where this heavy differences are coming from. The low statements (5%/12%) was about a Bulldozer module w/o L2 compared to a Thuban core without L2 but with die shrink.

Then they likely asked again an engineer and he said "no it is 50%" but then he compared a Bulldozer w/ 2 MB L2 to a Thuban core without L2 but with die shrink.

I have an older estimation based on analysis of a die picture where a BD module was sized 18 mm² for the module and 10 mm² for L2. That was abviously wrong as AMD gave the info out of exactly 30.9 mm² which is 10% more from what they got by die picture estimation.

PreferLinux · Feb 17, 2011

HW2050Plus said:
Carefully read this please. See e.g.:
If a logical processor has used its limit of a needed resource, such as store buffer entries, the allocator will signal stall for that logical processor and continue to assign resources for the other logical processor.

All what I said already. If there is a stall the other logical processor may execute.

You have to differ between just filling buffers (which does not mean execution but getting buffers full which will reduce switching time if a switch occurs) and the execution of instructions. As I have written above the other thread can recover from a stall while waiting.

This one statement:

Is really indicating that because usually dispatch means readying for execution (assigning an Intel issue port). And normally there is no more buffer in the execution unit as of just buffer filling.

However I read now multiple documents about HT including Intel's Architecture Reference and book excerpts of the book "Programming with Hyper-Transport" from Intel Press (ISBN 0-9702846-9-1) and various (!) other.

After reading all that I come to the conclusion that the above statement is related to a special NetBurst feature. With Netburst you have Port 0 and Port 1 beeing "double pumped" means they can execute two instructions in one cycle and the simple ALU is devided in a low cycle half and a high cycle half.

Therefore the scheduler may dispatch 2 yops of the first logical processor in the low cycle half then the other 2 yops of the other logical processor in the high cycle half for the other logical processor. Maybe that is why this sentence starts with the quite unique "For example" which isn't used elsewhere in this description. That could indicate that this happens only in an example case and that is if such half cycle yops are dispatched.

That and the document in general means that the two threads are alternativly dispatched so I am wrong on this preferred and waiting thread. It works as I previously described for Symetric Threads for SUN sparc T1. Don't know why I had this very different in mind, maybe I mixed the ways Intel's HT does it with SUN Sparc T1 or it came from Power 6 I don't know.

So I still say that the above statement comes from this Netburst special ability of having two units running with half cycles so formally you would be correct with your cycle statement but practically you are not since that does not work in general but only for instructions which can be executed in half cycles and then half cycle is what a real cycle is. And even more that would only apply to Netburst-HT and not to HT used in Core2.

But as I read all those many documents about HT and in detail they all differ from each other how Hyperthreading works it is no more possible to make such distinct statements. It's so weired one document says that if one thread stalls on Netburst it causes the other thread to stall as well which is quite the opposite of what another document says that in that case one thread will get all resources. But there are arguments since the document describes that this was the reason to add two new CPU instructions in Prescott, the monitor and mwait instruction to get away some nasty Hyperthreading issues of Northwood (e.g. usage of mwait to reduce this stall causing other unit to stall in at least some cases).

I am afraid that no one really understands this HT in depth and detail when even several Intel authors come to different statements.

And in at least NetBurst there are some flaws in HyperThreading which makes it not work right in some cases. The additional Prescott HT instructions should reduce these issues if they are used at least for some cases. I do not know if Core CPUs have still these HT issues (at least they have still these mwait and monitor instructions). And I could hardly find any information about HT in Core2. So it is just an assumption that it works exactly in the same way as in Netburst.

You ought to be banned for correcting technical documentation. It will say exactly what it means. Also, in case you didn't notice, it wasn't saying the thread had to stall, it was saying it would signal a stall when there wasn't one.

AtenRa · Feb 17, 2011

If a logical processor has used its limit of a needed resource, such as store buffer entries, the allocator will signal stall for that logical processor and continue to assign resources for the other logical processor.

This is not a pipeline stall say from a miss-prediction, this is a halt when we have come to a resource limit.

HW2050Plus · Feb 17, 2011

AtenRa said:
This is not a pipeline stall say from a miss-prediction, this is a halt when we have come to a resource limit.

It is exactly a memory stall which is also explicitly written (memory buffers full, means memory stall is signaled). I did not write that it is a stall from misprediction.

Mopetar · Feb 17, 2011

We might also see some price stratification based on clock speed. The low end eight core chips might be more reasonably priced, but if they can outperform SB, expect the black edition parts to carry some extra premium.

I'm still fairly curious about how well single threaded performance will be. I'm not expecting an individual BD core to be as powerful as an individual SB core, but considering that the turbo core for all cores is 500 MHz, that would seem to indicate that if the chip were only using one core (Or module depending on how power gating is handled) it would be able to boost significantly higher.

Perhaps AMD's strategy for light workloads is to turbo the hell out of the system.

AtenRa · Feb 17, 2011

Well, Sandybridge is a superscalar, OoO (Out of Order), Hyper-Threading (SMT) processor that can execute simultaneously two instructions from two different Threads in an uop (μop-level) in the processor Execution Unit (Ports 0 to 5).

Lets continue with the Bulldozer thread now

Arkadrel · Feb 17, 2011

Lets continue with the Bulldozer thread now

Hyper Transport Technology 3.1
AES encoding acceleration
Flex FP (AVX 8x32bit/4x64bit commands pr Unit pr Cycle x4 moduals(8core)).
Turbo Core Boost
SSE4.1, SSE4.2, AES, CLMUL, (XOP, FMA4 and CVT16)..ect

L1 cache is 16K data per core, 64K instruction shared between 2 cores
L2 cache is 2MB shared between 2 cores
L3 cache is 8MB shared between 8 cores

The integer cores in Bulldozer will be faster than the integer cores in our current products - JF

We have already said one core per thread, period. But a single thread gets all of the front end, all of the FPU and all of the L2 cache if there is not a second thread on the module. -JF

With our 16-core Interlagos we expect ~50% greater performance than our 12-core AMD Opteron 6100 series processors. 33% more cores, 50% more performance, so the per core performance is actually higher. -JF

So we know the 16-core Interlagos is gonna have ~50% performance on a Opteron 6176SE.... how does that compair to a sandy bridge? hell if I know. I suspect the 8core will be able to atleast match a i7 2600k though ^-^

Ajay · Feb 17, 2011

Arkadrel said:
So we know the 16-core Interlagos is gonna have ~50% performance on a Opteron 6176SE.... how does that compair to a sandy bridge? hell if I know. I suspect the 8core will be able to atleast match a i7 2600k though ^-^

We don't know much until Intel and AMD release their next gen solutions. Comparing a desktop chip to a server based one is pointless - very different operational parameters, different workloads etc.

Ajay · Feb 17, 2011

Phynaz said:
Octacore for $200 isn't going to happen unless performance is horrible.

AMD will price these comparable to their Intel counterparts. AMD hasn't had pricing power in nearly five years.

If AMD hits it out of the park, then they will have the power to set prices, and they will not be cheap. Remember A64X2 pricing?

Yes. This is a problem that is both good for us (consumers) and bad for us. Pricing on low end parts will drop if BD is a hit, but then BD will be expensive as you point out (which gives AMD more $$s for R&D) and so on...

Ajay · Feb 17, 2011

Here's an article from RWT going over SMT (in regards to alpha EV8 ghost CPU) - but it's still a valuable resource.

Phynaz · Feb 18, 2011

Ajay said:
We don't know much until Intel and AMD release their next gen solutions. Comparing a desktop chip to a server based one is pointless - very different operational parameters, different workloads etc.

They come from the same die.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Senior member

Diamond Member

Diamond Member

Golden Member

Elite Member

Senior member

Lifer

Senior member

Senior member

Golden Member

Senior member

Golden Member

Member

Diamond Member

Member

Senior member

Lifer

Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer