Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 64 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
Why would AMD bother with BD module/core stuff if one core was bigger than the old one?

I don't get it I am reading something about 35ish Million transistors per core not including the L2 Cache for Llano. Since Llano is 1MB L2 per Core and BD is 2MB per module, it doesn't make sense that Llano is as small as it is and BD (module) sitting at 213 million. Even if half of that is L2 cache, per core size would be 50% larger.

There is something missing in these estimates. Either Llano estimates are off (very possible), or the BD module size being reported includes the 2MB L3 that is sectioned off per module.

If that is the case then 1 3rd of that would be 71 or 35.5 million transistors per core. I haven't done the research in calculating cache sizes in transistors so I don't know how that extra 2MB of cache would calculate in the equation. But BD can't be two cores combined together "saving transistors" and "slower" yet 50% larger.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Alright time for clarity.

Bulldozer: 30.9mm2 for a module with 2MB L2
Llano: 9.69mm2 for a core

If we take out the L2, Bulldozer module is little under 18mm2(17.8-17.9).
 
Last edited:

ShadowVVL

Senior member
May 1, 2010
758
0
71
well how many transistors per core was Pll 1090T?

Im sure bd will be fine it might be near the 2600k or it might not we just have to wait and see.

I don't see why everyone is getting worked up over a cpu we dont have yet.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Alright time for clarity.

Bulldozer: 30.9mm2 for a module with 2MB L2 3
Llano: 9.69mm2 for a core

If we take out the L2, Bulldozer module is little under 18mm2(17.8-17.9).

Yeap,

BD core + 2MB L2 = 213M transistors (30,9mm2)
Llano core + 1MB L2 = 110M transistors (17,7mm2)
 

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
Alright time for clarity.

Bulldozer: 30.9mm2 for a module with 2MB L2
Llano: 9.69mm2 for a core

If we take out the L2, Bulldozer module is little under 18mm2(17.8-17.9).

So that would put it nearly 10% smaller then Llano per core. Not 50% larger.
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Some quotes from AMD

Nov 2009:

AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area. That's less than a complete doubling of die size for two cores, but still much more than something like Hyper Threading.
http://www.anandtech.com/Show/Index/2881?cPage=4&all=False&sort=0&page=1

This does not tells us much. Die area of what? 1 module, 2 module or 4 module chip? Or is it 50% on top of "hypothetical 1 core module"?


Aug 2010:


bulldozerefficient.jpg


http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/4


This one is as good as it gets. Here we know what is 100% and it seems it adds 5% to 4 module chip. It's also almost one year newer info.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Why would AMD bother with BD module/core stuff if one core was bigger than the old one?

Where are you getting this idea? An 8-core Valencia is smaller than a 6-core lisbon.

Don't forget slower as well.

Where are you getting this idea? A 16-core Interlagos has 50% more throughput than a 12-core Magny Cours, despite having only 33% more cores. That says on a "per core"basis a BD core is faster.
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Where are you getting this idea? An 8-core Valencia is smaller than a 6-core lisbon.


...
The comparison was Bulldozer and Llano. All I am saying is that 1 module is expected to be smaller than 2 full cores. Nothing else.
I understand that Bulldozer has some new stuff and Llano could lose some 'fat' but it seems it's still the case.
Let us know if we made a mistake.
Why are you comparing 45nm and 32nm cpu's?
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
There is no reason to think that a BD core should have less transistors than a Phenom II or Llano core. The module design allows AMD to save space compared to a hypothetical Bulldozer core without the module design -- not the older architecture's cores.

In order to get higher IPC, you need a more advanced design. In order to build a more advanced design, you need more transistors. Of course Llano's cores are super-tiny on 32nm, they were originally designed for much older nodes! But as comes up quite frequently, they are also slower (IPC) than another certain company's more modern designs ... The fact that AMD is still competing with what is essentially a heavily modified K8 is a testament to how awesome Hammer really was :biggrin:
 

RobertPters77

Senior member
Feb 11, 2011
480
0
0
Where are you getting this idea? A 16-core Interlagos has 50% more throughput than a 12-core Magny Cours, despite having only 33% more cores. That says on a "per core"basis a BD core is faster.
I've been wondering about that. The math simply doesn't add up well for BD.

For argument's sake, Let's assume that a 12 core Magnycours does 100 'X' ops at 1ghz. 100/12= 8.333 X ops per core at 1ghz.

Then we assume that a 16 core interlagos does 150 'X' op's at the same speed. 150/16=9.375 X ops per core at 1ghz.

So right there we have a problem.
9.375/8.333= 1.125 = 112.5%. Which means that clock for clock and core for core. Bulldozer is on average 12% faster than K10.5/Stars.

Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.
 

Mopetar

Diamond Member
Jan 31, 2011
8,490
7,739
136
I've been wondering about that. The math simply doesn't add up well for BD.

For argument's sake, Let's assume that a 12 core Magnycours does 100 'X' ops at 1ghz. 100/12= 8.333 X ops per core at 1ghz.

Then we assume that a 16 core interlagos does 150 'X' op's at the same speed. 150/16=9.375 X ops per core at 1ghz.

So right there we have a problem.
9.375/8.333= 1.125 = 112.5%. Which means that clock for clock and core for core. Bulldozer is on average 12% faster than K10.5/Stars.

Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.

Yeah, that math works out to 50% performance increase with 33% more resources. Of course JFAMD said throughput, which could mean who knows exactly what.

We don't know what kind of power budget we can get that performance for either. Is it the same amount of power, less power, or more? No one here knows for absolutely sure. The same goes for what kind of price we can expect to pay.

Too many unknowns to know for sure. Speculation and educated guesses are one thing. Official PR from AMD is another, but I hardly expect them to show all of their cards before they have to.
 

Soleron

Senior member
May 10, 2009
337
0
71
And besides that I don't care what AMD marketing says about AMD processors. I know this and I am writing this while knowing those statements. All that taking information from AMD engineers. Either the info from the optimization manual or the official statement of an AMD engineer at ISSCC stating that IPC goes down. Several pages before around ISSCC you can see a post where I confronted JFAMD with the statement of one of their engineers. And that was the engineer presenting the BD module at ISSCC.

Yeah, and that was shot down last time by this slide:

http://h-5.abload.de/img/47mob.jpg

Also, the only source for your claim is a badly written EEtimes article. The source is secondary; it could have quoted him out of context or he could have been referring to Bobcat (where AMD has said "90%" before). Is there a direct-from-AMD statement saying that IPC decreases?


As you see the OFFICIAL point is that IPC Deneb > BD!
As JFAMD posts all the time. His statements are his private view!

He says that as a legal precaution, like all corporate blog posts do. When he speaks about AMD's plans, he is stating AMD's actual position to the best of his ability. He writes a lot of the slides for conferences too so if he is mistaken then so are the audiences of the last few server briefings and analyst days.

JF has probably said it increases over 100 times now. He would have been corrected by now, because external statements about performance are the most risky and scrutinised.
 

JFAMD

Senior member
May 16, 2009
565
0
0
I've been wondering about that. The math simply doesn't add up well for BD.

For argument's sake, Let's assume that a 12 core Magnycours does 100 'X' ops at 1ghz. 100/12= 8.333 X ops per core at 1ghz.

Then we assume that a 16 core interlagos does 150 'X' op's at the same speed. 150/16=9.375 X ops per core at 1ghz.

So right there we have a problem.
9.375/8.333= 1.125 = 112.5%. Which means that clock for clock and core for core. Bulldozer is on average 12% faster than K10.5/Stars.

Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.

The 12% math is not right. The performance claim is on throughput, you are trying to nail down clock speed, there is a difference.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Of course JFAMD said throughput, which could mean who knows exactly what.

It is a rather specific term, it means something specific and it specifically does not mean other things.

http://en.wikipedia.org/wiki/High-throughput_computing

The problem here is ignorance on our part (speaking in general) of the vernacular that is relevant to the characterization of computing systems.

Rather than come to terms with the vernacular, many members opt instead to assume that JF is using these words as way to effect marketing obfuscation and babble-speak.

But there is no other word for him to use, the chip is designed for computing throughput, it is what it is and he's using the correct terminology to describe it. A rose by any other name...

I don't have insider knowledge but my expectations of Bulldozer are along the lines of taking a SUN (Oracle now) Niagara processor, changing the cores to be much higher DEC-like alpha monsters instead of weak-sauce Sparc cores, and adding in a liberal amount of turbo-clocking to substantially boost performance of low thread-count apps.

So you get the throughput on your throughput-intensive apps, and you get the timeliness of a high-speed low-latency processor for your low-thread count apps.

Intel has gone after the same brass ring too, it only makes sense. Amazing how divergent Intel and AMD can be at times and yet at the 50k-ft perspective the products are surprising similar for all the same reasons.
 

Soleron

Senior member
May 10, 2009
337
0
71
Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.

The 50% claim is for a top-bin Interlagos and a top bin Magny Cours. That says nothing about desktop performance, because the workloads are different, clockspeeds are much higher, and Turbo is an unknown factor.

What if, hypothetically, BD could Turbo a single core to 5GHz when only one thread was running? That could be much faster than a single SB core. It's thermally possible.
 

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
Where are you getting this idea? An 8-core Valencia is smaller than a 6-core lisbon.



Where are you getting this idea? A 16-core Interlagos has 50% more throughput than a 12-core Magny Cours, despite having only 33% more cores. That says on a "per core"basis a BD core is faster.

Sorry no sarcasm emote.

A user (not pointing fingers) had basically stated that BD ate more power and per "core" (module divided by 2) was larger then a Llano core. I was merely adding that he also assumed it was slower by his math. Meaning that AMD was trying to go bankrupt by attaching themselves to a new, slower, and bigger CPU.
 

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
The 50% claim is for a top-bin Interlagos and a top bin Magny Cours. That says nothing about desktop performance, because the workloads are different, clockspeeds are much higher, and Turbo is an unknown factor.

What if, hypothetically, BD could Turbo a single core to 5GHz when only one thread was running? That could be much faster than a single SB core. It's thermally possible.

Agreed. Which is why this IPC based focus is irritating. Without knowing several key pieces of information most importantly shipping speed and turbo-mode bins is going to be nearly impossible to predict performance BD. Even if BD is 50% slower per "core" then SB at a given speed if it clocks twice as fast then it all good. In fact there have been many programs and single threaded apps that run better with clock-speed no matter the "IPC". The few wins the P4 could maintain is proof of that.

The fact is the barrier to being a faster CPU would be much shorter for AMD if "IPC" increases with BD are true (easier to believe coming from AMD then one person who says my random math says it doesn't). Right now its down 20% performance per clock vs. SB. If eats up another 5% in trying to catch up to SB per clock, then it only needs enough clock to eat up that last 15%. A 4GHz Turbo mode would compete very well. 4.5 a Substantial lead, 5GHz turbo mode would make it a ridiculous CPU.

Many tests need to be run and much more information needs to be made available.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Too many people are looking for a single silver bullet to definitively say one thing is better or worse than another.

But the things they are comparing are multifaceted and focused at a wide range of needs.

The reality is that after BD is launched both sides with have "definitive proof" that thier opinion is right and the arguments will continue to go on and on.

People should buy what is the best product for their needs and stop telling people that because it is right for me it has to be right for you. Everyone's needs are different.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
The 50% claim is for a top-bin Interlagos and a top bin Magny Cours. That says nothing about desktop performance, because the workloads are different, clockspeeds are much higher, and Turbo is an unknown factor.

What if, hypothetically, BD could Turbo a single core to 5GHz when only one thread was running? That could be much faster than a single SB core. It's thermally possible.

To be more accurate , for Integer perfs..
JFAMD would have been in the HPC department rather
than on the server one , he would have told 80%+
better perfs than MC. (in FP).....
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
I've been wondering about that. The math simply doesn't add up well for BD.

For argument's sake, Let's assume that a 12 core Magnycours does 100 'X' ops at 1ghz. 100/12= 8.333 X ops per core at 1ghz.

Then we assume that a 16 core interlagos does 150 'X' op's at the same speed. 150/16=9.375 X ops per core at 1ghz.

So right there we have a problem.
9.375/8.333= 1.125 = 112.5%. Which means that clock for clock and core for core. Bulldozer is on average 12% faster than K10.5/Stars.

Now I hope I'm wrong on this, because a 12% performance boost won't be enough to sway anyone to buy BD over Sandy or even the current K10 chips.
The problem here and why you get to a 12% performance boost is that the Bulldozer parts will be clocked higher than Interlagos.

So core for core and clock for clock BD is slower as Magny Cours. However because Bulldozer will have a significantly higher clock as Magny Cours it will be able to be 12% faster.

This as well is an AMD statement.
You have the ISSCC statement
and with the latest information from AMD combined with the architecture information there is an issue.

I again try to explain exactly what is the problem. Basically it is a marketing bubble:


If you look at the picture everything looks basically nice. The question however which comes up is how a single decoder could be able to fead all those hungry pipelines?

And here comes AMD marketing into play. They draw nice 4 pipelines which look more than the 3 pipelines of Magny Cours. Great marketing they just don't tell that the 4 pipelines are only carring micro ops and two micro ops make up a macro op. Now that gives only 2 pipelines. Now look again at this picture and have the integer pipelines of Magny Cours. Then you know what the issue is. Instead of 6 integer pipelines and 2 128 Bit FPU of Magny Cours in Bulldozer you have only 4 integer pipelines and 2 128 Bit FPU.

And now here comes what I say is obvious from the decoders, so you could have seen this even before from optimization guide it turned out that this 4 pipelines are not what they appeared to be. In Magny Cours the decoders can do 3 MacroOPS/cycle. Enough to feed all 3 integer pipes, two decoders of two cores can feed 6 integer pipelines, everything fine. In Bulldozer you have enhanced decoder which can do 4 MacroOPS/cycle. This is enough to feed 2 pipelines on two integer cores, 4 pipelines in a module.

You also see that the issue is mostly integer related since for FP SSE it looks okay or good however you like that.

I fully agree on that IPC is not the point. A Bulldozer core will be faster than a Stars core. But only because it is clocked MUCH higher.

You critize my 0.8 per core statement? I will tell you something:
Interlagos 50% faster than Magny Cours.
If you strip off the 33% more cores you are at 1.12 faster. Now with the BD design of 30% (22 FO4 vs. 17 FO4) higher clock:
1.12 / 1.3 = 0.86
Okay that is 0.06 more than I claimed, however it might shrink to 0.8 if you consider another clock bump from 32 nm!

All those is coming directly from AMD. AMD performance statements, AMD engineer statements, AMD presentations and AMD documents. And if you look at the design it is totally clear why it is like that.

Now coming back and to explain why JFAMD is right as well:
He says "IPC is higher" and he is right. Because he talks about micro ops IPC. So BD core does 4 micro ops / cycle which is more than 3 MacroOPS per cycle by sheer number. 4 is more than 3. However if you do not compare apples and oranges then you get a lower IPC 2 vs. 3 (simplified). Now that is the difference of a statement from an engineer and a marketing guy. The marketing guy just does not specify what is meant by "instruction". And as Stars don't know microOps he is even right and no liar though the statement is completly useless (yeah a marketing statement).

On the other hand, yes AMD Bulldozer has 8 cores and that will be enough to surpass current Sandy Bridge. The problem is not in this year. The problem will come next year when Intel issues 8 core Sandy Bridge. The thing which makes this so worse is that a 8 core / 16 thread Sandy Bridge will consume around same die space as the 4 module / 8 core Bulldozer. And that on 32 nm and Intel's 22 nm is coming as well next year.

AMD has a brand new design and a brand new process and both is not sufficient to stay in competition. Means that in one year from now AMD is in the exactly same position as it is now, but they have shoot one's wad (new process, new design).

And heck I am not telling anyone to buy or not buy something. I just make technical analysis to get educated guess on what to expect from brand new architecture Bulldozer.

And yes I am disappointed about what they achieved.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
In Magny Cours the decoders can do 3 MacroOPS/cycle. Enough to feed all 3 integer pipes, two decoders of two cores can feed 6 integer pipelines, everything fine. In Bulldozer you have enhanced decoder which can do 4 MacroOPS/cycle. This is enough to feed 2 pipelines on two integer cores, 4 pipelines in a module.




Now coming back and to explain why JFAMD is right as well:
He says "IPC is higher" and he is right. Because he talks about micro ops IPC. So BD core does 4 micro ops / cycle which is more than 3 MacroOPS per cycle by sheer number. 4 is more than 3. However if you do not compare apples and oranges then you get a lower IPC 2 vs. 3 (simplified).

.

A truly apple/apple comparison would take account of the
architectural efficency of BD, which we are not aware of at this point.

In short, what about the number of cycles needed by K10
to execute those theorical 3 macro ops and the comparative
speed of BD to execute those 3 macro ops , since they are
broken in a number of micro ops by the scheduler before
execution by the two ALUs and eventually the two AGLUs ?....

Unless you can answer this question, all speculations about
BD architectural perfs are only wild guesses..
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
A truly apple/apple comparison would take account of the
architectural efficency of BD, which we are not aware of at this point.

Even if you could, unless you have a Bulldozer architecture simulator(or you could do that in your head), you won't be able to know how it performs other than general expectations.

So every time you try to go more detailed you are just making up excuses for wasting time on something you'll never be able to figure it out.
 
Status
Not open for further replies.