AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
I don't get the first sentence IDC. Stronger compared to what?

Update: I get what you mean. Yea that seems right
 
Last edited:

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
So with Bulldozer one thread is 25% stronger than the other right?

But since the second thread is "close enough" in power we just call the Bulldozer module "dual core" instead of single core with super strong hyperthreading.

I'm not sure.

Both cores are the same. If only one thread is being processed, it will be 100%, just like in Nehalem.

If 2 cores are being used, you lose 20% of the performance if compared to 2 single cores of that architecture due to shared resources, so instead of 200% you get 180%.

Imagine you had your PC and it had to do these instructions, "A+B+C+D", and would take 2 seconds.

Now you have 2 of those lists "A+B+C+D" and "E+F+G+H". A single core would take 4 seconds.

A dual core of the first CPU will do both in 2 seconds.

Bulldozer would take 2.22 seconds to present both lists.

A single core with HT (20% boost) would take 3.333s. With 30% boost 3.077s.
 
Last edited:

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
See, its that semantics thing again.
Since they're positioning one module against Intel's dual core, and one module is like 80% scaling, they're positioning 1.8 AMD cores against 2.0 Intel cores. The real question is, hows the $/mm^2 gonna be?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
-snip-
Imagine you had your PC and it had to do these instructions, "A+B+C+D", and would take 2 seconds.

Now you have 2 of those lists "A+B+C+D" and "E+F+G+H". A single core would take 4 seconds.

A dual core of the first CPU will do both in 2 seconds for a total CPU time of 4 seconds.
-snip-

See, its that semantics thing again.
Since they're positioning one module against Intel's dual core, and one module is like 80% scaling, they're positioning 1.8 AMD cores against 2.0 Intel cores. The real question is, hows the $/mm^2 gonna be?

kaigai2.jpg


I think goto-san captures it simplistically in this graphic. Look to the right, notice the "throughput" and "cost" bars to extend to the right.

Highest throughput means highest performance on multi-threaded apps.

Note how both CMT like Bulldozer and SMT like Nehalem have lower throughput, but at the compromise of having even lower cost per throughput. (higher performance/cost)

The whole idea here is to intentionally create lower cost cores while minimizing the reduction in throughput. It is a tradeoff and its not a tradeoff that is done so that a dual-core bulldozer (a single bulldozer module) will suddenly have higher throughput (higher performance) than a CMP based quad-core (or dual-core for that matter).
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
This architecture also looks like it would scale down very well to lower process nodes.
I also bet they've tried this on a 45nm wafer, so they probably have a pretty good idea of it's performance.
 
Last edited:

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
The whole idea here is to intentionally create lower cost cores while minimizing the reduction in throughput. It is a tradeoff and its not a tradeoff that is done so that a dual-core bulldozer (a single bulldozer module) will suddenly have higher throughput (higher performance) than a CMP based quad-core (or dual-core for that matter).

Indeed.

But it is also based on the architecture performance.

For example Nehalem, it has SMT, but also is faster clock per clock than the Core2 architectures.

It is like the i3 vs i5 vs Athlon 2 X3/X4 vs Phenom II X3/X4.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
20% higher IPC though?

Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?

It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.

So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.

And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.

You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
20% higher IPC though?

Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?

It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.

So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.

And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.

You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.

Yeah but since when does multicore cpu's offer perfect scaling, even with the best mutithreaded software out there.

Although that point is probably moot since I believe AMD said that performance number is relative to existing cores.

Finally there is one very important thing missing from these equations: clock speed. This will be on the new 32nm SOI HiK gate first process, which appears to have very good characteristics. Should know more on that early next year. Maybe this thing really sings with a high clock.
 

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
Really, it sounds to me like what AMD is doing is less like hyperthreading and more like doing a tug-o-borrow between the cores. IOW, it's more like you have 4 cores at 80% of the performance(but at 50% of the die space cost because they share a few things) than HT(which uses a little extra core space to make 2 cores look like 4, using a little extra core space but utilizing otherwise unused resources for a bump in performance)

Really, I don't know how you can really compare HT and the new bulldozer arch.

But what I want to know is how the new bulldozer design is going to affect single threaded performance. Will a module utilize only half of its resources with a single thread, or will it be able to take advantage of resources that might otherwise be idle?

I'm really confused.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
20% higher IPC though?

Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?

It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.

So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.

And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.

You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.

Probably they are unrealistic.

Sometimes we get surprises, though.

But do you reckon a 4 Bulldozer core (2 modules) will be slower than a Phenom II X4?

And so, 16 Bulldozer cores is 12.96 cores vs interlagos 12?
 
Last edited:

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
The whole idea here is to intentionally create lower cost cores while minimizing the reduction in throughput. It is a tradeoff and its not a tradeoff that is done so that a dual-core bulldozer (a single bulldozer module) will suddenly have higher throughput (higher performance) than a CMP based quad-core (or dual-core for that matter).
I see that. But the problem Im seeing is that AMD is reducing the performance of 2 cores ("we'll give you 2 cores with 1.8 cores worth of performance"), while Intel is improving the performance of each core ("heres a quad core with 4.6x performance").

Unless the smallest bulldozer they'll ever sell is a 2 module "quad" part, in which case Id call it a dual core with 3.8x performance and AMD would do pretty well.

Again, semantics.
 

ShawnD1

Lifer
May 24, 2003
15,987
2
81
I see that. But the problem Im seeing is that AMD is reducing the performance of 2 cores ("we'll give you 2 cores with 1.8 cores worth of performance"), while Intel is improving the performance of each core ("heres a quad core with 4.6x performance").

Unless the smallest bulldozer they'll ever sell is a 2 module "quad" part, in which case Id call it a dual core with 3.8x performance and AMD would do pretty well.
All comes down to cost. AMD's cheap Athlon II X4 cores are significantly slower than Intel E8400 cores, but it's still a good processor because the price is right. If these newer and even more crippled AMD processors are anything like the Athlon X4, I'll be first in line to get one.

Really, it sounds to me like what AMD is doing is less like hyperthreading and more like doing a tug-o-borrow between the cores
Which may or may not be a bad thing. This would be comparable to an Intel Q8200 and E8400 having a similar number of transistors but they have a different number of cores. Both have their advantages and disadvantages.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Yeah but since when does multicore cpu's offer perfect scaling, even with the best mutithreaded software out there.

Although that point is probably moot since I believe AMD said that performance number is relative to existing cores.

Finally there is one very important thing missing from these equations: clock speed. This will be on the new 32nm SOI HiK gate first process, which appears to have very good characteristics. Should know more on that early next year. Maybe this thing really sings with a high clock.

The reduction in thread scaling (2x->1.8x) is intended to reflect that the thread scaling is that much less than scaling would have been otherwise were there no unshared resources at the core/module level.

So an application like LinX which scales to ~3.5x for four threads on four CMP cores would be expected to scale to 3.5 x 0.9 = 2.6 on a CMT styled architecture like that of BD.

Its not that BD architecture magically makes all apps thread to 90% while other CMP-based architectures perform even worse, its that it scales even less well as CMP but the idea is you can potentially have more cores (because they are all the smaller) to throw at the problem.

Regarding clockspeed...we have been explicit to denote we are talking about IPC here. Yes a clockspeed advantage would create and even larger beneficial performance advantage for BD but we aren't talking about that yet. We are just trying to ascertain whether we should really be expecting a BD module (2 cores) to have greater dual-threaded throughput than an Athlon II X2 or Phenom II X2 (Deneb-based) dual-core chip.

I don't say that to dismiss your points about clockspeed, I think your points are spot-on.

But what I want to know is how the new bulldozer design is going to affect single threaded performance. Will a module utilize only half of its resources with a single thread, or will it be able to take advantage of resources that might otherwise be idle?

That is the advantage of CMT with Bulldozer...if a module gets loaded with only one thread then the aggregate of all the otherwise shared resources in the module are solely at the disposal of enhancing IPC of that one thread.

How much benefit? Could be 5-10%. Basically the opposite of the thread-scaling penalty experienced when the shared resources are actually being shared.

Probably they are unrealistic.

Sometimes we get surprises, though.

But do you reckon a 4 Bulldozer core (2 modules) will be slower than a Phenom II X4?

And so, 16 Bulldozer cores is 12.96 cores vs interlagos 12?

At the same clockspeed? Yes. But not by much and we must consider those 4 BD cores could be as much as 33% smaller in aggregate die-area compared those 4 Deneb cores we are comparing it to.

Remember with CMT your throughput (performance) goes down some (AMD says ~20% when the modules are loaded with two threads versus if the modules were CMP with no shared resources) but your cost (die area) goes down even more.

Also remember Interlagos is BD-based. Are you thinking of Magny-Cours? (12-core based on 45nm Lisbon cores, like Deneb) If that is the case then yes, I think at the same clockspeed 16-core Interlagos will be comparable to 12-core Magny-Cours (in 16+ threaded apps obviously) but will occupy vastly smaller die-area (even if Interlagos were 45nm, but it is 32nm so it will be even smaller still).

And because Interlagos will be 32nm I expect clockspeeds to be upwards of 30-40% higher than Magny-Cours, so for reasons piesquared mentioned the actual performance of Interlagos versus magny-cours could be quite substantial (30-40%).
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
I don't understand why we are saying these new processors are crippled.

Bulldozer isn't a Deneb core.

The number of instructions for 1 integer core is going to be increased.

AMD already increased IPC from Athlon x2 to Phenom x2 and then again to Athlon 2 X2 for example. Athlon II X2 is like 10-30% faster at same clocks then Athlon X2, depending on application.

Don't see why can't a Bulldozer be 20% faster than a Deneb (although AMD didn't say those Magny-Cours vs Interlagos was at same clock-speed, so can be at the same power instead).

It is a new architecture. For example each Bulldozer core is 4-way INT and 2 to 4-way FP (depending on sharing), up from the 3-way each in K8/K10 today.

So I don't really know if we can assume a module is worse than a dual-core phenom II.

EDIT: Yes I meant Magny-Cours (12 cores) vs Interlagos (16 cores).
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I don't understand why we are saying these new processors are crippled.

Bulldozer isn't a Deneb core.

The number of instructions for 1 integer core is going to be increased.

AMD already increased IPC from Athlon x2 to Phenom x2 and then again to Athlon 2 X2 for example. Athlon II X2 is like 10-30% faster at same clocks then Athlon X2, depending on application.

Don't see why can't a Bulldozer be 20% faster than a Deneb (although AMD didn't say those Magny-Cours vs Interlagos was at same clock-speed, so can be at the same power instead).

EDIT: Yes I meant Magny-Cours (12 cores) vs Interlagos (16 cores).


We aren't saying they are crippled. We are recognizing that the cores share resources within the module and that resource sharing leads to resource contention, very much the same thing that reduces performance with SMT/hyperthreading just to a lesser extent with CMT in bulldozer.

Were the cores in bulldozer CMP based like Nehalem (HT disabled) or Deneb the the resource contention is reduced to that of the L3$ and memory controller, i.e. the shared aspects of the uncore.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
20% higher IPC though?

Think about it, we are trying to compare a 4C/4T bulldozer to a 4C/4T deneb and we want to convince ourselves that the bulldozer will have higher performance at the same clock?

It has already been said that "at best" the four threads on bulldozer will give a thread scaling of 1.8^2 = 3.24x owing to CMT-based limitations whereas a four threads on deneb (being CMP-based architecture) can still deliver 4x scaling.

So at the same clockspeed we want to believe that bulldozer architecture will be improved enough over that of Deneb that 3.24x thread scaling will still yield higher performance than 4x thread scaling...that requires the bulldozer cores (each core, not each module) to have 23.5% higher throughput above and beyond that of Deneb's architecture.

And that is a minimum throughput improvement criterion since we used the best-case scenario number of 1.8x thread scaling per module.

You guys have really really tall expectations for the prospects of IPC improvements in sequential microarchitectures. Not that it can't be done, but those kinds of IPC improvements just seem unrealistic.

I thought that benchmark was a purely Integer calculation driven benchmark. Wouldn't that make the "remove 20% improvement because the two cores are sharing a single FPU" arraingement less meaningful? You have to look at what is being measured here, not just blindly say that each additional core is at best 80% as efficient as the previous core. Plus, the 80% figure was never meant to be a best case scenario, it was just a guess based on the above statement. The integer pipeline was improved immensely as stated previously by a few posters. To say that it improved by 20% doesn't sound like a huge stretch to me at least, especially for a brand new architecture with years of R&D focused on that area. Especially when Intel has already proven that this type of performance can be had with integer calculations with their architecture.

I have a headache from the flu and some dental work, and it is difficult to think straight, so maybe I am not understanding your logic here. I think I will need to PM you to figure out why you feel that a 20% improvement in integer calculations is unreasonable, when the improvements seem to point to much higher theoretical improvement in that area. I probably just don't have a good understanding of that benchmark, and what it is measuring. I also don't have a great understanding of the BD architecture, but I swear I remember reading that the IPU's and FPU's both have double the bandwidth of the Shanghai architecture.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
I think I will need to PM you to figure out why you feel that a 20% improvement in integer calculations is unreasonable

Please do it in the thread - I think we all want any insight on this matter (and I'm not saying this because I think IDC is right or wrong or being silly - just for information).
 
Last edited:

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
I don't understand why we are saying these new processors are crippled.

Bulldozer isn't a Deneb core.

The number of instructions for 1 integer core is going to be increased.

AMD already increased IPC from Athlon x2 to Phenom x2 and then again to Athlon 2 X2 for example. Athlon II X2 is like 10-30% faster at same clocks then Athlon X2, depending on application.

Don't see why can't a Bulldozer be 20% faster than a Deneb (although AMD didn't say those Magny-Cours vs Interlagos was at same clock-speed, so can be at the same power instead).

It is a new architecture. For example each Bulldozer core is 4-way INT and 2 to 4-way FP (depending on sharing), up from the 3-way each in K8/K10 today.

So I don't really know if we can assume a module is worse than a dual-core phenom II.

EDIT: Yes I meant Magny-Cours (12 cores) vs Interlagos (16 cores).

I don't think it's crippled at all. Just putting 2 of these together 'on die'* will be a threaded titan, and it seems to be implimented well so far from what little we can see. The int pipes is where the data spends most of it's time, it seems. Doubling up those at the expense of a much underutilized FP unit seems to make alot of sense, while beefing up the other one. And if the design scales well up to 8 modules, it could be a pretty big leap in performance. Should be interesting.


* IIRC, JF has described Bulldozer variations as a number of Bulldozer modules on die.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I don't think it's crippled at all. Just putting 2 of these together 'on die'* will be a threaded titan, and it seems to be implimented well so far from what little we can see. The int pipes is where the data spends most of it's time, it seems. Doubling up those at the expense of a much underutilized FP unit seems to make alot of sense, while beefing up the other one. And if the design scales well up to 8 modules, it could be a pretty big leap in performance. Should be interesting.


* IIRC, JF has described Bulldozer variations as a number of Bulldozer modules on die.

Yeah I also realized I was double-counting the CMT penalization's in my Interlagos vs. Magny-Cours post above, 16core BD would be expected to perform about the same as 12core Magny-cores at the same clock if we weren't counting for any integer improvements (which we know to be false) so yeah we should expect 16core Interlagos to outperform 12core magny-cours at the same clock by probably 25-30%, then add in a 30-40% clockspeed boost from there to scale out your specINT_rate expectations.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I thought that benchmark was a purely Integer calculation driven benchmark. Wouldn't that make the "remove 20% improvement because the two cores are sharing a single FPU" arraingement less meaningful? You have to look at what is being measured here, not just blindly say that each additional core is at best 80% as efficient as the previous core. Plus, the 80% figure was never meant to be a best case scenario, it was just a guess based on the above statement. The integer pipeline was improved immensely as stated previously by a few posters. To say that it improved by 20% doesn't sound like a huge stretch to me at least, especially for a brand new architecture with years of R&D focused on that area. Especially when Intel has already proven that this type of performance can be had with integer calculations with their architecture.

I have a headache from the flu and some dental work, and it is difficult to think straight, so maybe I am not understanding your logic here. I think I will need to PM you to figure out why you feel that a 20% improvement in integer calculations is unreasonable, when the improvements seem to point to much higher theoretical improvement in that area. I probably just don't have a good understanding of that benchmark, and what it is measuring. I also don't have a great understanding of the BD architecture, but I swear I remember reading that the IPU's and FPU's both have double the bandwidth of the Shanghai architecture.

Martimus the bulldozer cores share more resources than just the FPU. The integer units share resources as well (within the same module).

At any rate there is a distinction to be made here between expecting 20% and expecting more than 23% as a minimum. 20% is reasonable as an upper-limit of what we tend to expect of microarchitecture. Expecting more than that as the minimum start value for our range is what seems unrealistic.

It is just numerology and opinion, no reason mine should be any more valid than yours. When was the last time we had a new architecture debut which improved IPC by 25% across the range of virtually every application category?

Also do go ahead and read JFAMD's comments in this thread, he stated plainly (as far as I perceived it to be) that the 1.8x thread scaling was "best case". I'm not doing anything here blindly, I am taking AMD's marketing at face value and assuming it is 100% correct. From there I am simply walking out the implications and ramifications of those statements.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Ok, I think I got how the confusion came from.

The sole reason for multi-threading technologies is to increase the utilization of the functional units.

A single core Core 2 processor might be 20% faster per clock in single thread apps than a dual core Celeron processor. But because a Celeron is a dual core, it'll be faster than Core 2 in well multi-threaded apps.

Well, the Bulldozer module can switch between a single core Core 2 and a dual core Celeron depending on the app.
 

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
Ok, I think I got how the confusion came from.

The sole reason for multi-threading technologies is to increase the utilization of the functional units.

A single core Core 2 processor might be 20% faster per clock in single thread apps than a dual core Celeron processor. But because a Celeron is a dual core, it'll be faster than Core 2 in well multi-threaded apps.

Well, the Bulldozer module can switch between a single core Core 2 and a dual core Celeron depending on the app.

except I don't think it can quite switch like that... it's just that if you're only running one "core", you also get access to the whole FP, where as if you're using both cores you still only have access to 1 fp(because that's all there is to have)

AMD is betting that FP won't be as important
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Well, the Bulldozer module can switch between a single core Core 2 and a dual core Celeron depending on the app.

You are saying this because the second smaller "core" in the Bulldozer module becomes the bottleneck right?

If only they could "Turbo" that second core? Or "Turbo" the primary core when the second core isn't needed?
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
You are saying this because the second smaller "core" in the Bulldozer module becomes the bottleneck right?

If only they could "Turbo" that second core? Or "Turbo" the primary core when the second core isn't needed?

You gotta drop that "second thread is slower than first thread" theory. They are both equal.

except I don't think it can quite switch like that... it's just that if you're only running one "core", you also get access to the whole FP, where as if you're using both cores you still only have access to 1 fp(because that's all there is to have)

No, It'll work exactly like that. Forget about single thread for now. Again, the sole purpose of multi-threading technologies is to increase utilization of the CPU. There are always "bubbles' in the pipeline, the part where some parts of the CPU are idle. The idle points can be used to do additional work, for example, when the pipeline stalls and the CPU is doing nothing, the 2nd thread can fetch another instruction for the functional units to work on.

But rather than share almost everything but the bare minimum as in Hyperthreading, the execution units are independent, the schedulers are independent. The resource contention that occur in Hyperthreading is the reason the performance increase is only 30%.

Look at the Power 5 multi-threading description: http://www.chip-architect.com/news/2003_08_22_hot_chips.html

It was estimated that the multithreading performance improvement would have been limited to 20% without these extensions.

In multi-threading, Bulldozer would act like 0.9 x 2. Hyperthreading is 0.65 x 2, a dual core is 1x2. In single thread, both CPUs become 1x again since it can use all the resouces available. No, its not clarified whether the two integer cores can combine in single thread, but it seems to be the point of Bulldozer. The unified fetch and decode, the L2 caches, its there to use both units in single thread.

It's said that 1.7x 12 Magny Cours cores equal 16 Bulldozer cores right? Go back at my SpecCPU2006 comparison. If a Bulldozer CPU with 1.33x the cores can perform 1.7x faster, assuming similar scaling, it'll perform 50% faster per clock in the single threaded SpecCPU2006 benchmark, because performance doesn't scale linearly with cores.

1x Nehalem die
1x Single threaded performance as Nehalem
1.4x Multi threaded performance(Owing entirely to the difference between CMT and SMT)
 
Last edited: