AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
You gotta drop that "second thread is slower than first thread" theory. They are both equal.



No, It'll work exactly like that. Forget about single thread for now. Again, the sole purpose of multi-threading technologies is to increase utilization of the CPU. There are always "bubbles' in the pipeline, the part where some parts of the CPU are idle. The idle points can be used to do additional work, for example, when the pipeline stalls and the CPU is doing nothing, the 2nd thread can fetch another instruction for the functional units to work on.

But rather than share almost everything but the bare minimum as in Hyperthreading, the execution units are independent, the schedulers are independent. The resource contention that occur in Hyperthreading is the reason the performance increase is only 30%.

Look at the Power 5 multi-threading description: http://www.chip-architect.com/news/2003_08_22_hot_chips.html



In multi-threading, Bulldozer would act like 0.9 x 2. Hyperthreading is 0.65 x 2, a dual core is 1x2. In single thread, both CPUs become 1x again since it can use all the resouces available. No, its not clarified whether the two integer cores can combine in single thread, but it seems to be the point of Bulldozer. The unified fetch and decode, the L2 caches, its there to use both units in single thread.

It's said that 1.7x 12 Magny Cours cores equal 16 Bulldozer cores right? Go back at my SpecCPU2006 comparison. If a Bulldozer CPU with 1.33x the cores can perform 1.7x faster, assuming similar scaling, it'll perform 50% faster per clock in the single threaded SpecCPU2006 benchmark, because performance doesn't scale linearly with cores.

1x Nehalem die
1x Single threaded performance as Nehalem
1.4x Multi threaded performance(Owing entirely to the difference between CMT and SMT)

Good post.

I'm not sure they can use both integer cores in the same thread - that would be like the holy grail and fuck multi-thread performance in that case!

That seems to only be possible if software is especially coded for that, although maybe with BD that coding will be simpler.
 

deputc26

Senior member
Nov 7, 2008
548
1
76
After watching this thread unfold and seeing the participants struggle to quantify bulldozer performance in relation to available architectures, several issues have emerged that make me question AMDs decision to to market a module as 2 cores. The first is the obvious performance advantage that Intel will now hold "per core" from the average consumer's perspective: "should I get the Intel quad or the AMD quad?" Answer would be obviously Intel if you want the best. This makes AMD look like a chinese "quantity over quality" no-nam/e brand which will do long term damage to the brand. On the other hand, if AMD decides to call a module a core they will have "core" superiority and when ever processors of like core counts are discussed AMD will be the winner or respectable runner-up. While this would cause AMD to lose the core count war it would be far less important because asa demonstrated by the shifts to duals and quads it takes at leasgt a hyear for gthe n ewest core-count to become mainstream allowing AMD to "ride the high volume portion of the demand bell curve". This should increase the AMD brand value as well as increasing total sales at the cost of core xount and a performance halo that gtbey couuld not get anyway.

It seems like AMD is taking the short-term "low" roadwith their current interpretation of "core"

Sorry for typos, posted from phone.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
After watching this thread unfold and seeing the participants struggle to quantify bulldozer performance in relation to available architectures, several issues have emerged that make me question AMDs decision to to market a module as 2 cores. The first is the obvious performance advantage that Intel will now hold "per core" from the average consumer's perspective: "should I get the Intel quad or the AMD quad?" Answer would be obviously Intel if you want the best. This makes AMD look like a chinese "quantity over quality" no-nam/e brand which will do long term damage to the brand. On the other hand, if AMD decides to call a module a core they will have "core" superiority and when ever processors of like core counts are discussed AMD will be the winner or respectable runner-up. While this would cause AMD to lose the core count war it would be far less important because asa demonstrated by the shifts to duals and quads it takes at leasgt a hyear for gthe n ewest core-count to become mainstream allowing AMD to "ride the high volume portion of the demand bell curve". This should increase the AMD brand value as well as increasing total sales at the cost of core xount and a performance halo that gtbey couuld not get anyway.

It seems like AMD is taking the short-term "low" roadwith their current interpretation of "core"

Sorry for typos, posted from phone.

I wish AMD had faster cores also, but I wonder if some of this power efficiency strategy is due to Bulldozer products being predictably built on larger nodes (in comparison to Intels usual faster transitioning)
 
Last edited:

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
I wish AMD had faster cores also, but I wonder if some of this power efficiency strategy is due to Bulldozer products being predictably built on larger nodes (in comparison to Intels usual faster transitioning)

But since BD cores are smaller they can throw more cores at Intel for the same size.

HT is really nice, but if you are only using 4 cores, Nehalem 4 "logical cores" do nothing.

Lets say a BD core is 20%-30% faster and a Nehalem 30% faster than a Deneb core. In terms of sizes Nehalem = Deneb=1 , 2 BD cores = 1.5 Denebs - "Dual-core BD" = 1.5 dies. Quad-core BD = 3 dies, Octo-core BD =6 dies. "Dual-Nehalem" = 2 dies, Quad-Nehalem = 4 dies, Octo Nehalem = 8 dies.

(I know Phenom II and i7 are 45nm and i3 is 32nm as will be i9, BD and Sandy, but disregard that atm)

Now lets play with numbers (all based on figures and suppositions that may or may not happen).

2 threads applications

Dual Phenom II - 1x2 = 2 cores

"Dual BD"-----0.9x2x1.2(or 1.3) = 2.16 (2.34) cores

Nehalem--- 1x2x1.3 = 2.6


4 threads application

Quad Phenom II - 1x4 = 4 PH II cores

Quad BD -------0.9x4x1.2 (or 1.3) = 4.32 (4.68) cores

Octo BD ---------- 1x4x1.2 (or 1.3) = 4.8 (5.2) cores

Quad Nehalem---- 1x4x1.3 = 5.2 cores (i5 and i7)

"Octo Nehalem" -- 1x4x1.3 = 5.2 cores

Dual Nehalem ---- 0.65x4x1.3 = 3.38 cores (i3 with HT)


8 threads application.

"Octo Phenom II"- 1x8 = 8 cores

Octo BD --------0.9x8x1.2 (or 1.3) = 8.64 (9.36) cores

Quad Nehalem---- 0.65x8x1.3 = 6.76 cores (i7 with HT)

"Octo Nehalem" -- 1x8x1.3 = 10.4x cores

Smells like price war (if Intel can actually be dragged into that - I remember Pentium D EE priced @$1000 even though the much faster dual-core FX were the same price).

Depends on what Sandy Bridge/Ivy Bridge will bring to the table and if AMD can surprise us even more with Bulldozer.

Anyway, seems good to the consumer.
 
Last edited:

JFAMD

Senior member
May 16, 2009
565
0
0
After watching this thread unfold and seeing the participants struggle to quantify bulldozer performance in relation to available architectures, several issues have emerged that make me question AMDs decision to to market a module as 2 cores. The first is the obvious performance advantage that Intel will now hold "per core" from the average consumer's perspective: "should I get the Intel quad or the AMD quad?" Answer would be obviously Intel if you want the best. This makes AMD look like a chinese "quantity over quality" no-nam/e brand which will do long term damage to the brand. On the other hand, if AMD decides to call a module a core they will have "core" superiority and when ever processors of like core counts are discussed AMD will be the winner or respectable runner-up. While this would cause AMD to lose the core count war it would be far less important because asa demonstrated by the shifts to duals and quads it takes at leasgt a hyear for gthe n ewest core-count to become mainstream allowing AMD to "ride the high volume portion of the demand bell curve". This should increase the AMD brand value as well as increasing total sales at the cost of core xount and a performance halo that gtbey couuld not get anyway.

It seems like AMD is taking the short-term "low" roadwith their current interpretation of "core"

Sorry for typos, posted from phone.


So, if you had an 8-core product that showed up to the hardware as 8 cores, showed up to the operating system as 8 core and showed up to the application as 8 cores, you'd call it a 4-core?

The only 2 places that people will see the bulldozer modules will be in the architecural layouts and the powerpoint slides. Modules are invisible to the user and the platform, all they see is cores.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
I'm not sure they can use both integer cores in the same thread - that would be like the holy grail and fuck multi-thread performance in that case!
Agreed. If that were possible (but it isn't), that would make a monster single-threaded performance. But then, if there were only two threads, why can't it be handled by 2 separate modules so that 2 integer cores will work together on the first thread, and the other thread will be handled by another pair of integer cores?

Nice to think about it, but it's not gonna happen. If that were an actual design target, we'd have heard about it already from AMD.
 

jones377

Senior member
May 2, 2004
467
70
91
So, if you had an 8-core product that showed up to the hardware as 8 cores, showed up to the operating system as 8 core and showed up to the application as 8 cores, you'd call it a 4-core?

The only 2 places that people will see the bulldozer modules will be in the architecural layouts and the powerpoint slides. Modules are invisible to the user and the platform, all they see is cores.

So lets say you have a 4-module/8-core Bulldozer running an application that only uses 4 threads (with a mix of int and float, lets say a game). How would you like the OS scheduler to dispatch the threads for maximum performance? Is there a technique in place already that AMD can take advantage of instead of telling the OS vendors to implement something new?

The fact is, the cores in a Bulldozer with 2 or more modules aren't all symmetrical. Putting those 4 threads on a separate module each would give you higher performance than using only 2 modules, especially if there are floating point instructions involved. So I say let Bulldozer (4M/8C) pretend to the OS that it is a SMT processor with 8 logical cores. Intel has done that since they first implemented SMT (back in 2001?) and it took microsoft quite some time to finally implement it properly and it should work just as well for Bulldozer even though it's not SMT.

I really don't care what marketing calls the end product, I just hope you don't have that kind of power over engineering.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
Gaia, that doesn't look like an AMD design target for their chip. That slide is informational in nature. It simply elaborated on the different types parallelism.

Parallel applications have to be written that way. No ingenious chip (as of right now, even if just a proof-of-concept) will make a structured linear program behave as if multi-threaded. "Parallel applications" will therefore never be a target of AMD design team unless they expand into software development.

It also certainly does not say that AMD is targeting our topic of 2 cores combining into one in order to handle a single thread.
 

JFAMD

Senior member
May 16, 2009
565
0
0
So lets say you have a 4-module/8-core Bulldozer running an application that only uses 4 threads (with a mix of int and float, lets say a game). How would you like the OS scheduler to dispatch the threads for maximum performance? Is there a technique in place already that AMD can take advantage of instead of telling the OS vendors to implement something new?

The fact is, the cores in a Bulldozer with 2 or more modules aren't all symmetrical. Putting those 4 threads on a separate module each would give you higher performance than using only 2 modules, especially if there are floating point instructions involved. So I say let Bulldozer (4M/8C) pretend to the OS that it is a SMT processor with 8 logical cores. Intel has done that since they first implemented SMT (back in 2001?) and it took microsoft quite some time to finally implement it properly and it should work just as well for Bulldozer even though it's not SMT.

I really don't care what marketing calls the end product, I just hope you don't have that kind of power over engineering.

Yes, you are right that in a perfect world, if you had 8 cores and 4 threads you would want to put one thread on each module, but the penatly for misjudging is nowhere near the penalty for putting 2 threads on a single core with SMT.

If you assume that 2 Bulldozer cores are 180% of the performance of 1 core, that means that there is a 90% throughput per core, or a 10% penalty for calling it wrong. But with SMT if you are getting ~20% bump, then that means 120% for 2 threads on one core, or 60% throughput per core. I'd rather take a 10% penalty for a missed route than a 40%.

And from an FP perspective, even if you did share 2 threads across one module, the FP unit will allocate 128-bit per thread. In an SMT system, unless they have 256-bit FPU, both threads are sharing a single FPU so you have effectively half the width.
 

JFAMD

Senior member
May 16, 2009
565
0
0

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
JFAMD, still seems ok to me now, the link still works fine. The slide GaiaHunter refers to is just informational in nature, a slide about different types of parallelism, the sort of basic information a CS student gets in their required textbook.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
So, if you had an 8-core product that showed up to the hardware as 8 cores, showed up to the operating system as 8 core and showed up to the application as 8 cores, you'd call it a 4-core?

The only 2 places that people will see the bulldozer modules will be in the architecural layouts and the powerpoint slides. Modules are invisible to the user and the platform, all they see is cores.

If I run a multi-threaded app which executes 256bit FPU instructions it will become apparent how many modules are present in my bulldozer-based CPU. Thread scaling will take a nosedive, worse than hyperthreading at that point as the resource contention will be 100%.
 

jones377

Senior member
May 2, 2004
467
70
91
Yes, you are right that in a perfect world, if you had 8 cores and 4 threads you would want to put one thread on each module, but the penatly for misjudging is nowhere near the penalty for putting 2 threads on a single core with SMT.

If you assume that 2 Bulldozer cores are 180% of the performance of 1 core, that means that there is a 90% throughput per core, or a 10% penalty for calling it wrong. But with SMT if you are getting ~20% bump, then that means 120% for 2 threads on one core, or 60% throughput per core. I'd rather take a 10% penalty for a missed route than a 40%.

And from an FP perspective, even if you did share 2 threads across one module, the FP unit will allocate 128-bit per thread. In an SMT system, unless they have 256-bit FPU, both threads are sharing a single FPU so you have effectively half the width.

But why take the hit at all when you don't have to? From the discussions here and elsewhere it seems that one thread can max out the FPU on the module all by itself, especially when running something like Linpack which would be "worst case". In fact the FPU looks very much like SMT to me when it's running 2 threads at once. I'd be willing to
bet that Bulldozer will take advantage of existing OS schedulers optimized for SMT.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
JFAMD, still seems ok to me now, the link still works fine. The slide GaiaHunter refers to is just informational in nature, a slide about different types of parallelism, the sort of basic information a CS student gets in their required textbook.

And it is from 2005.

Probably more like a wet dream of the writer.

Don't worry we won't start spreading Bulldozer can have multiple cores working on the same thread. :D
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
If I run a multi-threaded app which executes 256bit FPU instructions it will become apparent how many modules are present in my bulldozer-based CPU. Thread scaling will take a nosedive, worse than hyperthreading at that point as the resource contention will be 100%.
Evidently, they are betting on no such load, or infrequently. Or, perhaps, offloading such tasks to a discrete GPU in those scenarios.

Seems that Bulldozer looks good for the desktop, since Integer > Floating Point, as the lesson they learned the hard way in the original Phenom was, if I recall correctly.

Also, what's the timeline again for the on-die GPU? Will that be the norm for AMD by the time Bulldozer is released? That might be where they offload the heavy FP tasks?

I'm sure Mr. Fruehe will set us straight.
 
Last edited:

deputc26

Senior member
Nov 7, 2008
548
1
76
So, if you had an 8-core product that showed up to the hardware as 8 cores, showed up to the operating system as 8 core and showed up to the application as 8 cores, you'd call it a 4-core?

The only 2 places that people will see the bulldozer modules will be in the architecural layouts and the powerpoint slides. Modules are invisible to the user and the platform, all they see is cores.

Intels i7 quads "show up to the OS as 8 cores" and "show up to the application as 8 cores" as well. I am not sure if i7 shows up to the hardware as 8 cores.

The definition of core has been blurred as previously noted. I am not saying that AMD is not justified in calling it an 8 core. I am saying that calling it a quad would also be justified (I am assuming that a module is capable of devoting all resources to a single thread which i am not 100% sure of) and would have additional benefits for AMD.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
OMG, I'm getting IDC confused with ComputerBottleneck,
Can't believe I missed this when I first got back to this thread. IDC must have been honored and maybe even giggled himself to sleep :D

As long as the AMD Marketing team wises up and actually names the end product "Bulldozer" instead of something awkward like "Phenom III X8", it'll sell like crazy.
I kinda liked "Phenom II"... but I agree, "Phenom III XY" is a bit less impressive than, say, a "Bulldozer".
 

JFAMD

Senior member
May 16, 2009
565
0
0
If I run a multi-threaded app which executes 256bit FPU instructions it will become apparent how many modules are present in my bulldozer-based CPU. Thread scaling will take a nosedive, worse than hyperthreading at that point as the resource contention will be 100%.

If you have a workload that requires 256-bit FPU on every single cycle, I would recommend a CPU/GPU solution instead.

I have not seen an applicaiton like this at all in the x86 server world. There are a lot of apps that do heavy floating point, but not one that has those requirements.

And how do you come to the conclusion that it would be worse than hyperthreading at that point. You believe that putting that same load on a single core will net better results? How?
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
So, if you had an 8-core product that showed up to the hardware as 8 cores, showed up to the operating system as 8 core and showed up to the application as 8 cores, you'd call it a 4-core?

The only 2 places that people will see the bulldozer modules will be in the architecural layouts and the powerpoint slides. Modules are invisible to the user and the platform, all they see is cores.
The methods are different, but the results are the same with HT. Everything shows as 8 threads to the user, on a Ci7. But Intel calls it a quad and you're going to call it an octo. Both methods are sharing resources, the amount is all that varies.

Unless you can take the duplicated pieces apart and have both halves keep working, as far as Im concerned its a single core.
 

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
The methods are different, but the results are the same with HT. Everything shows as 8 threads to the user, on a Ci7. But Intel calls it a quad and you're going to call it an octo. Both methods are sharing resources, the amount is all that varies.

Unless you can take the duplicated pieces apart and have both halves keep working, as far as Im concerned its a single core.

Yeah, but according to released data BD will have 2x Integer resources of the Nehalem.
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
Yeah, but according to released data BD will have 2x Integer resources of the Nehalem.
source?
Each module has 2 integer cores, that doesn't necessarily mean each int core has double the integer resources. (it could be that thats the case, I certainly don't know)