Bulldozer has 33% less ALUs than K10

xfiver

Junior Member
May 14, 2007
1
0
0
Would someone (Anand ??) please comment about this article at:

http://www.pcgameshardware.de/aid,7...chnikdetails-zu-Bulldozer-und-Llano/CPU/News/

Its German and one paragraph translates as

" AMDs of coming bulldozer architecture news could be serious. Previous charts showed four unspezifizierte pipelines per core, which is usually considered an extension of the current triple scalar K10 architecture. In the detailed version is, however, two ALUs and two load/store units. Contrary to previous expectations a bulldozer core would not 33% more computational as a K10 core (with 3 ALUs and 3 load/store units) but 33% less so. Zambezi CPU with 8 cores would be on 16 ALUs while has a current Thuban Hexacore whose 18. It remains still unclear what changes AMD makes within the ALUs, and what performance have important gaming FPUs "

Is this a problem or not ?
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?
 

Lonyo

Lifer
Aug 10, 2002
21,938
6
81
More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?

I thought each "core" was a smaller current core, and there are 2 "cores" in a "module".
So it doesn't have SMT, it has more bits in what is currently a core, but each of those core bits is smaller.

So a current core is 1x3 ALUs (seen as one core in Windows).
Bulldozer would be 2x2 ALUs (seen as two cores in Windows).
So it's almost like SMT but with more physical stuff than Hyperthreading, but less than an actual second core vs current AMD processors, with some of the other resources being shared between the two "cores". Each module ends up being faster than a current core, but each new "core" is slower than a core.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Usually computing platforms are evaluated against the following metrics:
1. Performance
2. Performance/Dollar (time-zero expense)
3. Performance/Watt (TCO consideration)

Why would anyone care to break this down into "ALU/dollar", etc? To critique the design decisions that went into developing the architecture, I suppose, but to actually make any purchasing decisions? Doubtful.

ALU is really not such an "all purpose" metric anyways...fewer ALU's paired with a lower latency ISA or more ALU's paired with a higher latency ISA. Performance will completely depend on instruction mix for any given application.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
The article is supporting my initial belief that the CPU design team on AMD is taking cues from the GPU design team.

GPU design team plays with the idea of having small "base" GPU core that can be easily put in a dual chip configuration for performance, minimizing yield issues associated with large die GPUs like Nvidia is pushing.

Both GPU and CPU design team is facing competitors that are adept at creating large cores, and being a smaller company the new design direction is a way of adjusting to it. It might never take the absolute performance lead, but it'll be more profitable.
 

JFAMD

Senior member
May 16, 2009
565
0
0
More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?

No, we do have shared resources but the shared resources allow simultaneous processing.

SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

While we have 2 cores in a module that share resources, they are truly processing simultaneously.

Technically, SMT is more like serial multithreading and not simultaneous multithreading.

We have no shared pipelines, they are all dedicated to the integer core. That is the difference.
 

Borealis7

Platinum Member
Oct 19, 2006
2,901
205
106
will AMD have a fast dual/quad CPU for gaming?
if i choose AMD, then i'd rather my next upgrade be a faster quad (with turbo, of course) than a slower X6,8,whatever
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?
Yes, I believe that is true.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
The article is supporting my initial belief that the CPU design team on AMD is taking cues from the GPU design team.

GPU design team plays with the idea of having small "base" GPU core that can be easily put in a dual chip configuration for performance, minimizing yield issues associated with large die GPUs like Nvidia is pushing.

Both GPU and CPU design team is facing competitors that are adept at creating large cores, and being a smaller company the new design direction is a way of adjusting to it. It might never take the absolute performance lead, but it'll be more profitable.

Nice post. Youdo understand that Intel has a 48 core silly out in the wild. Than there is Atom . Saying Intel is adept at large cores . Your leaving out the fact they are adept at small cores also . I like the direction AMD is moving .
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?


You ans your own question . Nehalem has ht . Conroe didn't . Yet its basicly the same core
 
Jan 27, 2009
182
0
0
I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

Is this not more or less what we see with the Linpack benchmark test? Highly optimised code that leaves very little stalling, and therefore little/ worse performance with HT enabled. I think even Intel recommend disabling HT when using Linpack.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I said over a year ago that I believe that at some point Intel/Amd will be on same chip . With intel fabing the chip . I always liked ATI because of there front end mainly . Its a good fit for Intel. The Best part is its should end the fanbois wars.
 

iCyborg

Golden Member
Aug 8, 2008
1,352
62
91
Idontcare said:
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?
Yes, I believe that is true.
I thought that the main reason for HT was the cost of context switches, on a typical X86/X64 system they are in the order of something like 10,000 cycles. And it's hard to optimize some stuff, like cache misses or branch mispredictions which will stall the running thread. Switching to another thread in a fraction of a time can make a difference. I don't think a program who can't keep everything in a few MB of L2/L3 cache is poorly coded or that it's a poorly designed architecture thing.
It's not uncommon for code that needs to wait for a resource like a mutex lock to do spin waiting for a little while, i.e. wasting cycles, before going to a proper wait - often the locks are released quickly, so it's better to waste some cycles than incurring 2 context switches (out and back in later) on a non-HT system.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
^

They left you hanging . I think maybe you should expand on what you said . But don't take big step baby steps is better.
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

I also heard that the Conroe/Penryn architecture's Front End Issue which are 4 of them, went underutilized most of the time, so Hyper Threading helped somehow with maximizing their utilization, hence better performance specially in Multi Threading, after all Nehalem isn't any wider than Conroe/Penryn.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
No, we do have shared resources but the shared resources allow simultaneous processing.

SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

I do not know where you got this.

Look at the diagram I drew:



(Uploaded with ImageShack.us)

So again in summary:

Coarse Grain Multi-Threading: Switches to another thread ONLY with high latency events, like a cache miss

Fine Grain Multi-Threading: Switches to another thread with every cycle

Simultaneous Multi-Threading: Can switch to another thread AND execute at the same time. You cannot do this on a non-superscalar CPU.

The Pentium 4, and Nehalem, and Atom cores all use Simultaneous Multi-threading. The only other Intel processor that uses another multi-threading technique is Montecito and Tukwila, which are multi core Itanium CPUs.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

I am far from being a software engineer, but I would say that you are generally correct.

The biggest takeaway that I get from my engineers is that as you make your system more efficient (i.e. eliminate gaps in the processing chain), the less effective HT becomes.

One could argue (correctly) that we will probably never get to 0% cache misses, so, theoretically there will always be a place for an SMT-type architecture. But look at the differences between client and server. It gives bigger uplift in client because those apps are less threaded and *generally* less efficient. They were written for one big fast core instead of being written for multiple cores. Even in the world of single core processors, server apps were multithreaded, client apps were not.

If you could guarantee an environment where you had 0% pipeline stalls and 0% cache misses you'd kill SMT in a heartbeat. Why take the overhead if there is no benefit (because your pipelines would be full and you would never have a thread stop).

The only question to ask yourself is, do you believe people are working to make software more efficient or less efficient? If they are driving for higher efficiency, they are *probably* driving towards lower uplift on SMT over time.

Let's say a pipeline can handle 100 units of work at max. Because of things like cache misses and thread stalls you only get 80 units of work out of it. Now, with SMT you can load some additional work in during a stall, but there is a switching overhead (emptying/loading cache), so you get another 12 units of work out of it. 92 units out of a possible 100 units.

If you make your application and OS more efficient, you get an extra 6 units out of the pipeline. Then you make your cache larger and get another 4 units out. Now you are at 90 units. Because there is a physical limit of 100 units, and there is some SMT overhead, maybe you get 6 units out of SMT. The good news is that you are at 96 units of work, so you are getting more done and are more efficient, but guess what? Your SMT efficiency went from 15% (12 units on 80) to ~7% (6 units on 90).

Now, take that same 80 unit throughput and add a second core. 80 x 2 = 160 units. even if you assume 90% scaling, you are at 152 units. THAT is how you increase throughput.

Take your 152 unit dual core,and add the 4 units of cache and 6 units of SW enhancement. That adds 19 units (if you assume 90% scaling on the second core), or, roughly 12% uplift (to a total of 171). You are getting more benefit out of both of those increases because each has a full impact. In the SMT world, the benefit of those enhancements brings the SMT benefit down because you have the same number of pipes.

So the key is more cores will get you better increases as you make architectural changes towards better efficiency. In an SMT environment, you are banking on inefficiency in order to make 2 threads work on one set of pipelines.

Before anyone starts the flames, all of these numbers are made up and for comparison only. But you should get the idea.
 

JFAMD

Senior member
May 16, 2009
565
0
0
I do not know where you got this.

Look at the diagram I drew:



(Uploaded with ImageShack.us)

So again in summary:

Coarse Grain Multi-Threading: Switches to another thread ONLY with high latency events, like a cache miss

Fine Grain Multi-Threading: Switches to another thread with every cycle

Simultaneous Multi-Threading: Can switch to another thread AND execute at the same time. You cannot do this on a non-superscalar CPU.

The Pentium 4, and Nehalem, and Atom cores all use Simultaneous Multi-threading. The only other Intel processor that uses another multi-threading technique is Montecito and Tukwila, which are multi core Itanium CPUs.

Can't see the picture, imageshack is blocked by our proxy for some reason.

You can't put 10 pounds of instructions in a 5 pound bag. Your limit is your pipelines. All you are doing in SMT is trading off the inefficiency in the pipeline, you are not creating extra pipelines.

Fine grain, coarse grain, it doesn't matter. What matters is the pipelines, and SMT doesn't allow 2 instructions at the same time, it allows you to fill in the gaps.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Is this not more or less what we see with the Linpack benchmark test? Highly optimised code that leaves very little stalling, and therefore little/ worse performance with HT enabled. I think even Intel recommend disabling HT when using Linpack.

Exactly. The more efficient your sw is, the less of a benefit you'll see from SMT. Because there is an overhead (which will vary by application and environment) there are actually times when you see negative scaling.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
snip

The biggest takeaway that I get from my engineers is that as you make your system more efficient (i.e. eliminate gaps in the processing chain), the less effective HT becomes.

snip

Now, take that same 80 unit throughput and add a second core. 80 x 2 = 160 units. even if you assume 90% scaling, you are at 152 units. THAT is how you increase throughput.

snip

So the key is more cores will get you better increases as you make architectural changes towards better efficiency. In an SMT environment, you are banking on inefficiency in order to make 2 threads work on one set of pipelines.

snip

I think I am finally beginning to catch a glimpse as to why HT is met with the negative reaction it does within certain circles of computing aficionados.

It had completely escaped me until now that hyperthreading performance lives and dies within the envelope of pipeline stalls - be it from less than optimally engineered architecture or less than optimally engineered software - anything that decreases the stalls also goes towards reducing the efficacy of the thread performance.

Now I am finally beginning to understand why AMD never went there and why the preferred architecture path is towards more cores if you want throughput scaling with threads. (and probably speaks a bit as to why thread scaling is so poor on nehalem vs. opteron for some apps).

Euler3DBenchmarkScaling.gif


Corei79204GHzwithHT.png
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Fine grain, coarse grain, it doesn't matter. What matters is the pipelines, and SMT doesn't allow 2 instructions at the same time, it allows you to fill in the gaps.

Try this
al0t5f.png


I see where you are getting at. Differences in implementation to get efficiency up, but the final goals are similar.

You can't rely on perfect code and perfect environment. The whole world works on the fact that nothing is perfect. Ironically, if perfect code could have been written we might have seen very wide high clocked ILP processors instead.

Whether client or server benefits more from SMT is too hard to say and very dependent on app. Actually from Nehalem, SMT benefits more relevant crowd for servers than on PC because practical ILP is far away from theoretical ILP and apps that gain a lot from multi processor systems generally gain on SMT as well.

Why do just x cores when you can do x cores + x threads?
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Now I am finally beginning to understand why AMD never went there and why the preferred architecture path is towards more cores if you want throughput scaling with threads. (and probably speaks a bit as to why thread scaling is so poor on nehalem vs. opteron for some apps).

Euler3DBenchmarkScaling.gif

I am not understanding you IDC. On that Euler3D example its comparing 2 quad core Opterons with single Core i7 CPU that supports SMT. Yes, scaling is lower, but you are getting lower prices and having to buy only 1 die for it.

Maybe if you want to approach in terms of design it makes a bit of sense. TLP optimized cores are likely to be simpler than ILP optimized cores, while the throughput is likely higher on a TLP optimized core.

Or perhaps you are talking about running software that has max thread limit: http://anandtech.com/show/2774/7

On the MCS eFMS 9.2 Website, it does not scale beyond 8 cores(threads) and therefore the HT-enabled Dual Xeon turns out slower than the HT-disabled Dual Xeon and is only few single digit % faster than a HT-enabled Single Xeon.

On that specific case, it would be better to have x cores than x threads instead. But since JFAMD brought server into the discussion, if the app supports nearly limitless threads, the only limit here really is how far from theoretical ILP it can really extract.