Bulldozer has 33% less ALUs than K10

xfiver · May 6, 2010

Would someone (Anand ??) please comment about this article at:

http://www.pcgameshardware.de/aid,7...chnikdetails-zu-Bulldozer-und-Llano/CPU/News/

Its German and one paragraph translates as

" AMDs of coming bulldozer architecture news could be serious. Previous charts showed four unspezifizierte pipelines per core, which is usually considered an extension of the current triple scalar K10 architecture. In the detailed version is, however, two ALUs and two load/store units. Contrary to previous expectations a bulldozer core would not 33% more computational as a K10 core (with 3 ALUs and 3 load/store units) but 33% less so. Zambezi CPU with 8 cores would be on 16 ALUs while has a current Thuban Hexacore whose 18. It remains still unclear what changes AMD makes within the ALUs, and what performance have important gaming FPUs "

Is this a problem or not ?

JFAMD · May 6, 2010

Bulldozer has more pipelines than our existing products.

Ancalagon44 · May 6, 2010

More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?

Lonyo · May 6, 2010

Ancalagon44 said:
More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?

I thought each "core" was a smaller current core, and there are 2 "cores" in a "module".
So it doesn't have SMT, it has more bits in what is currently a core, but each of those core bits is smaller.

So a current core is 1x3 ALUs (seen as one core in Windows).
Bulldozer would be 2x2 ALUs (seen as two cores in Windows).
So it's almost like SMT but with more physical stuff than Hyperthreading, but less than an actual second core vs current AMD processors, with some of the other resources being shared between the two "cores". Each module ends up being faster than a current core, but each new "core" is slower than a core.

Idontcare · May 6, 2010

Usually computing platforms are evaluated against the following metrics:
1. Performance
2. Performance/Dollar (time-zero expense)
3. Performance/Watt (TCO consideration)

Why would anyone care to break this down into "ALU/dollar", etc? To critique the design decisions that went into developing the architecture, I suppose, but to actually make any purchasing decisions? Doubtful.

ALU is really not such an "all purpose" metric anyways...fewer ALU's paired with a lower latency ISA or more ALU's paired with a higher latency ISA. Performance will completely depend on instruction mix for any given application.

IntelUser2000 · May 6, 2010

The article is supporting my initial belief that the CPU design team on AMD is taking cues from the GPU design team.

GPU design team plays with the idea of having small "base" GPU core that can be easily put in a dual chip configuration for performance, minimizing yield issues associated with large die GPUs like Nvidia is pushing.

Both GPU and CPU design team is facing competitors that are adept at creating large cores, and being a smaller company the new design direction is a way of adjusting to it. It might never take the absolute performance lead, but it'll be more profitable.

JFAMD · May 6, 2010

Ancalagon44 said:
More detail, JFAMD?

Doesnt bulldozer also implement a version of SMT by allowing two physical cores to share execution resources?

No, we do have shared resources but the shared resources allow simultaneous processing.

SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

While we have 2 cores in a module that share resources, they are truly processing simultaneously.

Technically, SMT is more like serial multithreading and not simultaneous multithreading.

We have no shared pipelines, they are all dedicated to the integer core. That is the difference.

Borealis7 · May 6, 2010

will AMD have a fast dual/quad CPU for gaming?
if i choose AMD, then i'd rather my next upgrade be a faster quad (with turbo, of course) than a slower X6,8,whatever

Idontcare · May 6, 2010

JFAMD said:
SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

yh125d · May 6, 2010

This is the part where my eyeballs glaze over

VirtualLarry · May 6, 2010

Idontcare said:
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

Yes, I believe that is true.

Nemesis 1 · May 6, 2010

IntelUser2000 said:
The article is supporting my initial belief that the CPU design team on AMD is taking cues from the GPU design team.

GPU design team plays with the idea of having small "base" GPU core that can be easily put in a dual chip configuration for performance, minimizing yield issues associated with large die GPUs like Nvidia is pushing.

Both GPU and CPU design team is facing competitors that are adept at creating large cores, and being a smaller company the new design direction is a way of adjusting to it. It might never take the absolute performance lead, but it'll be more profitable.

Nice post. Youdo understand that Intel has a 48 core silly out in the wild. Than there is Atom . Saying Intel is adept at large cores . Your leaving out the fact they are adept at small cores also . I like the direction AMD is moving .

Nemesis 1 · May 6, 2010

Idontcare said:
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

You ans your own question . Nehalem has ht . Conroe didn't . Yet its basicly the same core

NumericalMethods · May 6, 2010

Idontcare said:
I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

Is this not more or less what we see with the Linpack benchmark test? Highly optimised code that leaves very little stalling, and therefore little/ worse performance with HT enabled. I think even Intel recommend disabling HT when using Linpack.

Nemesis 1 · May 6, 2010

I said over a year ago that I believe that at some point Intel/Amd will be on same chip . With intel fabing the chip . I always liked ATI because of there front end mainly . Its a good fit for Intel. The Best part is its should end the fanbois wars.

iCyborg · May 6, 2010

VirtualLarry said:
Idontcare said:

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

Click to expand...

Yes, I believe that is true.

I thought that the main reason for HT was the cost of context switches, on a typical X86/X64 system they are in the order of something like 10,000 cycles. And it's hard to optimize some stuff, like cache misses or branch mispredictions which will stall the running thread. Switching to another thread in a fraction of a time can make a difference. I don't think a program who can't keep everything in a few MB of L2/L3 cache is poorly coded or that it's a poorly designed architecture thing.
It's not uncommon for code that needs to wait for a resource like a mutex lock to do spin waiting for a little while, i.e. wasting cycles, before going to a proper wait - often the locks are released quickly, so it's better to waste some cycles than incurring 2 context switches (out and back in later) on a non-HT system.

Nemesis 1 · May 6, 2010

^

They left you hanging . I think maybe you should expand on what you said . But don't take big step baby steps is better.

evolucion8 · May 6, 2010

Idontcare said:
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

I also heard that the Conroe/Penryn architecture's Front End Issue which are 4 of them, went underutilized most of the time, so Hyper Threading helped somehow with maximizing their utilization, hence better performance specially in Multi Threading, after all Nehalem isn't any wider than Conroe/Penryn.

IntelUser2000 · May 6, 2010

JFAMD said:
No, we do have shared resources but the shared resources allow simultaneous processing.

SMT is actually NOT simultaneous because you have one set of pipelines and you can only schedule one thread at a time. Thread B takes over when thread A stalls.

I do not know where you got this.

Look at the diagram I drew:

(Uploaded with ImageShack.us)

So again in summary:

Coarse Grain Multi-Threading: Switches to another thread ONLY with high latency events, like a cache miss

Fine Grain Multi-Threading: Switches to another thread with every cycle

Simultaneous Multi-Threading: Can switch to another thread AND execute at the same time. You cannot do this on a non-superscalar CPU.

The Pentium 4, and Nehalem, and Atom cores all use Simultaneous Multi-threading. The only other Intel processor that uses another multi-threading technique is Montecito and Tukwila, which are multi core Itanium CPUs.

JFAMD · May 6, 2010

Idontcare said:
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

I am far from being a software engineer, but I would say that you are generally correct.

The biggest takeaway that I get from my engineers is that as you make your system more efficient (i.e. eliminate gaps in the processing chain), the less effective HT becomes.

One could argue (correctly) that we will probably never get to 0% cache misses, so, theoretically there will always be a place for an SMT-type architecture. But look at the differences between client and server. It gives bigger uplift in client because those apps are less threaded and *generally* less efficient. They were written for one big fast core instead of being written for multiple cores. Even in the world of single core processors, server apps were multithreaded, client apps were not.

If you could guarantee an environment where you had 0% pipeline stalls and 0% cache misses you'd kill SMT in a heartbeat. Why take the overhead if there is no benefit (because your pipelines would be full and you would never have a thread stop).

The only question to ask yourself is, do you believe people are working to make software more efficient or less efficient? If they are driving for higher efficiency, they are *probably* driving towards lower uplift on SMT over time.

Let's say a pipeline can handle 100 units of work at max. Because of things like cache misses and thread stalls you only get 80 units of work out of it. Now, with SMT you can load some additional work in during a stall, but there is a switching overhead (emptying/loading cache), so you get another 12 units of work out of it. 92 units out of a possible 100 units.

If you make your application and OS more efficient, you get an extra 6 units out of the pipeline. Then you make your cache larger and get another 4 units out. Now you are at 90 units. Because there is a physical limit of 100 units, and there is some SMT overhead, maybe you get 6 units out of SMT. The good news is that you are at 96 units of work, so you are getting more done and are more efficient, but guess what? Your SMT efficiency went from 15% (12 units on 80) to ~7% (6 units on 90).

Now, take that same 80 unit throughput and add a second core. 80 x 2 = 160 units. even if you assume 90% scaling, you are at 152 units. THAT is how you increase throughput.

Take your 152 unit dual core,and add the 4 units of cache and 6 units of SW enhancement. That adds 19 units (if you assume 90% scaling on the second core), or, roughly 12% uplift (to a total of 171). You are getting more benefit out of both of those increases because each has a full impact. In the SMT world, the benefit of those enhancements brings the SMT benefit down because you have the same number of pipes.

So the key is more cores will get you better increases as you make architectural changes towards better efficiency. In an SMT environment, you are banking on inefficiency in order to make 2 threads work on one set of pipelines.

Before anyone starts the flames, all of these numbers are made up and for comparison only. But you should get the idea.

JFAMD · May 6, 2010

IntelUser2000 said:
I do not know where you got this.

Look at the diagram I drew:

(Uploaded with ImageShack.us)

So again in summary:

Coarse Grain Multi-Threading: Switches to another thread ONLY with high latency events, like a cache miss

Fine Grain Multi-Threading: Switches to another thread with every cycle

Simultaneous Multi-Threading: Can switch to another thread AND execute at the same time. You cannot do this on a non-superscalar CPU.

The Pentium 4, and Nehalem, and Atom cores all use Simultaneous Multi-threading. The only other Intel processor that uses another multi-threading technique is Montecito and Tukwila, which are multi core Itanium CPUs.

Can't see the picture, imageshack is blocked by our proxy for some reason.

You can't put 10 pounds of instructions in a 5 pound bag. Your limit is your pipelines. All you are doing in SMT is trading off the inefficiency in the pipeline, you are not creating extra pipelines.

Fine grain, coarse grain, it doesn't matter. What matters is the pipelines, and SMT doesn't allow 2 instructions at the same time, it allows you to fill in the gaps.

JFAMD · May 6, 2010

NumericalMethods said:
Is this not more or less what we see with the Linpack benchmark test? Highly optimised code that leaves very little stalling, and therefore little/ worse performance with HT enabled. I think even Intel recommend disabling HT when using Linpack.

Exactly. The more efficient your sw is, the less of a benefit you'll see from SMT. Because there is an overhead (which will vary by application and environment) there are actually times when you see negative scaling.

Idontcare · May 6, 2010

JFAMD said:
snip

The biggest takeaway that I get from my engineers is that as you make your system more efficient (i.e. eliminate gaps in the processing chain), the less effective HT becomes.

snip

Now, take that same 80 unit throughput and add a second core. 80 x 2 = 160 units. even if you assume 90% scaling, you are at 152 units. THAT is how you increase throughput.

snip

So the key is more cores will get you better increases as you make architectural changes towards better efficiency. In an SMT environment, you are banking on inefficiency in order to make 2 threads work on one set of pipelines.

snip

I think I am finally beginning to catch a glimpse as to why HT is met with the negative reaction it does within certain circles of computing aficionados.

It had completely escaped me until now that hyperthreading performance lives and dies within the envelope of pipeline stalls - be it from less than optimally engineered architecture or less than optimally engineered software - anything that decreases the stalls also goes towards reducing the efficacy of the thread performance.

Now I am finally beginning to understand why AMD never went there and why the preferred architecture path is towards more cores if you want throughput scaling with threads. (and probably speaks a bit as to why thread scaling is so poor on nehalem vs. opteron for some apps).

IntelUser2000 · May 6, 2010

JFAMD said:
Fine grain, coarse grain, it doesn't matter. What matters is the pipelines, and SMT doesn't allow 2 instructions at the same time, it allows you to fill in the gaps.

Try this

I see where you are getting at. Differences in implementation to get efficiency up, but the final goals are similar.

You can't rely on perfect code and perfect environment. The whole world works on the fact that nothing is perfect. Ironically, if perfect code could have been written we might have seen very wide high clocked ILP processors instead.

Whether client or server benefits more from SMT is too hard to say and very dependent on app. Actually from Nehalem, SMT benefits more relevant crowd for servers than on PC because practical ILP is far away from theoretical ILP and apps that gain a lot from multi processor systems generally gain on SMT as well.

Why do just x cores when you can do x cores + x threads?

IntelUser2000 · May 6, 2010

Idontcare said:
Now I am finally beginning to understand why AMD never went there and why the preferred architecture path is towards more cores if you want throughput scaling with threads. (and probably speaks a bit as to why thread scaling is so poor on nehalem vs. opteron for some apps).

I am not understanding you IDC. On that Euler3D example its comparing 2 quad core Opterons with single Core i7 CPU that supports SMT. Yes, scaling is lower, but you are getting lower prices and having to buy only 1 die for it.

Maybe if you want to approach in terms of design it makes a bit of sense. TLP optimized cores are likely to be simpler than ILP optimized cores, while the throughput is likely higher on a TLP optimized core.

Or perhaps you are talking about running software that has max thread limit: http://anandtech.com/show/2774/7

On the MCS eFMS 9.2 Website, it does not scale beyond 8 cores(threads) and therefore the HT-enabled Dual Xeon turns out slower than the HT-disabled Dual Xeon and is only few single digit % faster than a HT-enabled Single Xeon.

On that specific case, it would be better to have x cores than x threads instead. But since JFAMD brought server into the discussion, if the app supports nearly limitless threads, the only limit here really is how far from theoretical ILP it can really extract.

Bulldozer has 33% less ALUs than K10

Junior Member

Senior member

Diamond Member

Lifer

Elite Member

Elite Member

Senior member

Platinum Member

Elite Member

Diamond Member

No Lifer

Lifer

Lifer

Member

Lifer

Golden Member

Lifer

Platinum Member

Elite Member

Senior member

Senior member

Senior member

Elite Member

Elite Member

Elite Member