Bulldozer has 33% less ALUs than K10

Accord99 · May 6, 2010

IntelUser2000 said:
I am not understanding you IDC. On that Euler3D example its comparing 2 quad core Opterons with single Core i7 CPU that supports SMT. Yes, scaling is lower, but you are getting lower prices and having to buy only 1 die for it.

Plus, while scaling can be useful for future extrapolation and technical discussions, ultimately what matters is performance. And comparing results here:

http://www.techreport.com/articles.x/15905/9
http://www.techreport.com/articles.x/15818/13

While HT only increases throughput by about 13%, the raw power of a single Nehalem core in this particular application is such that a single quad-core Nehalem is significantly faster two quad-core Shanghais or Harpertown.

On that specific case, it would be better to have x cores than x threads instead. But since JFAMD brought server into the discussion, if the app supports nearly limitless threads, the only limit here really is how far from theoretical ILP it can really extract.

Also, we can see it would be better to have x powerful cores than x weaker cores as evidenced by 8 2.7GHz Shanghai cores outperforming 12 2.6GHz Istanbul cores and no doubt outperform Magny Cours as well in this particular test.

soccerballtux · May 6, 2010

Idontcare said:
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

I said something like what you are theorizing about a few months ago-- we could code an application which would scale tremendously on the i7's Hyperthreading.
I think the benefits are greater on Nehalemn thanks to the limited cache. 256kb L2 and that's all.

This is what I think is going on with Farcry 2's benchmark results. It scales too well with hyperthreading. I prefer to see Crysis benchies for gaming CPU decisions.

soccerballtux · May 6, 2010

iCyborg said:
I thought that the main reason for HT was the cost of context switches, on a typical X86/X64 system they are in the order of something like 10,000 cycles. And it's hard to optimize some stuff, like cache misses or branch mispredictions which will stall the running thread. Switching to another thread in a fraction of a time can make a difference. I don't think a program who can't keep everything in a few MB of L2/L3 cache is poorly coded or that it's a poorly designed architecture thing.
It's not uncommon for code that needs to wait for a resource like a mutex lock to do spin waiting for a little while, i.e. wasting cycles, before going to a proper wait - often the locks are released quickly, so it's better to waste some cycles than incurring 2 context switches (out and back in later) on a non-HT system.

If we coded a bunch of nested while loops for mathematics calculations that we jump in and out of frequently; instead of using for loops, then one could call that "poor" coding. Poor because the compiler can minimize branch mispredictions if you're writing for loops, but it can't for while loops (since it has no knowledge at compile-time of how large the variable inside the while condition will be). Does it matter? Not really, computers are so fast these days. But for an exercise it could prove useful.

Kantastic · May 6, 2010

Borealis7 said:
will AMD have a fast dual/quad CPU for gaming?
if i choose AMD, then i'd rather my next upgrade be a faster quad (with turbo, of course) than a slower X6,8,whatever

The new X6's run cooler and clock higher than current X4's.

soccerballtux · May 6, 2010

soccerballtux said:
I said something like what you are theorizing about a few months ago-- we could code an application which would scale tremendously on the i7's Hyperthreading.
I think the benefits are greater on Nehalemn thanks to the limited cache. 256kb L2 and that's all.

This is what I think is going on with Farcry 2's benchmark results. It scales too well with hyperthreading. I prefer to see Crysis benchies for gaming CPU decisions.

it's why the Atom architecture scales up to 50% with the addition of Hyperthreading. No out of order execution, so lots of stalls. It was a great solution; OoOE hardware is very expensive (lots of transistors). IIRC the HT is something like 10% the number of transistors they would have needed for OoO.

JFAMD · May 6, 2010

IntelUser2000 said:
Try this

I see where you are getting at. Differences in implementation to get efficiency up, but the final goals are similar.

You can't rely on perfect code and perfect environment. The whole world works on the fact that nothing is perfect. Ironically, if perfect code could have been written we might have seen very wide high clocked ILP processors instead.

Whether client or server benefits more from SMT is too hard to say and very dependent on app. Actually from Nehalem, SMT benefits more relevant crowd for servers than on PC because practical ILP is far away from theoretical ILP and apps that gain a lot from multi processor systems generally gain on SMT as well.

Why do just x cores when you can do x cores + x threads?

It is a design philosophy.

Because the entire architecture needs to be designed from the ground up to be optimized for that. You don't just "bolt" SMT on to the architecture, it impacts how you design your data paths, how you design your pipelines and how you design your caches.

It's like baking. You start with dough, and what you put in it and how you handle it determines whether it rises to be bread or stays flat for a pizza crust. They're both just flour, egg and water, right? Well, once the bread has risen, it makes a really crappy pizza crust and if you don't put the yeast in to let it rise, good luck making a sandwich out of two slices of pizza crust.

soccerballtux · May 6, 2010

IntelUser2000 said:
I am not understanding you IDC. On that Euler3D example its comparing 2 quad core Opterons with single Core i7 CPU that supports SMT. Yes, scaling is lower, but you are getting lower prices and having to buy only 1 die for it.

Maybe if you want to approach in terms of design it makes a bit of sense. TLP optimized cores are likely to be simpler than ILP optimized cores, while the throughput is likely higher on a TLP optimized core.

Or perhaps you are talking about running software that has max thread limit: http://anandtech.com/show/2774/7

On the MCS eFMS 9.2 Website, it does not scale beyond 8 cores(threads) and therefore the HT-enabled Dual Xeon turns out slower than the HT-disabled Dual Xeon and is only few single digit % faster than a HT-enabled Single Xeon.

On that specific case, it would be better to have x cores than x threads instead. But since JFAMD brought server into the discussion, if the app supports nearly limitless threads, the only limit here really is how far from theoretical ILP it can really extract.

I believe he is showing the comparison so we can see that it can indeed scale to 8 threads-- if there is real hardware to be used. When there is not real extra hardware (as is the case with Nehalem, and especially so when you're running optimized code without many stalls), it cannot scale.

iCyborg · May 6, 2010

soccerballtux said:
If we coded a bunch of nested while loops for mathematics calculations that we jump in and out of frequently; instead of using for loops, then one could call that "poor" coding. Poor because the compiler can minimize branch mispredictions if you're writing for loops, but it can't for while loops (since it has no knowledge at compile-time of how large the variable inside the while condition will be). Does it matter? Not really, computers are so fast these days. But for an exercise it could prove useful.

Thanks for the example, I guess I missed the point here, it's not that HT can't be genuinely useful, but if it scales really well, there's a good chance the app isn't optimized well.

Phynaz · May 6, 2010

Idontcare said:
It had completely escaped me until now that hyperthreading performance lives and dies within the envelope of pipeline stalls

This isn't 100% true in my understanding.

Think of a wide execution block - something like Nahalem or even wider. Although the pipeline may be completely full there may not be enough ILP available in one thread to keep the back end fully occupied. By bringing in a second thread to execute you can fill up some of those idle units. A well implemented scheduler will track instruction mix and juggle the instruction order from both threads to try to keep the back end fully utilized.

Vesku · May 6, 2010

Intel has a lot more engineering resources than AMD. It makes sense for Intel to have a team that works on SMT to get the most out of their chips with current software. In fact, it probably helps alleviate the too many fingers in the pie syndrome. Seems that AMD made the decision to focus on just improving the chip.

Makes sense to me, take your smaller resources and focus them all on the chip. Also, as JFAMD pointed out Intel's SMT approach has a direct impact on chip design. Some decisions made for SMT might actually be counterproductive in the future. If AMD makes some good design choices they may actually see another A64 in the future. I definitely hope their sales team is ready for it.

dmens · May 6, 2010

Idontcare said:
Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.

I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.

I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.

But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.

So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.

Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.

Is this "view" correct?

Few comments.

- SMT and scaling do not belong in the same sentence. SMT is opportunistic. Cores scale. The die size increase of SMT to a core is what, 5-10% vs 100%? Unfair comparison. The real question is, does SMT return a positive ratio on performance vs power and die size? Even with P4, it did. Even more so with nehalem onwards. That's a design win in my book.

- SMT is certainly simultaneous in parts of the pipeline, but not all, some resources are shared, some are duplicated. ALU's are simultaneous. Most codestreams cannot fill all the execution ports in a given cycle. SMT allows simultaneous dispatch and execution. No threads need to be stalled for that to occur.

- On the other hand, shared resources do depend more on stalling (and many, many other things) for SMT to "get a win". Those resources are usually enhanced to absorb the impact of SMT, but they are *NOT* serial. They all hold data from both threads at the same time. In any case, interruptions occur all the time in runtime. Give me a perfect app and an oracle core, I'll show you how to turn off SMT. It's easy.

Lastly, SMT is not a design philosophy... it's a relatively simple method to extract performance. SMT can be successfully implemented with many architectures.

Nemesis 1 · May 6, 2010

JFAMD said:
It is a design philosophy.

Because the entire architecture needs to be designed from the ground up to be optimized for that. You don't just "bolt" SMT on to the architecture, it impacts how you design your data paths, how you design your pipelines and how you design your caches.

It's like baking. You start with dough, and what you put in it and how you handle it determines whether it rises to be bread or stays flat for a pizza crust. They're both just flour, egg and water, right? Well, once the bread has risen, it makes a really crappy pizza crust and if you don't put the yeast in to let it rise, good luck making a sandwich out of two slices of pizza crust.

Thats how AMD makes bread with pizza dough. Intel uses yeast and it rises

veri745 · May 6, 2010

dmens said:
Few comments.

- SMT and scaling do not belong in the same sentence. SMT is opportunistic. Cores scale. The die size increase of SMT to a core is what, 5-10% vs 100%? Unfair comparison. The real question is, does SMT return a positive ratio on performance vs power and die size? Even with P4, it did. Even more so with nehalem onwards. That's a design win in my book.

I'm just going to address this point, and maybe JF can explain a little more.

With Bulldozer, hasn't it already been stated that the addition of an entire extra integer "core" takes only a minuscule amount of silicon real estate, comparable to what HyperThreading uses in Nehalem?

Only the integer pipelines are duplicated, not the FP unit or front/back ends.

In that case, it seems like the extra "core" is a comletely fair comparison to SMT.

JFAMD · May 6, 2010

Yes. Take a bulldozer die with 8 cores. Pull 4 integer cores out and that silicon represents ~5% of the total silicon of the die.

Idontcare · May 6, 2010

soccerballtux said:
I believe he is showing the comparison so we can see that it can indeed scale to 8 threads-- if there is real hardware to be used. When there is not real extra hardware (as is the case with Nehalem, and especially so when you're running optimized code without many stalls), it cannot scale.

^ This...was just showing proof that thread level parallelism exists to a degree that is substantially higher than that extracted with the Nehalem architecture as evidenced by the thread scaling on an opteron system.

The red line indicates where the max-scaling resides for the specific application (i.e. the max achievable scaling regardless the hardware architecture).

Fox5 · May 6, 2010

So how will the single thread performance of bulldozer be? Sounds like it's still not really a concern unless the rumored reverse hyper threading is finally going to show its face. I'm sure bulldozer will achieve very high resource utilization, but if the integer resources are only 2 wide, compared to 3 wide on phenom II, and 4 wide (sometimes 5) on intel's cores, it seems like it would still be a good percentage slower. Would it be safe to safe bulldozer single thread performance will be about the same per clock as phenom II? (less width, but other improvements to increase utilization?)

dmens · May 6, 2010

veri745 said:
I'm just going to address this point, and maybe JF can explain a little more.

With Bulldozer, hasn't it already been stated that the addition of an entire extra integer "core" takes only a minuscule amount of silicon real estate, comparable to what HyperThreading uses in Nehalem?

Only the integer pipelines are duplicated, not the FP unit or front/back ends.

In that case, it seems like the extra "core" is a comletely fair comparison to SMT.

If the frontend is not duplicated, it's not an extra core.

Maybe I'm missing something, but isn't it rather dangerous to cut the number of dedicated execution resources w.r.t. the last generation when you really want to be at least as good the last gen? On some workloads, this strategy may backfire.

In any case, sharing the integer execution resources with other cores would be a very neat trick, but I fail to see how this ALU-sharing idea has anything to do with SMT. Two different ideas with totally different engineering challenges, as far as I can tell.

JFAMD · May 6, 2010

Single thread performance should only be considered if you are going to also consider single thread price and single thread power.

Price/performance/watt will be excellent.

JFAMD · May 6, 2010

I'm sure bulldozer will achieve very high resource utilization, but if the integer resources are only 2 wide, compared to 3 wide on phenom II, and 4 wide (sometimes 5) on intel's cores, it seems like it would still be a good percentage slower.

Integer resources are not 2 wide.

Nemesis 1 · May 6, 2010

. Well this has been a great read . I don't want to Ruin this thread . So I would like to ask . On sandy Intel uses AVX with a Vexprefix . AMD is using and forgive I not taking time to look XOD or something like that. Could JFAMD if you would explain the differance to me in a meaningful way

Seferio · May 6, 2010

JFAMD said:
Single thread performance should only be considered if you are going to also consider single thread price and single thread power.

Price/performance/watt will be excellent.

Since not all computing applications scale well with multiple cores, I think just having an idea of how single threaded performance will compare with Intel's existing architecture in Nehalem would be interesting.

Accord99 · May 6, 2010

Idontcare said:
^ This...was just showing proof that thread level parallelism exists to a degree that is substantially higher than that extracted with the Nehalem architecture as evidenced by the thread scaling on an opteron system.

The red line indicates where the max-scaling resides for the specific application (i.e. the max achievable scaling regardless the hardware architecture).

But that's not the goal of Hyperthreading, to match the throughput increase of another core. The point of Hyperthreading is that using more threads than cores shows a performance increase. And using a relative scale ignores the significantly greater performance of Nehalem (its easier to scale when you're slow to begin with) and a relatively modest % increase from Hyperthreading 4 core Nehalems in absolute terms is almost as great as the increase from going from 4 cores to 6 on Thuban for Euler 3D.

soccerballtux · May 7, 2010

Accord99 said:
But that's not the goal of Hyperthreading, to match the throughput increase of another core. The point of Hyperthreading is that using more threads than cores shows a performance increase. And using a relative scale ignores the significantly greater performance of Nehalem (its easier to scale when you're slow to begin with) and a relatively modest % increase from Hyperthreading 4 core Nehalems in absolute terms is almost as great as the increase from going from 4 cores to 6 on Thuban for Euler 3D.

only with poorly optimized code.

soccerballtux · May 7, 2010

Seferio said:
Since not all computing applications scale well with multiple cores, I think just having an idea of how single threaded performance will compare with Intel's existing architecture in Nehalem would be interesting.

TBH, as long as we keep getting more cores, I'm not worried.
1). AMD wouldn't be developing this architecture if their PhDs hadn't found that it could increase throughput for cheaper than adding more cores
2). My single threaded performance is fast enough for everything I need. Firefox is dog slow, no matter how fast of a core you throw at it. I have about 20 tabs that load as my homepage and Firefox goes unresponsive for seconds at a time; it even freezes the Windows 7 UI and I can't drag the window from one monitor to the other. All while only using 25% of my available resources-- only using one core.

Alternatively, I open the same 20 tabs in Chrome, and all 4 of my cores peg to 100% for about 3 seconds until all the pages are finished rendering. Love it.

Accord99 · May 7, 2010

soccerballtux said:
only with poorly optimized code.

What makes Euler unoptimized? The only real multi-threaded applications that HT doesn't show improvements in are things like LINPACK, which are well optimized enough that they already maximize the floating point throughput of a core and stuff like encryption that are dependent on a few key instructions. Otherwise HT shows measurable gains in rendering, video encoding, computation, distributed computing, database and commercial server applications.

Bulldozer has 33% less ALUs than K10

Platinum Member

Lifer

Lifer

Platinum Member

Lifer

Senior member

Lifer

Golden Member

Lifer

Diamond Member

Platinum Member

Lifer

Golden Member

Senior member

Elite Member

Diamond Member

Platinum Member

Senior member

Senior member

Lifer

Member

Platinum Member

Lifer

Lifer

Platinum Member