Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Accord99 · Feb 16, 2011

maddie said:
Is it truly possible for 2 threads to use the same part of the pipeline at once? Really asking.

Can anyone explain this?

Hyperthreading enabled CPUs can execute instructions from two threads simultaneously:

http://arstechnica.com/old/content/2002/10/hyperthreading.ars/3

Castiel · Feb 16, 2011

Have there been any real benchmarks yet for BD or is it still fanboy hype?

maddie · Feb 16, 2011

Accord99 said:
Hyperthreading enabled CPUs can execute instructions from two threads simultaneously:

http://arstechnica.com/old/content/2002/10/hyperthreading.ars/3

Maybe I'm not explaining myself properly.

Is it possible to have a thread, real world or contrived, able to fill the pipeline stages so that only one thread can be simultaneously executed on a hyperthreading CPU?

IntelUser2000 · Feb 16, 2011

Please delete post.

JFAMD · Feb 16, 2011

Phynaz said:
I notice that none of the BD advocates mention the Xeon 7560. I wonder why they are afraid to make comparisons to that CPU?

Xeon 7560
Highest 4P SPECint_rate2006: 780

Opteron 6180 SE
Highest 4P SPECint_rate2006: 825

AMD ahead by 5%+

Thanks for pointing that out.

Arkadrel said:
Only 3 things matter:

performance + price + power used.

*IF*

2 core Intel cpu costs same as 4 core AMD cpu
2 core Intel cpu performance same as 4 coro AMD cpu
2 core Intel cpu power same as amd cpu

why not compaire them? they cost the same, they give same performance, they use same power.

your the one thats being silly about core vs cores haveing to be equal to compair cpus.....

for me its price/performance.... thats what I look for in a cpu, I dont care if it has 1 core or 16 cores, if its price is cheap, and its performance is fast! Thats the cpu Im buying, its the same for the rest of the world, except for fanboys.

Thank you for making my point for me.

Phynaz said:
Incorrect.

The two threads execute simultaneously. SMT - Simultaneous MultiThreading

http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719

While HT allows 2 threads to execute, your pipelines only have up to 100% capacity.

So if you are at 80% efficiency you have essentially only 20% left to give for the second thread. And as software becomes more tuned and more efficient (as it seems to do every year), the amount of upside comes down.

In really efficient workloads the percentage from the second thread may be 0%. So, it can run 2 threads, but they are two even threads nor is it two optimized threads.

Accord99 · Feb 16, 2011

maddie said:
Maybe I'm not explaining myself properly.

Is it possible to have a thread, real world or contrived, able to fill the pipeline stages so that only one thread can be simultaneously executed on a hyperthreading CPU?

No, both threads will get time. However in rare, highly optimized code, like parts of Prime95, Distributed.net's RC5 encryption breaking challenge or LinPack the two threads end up providing no more throughput than just 1 thread.

Jovec · Feb 16, 2011

maddie said:
Maybe I'm not explaining myself properly.

Is it possible to have a thread, real world or contrived, able to fill the pipeline stages so that only one thread can be simultaneously executed on a hyperthreading CPU?

You could theoretically design an application thread that manages to fill the entire pipeline every cycle, thus preventing execution of another thread (outside of the normal OS thread scheduling). Such an app would be contrived (in your parlance) with numerous wasteful, sub-optimal, and even unnecessary instructions solely designed to achieve this goal and would require understanding and modifying the assembly instructions outputted by the compiler and exactly how a given CPU arch executes them.

IntelUser2000 · Feb 16, 2011

JFAMD said:
In really efficient workloads the percentage from the second thread may be 0%. So, it can run 2 threads, but they are two even threads nor is it two optimized threads.

How many % of practical programs are out there that has 100% utilization? Outside of AMD chips, all the high end chips use some form of SMT. They use it because it works with enough applications to make the effort worth it.. The programs that gain performance using SMT are nearly always the same ones that gain from having more cores.

If everything did work so ideally as you believe, then we would be having an architecture akin to GPUs with hundreds of decoders and ALUs/FPUs rather than doing all the painful work of going OoOE, frequency gains and multi-core programming.

dmens · Feb 16, 2011

Pages of back and forth on SMT vs cores and "die area" is not even mentioned? Fail.

maddie · Feb 16, 2011

Therefore, after the last few posts, it is possible to claim that in some cases, a hyperthreading CPU will only be able to run 1 thread, at any precise point in time, whereas a bulldozer module can run 2.

Sure it can switch between 2 threads quickly, but we are not discussing this.

Point made.

Phynaz · Feb 16, 2011

maddie said:
Therefore, after the last few posts, it is possible to claim that in some cases, a hyperthreading CPU will only be able to run 1 thread, at any precise point in time, whereas a bulldozer module can run 2.

Sure it can switch between 2 threads quickly, but we are not discussing this.

Point made.

No, you still have it wrong. HT is always executing 2 threads. Remember Bulldozer also shares part of it's front end amongst cores. It also executes 2 threads simultaneously.

Phynaz · Feb 16, 2011

dmens said:
Pages of back and forth on SMT vs cores and "die area" is not even mentioned? Fail.

Hey! I did...When I was mentioning transistor budget.

maddie · Feb 16, 2011

Phynaz said:
No, you still have it wrong. HT is always executing 2 threads. Remember Bulldozer also shares part of it's front end amongst cores. It also executes 2 threads simultaneously.

Have you read at all how hyperthreading works?

Do you understand that 2 threads cannot use the same stages at the exact same time?

What is the problem? Don't get stuck on the word simultaneous.

Understand the context in which it is used and for what cases.

A thread that is not using all stages will allow another one to utilize the unused resources.

A thread that is using all stages at once will prevent another from executing, thus having only 1 thread able to run.

Phynaz · Feb 16, 2011

JFAMD said:
While HT allows 2 threads to execute, your pipelines only have up to 100% capacity.

So if you are at 80% efficiency you have essentially only 20% left to give for the second thread. And as software becomes more tuned and more efficient (as it seems to do every year), the amount of upside comes down.

In really efficient workloads the percentage from the second thread may be 0%. So, it can run 2 threads, but they are two even threads nor is it two optimized threads.

You are being disingenuous here John. The same constraints apply to the shared hardware in a BD module. Unless you are saying these is never a case that Bulldozer could become hardware constrained on the front end, or have its pipeline filled 100%.

Also, as you state, it is about overall system thoughtput. You are choosing your words carefully to imply the second thread runs slower than the first or sometimes doesn't run at all, and somehow overall system throughput is affected. You know that is false (excluding corner cases).

Be proud of AMD, you should be. Work your butt off to market your products (I'm sure you are). But spin is unbecoming. Saying in one post that you don't like benchmarks, and then posting a cherry picked benchmark to prove a point raises questions, you know?

Phynaz · Feb 16, 2011

maddie said:
Have you read at all how hyperthreading works?

Do you understand that 2 threads cannot use the same stages at the exact same time?

What is the problem? Don't get stuck on the word simultaneous.

Understand the context in which it is used and for what cases.

A thread that is not using all stages will allow another one to utilize the unused resources.

A thread that is using all stages at once will prevent another from executing, thus having only 1 thread able to run.

So how does Bulldozer do it then? Since it also has shared hardware in the front end, it must only be able to execute one thread at a time. After all, two threads cannot use the exact same hardware at the same time.

Or it could be that the guys that design CPUs are really smart, and they know how to schedule ops to use available resources optimally. You know, something like out of order execution.

If you were to ask my opinion, I would say you should do some reading on SMT before you make yourself look sillier. I suggest Jon Stoke's articles over at Ars. I believe somebody already posted the link.

maddie · Feb 16, 2011

Phynaz said:
So how does Bulldozer do it then? Since it also has shared hardware in the front end, it must only be able to execute one thread at a time. After all, two threads cannot use the same stages at the same time.

Or it could be that the guys that design CPUs are really smart, and they know how to schedule ops to use available resources optimally. You know, something like out of order execution.

If you were to ask my opinion, I would say you should do some reading on SMT before you make yourself look sillier. I suggest Jon Stoke's articles over at Ars. I believe somebody already posted the link.

To quote Kanter's article.

"To accommodate both cores, Bulldozer’s decode stage has been widened. Bulldozer can decode up to 4 instructions per cycle"

I feel I'm debating with someone who's designed a perpetual motion machine.

Phynaz · Feb 16, 2011

maddie said:
To quote Kanter's article.

"To accommodate both cores, Bulldozers decode stage has been widened. Bulldozer can decode up to 4 instructions per cycle"

I feel I'm debating with someone who's designed a perpetual motion machine.

I give. Anymore and I'll be banging my head against the wall.

You win.

IntelUser2000 · Feb 16, 2011

maddie said:
To quote Kanter's article.

"To accommodate both cores, Bulldozer’s decode stage has been widened. Bulldozer can decode up to 4 instructions per cycle"

I feel I'm debating with someone who's designed a perpetual motion machine.

IT DOESN'T MATTER.

Both Bulldozer/Nehalem/Sandy Bridge has 4 decoders. If you can't understand this you should stop talking rather than blurting out nonsense. AMD splits the ALUs into two seperate groups, one for each thread because they believe that's a better way, while Intel keeps whatever they have, but if there's a idle instruction in the pipeline the 2nd thread comes in to fill it.

It's called simultaneous because there are multiple ALUs. Let's say there's 4 of them.

ALU0-Idle
ALU1-Running
ALU2-Running
ALU3-Running

ALU0 is idle, while ALU1-3 is filled up by the first thread and is working fine. Hyperthreading brings in a second thread to work on ALU0 so its not idle. Because two threads can work together like that its called Simultaneous Multi-Threading.

JFAMD · Feb 16, 2011

Phynaz said:
You are being disingenuous here John. The same constraints apply to the shared hardware in a BD module. Unless you are saying these is never a case that Bulldozer could become hardware constrained on the front end, or have its pipeline filled 100%.

Also, as you state, it is about overall system thoughtput. You are choosing your words carefully to imply the second thread runs slower than the first or sometimes doesn't run at all, and somehow overall system throughput is affected. You know that is false (excluding corner cases).

Be proud of AMD, you should be. Work your butt off to market your products (I'm sure you are). But spin is unbecoming. Saying in one post that you don't like benchmarks, and then posting a cherry picked benchmark to prove a point raises questions, you know?

First, BD is designed with enough resources not to bottleneck.

Second, you asked for a benchmark and I showed you the benchmark, so don't accuse me of cherry picking. Why did I choose SPEC int? Because that is a processor-only benchmark. If you want to argue any other benchmarks you start down the path of the platform and the processor only becomes a part of the equation.

dmens · Feb 16, 2011

JFAMD said:
First, BD is designed with enough resources not to bottleneck.

No bottlenecks? Ever? Awesome!

</sarc>

Mopetar · Feb 16, 2011

dmens said:
No bottlenecks? Ever? Awesome!

</sarc>

I'm guessing he means that the architecture is tuned to the point that any bottlenecking is at some minimal threshold; essentially, there's not more execution pipelines than can be possibly fed by the scheduler and the decoder isn't going to be able to pile up instructions that the scheduler can't handle.

For something as complicated as an x86 CPU, there's always going to be some underutilized hardware, but the idea is the minimize it. If you do a really good job at it you won't bottleneck any part of the system to a significant extent.

BD has been in development long enough that AMD should have this aspect of the chip really well ironed out. It also makes sense if your design philosophy is focused on performance per watt. With that end goal in mind you never want to waste transistors that can't be fully utilized.

jvroig · Feb 16, 2011

maddie said:
AMD has developed a module that allows 2 threads at all times. You cannot separate the cores in a module. This also has a transistor penalty, but a VERY small one compared to 2 distinct cores.

It would seem obvious to any neutral observer that this is a very big break from traditional design layouts.

Not entirely. It is marketing speak. Any engineer could disassemble it to see that this achievement is not as earth-shattering as it is made to be by marketing. It means nothing at all, actually. It's just useless marketing factoid that has no bearing on engineering or technical merit.

In fact, we have gone through all of that before, like so:
http://forums.anandtech.com/showpost.php?p=30359920&postcount=241

See? Now the entire 5% for a new core isn't quite as exciting, seeing the metrics of Intel's quad-to-hexa core achievement (6-7% for a new core), It is nothing but an interesting (but useless) number given by marketing. Intel marketing could have done the same to Gulftown (woot! only 6% more die area per core), but they didn't bother probably because it doesn't mean squat.

There are also a lot more information in that thread, but it is so long (and with lots of fighting) that I would rather not quote all of them.

HW2050Plus · Feb 16, 2011

Phynaz said:
I notice that none of the BD advocates mention the Xeon 7560. I wonder why they are afraid to make comparisons to that CPU?

If calling me a Intel fanboy helps you sleep better at night have at it. I'll introduce you to the folks in the video forum that call me an AMD fanboy and you can argue which one I am with them.

In the mean time I will continue to call out making comparisons between two CPU's, when one has twice the number of processing units than the other as stupid. Especially when you bring into play the extremely limited amount of software that can take advantage of eight cores.

For Xeon 7560 which is a 8C/16T SERVER CPU for 4/8 socket systems the right comparison partner would be Bulldozer SERVER part Interlagos with equal 8M/16T and also for 4/8 socket systems. But to be fair we should wait for the upcoming Xeon based on SandyBridge.

Phynaz said:
Read what JF has posted and quit spreading false information. 1 AMD module = 2 cores.

AMD is not going to market modules, they are going to market cores.
Cores = Cores, get it?

Yes that is obviously something you do not understand. As I said the different marketing naming obviously confuses you. Intel Sandybridge 2600 e.g. provides 8 cores to the operating system. AMD Phenom II 970 provides 4 cores to the operating system. AMD Zambezi provides 8 cores to the operting system. So cores = cores, but the amount of cores I have in the OS and I can really use and not the marketing naming thingy.

LoneNinja said:
I know that's how I look at it, I don't care if 1 processor takes more cores to beat another as long as power consumption and price are similar.

I wouldn't even care if they are not similar. And if you look at the benchmark lists e.g. Anandtech never cared about that. They do not care if they compare a 12T i7-980 or a 8T i7-2600 with a 4T AMD 970 or 6T AMD 1100 where e.g. the i7-980 costs 5 times more.

And to understand the marketing naming differences you should know why this is:
AMD Bulldozer provides a number of equally fast cores so they name them cores.
Intel provides a number of very fast cores + a number of very slow cores that is why they only call their fast cores as cores and the slow cores as (hyper) threads.
Technically AMD splits most but not all of the core resources into 2 while for Intel one core gets all and the other gets also all but only if the "preferred" stalls. It is just this asymetric/symetric stuff. And because AMD calls their half core as core they "invented" the new name of module for what is really a core. From a marketing perspective this is clever because they can show that they have just more cores. But that is okay since the way AMD does core doubling ("module technology") is way superior to Intel's way of doubling cores ("Hyper Threading").

Intel could as well change their hyper threading to be symetrically and then marketing their core doubling as cores. However this will not change overall performance.

dmens · Feb 16, 2011

Mopetar said:
I'm guessing he means that the architecture is tuned to the point that any bottlenecking is at some minimal threshold; essentially, there's not more execution pipelines than can be possibly fed by the scheduler and the decoder isn't going to be able to pile up instructions that the scheduler can't handle.

For something as complicated as an x86 CPU, there's always going to be some underutilized hardware, but the idea is the minimize it. If you do a really good job at it you won't bottleneck any part of the system to a significant extent.

BD has been in development long enough that AMD should have this aspect of the chip really well ironed out. It also makes sense if your design philosophy is focused on performance per watt. With that end goal in mind you never want to waste transistors that can't be fully utilized.

I get that, and what I'm saying is that no amount of tweaking will be able to completely break down hard restrictions like page faults, which means on average SMT will always yield a gain at a significantly less cost than an entire new core. The amount of gain and the design cost is the issue, but even with P4 the gain was tangible, and even more so with Nehalems and onwards.

As far as I'm concerned the argument over SMT is not whether it yields benefit or not, but whether it is worth the design cost. The former issue has been long settled and anything to the contrary is (imo) marketing fud.

Mopetar · Feb 16, 2011

dmens said:
I get that, and what I'm saying is that no amount of tweaking will be able to completely break down hard restrictions like page faults

Use a ridiculous amount of RAM and never page anything out?

Yeah, you'll always be able to squeeze a little out of HT, but considering it's an extra $100 for the 2600K I'm not sure if that little bit is worth the extra dosh.

Would be interesting to see AMD add it to their chips in some capacity. I'm guessing that it's not a patent issue as the two companies have some cross licensing deals in place as I recall. If it's so much extra possible performance for so little transistor cost why not toss it in?

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Platinum Member

Golden Member

Diamond Member

Elite Member

Senior member

Platinum Member

Senior member

Elite Member

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Elite Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Member

Platinum Member

Diamond Member