Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

HW2050Plus · Feb 17, 2011

PreferLinux said:
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf (Page 7)
In other words, you are exactly wrong.

It's especially funny because exactly your quotation proves I am correct:

A second goal was to ensure that when one logical
processor is stalled the other logical processor could
continue to make forward progress. A logical processor
may be temporarily stalled for a variety of reasons,
including servicing cache misses, handling branch
mispredictions, or waiting for the results of previous
instructions. Independent forward progress was ensured
by managing buffering queues such that no logical
processor can use all the entries when two active
software threads2 were executing. This is accomplished
by either partitioning or limiting the number of active
entries each thread can have.

2 Active software threads include the operating system
idle loop because it runs a sequence of code that
continuously checks the work queue(s). The operating
system idle loop can consume considerable execution
resources.

You also should have had a look on figure 3.
Obviously you just do not understand this text nor my posts but look at least at the figures (3/4) which is more comprehensive.

To explain you the meaning of what you highlighted:

Independent forward progress was ensured
by managing buffering queues such that no logical
processor can use all the entries when two active
software threads2 were executing.

This means that current results of pipeline stages are stored in the mentioned buffering queues so that when a thread switch occurs those buffered results can be used in order to minimize switching penality (if the slow/waiting thread would have to execute again the whole pipeline down to continue). See also figure 4 of this document to better understand the description you highlighted.

I'll make another try from the perspective of two programs running in two threads:

Intel Hyperthreading:

priority/fast thread execution - slow thread execution
(priority/fast because it is the one which is allowed to execute)
(slow because it is the one which has to wait)
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
;Theroretically you could continue this endlessly with result of that the
;fast/priority thread executes always and the slow threads executes never
;(always waiting)
; but in real code the following occures e.g.:
move rax, [qword ptr]xxxxxxxx - waiting
* see below - now left is slow and right is priority
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - move rax, [qword ptr]xxxxxxxx
stall (L1 cache miss) - stall (L1 cache miss)
stall (L1 cache miss) - stall (L1 cache miss)
* see below - now again left is fast and right is slow
add rdx,r09 - waiting (+ stalling)
add rdx,rax - waiting (+ stalling)
mul rdx,rax - waiting (+ stalling)
mov rax,rex - waiting (+ stalling)
mov rdx, r10 - waiting (+ stalling)
shl rex,3 - waiting (+ stalling)
add rdx,r09 - waiting (+ stalling)
add rdx,rax - waiting (+ stalling)
mul rdx,rax - waiting (+ stalling)
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
As you see NEVER EVER the two threads execute at the same time! This is just not possible by design. And just count the instructions executed for the fast/priority thread and the slow thread then you know why fast/slow.

I think the above is quite good as it shows as well why and from what you can get a performance benefit from Hyperthreading.

* However as I described already the priority in current Intel HT implementations is switched after each switch that is why for most workloads the two threads appear to work roughly at the same speed. But this is just a statistically sorting out of this fast/slow thread thing.

In comparison a real symetric multithreading (like in UltraSPARC T1):
thread 1 - thread 2
add rdx,r09 - waiting
waiting - add rdx,r09
add rdx,rax - waiting
waiting - mov rdx,r09
mul rdx,rax - waiting
waiting - shl rdx,r08

There (in e.g. UltraSPARC T1) you do not have a priority/fast thread and a slow thread, all threads are equal.

compared with
AMD Core Multithreading (CMT):

thread 1 - thread 2
add rdx,r09 - mul rex, rax
add rdx,rax - add rbx, rcx
mul rdx,rax - xor rdx, rdx
mov rax,rex - move rax, rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - d rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
move eax, [dword ptr]xxxxxxxx - shl rbx,3
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - move rax, [qword ptr]xxxxxxxx
stall (L1 cache miss) - stall (L1 cache miss)
stall (L1 cache miss) - stall (L1 cache miss)
add rdx,r09 - stall (L1 cache miss)
add rdx,rax - stall (L1 cache miss)
mul rdx,rax - stall (L1 cache miss)
mov rax,rex - stall (L1 cache miss)
mov rdx, r10 - stall (L1 cache miss)
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3

As you see both threads and be executed at the same time with CMT and there is never a thread in a waiting condition. That is why this is so fast and why it is so near to a real core that AMD just renamed the threads to cores.

Now you should know everything you need to know about HT and CMT (and even Symetric-MT).

Accord99 said:
It sounds like you've never used a Hyperthreading processor, regardless of time frame, each thread gets roughly half and there is no fast thread or slow thread.

see above and/or my previous posts there is also the explanation why from a macro view it appears that they get roughly half. I also showed above a technique Symetric-MT which ensures that they get half (and not only roughly and only averaged over a long period).

Accord99 said:
The thing that some keep missing is that HT is -5 to 30%, averaging around 20% of "X"

This 20% is an estimation from you but let's assume that.

Accord99 said:
while CMT is 80% of "Y".

This 80% is as well an estimation from AMD but let's assume that.

Accord99 said:
What if 1.2X > 1.8Y?

Then the CPU with 1.2X would be faster.

Accord99 said:
1.2X Westmere already beats 2.0Y of Magny Cours.

You are more clever than this. No mentioning that Magny Cours is an especially handicapped part with low clock and flipped together and your statement in addition is wrong:

specINTrate2006:
Magny Cours 24 core: 392
Intel 24 core: 548
That is just 548/392 => 40% faster and that is including the flipping disadvantage for AMD. So without flipping it would be even less.

IntelUser2000 · Feb 17, 2011

itsmydamnation said:
im not sure if im correct here, but isn't a bulldozer modules front end bigger then a SB core because the FP unit has its own scheduler where as SB doesn't?

Traditionally, Intel focused more on Integer resources and AMD did on FP. I'd expect it to continue with Bulldozer.

Accord99 said:
That's debatable, to match a 3.4GHz Sandy Bridge's single-threaded performance, the K10 core would need to be clocked at somewhere near 5 GHz. That's a big jump to make.

Are you implying there's close to 50% advantage in single thread performance for Sandy Bridge compared to Deneb? Well that's not true.
http://www.anandtech.com/bench/Product/288?vs=102

SysMark2007, Windows Media Encoder 9 x64, and 7-zip Real World Compression seems to be a good indicator for light thread count performance.

I'd think SysMark and WME 9 would be typical of how Sandy Bridge would fare thread vs. thread. Remember, Nehalem itself had a good advancement over Penryn in overall, but the data was skewed because of the really high gains with multi-threaded apps. Hyperthreading and architecture more suited for better multi-threaded performance in general helps with that.

Single thread IPC

Nehalem vs. Penryn: 5-10%
Penryn vs. Deneb: 5%
Sandy Bridge vs Lynnfield: 10-15%

Final tally comes out to be 30-35% advantage of Sandy Bridge over Deneb, which is roughly what SysMark/WME9 shows.

When AMD claims 80% gain, its a maximum gain. It's incorrect to call Hyperthreading 5-30%, and CMT 80%. Hyperthreading can gain up to 30%.

I think there's rather a good chance that Bulldozer will have higher performance with equal threads than Deneb. The architecture isn't focused as much on per thread performance as Nehalem or Sandy Bridge is, but it seems like an advancement nevertheless.

HW2050Plus · Feb 17, 2011

jvroig said:
Otherwise, it's hard to continue a healthy debate when you insist on your own terms and wonder why people don't see your point.

You should read my posts:

you may
either use AMD terms, or Intel terms or AMD terms for AMD only and Intel terms for Intel only. But and that was what Phynaz did: You may not use AMD terms for AMD only and Intel terms for Intel and then compare terms between them. That will lead to wrong conclusions.

Again all that would be superfluous discussion if not Phynaz brought up this:

Phynaz said:
Really, would you compare that 2600K to an Athlon X2??

But I think we can close this as all is said already and also Phynaz wrote:

Phynaz said:
I agree it's fair to compare similarly priced CPUs.

How much does does Bulldozer cost?

So I think this is already solved. And this time Phynaz argument is valid. Let's wait for the pricing.

Originally Posted by HW2050Plus
I mean AMD could even implement Intel's Hyperthreading Technology into Bulldozer resulting then in a 4C/16T CPU and squeeze out some more performance as Intel did it with their HT.

Phynaz said:

Congrats. You've just proven you have no idea what you are talking about.

Click to expand...

No as you do not mention any arguments why anything I wrote is not correct I would say you have just proven that you have no idea and your post is just to disqualify others without reason nor argument.

Phynaz said:
CMT is basically SMT with fewer shared resources anyway when you get down to it.

I think that is the cause. What you say is not incorrect if you generously would call it just an oversimplification because of resources is a very general term it is correct but as it affects completly different resources it is very different if you look deeper.

So the technique how this is achieved is very different. And that is why you can combine CMT and HT.

Hypertransport relies on pipeline stalls and NO pipeline stage is doubled. CMT means doubling of pipeline stages whereas HT means latching of pipeline stages (see ftp://download.intel.com/technology/...technology.pdf (Page 7/8, especially figure 4 on page 8)

Now you can do both together, doubling pipeline stages and latching pipeline stages then you get CMT + HT and therefore e.g. a 4M/16T processor. So if you want that you can do that and it will result in more performance than implementing only one of these techniques.

AtenRa · Feb 17, 2011

The biggest fundamental difference in micro architectural design between Intel's HT (SMT) and AMD's Bulldozer (Cluster-Based Multithreading ) is that Intels execution units (Ports 0 to 5) are shared (one shared Scheduler), were in AMDs Bulldozer the Integer Execution Units are completely independent (two Schedulers + 2 Int Exc Units) and only the FP execution units share the resources via a single FP Scheduler (Shared for both 2x 128-Bit FMACs) per module.

AtenRa · Feb 17, 2011

In simple words,

In Intel's design if the first Thread uses 80% of the Execution units (Ports 0 to 5) then the second Thread will use the remaining 20%. From my understanding, Intel's HT tries to fill the Execution Units making a more efficient use of them because one thread will never use all of the Execution ports.
The second use of the HT is to minimize the effect of the pipeline stall by forcing the second thread to occupy the Execution units when the first thread is stalling.

In AMDs design when both threads need Integer execution units both of them have their own schedulers and execution pipelines and they are executed simultaneously in the same cycle.

jvroig · Feb 17, 2011

HW2050Plus said:
You should read my posts:

That post you quoted is exactly what I read that made me tell you not to make up your own terms and expect people to follow suit.

You were complaining that the comparison (made by Phynaz) is wrong (8-core BD vs 4-core SB) because you insist on believing that it is "actually just a 4-core BD because AMD just renamed it but it rightfully should be quad core with far better HT".

Nobody else insists on that. And with good reason: AMD already decided it isn't "quad core + super HT". It's an octo core. And their quad core Zambezi isn't a "dual core with superior HT", It's just a quad core, period.

So yes, people can use AMD terms on an AMD proc, Intel terms on an Intel proc, and compare them and there won't be any wrong conclusions, so Phynaz' post about "duh, of course the 8-core is faster than a quad" is valid. Nobody has to adapt your own terminologies just to be correct. There is no more issue of "what should be the proper term" because AMD and Intel already settled it for all of us. Of course, this is all in context to having no pricing information yet. When pricing information arrives, and one is cheaper than the other, our existing comparisons will be revised. Until then, a quad core AMD processor is what AMD says it is, and a quad core Intel processor is what Intel says it is, and comparing both using each other's respective terms is completely valid.

maddie · Feb 17, 2011

AtenRa said:
In simple words,

In Intel's design if the first Thread uses 80% of the Execution units (Ports 0 to 5) then the second Thread will use the remaining 20%. From my understanding, Intel's HT tries to fill the Execution Units making a more efficient use of them because one thread will never use all of the Execution ports.
The second use of the HT is to minimize the effect of the pipeline stall by forcing the second thread to occupy the Execution units when the first thread is stalling.

In AMDs design when both threads need Integer execution units both of them have their own schedulers and execution pipelines and they are executed simultaneously in the same cycle.

Is this 100% accurate or can situations exist where this is not true?

I'm not talking whether this is unlikely or improbable, but possible.

Some here are arguing that simultaneous means exactly that, end of argument. Some are saying equal sharing of resources. I can appreciate that in most cases, there will be a gain from HT allowing 2 threads to be executing.

Is it so impossible to some that AMD might have developed a superior technology than Intel, that they reject reasoning.

jvroig · Feb 17, 2011

maddie said:
Is this 100% accurate or can situations exist where this is not true?

You can create a program that does exactly that, so it is possible.

But for real world applications, yes, that is accurate. Each core is just over-provisioned (redundant functional units) since they are general purpose, and this has been the case ever since superscalar technique and pipelining have been joined together in the arch, and this has been a long long time already.

There are exceptions, of course, but they are very rare and mostly very niche cases, like chess engines. In those rare cases, hyperthreading is advised to be off. It is also advised to be off in enterprise cases where HTT has not been validated by the vendor, even though chances are it is probably safe.

HW2050Plus · Feb 17, 2011

IntelUser2000 said:
When AMD claims 80% gain, its a maximum gain. It's incorrect to call Hyperthreading 5-30%, and CMT 80%. Hyperthreading can gain up to 30%.

I think there's rather a good chance that Bulldozer will have higher performance with equal threads than Deneb. The architecture isn't focused as much on per thread performance as Nehalem or Sandy Bridge is, but it seems like an advancement nevertheless.

All okay, but there is one important difference which maybe was not too clear. The gain from Hyperthreading has a wide range of variation depending on workload (and I could even write some artificial code which gets more than this 30% and less than -5%, actually lower real world of HT it is -5% e.g. for chess engines the effect from HT is negative). The Core Multithreading is much more invariant to workloads. This 80% gain from CMT is the average and the maximum gain from CMT is of course 100% (quite obvious). Maybe the minimum gain is 60% but that is speculation and maybe such low gain for CMT can only be reached with artificial code.

AtenRa said:
In simple words,

In Intel's design if the first Thread uses 80% of the Execution units (Ports 0 to 5) then the second Thread will use the remaining 20%. From my understanding, Intel's HT tries to fill the Execution Units making a more efficient use of them because one thread will never use all of the Execution ports.
The second use of the HT is to minimize the effect of the pipeline stall by forcing the second thread to occupy the Execution units when the first thread is stalling.

No. Don't mix this up. Splitting execution ports to threads would be as it was done in AMD Core Multithreading. Intel's Hyperthreading cannot do this. A thread uses either all port resources or the other thread does. Only one thread at any cycle may issue instructions to these 6 execution ports. Again to say it clear, with Intel's HT there is no sharing of neither pipeline nor execution resources and there is no doubling of those resources.

It works with stalls if one thread stalls and therefore does not use any pipeline nor port than and ONLY THEN the other (waiting) thread is executed.

See:
ftp://download.intel.com/technology/...technology.pdf

maddie · Feb 17, 2011

HW2050Plus said:
All okay, but there is one important difference which maybe was not too clear. The gain from Hyperthreading has a wide range of variation depending on workload (and I could even write some artificial code which gets more than this 30% and less than -5%, actually lower real world of HT it is -5% e.g. for chess engines the effect from HT is negative). The Core Multithreading is much more invariant to workloads. This 80% gain from CMT is the average and the maximum gain from CMT is of course 100% (quite obvious). Maybe the minimum gain is 60% but that is speculation and maybe such low gain for CMT can only be reached with artificial code.

No. Don't mix this up. Splitting execution ports to threads would be as it was done in AMD Core Multithreading. Intel's Hyperthreading cannot do this. A thread uses either all port resources or the other thread does. Only one thread at any cycle may issue instructions to these 6 execution ports. Again to say it clear, with Intel's HT there is no sharing of neither pipeline nor execution resources and there is no doubling of those resources.

It works with stalls if one thread stalls and therefore does not use any pipeline nor port than and ONLY THEN the other (waiting) thread is executed.

See:
ftp://download.intel.com/technology/...technology.pdf

Good clarification, thanks.

HW2050Plus · Feb 17, 2011

jvroig said:
Until then, a quad core AMD processor is what AMD says it is, and a quad core Intel processor is what Intel says it is, and comparing both using each other's respective terms is completely valid.

No that is naive and you know that. That is just the standpoint of "Without reflection I feed in everything what marketing pours out."

If marketing says we rename dual socket to single socket or 4 cores to 32 cores you feed it as well? Or they define 1 GHz as 500 000 000/s and say it runs at 8 GHz?

And most importantly if Intel does the same redefinition for their Hyperthreading?

If you like you can do this no question but this will lead you to wrong conclusions. I am not of this type and I think you are not as well.

I do not argue about using each terms but if you compare you must look behind that to get a meaningful comparison.

Or we are talking at cross-purpose I don't know.

JFAMD · Feb 17, 2011

PreferLinux said:
It is just AMD which brings up that confusion.
Before it was clear what a socket, a core and a thread means.

Now AMD renamed core to module and thread to core. They did this to put emphasis on the different technology to provide multiple threads within a core. And yes that is great marketing. With a wording change they can sell double core count. You can find my post in reply to JFAMD saying that this is very aggressive marketing.

Hate AMD for that renaming issue, but you may
either use AMD terms, or Intel terms or AMD terms for AMD only and Intel terms for Intel only.

My posts where just to explain that issue. Issue itself was "invented" by AMD.

AMD hasn't changed anything. A core refers to an integer core. That is a scheduler and a set of integer pipelines that can execute integer commands. When you boot the system you see cores, not modules. The OS sees cores, not modules. The applications see cores not modules. Cores are integer cores, plain and simple.

We will not be marketing modules and you will never see them in any customer-facing collateral.

Look at it this way. Xeon and Atom both have cores in them. Both are very different. Xeon includes a lot of things that Atom does not. Does that make Atom not a core? Across the industry there are a lot of variations about what makes a core.

But the ONE THING that every core has, the lowest common denominator, is the integer execution pipelines. I am not aware of a single processor that has two sets of integer schedulers and two sets of integer pipelines in a single core.

itsmydamnation said:
im not sure if im correct here, but isn't a bulldozer modules front end bigger then a SB core because the FP unit has its own scheduler where as SB doesn't?

remember IPC X clock speed. both IPC and clock speed by themselves are meaningless.

Our front end is a lot larger, it is designed with enough bandwidth to handle two threads From a scheduler standpoint, I think SB has 56 entries for integer, FP and the second thread (HT), someone correct me if I am wrong.

BD has 40 entries for EACH integer scheduler and 60 entries for the floating point.

jvroig · Feb 17, 2011

HW2050Plus said:
That is just the standpoint of "Without reflection I feed in everything what marketing pours out."

I think Intel was right not to call a 4c/8t processor an octo-core, because HTT is not powerful enough and consistent enough to be considered a separate core.

I think AMD was justified to call their cores as it is. Not only did they duplicate enough int resources for it, it is also powerful enough (~80% of a real core) and consistent enough to be considered one. It may or may not have been my personal choice to call it an octo-core instead of a "quad-core with Superior AMD SMT", but it is justifiable anyway.

As I've already said, you are just trying to impose your own view of things, even though it is a non-issue, and the proper authorities have resolved it. You are free to believe what you want, but imposing your point on others in a forum debate is not a good formula for a healthy debate.

jvroig · Feb 17, 2011

@JFAMD
I think you pressed the wrong quote button. You are replying to HW2050Plus, and the quote you quoted belongs to him, but the name shown is "PreferLinux", a different member.

AtenRa · Feb 17, 2011

maddie said:
Is it so impossible to some that AMD might have developed a superior technology than Intel, that they reject reasoning.

Intels HT design is trying to make a more efficient Processor by trying to always use all of the Execution Units (Ports 0 to 5) and minimizing the pipeline stall from the front end (Prefetchers, predictors etc) all the way down to Execution and Retire stage.

AMDs approach is a little bit different, they trying to make a more efficient CMP (Chip Multiprocessor) by sharing parts of the processors. I would say that AMDs approach is using more brute force (double the Execution) in a more efficient way by sharing processor stages (single, shared front end etc).

Both of them are using cutting edge technology in order to accomplish what they want with their designs.

Arkadrel · Feb 17, 2011

HT takes up 5% extra space and gains like ~20% extra performance.

Modual design, takes up 12% extra space and gains ~80% extra performance.

I think the Modular approch is more elegant (more performance gain vs space taken up), also it doesnt rely on how badly coded software is (like HT does).

End of the day what matters is how much performance that CPU has,... its what you pay for, the performance, not how many fancy names Intel or AMD makes for things, not how many cores are inside, or the threads it can handle... PERFORMANCE.

this intire discussion about "core vs core" stuff is mindless nonsense, by Intel fanboys that spam this thread.

jvroig · Feb 17, 2011

Arkadrel said:
HT takes up 5% extra space and gains like ~20% extra performance.

That figure was for the initial implementation of HTT. No published figure for current HTT implementations.

Arkadrel said:
Modual design, takes up 12% extra space and gains ~80% extra performance.

12% is only for the int core in a module, not taking into account the module designs needed to fuse those two independent int cores into a "bulldozer monolithic dual-core module". In the end, this isn't even revolutionary. If Intel marketing wanted to play this game, they could have said Gulftown was able to add 2 more cores (producing a hexcore) and each additional core was only 6% of the die area. We've had this all worked out already before. It is an old argument, and often misunderstood by laymen. I linked to an old post in an old thread several posts back. That would be a good read for you.

Arkadrel said:
I think the Modular approch is more elegant (more performance gain vs space taken up), also it doesnt rely on how badly coded software is (like HT does).

They are different approaches, and aren't mutually exclusive, and solve different problems.

Arkadrel said:
End of the day what matters is how much performance that CPU has,... its what you pay for, the performance, not how many fancy names Intel or AMD makes for things, not how many cores are inside, or the threads it can handle... PERFORMANCE.

Nobody was disputing this, I wonder why you feel you have to hammer this point.

Arkadrel said:
his intire discussion about "core vs core" stuff is mindless nonsense, by Intel fanboys that spam this thread.

There was question about validity of a benchmark, which measures your precious PERFORMANCE (specifically, HW2050 vs Phynaz, when HW2050's leak was questioned). That question of validity led to some people discussing why an AMD "core" should not be compared to an Intel "core". It is not mindless nonsense. And barging in as you did is rather rude, without contributing anything at all to the thread, and no better than being the spam that you decried.

ibitmyeye3 · Feb 17, 2011

HW2050Plus said:
Now AMD renamed core to module and thread to core.

A module is actually 2 cores on one thread if I'm correct.

maddie · Feb 17, 2011

jvroig said:
That figure was for the initial implementation of HTT. No published figure for current HTT implementations.

12% is only for the int core in a module, not taking into account the module designs needed to fuse those two independent int cores into a "bulldozer monolithic dual-core module". In the end, this isn't even revolutionary. If Intel marketing wanted to play this game, they could have said Gulftown was able to add 2 more cores (producing a hexcore) and each additional core was only 6% of the die area. We've had this all worked out already before. It is an old argument, and often misunderstood by laymen. I linked to an old post in an old thread several posts back. That would be a good read for you.

They are different approaches, and aren't mutually exclusive, and solve different problems.

Nobody was disputing this, I wonder why you feel you have to hammer this point.

There was question about validity of a benchmark, which measures your precious PERFORMANCE (specifically, HW2050 vs Phynaz, when HW2050's leak was questioned). That question of validity led to some people discussing why an AMD "core" should not be compared to an Intel "core". It is not mindless nonsense. And barging in as you did is rather rude, without contributing anything at all to the thread, and no better than being the spam that you decried.

You are being very disingenuous here.

If you take 1 Intel core and decide to double up, I'm fairly certain it will be close to, if not 100% increase in area, when only looking at the cores.

I read your single post that you linked earlier and your logic is flawed in my opinion.

CPU = core + caches + uncore (everything else). I'm assuming here that cache scales linearly with cores

Intel: Using the gulftown pic (in your post) as an example

Bare core area is 45%
L3 cache = 24%
Core + all cache = 69%
Uncore = 31%

Doubling cores = (69x2 + 31)% at best = 169%

Assuming similar ratios for AMD

Effectively doubling cores by forming modules = (45 x 1.12)% =51%
Doubling cache = (24 x 2) = 48%
Uncore = 31%

Total = 130%

And please don't argue that the extra core is only 80%, they're identical. The new module cores are now roughly (180/2) = 90% of the Intel and old AMD method.

We have.

Intel 200% computation for 169% area increase

AMD 180% computation for 130% area increase

A roughly 17% increase in area efficiency
That, in my opinion, is significant.

HW2050Plus · Feb 17, 2011

I felt to give some more explanation of CMT and how and why it is implemented.

For a current Phenom core you have the following execution resources:

1 Integer scheduler
with 3 AGU units
and 3 ALU units

1 Floating Point scheduler
with 3 FPU/SSE units

(and 2 load / 1 store unit, only since Phenom II, before 1 load / 1 store)

Now for Bulldozer they have split these resources. As 3 cannot be evenly divided by two the added one more ALU/AGU and dropped a FPU/SSE unit

So with Bulldozer you have:
2 Integer scheduler
each with 2 AGU units (total of 4)
and each with 2 ALU units (total of 4)

1 Floating Point scheduler
with 2 FPU/SSE/AVX units (AVX can be "ganged" to work as one full 256 Bit AVX or 2 128 Bit AVX).

As I previously said this reduction in AGU/ALU/FPU/SSE units does not really hurt since in Phenom they were simply unused nearly all of the time.

Of course this degrades the IPC of a half core (named "core") a little bit but increases that of the whole now called "module" a lot.

You need a little more die area for 1 scheduler/1 AGU/ 1 ALU more but you save also some die area because of 1 FPU/SSE unit less.

To be able to exploit these two integer schedulers you need another thread running, so the core was enabled to just do that. Alltogether AMD named this design Core Multithreading (CMT).

As the two schedulers are fully independant and the FPU scheduler was enabled to be a shared resource they theoretically doubled the performance of their current architecture.

AMD claims that they get a 80% gain from that (which is obviously the average gain they measered).

So previously a AMD core (K7-K10.5) was totally oversized with 9-wide execution (plus 3 load/stores it would be even 12-wide) which never could be used at that width because no practical code existed.

Basically the reason for this high width was that they have a unit design and therefore this simplifies many things if all pathes in the CPU are of a width of three you don't need shuffling around nor the decision logic for that. That way AMD saved a lot of development resources when they brought up K7. On the other hand they could't just remove those extra units without needing a lot of die area and R&D resources to add compensation logic.

Now with Bulldozer they have cut that down to a 2 times 5 wide execution, means a total of 10 wide (1 more than K7-K10.5 with 9 wide). However this 10-wide can be practically reached most of the time because you actually have two programs (threads) running on them. It is obvious that if you run two threads you double the utilization.

My statement that Intel might have difficulties to implement CMT should be also clear as well, as they do not have this high amount of rather unused units they just could split. Intel has only 4 execution ports (4 execution ports + 2 load/store ports compared to AMDs 12 wide execution, 9 units + 3 load/store). So if they would follow CMT they would either need to add more units than AMD had to or they are splitting their issue port design to a unit design as used by AMD. Note: Some Intel issue ports host more than 1 execution unit so you may not compare AMD units with Intel issue ports (especially issue port 0,1,5 have in fact 2 execution units 1 ALU and 1 FPU/SSE, so same amount as AMD K7-K10.5 has).

That is why it is quite difficult for Intel to implement CMT, it requires much more design changes. Of course Intel has enough resources to do that but more work means normally more time even if you have the resources.

But maybe Intel invents something different more suited for their x86-technology we have to wait and see.

jvroig said:
12% is only for the int core in a module, not taking into account the module designs needed to fuse those two independent int cores into a "bulldozer monolithic dual-core module".

As you see what they did (see above) it is rather realistic that this 12% more die area is already including everything they needed to do this. Your claim that "12% is only for the int core in a module" is just a speculation from you. The official statement is that this is for all and again if you look what was done than it sounds reasonable.

And regarding this HT/CMT discussion:
It is even possible to do this CMT again: Split one of the half cores again. That will bring of course much much less since the with the first CMT step you used up already all more or less free units and you have to take now rather busy units to split.

Again this shows why this CMT only works well when you have a lot of rather unused execution units. Otherwise you have to add so much units that it would be better to make a new full core rather than implement CMT.

So to improve this further the better option would be to add HT on top of CMT. CMT is the technique to get unused units getting used and HT is the technique to get the free resources during a pipeline stall getting used.

And now regarding comparisons. As Intel HT adds only 5% die area and AMD's CMT adds only 12% die area (in both cases without L2/L3 cache which would reduce that for the whole chip even more) you will have rather same die area and rather same manufacturing prices for both HT and CMT. That it is why it is appropriate to compare those two techniques.

AtenRa · Feb 17, 2011

HW2050Plus said:
No. Don't mix this up. Splitting execution ports to threads would be as it was done in AMD Core Multithreading. Intel's Hyperthreading cannot do this. A thread uses either all port resources or the other thread does. Only one thread at any cycle may issue instructions to these 6 execution ports. Again to say it clear, with Intel's HT there is no sharing of neither pipeline nor execution resources and there is no doubling of those resources.

It works with stalls if one thread stalls and therefore does not use any pipeline nor port than and ONLY THEN the other (waiting) thread is executed.

See:
ftp://download.intel.com/technology/...technology.pdf

ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf

page 10, figure 6: Out-Of-Order execution engine detailed pipeline

The out-of-order execution engine consists of the allocation, register renaming, scheduling, and execution functions, as shown in Figure 6. This part of the
machine re-orders instructions and executes them as Specifically, each logical processor can use up to a maximum of 63 re-order buffer entries, 24 load buffers, and 12 store buffer entries.

If there are uops for both logical processors in the uop queue, the allocator will alternate selecting uops from the logical processors every clock cycle to assign resources. If a logical processor has used its limit of a needed resource, such as store buffer entries, the allocator will signal stall for that logical processor and continue to assign resources for the other logical processor. In addition, if the uop queue only contains uops for one logical processor, the allocator will try to assign resources for that logical processor every cycle to optimize allocation bandwidth, though the resource limits would still be enforced.

Page 11

Instruction Scheduling

The schedulers are at the heart of the out-of-order execution engine. Five uop schedulers are used to schedule different types of uops for the various execution units. Collectively, they can dispatch up to six uops each clock cycle. The schedulers determine when uops are ready to execute based on the readiness of their dependent input register operands and the availability of the execution unit resources.

The memory instruction queue and general instruction queues send uops to the five scheduler queues as fast as they can, alternating between uops for the two logical processors every clock cycle, as needed. Each scheduler has its own scheduler queue of eight to twelve entries from which it selects uops to send to the execution units.The schedulers choose uops regardless of whether they belong to one logical processor or the other. The schedulers are effectively oblivious to logical processor distinctions. The uops are simply evaluated based on dependent inputs and availability of execution resources. For example, the schedulers could dispatch two uops from one logical processor and two uops from the other logical processor in the same clock cycle. To avoid deadlock and ensure fairness, there is a limit on the number of active entries that a logical processor can have in each schedulers queue. This limit is dependent on the size of the scheduler queue.

Intel's HT CAN do that

jvroig · Feb 17, 2011

maddie said:
I read your single post that you linked earlier and your logic is flawed in my opinion.

HW2050Plus said:
Your claim that "12% is only for the int core in a module" is just a speculation from you.

It isn't. This reply also goes for maddie, above.

We've all gone through this discussion before, several months ago. I am in no mood to do it all over again for people who still don't get it.

Here is a link to Anand clarifying this issue: http://www.anandtech.com/show/2881.
Scroll down to the very bottom:

From the Link said:
AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area.

I am not singling out maddie or HW2050 or anybody, but before anybody accuses anybody of being "flawed in reasoning" or "just your own speculation" in a technical discussion, it might be a good idea to please get the facts straight. I personally keep out of CPU discussions these days because I have gotten sick of teenagers (or adults maybe) who have never even read through the first few chapters of any computer microarchitecture / comp organization book, yet feel obliged to argue technical issues using "knowledge" from marketing materials against real professionals who either deal with these issues directly through hardware (design) or low-level software optimizations, or teach the damn comp organization / uarch course, or any or all of those combined. I can see the signal to noise ratio still hasn't improved. I should not have bothered in the first place.

All I can do now is say thank you to the participants I have interacted with, and bid everyone goodbye. I bow out of the thread already, there is nothing more I can do.

Regards.

HW2050Plus · Feb 17, 2011

maddie said:
We have.

Intel 200% computation for 169% area increase

AMD 180% computation for 130% area increase

A roughly 17% increase in area efficiency
That, in my opinion, is significant.

Why you think that Intel scales linear?

And please make your calculations for real CPUs and based on absolute die size as Sandy Bridge got as well a lot of additional transistors they do not get their performance for free.

Okay let's see:
Intel Sandy Bridge: 32 nm process, 216 mm² die area, 4C/8T, 8 MB cache
AMD Phenom II: 45 nm process, 258 mm² die area, 4C/4T, 8 MB cache

So let's start from Phenom II:
First die shrink to 32 nm: 151 mm² die area for a shrinked Phenom II
Then 12% more for CMT which applies to ~50% of that: +9 mm² for CMT (!)
Then double of cache: okay let's take this 2 * of 24% -> +36 mm² for additional cache.

Then we have:
AMD Zambezi: 32 nm process, 196 mm² die area, 4M/8C, 16 MB cache
Yes that is 30% more than just a die shrink would be but look at the absolute die size it is still smaller than Sandy Bridge, yes Sandy Bridge is fat!

That is 20 mm² less than Sandy Bridge. Okay you have to consider that Sandy Bridge also provides a small GPU on the chip on the other hand Zambezi has 8 MB more cache.

Arkadrel · Feb 17, 2011

@HW2050Plus

Now if your hypothetical Zambezi of 196mm2, 4m/8c,16 MB Cache, CPU just costs the same as a Sandy Bridge.... Should be good times ahead for AMD

it ll be fun to see how close your "quest"-timate is, to the actually size when its out.

I ll take my stock 3.5GHz + 500mhz turbo x 8 "core" Zembezi now please

ShadowVVL · Feb 17, 2011

I'm looking forward to

1.seeing zamb x8 vs sandy bridge.

2.the performance difference 4x zambezi will have over phenom ll x4 970, hoping for 30% increase.

I hoping to see zamb x4 @ $139-$149 and a x8 at $249.

if the price is right I might jump on the 8 core train.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Member

Elite Member

Member

Lifer

Lifer

Platinum Member

Diamond Member

Platinum Member

Member

Diamond Member

Member

Senior member

Platinum Member

Platinum Member

Lifer

Diamond Member

Platinum Member

Junior Member

Diamond Member

Member

Lifer

Platinum Member

Member

Diamond Member

Senior member