AMD vs Intel at the high end in the future

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Kuzi
Originally posted by: IntelUser2000
Originally posted by: Kuzi

Yep you are right about the L1 cache, I was mainly thinking of the extra Integer Unit in each core. You know K10 processors have only one Integer Unit, BD may have two.

"The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs."

What do you mean by "Integer Units"? Usually, they are ALUs, but you are saying like its something else. Number of ports?? Or the things called "Integer Clusters" in the pic? It seems the definition is very vague.

Lets call the part of the CPU that does Integer Calculations an Integer Execution Unit. In K10 this is how it looks like.

Notice to the lower left of the diagram, there are three ALUs, these are part of the Integer Execution Unit in K10 CPUs. It's 3-way superscalar having the ability to issue 3 integer operations per clock cycle.

Notice on the Bulldozer diagram, there are "two" Integer Execution Units per core, called Clusters on the diagram. Now from the info IDC provided:

Here is the link to Dresdenboy's patent search results into AMD MPU:
?clustered multithreading with 2 int clusters with each of them having:
?2 ALUs, 2 AGUs
?one L1 data cache
?scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)

Each of those INT Clusters will have two ALUs, if you combine the two clusters as one unit (4 ALUs), you get the ability to issue 4-way operations per clock (like Core 2/i7). Or run each cluster separately and the CPU can run two threads at a time (SMT).

This is what we are assuming AMD might do to add SMT capability into Bulldozer.

Kuzi this method of dynamically busting up the clusters to enable SMT "as needed" is intriguing when put into reverse. It seems like hardware mitosis of sorts...not to actually take single-threads and make them multi-threads but rather to take enable the option of making a multi-threaded core (or multi-cored CPU) function as a faster single-thread (or fewer thread) processor when that is all that is needed.

(I know I am saying this rather poorly, I apologize for that, the proper words are escaping me at the moment)

If this clustered processing technique really works, we could imagine an 16-core 32-thread capable Interlagos chip that when challenged with say only 8 threads it can suddenly, dynamically, configure the clusters so as to become a seemingly more efficient 8-core 8-thread processor just for the time that the 8 threads are processing.

Can it really operate like that? If we concede that clustered processing enables SMT like approach, then it would seem like we have to concede that clustered processing enables reverse SMT like capabilities as well.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
Originally posted by: Idontcare
Kuzi this method of dynamically busting up the clusters to enable SMT "as needed" is intriguing when put into reverse. It seems like hardware mitosis of sorts...not to actually take single-threads and make them multi-threads but rather to take enable the option of making a multi-threaded core (or multi-cored CPU) function as a faster single-thread (or fewer thread) processor when that is all that is needed.

(I know I am saying this rather poorly, I apologize for that, the proper words are escaping me at the moment)

It's exactly how you describe it here IDC. A 16 core Interlagos would already have 2 Integer Clusters per core, thus the ability to process 32 threads per cycle if needed. But can it operate in reverse? To combine two Clusters to work as one larger Cluster (4-way), that's what I'm not sure of.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
Originally posted by: Idontcare
If this clustered processing technique really works, we could imagine an 16-core 32-thread capable Interlagos chip that when challenged with say only 8 threads it can suddenly, dynamically, configure the clusters so as to become a seemingly more efficient 8-core 8-thread processor just for the time that the 8 threads are processing.

Yes that is it. It would be able to run more efficiently (4-way) up to 16 threads, but if we go over 16 threads some cores have to be "split" up (2-way) to support the extra threads. And from Hans De Vries's explanation there isn't much performance loss from going only 2-way:

Originally posted by: Hans de Vries
Bulldozer's clustered multiprocessor architecture
A 2-way superscalar processor can reach 80%-100% of the performance of a 3-way for lots of applications. Only a subset of programs really benefits from going to a 3-way. A still smaller subset benefits from going to a 4-way superscalar.

Lets make a Hypothetical Example. We have 16 threads running on a 16-core Interlagos at maximum efficiency (4-way), each core giving 100% performance gain. We get (16 threads)*(100% per core) = 1600% improved performance.

Now lets say we have 17 threads, 15 of 16 cores can run at maximum efficiency (1500%), and one core has to be split up to run 2 threads. Lets assume a worst case scenario, the split core running the two threads at 80% efficiency each, so (2 threads)*(80%) = 160%. Now we add 1500%+160% = 1660%

It can lose some efficiency when running more than 16 threads but there is still a performance gain. If we take 32 threads at worst case efficiency (80%), (32 threads)*(80%) = 2560% performance gain. Still much better than running only 16 threads (1600%) :D

Of course in real life the performance gain wouldn't be that big for most apps. It will be interesting to see how this SMT method fairs against Intel's HT.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I'd like to see this taken to the next level - a 16 or 32 "cluster" core that can operate up to 32 threads but can also process one thread using nearly all the power of the "unified" cluster.

I'm pretty sure this has to be just stupid wrong of me to state for a myriad of reasons, but at the moment I'm at a loss to make a reasonable argument why it can't work if clustered computing itself can work.

Hopefully someone can add some thoughts here to this thread, or drop me a pm if they'd rather.
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
Isn't what you are talking about the reverse-HT that Intel was talking about a while back? (sorry for being fuzzy, but it sounds like the same concept. If we can split resources to work on two threads, maybe we can combine resources to work on 1 thread when 2 aren't needed).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: ilkhan
Isn't what you are talking about the reverse-HT that Intel was talking about a while back? (sorry for being fuzzy, but it sounds like the same concept. If we can split resources to work on two threads, maybe we can combine resources to work on 1 thread when 2 aren't needed).

Is it? Could be exactly the same then. I thought reverse-HT was to take a single-threaded application and forcibly do parallel-processing on it of sorts (in some manner of speculative multi-processing or some such).

At any rate if reverse-HT is as you describe then yes what I was thrashing around attempting to describe with clustered computing is reverse-HT.
 

jones377

Senior member
May 2, 2004
462
64
91
Originally posted by: Idontcare
Originally posted by: ilkhan
Isn't what you are talking about the reverse-HT that Intel was talking about a while back? (sorry for being fuzzy, but it sounds like the same concept. If we can split resources to work on two threads, maybe we can combine resources to work on 1 thread when 2 aren't needed).

Is it? Could be exactly the same then. I thought reverse-HT was to take a single-threaded application and forcibly do parallel-processing on it of sorts (in some manner of speculative multi-processing or some such).

At any rate if reverse-HT is as you describe then yes what I was thrashing around attempting to describe with clustered computing is reverse-HT.

I don't think that reverse-HT rumor, as it was described at the time, had any basis in reality whatsoever.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: jones377
Originally posted by: Idontcare
Originally posted by: ilkhan
Isn't what you are talking about the reverse-HT that Intel was talking about a while back? (sorry for being fuzzy, but it sounds like the same concept. If we can split resources to work on two threads, maybe we can combine resources to work on 1 thread when 2 aren't needed).

Is it? Could be exactly the same then. I thought reverse-HT was to take a single-threaded application and forcibly do parallel-processing on it of sorts (in some manner of speculative multi-processing or some such).

At any rate if reverse-HT is as you describe then yes what I was thrashing around attempting to describe with clustered computing is reverse-HT.

I don't think that reverse-HT rumor, as it was described at the time, had any basis in reality whatsoever.

Are you talking about mitosis?
 

jones377

Senior member
May 2, 2004
462
64
91
Originally posted by: Idontcare
Originally posted by: jones377
Originally posted by: Idontcare
Originally posted by: ilkhan
Isn't what you are talking about the reverse-HT that Intel was talking about a while back? (sorry for being fuzzy, but it sounds like the same concept. If we can split resources to work on two threads, maybe we can combine resources to work on 1 thread when 2 aren't needed).

Is it? Could be exactly the same then. I thought reverse-HT was to take a single-threaded application and forcibly do parallel-processing on it of sorts (in some manner of speculative multi-processing or some such).

At any rate if reverse-HT is as you describe then yes what I was thrashing around attempting to describe with clustered computing is reverse-HT.

I don't think that reverse-HT rumor, as it was described at the time, had any basis in reality whatsoever.

Are you talking about mitosis?

No that was an Intel thing wasn't it? Reverse-HT (or -SMT) was supposed to allow a single thread to be split up among all the cores on the die automagically for an almost linear speedup (lulz) in performance. it was supposed to be AMD's answer to Hyperthreading but work opposite to SMT, thus called reverse-HT.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Idontcare
I'd like to see this taken to the next level - a 16 or 32 "cluster" core that can operate up to 32 threads but can also process one thread using nearly all the power of the "unified" cluster.

I'm pretty sure this has to be just stupid wrong of me to state for a myriad of reasons, but at the moment I'm at a loss to make a reasonable argument why it can't work if clustered computing itself can work.

Hopefully someone can add some thoughts here to this thread, or drop me a pm if they'd rather.

You can't do that because you can't shuffle the data between execution units fast enough. The two problems are: 1) simple distance, and 2) bypassing. For performance reasons, it's extremely preferable to support back-to-back issue of dependent operations.

1) If you stick two pipes next to each other, you might be able to get data between them within a cycle, but at 16 there's no chance. Even at 4, if you have all the normal CPU core junk in between the execution units, you'll end up with too much distance between them to send data within a cycle.

2) You basically have a large crossbar in front of the execution units so that the result from any one unit can feed any other unit the next cycle. As you add restrictions on which units can bypass to others, scheduling complexity goes up and performance goes down. There's a practicla limit to how many execution units you can bypass between without adding cycles.

The Alpha 21264 used two clusters of execution units (2 pipes each). Each cluster included a register file, and data was always written to both. It took a cycle for data to cross between clusters, so if you had instruction 1 in cluster 0 and instruction 2 in cluster 1, you could not execute them back to back. They claim "a few percent" performance impact, but obviously as you increase the number of clusters and the latency, the performance impact will quickly outweigh the potential performance improvements (dimishing returns in # of pipes, but increasing costs). See section 4 of this paper for details.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Originally posted by: Idontcare
Have you guys seen this already?

The specINT and specFP were added by someone after AMD presented the slide, it has been "doctored" so to speak, the numbers are not from AMD. But the rest of the graphic is, including the performance normalized scale on the y-axis.

I addressed that in the other forum. That is a slide that was done in powerpoint. The graphic was drawn in powerpoint. Then it was converted to PDF.

The overall performance increases are correct but pay no attention to the numbers, they are not exact, the person that did the math is making some assumptions about precision of the graphics.


 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: JFAMD
Originally posted by: Idontcare
Have you guys seen this already?

The specINT and specFP were added by someone after AMD presented the slide, it has been "doctored" so to speak, the numbers are not from AMD. But the rest of the graphic is, including the performance normalized scale on the y-axis.

I addressed that in the other forum. That is a slide that was done in powerpoint. The graphic was drawn in powerpoint. Then it was converted to PDF.

The overall performance increases are correct but pay no attention to the numbers, they are not exact, the person that did the math is making some assumptions about precision of the graphics.

Yeah I have no idea where I came across it, just remembered the backstory and had saved the file for later viewing.

Wasn't really trying to imply the specint/fp stuff as relevant (but I don't have the original undoctored slide), just the performance scaling (arbitrary since we don't know actual application) as AMD was willing to claim for BD over PhII.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: CTho9305
Originally posted by: Idontcare
I'd like to see this taken to the next level - a 16 or 32 "cluster" core that can operate up to 32 threads but can also process one thread using nearly all the power of the "unified" cluster.

I'm pretty sure this has to be just stupid wrong of me to state for a myriad of reasons, but at the moment I'm at a loss to make a reasonable argument why it can't work if clustered computing itself can work.

Hopefully someone can add some thoughts here to this thread, or drop me a pm if they'd rather.

You can't do that because you can't shuffle the data between execution units fast enough. The two problems are: 1) simple distance, and 2) bypassing. For performance reasons, it's extremely preferable to support back-to-back issue of dependent operations.

1) If you stick two pipes next to each other, you might be able to get data between them within a cycle, but at 16 there's no chance. Even at 4, if you have all the normal CPU core junk in between the execution units, you'll end up with too much distance between them to send data within a cycle.

2) You basically have a large crossbar in front of the execution units so that the result from any one unit can feed any other unit the next cycle. As you add restrictions on which units can bypass to others, scheduling complexity goes up and performance goes down. There's a practicla limit to how many execution units you can bypass between without adding cycles.

The Alpha 21264 used two clusters of execution units (2 pipes each). Each cluster included a register file, and data was always written to both. It took a cycle for data to cross between clusters, so if you had instruction 1 in cluster 0 and instruction 2 in cluster 1, you could not execute them back to back. They claim "a few percent" performance impact, but obviously as you increase the number of clusters and the latency, the performance impact will quickly outweigh the potential performance improvements (diminishing returns in # of pipes, but increasing costs). See section 4 of this paper for details.

Thanks :thumbsup:

Yep, your post convincingly explains the nuts and bolts (latency defeats the purpose in a rather steep point of diminishing returns sort of way) of why it wouldn't be scalable as I was trying to envision it.

Figure 3 of that paper, got it, nicely displays the challenge. I wasn't expecting there to be a free lunch from this, but I did have high hopes of there being a continental breakfast included :laugh: