Originally posted by: Kuzi
Originally posted by: IntelUser2000
Originally posted by: Kuzi
Yep you are right about the L1 cache, I was mainly thinking of the extra Integer Unit in each core. You know K10 processors have only one Integer Unit, BD may have two.
"The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs."
What do you mean by "Integer Units"? Usually, they are ALUs, but you are saying like its something else. Number of ports?? Or the things called "Integer Clusters" in the pic? It seems the definition is very vague.
Lets call the part of the CPU that does Integer Calculations an Integer Execution Unit. In K10 this is how it looks like.
Notice to the lower left of the diagram, there are three ALUs, these are part of the Integer Execution Unit in K10 CPUs. It's 3-way superscalar having the ability to issue 3 integer operations per clock cycle.
Notice on the Bulldozer diagram, there are "two" Integer Execution Units per core, called Clusters on the diagram. Now from the info IDC provided:
Here is the link to Dresdenboy's patent search results into AMD MPU:
?clustered multithreading with 2 int clusters with each of them having:
?2 ALUs, 2 AGUs
?one L1 data cache
?scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)
Each of those INT Clusters will have two ALUs, if you combine the two clusters as one unit (4 ALUs), you get the ability to issue 4-way operations per clock (like Core 2/i7). Or run each cluster separately and the CPU can run two threads at a time (SMT).
This is what we are assuming AMD might do to add SMT capability into Bulldozer.
Kuzi this method of dynamically busting up the clusters to enable SMT "as needed" is intriguing when put into reverse. It seems like hardware mitosis of sorts...not to actually take single-threads and make them multi-threads but rather to take enable the option of making a multi-threaded core (or multi-cored CPU) function as a faster single-thread (or fewer thread) processor when that is all that is needed.
(I know I am saying this rather poorly, I apologize for that, the proper words are escaping me at the moment)
If this clustered processing technique really works, we could imagine an 16-core 32-thread capable Interlagos chip that when challenged with say only 8 threads it can suddenly, dynamically, configure the clusters so as to become a seemingly more efficient 8-core 8-thread processor just for the time that the 8 threads are processing.
Can it really operate like that? If we concede that clustered processing enables SMT like approach, then it would seem like we have to concede that clustered processing enables reverse SMT like capabilities as well.