Curious about AMD's SMT

Verndewdimus

Member
Nov 18, 2016
60
21
81
www.reverbnation.com
I was doing a bit of digging, and found that AMD filed a patent in 96, that borrowed a little from Intels HT, then dug a bit farther into IBM's smt, and smt in general, here, http://ibmsystemsmag.com/mainframe/trends/ibm-research/smt_mainframe/?page=2. I am wondering if anyones done the back tracking to see how much Intel may or may not have borrowed from IBM and the trickle down to AMD. The above article clearly explains why neither would do more than two threads per core, its a very good read.

I know, its wiki but, IBM in 1968 created SMT https://en.wikipedia.org/wiki/Simultaneous_multithreading#Historical_implementations
So Intel borrowed from tullsen work with IBM for HT. Maybe AMD did as well.
i wonder why AMD didnt utilize it earlier, i remember at the time Intel introduced prescott, the chip was running hot with performance hits for SMT and AMD's FX was killing it. Prescott was an immature move for Intel but it all worked out. I am not sure if IBM suspended SMT work after Tullsen's design but now there are several companies offering as much as 8 threads per core.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
AMD question:
Around the timeframe Clustered Multithreading was considered a viable alternative. CMP was cheaper than SMT as well.

On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future.
- IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005, 20-22 March 2005.
www.cs.rochester.edu/u/sandhya/papers/ispass05.pdf

https://i.imgur.com/2tp1dep.png
Note the year.

Then, you have Intel dropping SMT for this: https://www.google.com/patents/US20120246657
(Pay attention to what the virtual cores mode implies)
[redacted] 2019.
 
  • Like
Reactions: Verndewdimus

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Am i looking at that patent right, in that Intel moved into a more CMT direction?
They planned to move to CMT around the same time as AMD. With CSMT which is the successor to CMT and SMT; more efficiency and more IPC than both. Major reasons they didn't go the same route is the failure of P68(Netburst) and P7(Itanium). P8 was then delayed indefinitely, but has popped up in Linkedin/Patents (2014-2017). The issue is the codenames are of course in code. The Pentium 8 and Pentium 8+ codes share names with other Intel products. The patents specifically mention Pentium 4 or Itanium. While, also talking about clusters of cores, and cores with multiple independent FE(instruction bus)-EX(execution datapaths/data bus)-BE(control unit) units aka "cores".

The original patent notes four cores within a core. While, the more modern one notes eight cores, within four cores. Which makes it more confusing than anything.
 

Verndewdimus

Member
Nov 18, 2016
60
21
81
www.reverbnation.com
well i was reading that IBM observed that in assessing how much a core is working, assuming 100% wasnt really a rational thing because when the second thread was introduced which increased performance by 30 percent or so your single threaded core was probably at a 70 percent load. So assuming IBM is right, how is csmt going to add more IPC without taking a performance hit? (assuming I read that correctly)

Now I am reading this https://books.google.com/books?id=ujQ9dt0DFhkC&pg=PA128&lpg=PA128&dq=csmt+threading+performance&source=bl&ots=8f3YtFejqV&sig=MRbJpD1aKi7uIFm3XLM8PxuBm34&hl=en&sa=X&ved=0ahUKEwi-tvHe_ZHYAhUNwWMKHfKYAJ0Q6AEIZzAJ#v=onepage&q=csmt threading performance&f=false
Thats a far bigger number than IBM's article suggested, @ 225, and 85 percent over single thread and imt
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Thats a far bigger number than IBM's article suggested, @ 225, and 85 percent over single thread and imt
Sightly different targets, but similar results...

"The compiler assigns always first clusters to all threads and tries to reduce the number of clusters used by
each thread in order to also reduce communication overhead among different clusters. As a consequence, the
assigned clusters collide when several threads are executed simultaneously. CSMT avoids this problem and
allows a more parallel execution of the threads by renaming the clusters previously assigned by the compiler.
The renaming mechanism is fast and has a very low hardware complexity.

Our results show that CSMT makes a better use of clusters than interleaved multithreading (IMT). In terms
of performance, CSMT clearly outperforms IMT. For instance, for a 4-cluster machine with 4 threads, CSMT
shows an average speedup of 75% over a single threaded machine and of 53% over IMT assuming no cache
misses, which shows the ability of CSMT to hide horizontal no-ops. When a realistic memory system is as-
sumed, the speedup over single thread increases to 118% while speedup over IMT is reduced to 41%. This is
because now most of the resources wasted are due to stalls caused by cache misses, and IMT already does a
very good job hiding memory latency. However, CSMT still has a very significant advantage over IMT due to
its ability to hide both vertical and horizontal no-ops."
https://www.ac.upc.edu/RR/2006/23.pdf

Intel side:
https://www.slideshare.net/rusnano/soft-machinesvis-ctmarchitecturetechbriefingvf
https://www.anandtech.com/show/10025/examining-soft-machines-architecture-visc-ipc

Why Soft Machines had to be bought by Intel:
https://www.google.com/patents/US20110271056
^-- which is essentially an extremely vague grasp at CMT, CSMT, future of SMT. Dynamic back-end is a early grab at rSMT.

https://www.google.com/patents/US8595468
^- rSMT which is more core multiprocessing level. It can be implemented very well in CMT and CSMT to get that single threaded increase. The rSMT bus can be implemented in the front-end (CSMT), back-end (pSMT), or in the execution datapath (CMT) or all three (CSMT).
 
Last edited:
  • Like
Reactions: Verndewdimus

Verndewdimus

Member
Nov 18, 2016
60
21
81
www.reverbnation.com
Well ive gotten way more information than i initially wanted, But intel and AMD's processes arent entirely dissimilar in the sense of clustered multithreading, I even went back on the opteron timeline regarding interlagos. I wonder what AMD has for a visc type scaling solution, because thats long been an issue in cpu's. Thanks for the chat man. so in your opinion is AMD still needing a better flush, renaming, and no op design? The cycling that deals with cache misses.

I mean Epyc is kind of a wtf at this point. Some obligatory step forward for the server scene with an implied promise of evolution?
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Bulldozer patents had everything AMD needed. CSMT is hinted and a much larger core is hinted at as well.

Bulldozer patents say:
(Quad AGLU)
- In one embodiment, the ALU 220 and the AGU 222 are implemented as the same unit.
(Multiple FPUs)
- By utilizing multiple integer execution units that share an FPU (or share multiple FPUs) and that share a single pre-processing front-end unit, increase processing bandwidth afforded by multiple execution units can be achieved while reducing or eliminating the design complexity and power consumption attendant with conventional designs that utilize a separate pre-processing front-end for each integer execution unit. Further, because in many instances it is the execution units that result in bottlenecks in processing pipelines, the use of a single shared front-end may introduce little, if any, delay in the processing bandwidth as the fetch, decode, and dispatch operations of the front-end unit often can be performed at a higher instruction-throughput than the instruction-throughput of two or more execution units combined.
(Variant rSMT)
- Eager execution is a technique frequently to improve single threaded execution by concurrently pursuing both paths of possible execution following a conditional branch. Many branches are difficult to predict and it may be advantageous to fetch and execute down both branch paths rather than making a prediction and continuing with fetch and execution down only the predicted branch path. This mode of execution naturally creates two “streams” of integer operation execution that could each individually be directed to one of the clusters of execution. One path (e.g. the “not-taken” path) could continue to execute on the original cluster, while the “taken” path could begin execution on the other cluster. When the branch is resolved, one path is terminated while the other continues. The difficulty with this use of the previously independent clusters is that they now need to communicate architectural state in order to “fork” two streams from the initial single thread. It is also advantageous to have any cached microarchitectural state (L1 data caches, L1 translation lookaside buffers (TLBs), etc.) be present in both clusters for improved performance for both the taken and not-taken paths.
(CSMT/Virtual Core)
- Each pipeline stage can independently select between threads such that, at any given pipeline cycle, the pipeline stage can have instruction data from different threads distributed among its substages. This independent selection at each pipeline stage can facilitate more even progress between threads. In at least one embodiment, the first selected thread and the second selected thread can be the same thread or different threads. The selection of the first selected thread and the selection of the second selected thread can be performed based on thread priority, based on a comparative amount of instruction data buffered for one thread versus another (e.g., based on a ratio of the amount of buffered instruction data for one thread to the amount of buffered instruction data for another thread), based on a round-robin method, or a combination thereof.
(Individualistic Dispatch)
- A front-end unit coupled to the first execution unit via a first dispatch bus and coupled to the second execution unit via a second dispatch bus separate from the first dispatch bus, the first dispatch bus configured to concurrently transmit a first dispatch group of up to N instruction operations from the front-end unit to the first execution unit for a dispatch cycle and the second dispatch bus configured to concurrently transmit a second dispatch group of up to N instruction operations from the front-end unit to the second execution unit for the dispatch cycle.
(Dispatch 2.0 CSMT)
- Alternately, integer instruction operations can be dispatched to the integer execution units 212 and 214 opportunistically. To illustrate, assume again that two threads To and T1 are being processed by the processing pipeline 200. In this example, the instruction dispatch module 210 can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like.

All patents: 2007~2009.
 
Last edited:

Verndewdimus

Member
Nov 18, 2016
60
21
81
www.reverbnation.com
I looked into bulldozer and it sort of interested me that they only went partial cmt. given that opteron fully went cmt i would have expected better of epyc, as most everyone else did.
 

NTMBK

Lifer
Nov 14, 2011
10,456
5,843
136
I looked into bulldozer and it sort of interested me that they only went partial cmt. given that opteron fully went cmt i would have expected better of epyc, as most everyone else did.

Epyc's problems have nothing to do with CMT Vs SMT. Its core is amazing, an enormous leap over anything AMD has made before. The weakness is in latency between CPU clusters- if your workload requires much intercore communication, that slow Fabric will hurt you.

Don't listen to Seronx about CMT, he has a weird hard on for it.
 

DrMrLordX

Lifer
Apr 27, 2000
22,948
13,038
136
NTMBK really isn't steering you wrong, though. SeronX is a bit loony, and he has knowingly posted misinformation/disinformation here before to suit his own oddball agenda.

If you've learned something then great. But CMT here is a red herring.
 
  • Like
Reactions: CatMerc

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
I wonder if 10 years from now he will be right and link these threads to make sure everyone knows he predicted this. :p
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
.i wonder why AMD didnt utilize it earlier, i remember at the time Intel introduced prescott, the chip was running hot with performance hits for SMT and AMD's FX was killing it.

Intel's version of SMT wasn't introduced with 90nm Prescott in 2004, but 0.13u Northwood in late 2002.

Despite Netburst being quite narrow when not hitting the Trace Cache, the gains from Hyperthreading was less than with Nehalem and successors.

It was revealed that was due to the peculiarity of how Netburst chips were designed. The chips would end up being essentially way too aggressive in speculation and lose performance, and that ended up sapping potential gains for SMT.
 

amd6502

Senior member
Apr 21, 2017
971
361
136
i remember at the time Intel introduced prescott, the chip was running hot with performance hits for SMT and AMD's FX was killing it. Prescott was an immature move for Intel but it all worked out.

I'm not sure about this, but weren't Intel's earlier implementations of "hyperthreading" blocked multithreading rather than real SMT?
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,036
136
I'm not sure about this, but weren't Intel's earlier implementations of "hyperthreading" blocked multithreading rather than real SMT?

In mainline x86 implementations, it's always been SMT (cf the Hot Chips presentation for Pentium4 SMT.) In Itanium, Intel did coarse-grained (switch-on-event) multithreading instead. Performance was underwhelming, as a thread switch required a pipeline flush.