Question On the virtues of SMT, or lack thereof

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
It's been a while since we've had a dedicated SMT thread where we can debate the pros and cons and ask ourselves and each other whether Intel and AMD should ditch the technology or keep it......and also whether ARM CPUs should adopt it.

Having owned many SMT capable CPUs, I can definitely say one thing. The introduction of efficiency cores has definitely lessened the impact of that technology. I ran a test in another thread where I transcoded a 4K60 FPS video to x265, and logged the difference between having HT on and off. HT on yielded 8.37% performance increase if I recall correctly over not having HT. Power usage was slightly more with HT as well as temps, but it wasn't a huge difference.

At first I was a bit surprised, given the fact that on my previously owned SMT capable CPUs (ranging from Nehalem all the way to Broadwell-E), the HT advantage was much greater in encoding workloads. It was always double digits, as encoding typically has both high TLP and ILP. Raptor Lake was the first CPU I've ever tested in encoding that had a single digit performance increase for HT enabled. But obviously, those previous CPUs that I owned didn't have 16 efficiency cores either.

So the efficiency cores are definitely sucking up a lot of TLP in those workloads. Which begs the question, is SMT now worth keeping or should Intel (and AMD should they ever implement efficiency cores) ditch SMT completely in favor of these efficiency cores?

Honestly, I am leaning strongly towards having SMT, but not because I believe it necessarily increases multithreading performance significantly. I've been doing some research, and one interesting tidbit I came across was from a recently released Chips and Cheese article convinced me of the virtues of SMT:

Golden Cove’s Lopsided Vector Register File – Chips and Cheese

Modern high performance CPUs from both Intel and AMD use SMT, where a core can run multiple threads to make more efficient use of various core resources. Ironically, a large motivation behind SMT is likely the need to improve single threaded performance. Doing so involves targeting higher performance per clock with wider and deeper cores. But scaling width and reordering capacity runs into increasingly diminishing returns. SMT is a way to counter those diminishing returns by giving each thread fewer resources that it can make better use of. SMT also introduces complexity because resources may have to be distributed between multiple active threads.

I definitely agree with the author's assessment here and it supports the performance characteristics I saw in my encoding test with HT on and off. HT/SMT is no longer just about increasing multithreaded performance. It's also about increasing single threaded performance. Case in point, my 13900KF saw a 8.37% gain in performance just by switching on HT. Does this mean that there was some TLP left that the 16 efficiency cores didn't tap into? Perhaps......but I doubt it. The task manager showed all 32 threads on my system at 100% capacity, as 4K transcoding is very compute intensive. After reading the Chips and Cheese article, what I think happened now is that HT enabled the performance cores to increase throughput and efficiency and better utilize the P cores. That's why the gain was much smaller than in the past, because with the efficiency cores now eating up a lot of the TLP, SMT is now primarily about increasing overall throughput in the core irrespective of whether it's a single threaded or multithreaded application.

This is because of the lopsided vector register file structure. Apparently, this makes it easier for the cores to dynamically adapt to high TLP or low TLP workloads without negatively impacting performance. It seems it's kind of like having your cake and eating it too. Now if I had turned off the efficiency cores, the HT impact would have been much larger I suspect due to more TLP being available so the second thread would have been allowed more resources.

The author states that this approach is not only more performant, but more die space efficient as well. So with that said, I declare the SMT debate to be over with, in favor of SMT :D

OK I'm sure there will be plenty of dissent. But this to me is an indication that SMT is not what it used to be. It has evolved and is now much more adaptive to the workload.

This merits it being kept around in my opinion.
 
Last edited:

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Sarcasm aside, SPECrate is peculiar: it just launches n completely independent processes. Things might be a bit different with data sharing.

That said, I never bought that Intel slide with their claimed gains with SMT removal. It sounded too much as they removed at design level the HT handling, ran some limited simulations and published the claims. I bet the engineers went mad at the marketing droids.
 
Jul 27, 2020
26,024
17,952
146
That said, I never bought that Intel slide with their claimed gains with SMT removal. It sounded too much as they removed at design level the HT handling, ran some limited simulations and published the claims. I bet the engineers went mad at the marketing droids.
Me too. It just seems too suspicious, especially since we have Meteor Lake 185H HT on/off scores in the V-ray benchmark thread that shows minimum 15% uplift. That's too significant enough to just throw away and Arrow Lake badly needs it, unless Lion Cove's SMT was being too detrimental to most workloads due to some bug, they decided to ditch it for faster time to market or the weird core arrangement in Arrow Lake led to significant latency issues.

I'm still hoping for a microcode/BIOS update reviving HT in Arrow Lake by end of 25H1 :p
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
It just seems too suspicious, especially since we have Meteor Lake 185H HT on/off scores in the V-ray benchmark thread that shows minimum 15% uplift. That's too significant enough to just throw away and Arrow Lake badly needs it, unless Lion Cove's SMT was being too detrimental to most workloads due to some bug, they decided to ditch it for faster time to market or the weird core arrangement in Arrow Lake led to significant latency issues.
I guess they had no choice to keep PPA for single thread competitive. On paper.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,809
1,289
136
I know this quote is over 2 years old, but I can't let this go.

K10 was the µArch family that preceded Bulldozer, with its initial implementation being branded Phenom and codenamed Agena (the one with the TLB bug).

IIRC a combination of the TLB bug and less than impressive 65nm node stymied its adoption, with its TLB fixed 45nm successor Phenom II/Deneb/K10.5 getting much greater traction.

Then Phenom II X6/Thuban/K10.6 was its last desktop/server design, with AMD's initial 32nm APU Llano carrying its final implementation, which I suppose you could call K10.7.
K10 is Bulldozer
K8H because of 128-bit SIMD was renamed to K9/Hounds which is Agena/Deneb/Thuban.

AMD never called Agena K10, it was only ever K8H/K9/Hounds.
Llano = Husky-10
Thuban = Hounds-100
Regor = Hounds-60
Propus = Hounds-50
Deneb = Hounds-40
Agena = Hounds-20

Mitch Alsup: K10 is Bulldozer, K8 is Opteron and follow-ons.
AMD CVs: Verification of Bulldozer cores (microprocessor based on K10 micro architecture & M-SPACE design methodology)
Andy Glew = Chief Architect of K10 between 2002-2004, Chuck Moore = Chief Architect of Bulldozer between 2005-2007, Mike Butler = Chief Architect of K10 2.0 as Bulldozer 2008+. Before Andy Glew the clustered core project was for low-power. Which was actually tied to Alchemy/MIPS64/PCS Group.

K10 = Dual-cluster core, advanced threading not included.
Bulldozer = Dual-cluster core, advanced threading included. 2007-slide, 2008-samples, 2009-release.
K10 2.0/Bulldozer that released = Dual-core processor, advanced threading not included. 2009-slide, 2010-samples, 2011-release.
 
Last edited:

soresu

Diamond Member
Dec 19, 2014
3,895
3,331
136
Llano = Husky-10
Thuban = Hounds-100
Regor = Hounds-60
Propus = Hounds-50
Deneb = Hounds-40
Agena = Hounds-20
I can't find a single reference supporting any "hounds" name from your post besides Husky.

Upon digging deeper it seems that AMD gave up K nomenclature in all official documentation after K8.

It seems like Barcelona is the official name for the µArch.

barcelona.jpg

roadmap.jpg


Likewise Deneb for Shanghai, and Thuban for Istanbul.

While AMD changed up the nomenclature on the consumer side to keep things interesting, they left it city based for server as it has remained up to Zen5 and beyond, presumably for continuity sake with long term customer relations.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,809
1,289
136
I can't find a single reference supporting any "hounds" name from your post besides Husky.
Mike Clark: From there, I did the Greyhound (K9) core, I was the lead architect there, which was a derivative of K8.

- timestamp: 04/07/08
AMD K8 "Greyhound" ...

The core family, family 10h was called Hounds. As they were going to give different dog names to each different core gen.

K8L (K8 Lions) became Family 11h and K8H (K8 Hounds) became Family 10h.

AMD 65nm/45nm 'greyhound core' CPU designs.
- Managed Execution Unit team for 45nm CPU: Led team in designing the execution unit and reorder buffer blocks for the Greyhound family.
NPI Greyhound/Ridgeback/Pharoah products
Greyhound = dog => Agena, 65nm, Deerhound = dog => Barcelona, HT Assist
Ridgeback = dog => Deneb, 45nm, Bloodhound = dog -> Propus (no L3), Dachshund = dog -> Regor (dual-core)
Pharoah (Hound) = dog => Thuban, Turbo-core
Husky = dog => Llano, Dual Supply target 0.8V/1.3V, bigger OoO, 32nm.

K8 Hounds (2005-2008) K9 (2009-present) ... for the virtues of CMT, the critical 2005 slides:
amdcmt0.png
amdcmt.png
amdcmt2.png
amdcmt4.png
amdcmt5.png
Bulldozer (Multi-threaded Core) July 2007 != Bulldozer (Multi-core Module) November 2009

Dec 31, 2022 (post)
Sept 29, 2023 = Zen has a low-power core option (effectively, the pervasive core)
Sometime in 2024 = AMD Saxony Strained SOI engineers returned to Dresden.
AMD Zen3 = Floating Point Unit => Cluster-based Multithreading
AMD Zen5 = Front-end => Cluster-based Multithreading
Just the Integer core is left to switch to Cluster-based Multithreading.
 
Last edited:
Jul 27, 2020
26,024
17,952
146
1738931988961.png

That's CRAZY SMT uplift. I wonder if we get something similar on 9950X or if Epyc 9005 is able to perform better mainly due to the higher RAM bandwidth.
 
  • Like
Reactions: Tlh97

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Didn't think anyone would miss them. It's not removal per se. Openvino is still visible and anyone can tell that it's not seeing any improvement.
I'm talking about the vino outliers at the top. No one can see them in your picture 😉
 

Attachments

  • Screenshot_20250207_192141_Samsung Internet.jpg
    Screenshot_20250207_192141_Samsung Internet.jpg
    158.1 KB · Views: 23
  • Haha
Reactions: igor_kavinski