Question On the virtues of SMT, or lack thereof

Carfax83 · Dec 29, 2022

It's been a while since we've had a dedicated SMT thread where we can debate the pros and cons and ask ourselves and each other whether Intel and AMD should ditch the technology or keep it......and also whether ARM CPUs should adopt it.

Having owned many SMT capable CPUs, I can definitely say one thing. The introduction of efficiency cores has definitely lessened the impact of that technology. I ran a test in another thread where I transcoded a 4K60 FPS video to x265, and logged the difference between having HT on and off. HT on yielded 8.37% performance increase if I recall correctly over not having HT. Power usage was slightly more with HT as well as temps, but it wasn't a huge difference.

At first I was a bit surprised, given the fact that on my previously owned SMT capable CPUs (ranging from Nehalem all the way to Broadwell-E), the HT advantage was much greater in encoding workloads. It was always double digits, as encoding typically has both high TLP and ILP. Raptor Lake was the first CPU I've ever tested in encoding that had a single digit performance increase for HT enabled. But obviously, those previous CPUs that I owned didn't have 16 efficiency cores either.

So the efficiency cores are definitely sucking up a lot of TLP in those workloads. Which begs the question, is SMT now worth keeping or should Intel (and AMD should they ever implement efficiency cores) ditch SMT completely in favor of these efficiency cores?

Honestly, I am leaning strongly towards having SMT, but not because I believe it necessarily increases multithreading performance significantly. I've been doing some research, and one interesting tidbit I came across was from a recently released Chips and Cheese article convinced me of the virtues of SMT:

Golden Cove’s Lopsided Vector Register File – Chips and Cheese

Modern high performance CPUs from both Intel and AMD use SMT, where a core can run multiple threads to make more efficient use of various core resources. Ironically, a large motivation behind SMT is likely the need to improve single threaded performance. Doing so involves targeting higher performance per clock with wider and deeper cores. But scaling width and reordering capacity runs into increasingly diminishing returns. SMT is a way to counter those diminishing returns by giving each thread fewer resources that it can make better use of. SMT also introduces complexity because resources may have to be distributed between multiple active threads.

I definitely agree with the author's assessment here and it supports the performance characteristics I saw in my encoding test with HT on and off. HT/SMT is no longer just about increasing multithreaded performance. It's also about increasing single threaded performance. Case in point, my 13900KF saw a 8.37% gain in performance just by switching on HT. Does this mean that there was some TLP left that the 16 efficiency cores didn't tap into? Perhaps......but I doubt it. The task manager showed all 32 threads on my system at 100% capacity, as 4K transcoding is very compute intensive. After reading the Chips and Cheese article, what I think happened now is that HT enabled the performance cores to increase throughput and efficiency and better utilize the P cores. That's why the gain was much smaller than in the past, because with the efficiency cores now eating up a lot of the TLP, SMT is now primarily about increasing overall throughput in the core irrespective of whether it's a single threaded or multithreaded application.

This is because of the lopsided vector register file structure. Apparently, this makes it easier for the cores to dynamically adapt to high TLP or low TLP workloads without negatively impacting performance. It seems it's kind of like having your cake and eating it too. Now if I had turned off the efficiency cores, the HT impact would have been much larger I suspect due to more TLP being available so the second thread would have been allowed more resources.

The author states that this approach is not only more performant, but more die space efficient as well. So with that said, I declare the SMT debate to be over with, in favor of SMT

OK I'm sure there will be plenty of dissent. But this to me is an indication that SMT is not what it used to be. It has evolved and is now much more adaptive to the workload.

This merits it being kept around in my opinion.

Exist50 · Dec 30, 2022

Carfax83 said:
I definitely agree with the author's assessment here and it supports the performance characteristics I saw in my encoding test with HT on and off. HT/SMT is no longer just about increasing multithreaded performance. It's also about increasing single threaded performance.

I think you're misreading the quote there. He's saying that the pursuit of high ST performance has resulted in often-unused execution resources that SMT helps take advantage of, not that SMT itself increases ST performance. I'm not aware of data that shows a single thread benefits in any way from SMT.

But yes, there have been various changes to the way SMT is implemented over the years to help improve its weaknesses. But I think that's kind of the core problem with SMT. If it were only an on-off switch that cost 5-10% transistors, it would be a no brainer, but instead it's a feature that needs active development and maintenance, all at a time where its role is being taken over by hybrid and small-core server chips.

zir_blazer · Dec 30, 2022

Carfax83 said:
Case in point, my 13900KF saw a 8.37% gain in performance just by switching on HT. Does this mean that there was some TLP left that the 16 efficiency cores didn't tap into?

You mean 8.37% total performance WITH the 16 E-Cores enabled? Disable the E-Cores and try again to see how SMT really performs, as 8P/8T + 16E/16T vs 8P/16T + 16E/16T is diluding the gains compared to 8P/8T vs 8P/16T.

Carfax83 · Dec 30, 2022

Exist50 said:
I'm not aware of data that shows a single thread benefits in any way from SMT.

Has any tech reviewer or outlet tested HT on Golden Cove/Raptor Cove other than Chips and Cheese? Not to my knowledge. People assume that SMT behaves much like it always has, and if it weren't for Chips and Cheese we would probably still have the same views on it. I know I did.

Also SMT may not directly affect ST performance, but it apparently does so indirectly by lowering the amount of resources available to the core. This seems contradictory to me, but I suppose it's similar to how having a smaller cache means having a faster cache.

If I had to guess, some machine learning is involved with how the core decides how much resources to apportion to the main thread and to the assist thread. But there's no doubt, to me HT increased performance by 8.37% in a workload that was already tapped out in terms of TLP due to the 16 efficiency cores.

In an older core like say Skylake, I think HT would have resulted in no performance increase or more likely a decrease in that particular workload assuming the hypothetical efficiency cores assisting.

That said, this shows that Intel is not going to give up on SMT as they have actively improved its performance and efficiency.

Carfax83 · Dec 30, 2022

zir_blazer said:
You mean 8.37% total performance WITH the 16 E-Cores enabled? Disable the E-Cores and try again to see how SMT really performs, as 8P/8T + 16E/16T vs 8P/16T + 16E/16T is diluding the gains compared to 8P/8T vs 8P/16T.

Yes I agree the gains were diluted and I will run some tests again when I have the time. One thing though, is that when I have turned off the E cores in the past, I noticed some performance drops in the cache benchmarks so I don't know if this would tarnish the tests. But to pull 8.4% out of the cores is nothing to be scoffed at as all the TLP had been sucked up by the efficiency cores.

The lopsided registers just allocated a smaller amount of resources to the performance cores so that they could achieve higher performance and efficiency. With SMT off, the cores had more than they knew what to do with apparently and thus performance and efficiency suffered.

Carfax83 · Dec 30, 2022

OK did a quick encoding test. This test was done with x264 to save time and done with the E cores disabled and with HT on and off. HT on was 17% faster than with it off. I used the fast 1080p 30fps preset with the same source video I used before.

In the future I will do it with x265 and I'm thinking that will yield in the 20% and above performance increase as x265 has more TLP than x264.

ElFenix · Dec 30, 2022

Carfax83 said:
So the efficiency cores are definitely sucking up a lot of TLP in those workloads. Which begs the question, is SMT now worth keeping or should Intel (and AMD should they ever implement efficiency cores) ditch SMT completely in favor of these efficiency cores?

this is not begging the question.

TheELF · Dec 30, 2022

Carfax83 said:
If I had to guess, some machine learning is involved with how the core decides how much resources to apportion to the main thread and to the assist thread.

No, it goes by thread priorities, equal thread priority and each thread gets the same resources, if one has a very high priority compared to the other it will get much more of the resources.
You can get process hacker to see at what priority any thread runs and even change priorities on single threads to see it change in realtime.

Carfax83 said:
But there's no doubt, to me HT increased performance by 8.37% in a workload that was already tapped out in terms of TLP due to the 16 efficiency cores.

You keep using that word. I do not think it means what you think it means.
Software issues as many threads as there are available cores, if you enable HT you enable 8 more logical cores and so the software launches 8 more threads so there is more TLP.

VirtualLarry · Dec 30, 2022

One thing that I've wonder is, about the partitioning for SMT.

What if, instead of perhaps 8 cores with 6-wide execution pipelines, and two SMT threads per core, what if they widened it out, to 8x6=48 execution units in ONE "hugecore" CPU (not unlike the SM warp arrays in GPUs), and then ran a whole bunch of SMT threads against that ultra-wide core. (At current standards, that would be 16 SMT threads at a time, though it could be more in the future.)

NTMBK · Dec 30, 2022

With the E cores enabled you won't just be running into TLP limits, you will be hitting bottlenecks in the memory subsystem. That is probably affecting your scaling.

If you want to test the SMT scaling in isolation, I would suggest isolating your workload to a single core through affinity masks, and see how it compares with 1 thread Vs 2 threads. That way almost the entire memory subsystem is just feeding 1 core, reducing that bottleneck.

Vattila · Dec 30, 2022

Also, to test SMT in a best-case scenario you want to look at workloads that have a lot of stalls (memory, unpredictable branches), i.e. workloads that underutilise the execution resources of the core. I think databases, web apps, business applications and content serving are good examples. The point of SMT is that a second thread (or more) in such workloads can utilise the free execution slots. On the other hand, workloads such as media encoding and decoding are optimised to the hilt to utilise the core in an optimal manner and with as few stalls as possible, e.g. by filling all the SIMD units and ensuring they are fed with data every clock cycle, as far as possible. This leaves little opportunity for SMT to benefit.

Simultaneous Multithreading (SMT) - SlideServe

Simultaneous Multithreading (SMT). An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors.

www.slideserve.com

TheELF · Dec 30, 2022

VirtualLarry said:
One thing that I've wonder is, about the partitioning for SMT.

What if, instead of perhaps 8 cores with 6-wide execution pipelines, and two SMT threads per core, what if they widened it out, to 8x6=48 execution units in ONE "hugecore" CPU (not unlike the SM warp arrays in GPUs), and then ran a whole bunch of SMT threads against that ultra-wide core. (At current standards, that would be 16 SMT threads at a time, though it could be more in the future.)

They could probably do that but I think it would be horrible for yields, also you would have to feed this whole thing with power even if you are just looking at an empty desktop so it would also be horrible for power consumption in idle/low utilization.

yuri69 · Dec 30, 2022

VirtualLarry said:
One thing that I've wonder is, about the partitioning for SMT.

What if, instead of perhaps 8 cores with 6-wide execution pipelines, and two SMT threads per core, what if they widened it out, to 8x6=48 execution units in ONE "hugecore" CPU (not unlike the SM warp arrays in GPUs), and then ran a whole bunch of SMT threads against that ultra-wide core. (At current standards, that would be 16 SMT threads at a time, though it could be more in the future.)

Widening stuff is not easy. You'd need to effectively feed those 48 EX units. Each of them needs to be connected to a source (register file port), produce outputs somewhere, and feed other units via a forwarding network. The complexity rises quite fast - multi-ported register file, storing and forwarding networks interconnecting the units as each-to-each, etc.

Carfax83 · Dec 30, 2022

TheELF said:
No, it goes by thread priorities, equal thread priority and each thread gets the same resources, if one has a very high priority compared to the other it will get much more of the resources.
You can get process hacker to see at what priority any thread runs and even change priorities on single threads to see it change in realtime.

If you read the article I linked to in my OP, the author claims that Intel is using a watermarking scheme which is able to determine how many resources each thread is supposed to get, and one thread can get a lot more than its sibling thread depending on the workload. This is different to how SMT used to behave in previous architectures, where there was a balanced approach to how the registers were used.

It seemingly has nothing to do with thread priority. I wouldn't be surprised if the thread manager has a role however, as it seems hardware based.

Software issues as many threads as there are available cores, if you enable HT you enable 8 more logical cores and so the software launches 8 more threads so there is more TLP.

But if that's the case, why does scaling not continue unto infinity? Some applications can scale with way more threads than others. Heck, even within an application itself, there may be parts that scale performance wise with more threads, while others are the exact opposite. Codecs are often like that, which is why CPUs still offer the best combination of quality and disk size as CPUs are very strong in single threaded workloads.

Exist50 · Dec 30, 2022

Carfax83 said:
Also SMT may not directly affect ST performance, but it apparently does so indirectly by lowering the amount of resources available to the core. This seems contradictory to me, but I suppose it's similar to how having a smaller cache means having a faster cache.

The sharing of resources will hurt each individual thread as compared to SMT-off mode. You don't get any of the kind of capacity-latency tradeoffs that you see with caches.

Carfax83 said:
If I had to guess, some machine learning is involved with how the core decides how much resources to apportion to the main thread and to the assist thread.

Nah, it's probably something a lot more primitive than that.

Exist50 · Dec 30, 2022

VirtualLarry said:
One thing that I've wonder is, about the partitioning for SMT.

What if, instead of perhaps 8 cores with 6-wide execution pipelines, and two SMT threads per core, what if they widened it out, to 8x6=48 execution units in ONE "hugecore" CPU (not unlike the SM warp arrays in GPUs), and then ran a whole bunch of SMT threads against that ultra-wide core. (At current standards, that would be 16 SMT threads at a time, though it could be more in the future.)

I have my doubts about the plausibility of such a scheme, but just rolling with the idea for a sec, I wonder what would happen if you took a clustered arch like Gracemont and alternated the decoding in such a way that you have one thread per cluster. I think that's kinda what IBM does for some of their Power CPUs? Been a while since I looked into those.

TheELF said:
also you would have to feed this whole thing with power even if you are just looking at an empty desktop so it would also be horrible for power consumption in idle/low utilization.

Playing devil's advocate, in low ILP scenarios, perhaps you could dynamically adjust the width by power gating the unused portions. @yuri69's concerned are probably harder to solve though.

Carfax83 · Dec 30, 2022

NTMBK said:
With the E cores enabled you won't just be running into TLP limits, you will be hitting bottlenecks in the memory subsystem. That is probably affecting your scaling.

If you want to test the SMT scaling in isolation, I would suggest isolating your workload to a single core through affinity masks, and see how it compares with 1 thread Vs 2 threads. That way almost the entire memory subsystem is just feeding 1 core, reducing that bottleneck.

Thanks for the suggestion. I don't know if I'll do it though just due to the amount of time it would take to run an encoding workload on a single core.

I also tested x265 and the performance increase was the exact same as x264, 17%. I am intrigued by this because I figured x265 would yield a bigger performance boost as x265 is more parallel than x264 in terms of ILP and TLP. This behavior with HT is completely different from my experiences with previous Intel architectures.

Architectures like Sandy Bridge, Haswell etcetera had larger boosts from HT, but could also have bigger losses as well due to the way the registers were apportioned between the threads. It seems that Golden/Raptor Cove goes away from that paradigm and prioritizes overall throughput and utilization of the core instead of attempting to balance thread load, which boosts performance under more scenarios while restricting performance losses. I surmise the trade off is that HT doesn't increase performance as much as it once did, but is much more reliable in terms of performance characteristics under more diverse conditions.

Carfax83 · Dec 30, 2022

Exist50 said:
The sharing of resources will hurt each individual thread as compared to SMT-off mode. You don't get any of the kind of capacity-latency tradeoffs that you see with caches.

It seems the author was alluding that the main thread would be limited up to a certain point in terms of how many registers it can access. I suppose this prevents thread clashing, something that used to happen a lot with older CPUs.

Nah, it's probably something a lot more primitive than that.

I figured it may be machine learning based as Intel has already stated that they used machine learning for the L3 cache inclusive/non-inclusive algorithm.

JustViewing · Dec 31, 2022

Continuing from the other thread, as I understood the argument against HT was that
1. It needs additional transistor budget to support HT
2. If not for HT, these additional transistors could be used to improve single thread performance by 5%
3. Any loss in multi thread performance can be compensated by adding more E-Cores

But I don't think you can simply increase performance by 5%. If not AMD/Intel would have already integrated these changes into their design. Additional logic could also negatively impact performance by increasing pipeline stages, additional heat, reduced clock speed etc. Bottom line is, it is not easy to increase IPC.

Intel's P-Core is relatively very large for its actual performance. For AMD, its version of E-Core should be only half the size of Zen4 core. So AMD won't have same level of advantage with E-Cores.

In my opinion, since the R&D on HT is already done and it is giving additional MT performance (0-85% from my personal experience), there is no need to remove it.

Regarding E-Core and AMD, AMD already tried E-Core concept and failed very badly with Bulldozer. But who knows how small a Excavator module on 5nm would be, they could add 64 core Excavator chiplet to Zen4

.

BorisTheBlade82 · Dec 31, 2022

@JustViewing
4. Without SMT the danger of side channel attacks like Meltdown/Spectre is much smaller.

Regarding Excavator as a small core: Oh my goodness, please, no! 😉

JustViewing · Dec 31, 2022

BorisTheBlade82 said:
@JustViewing
4. Without SMT the danger of side channel attacks like Meltdown/Spectre is much smaller.

Those people will find another side channel attack, maybe shared L2/L3

DrMrLordX · Dec 31, 2022

JustViewing said:
Those people will find another side channel attack, maybe shared L2/L3

Even OoO can open up side-channel attack surface.

Mopetar · Dec 31, 2022

JustViewing said:
Continuing from the other thread, as I understood the argument against HT was that

2. If not for HT, these additional transistors could be used to improve single thread performance by 5%

I've never heard the mechanism by which this supposed 5% uplift would occur if the relatively small number of transistors used to add SMT were applied to something else.

I think people are just assuming that taking those transistors and having them do something else would allow for an extra 5% performance. I strongly doubt this is the case and the hardware resources for enabling SMT aren't that excessive.

Hyper-threading is relatively cheap compared to most other hardware. Really all that you need at a minimum level is a separate program counter, register file, and anything else that stores the execution state of the thread. You might also want to duplicate the TLB, but that's not strictly necessary.

Multiple execution units, register renaming, and the complex logic to enable out of order execution all existed prior to the addition of SMT. Designers realized it was a low cost method of adding a lot of extra performance in many cases without requiring a lot of added logic on top of what already exists.

Depending on how the core is designed internally, the SMT might effectively be free if a massive register file is allowed to be split across different threads and for resources to be provisioned and reclaimed as necessary.

If transistors were reclaimed from SMT they would most likely be allocated to include additional smaller, simpler cores which excel at a different type of workload. Otherwise at best you might be able to use the transistors to get a small performance bump in niche workloads which were previously bottlenecked, but a 5% performance gain across the board is dubious. If that 5% could be achieved with so few transistors, it would be done without removing SMT. Something else would get cut first.

JustViewing · Dec 31, 2022

Mopetar said:
but a 5% performance gain across the board is dubious. If that 5% could be achieved with so few transistors, it would be done without removing SMT. Something else would get cut first.

Exactly !!!

Exist50 · Dec 31, 2022

JustViewing said:
But I don't think you can simply increase performance by 5%. If not AMD/Intel would have already integrated these changes into their design.

Every single new architecture has a list of changes that someone wanted to make, but didn't have the time/resources to do.

Question On the virtues of SMT, or lack thereof

Diamond Member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

No Lifer

Lifer

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Platinum Member