Question On the virtues of SMT, or lack thereof

Carfax83 · Dec 29, 2022

It's been a while since we've had a dedicated SMT thread where we can debate the pros and cons and ask ourselves and each other whether Intel and AMD should ditch the technology or keep it......and also whether ARM CPUs should adopt it.

Having owned many SMT capable CPUs, I can definitely say one thing. The introduction of efficiency cores has definitely lessened the impact of that technology. I ran a test in another thread where I transcoded a 4K60 FPS video to x265, and logged the difference between having HT on and off. HT on yielded 8.37% performance increase if I recall correctly over not having HT. Power usage was slightly more with HT as well as temps, but it wasn't a huge difference.

At first I was a bit surprised, given the fact that on my previously owned SMT capable CPUs (ranging from Nehalem all the way to Broadwell-E), the HT advantage was much greater in encoding workloads. It was always double digits, as encoding typically has both high TLP and ILP. Raptor Lake was the first CPU I've ever tested in encoding that had a single digit performance increase for HT enabled. But obviously, those previous CPUs that I owned didn't have 16 efficiency cores either.

So the efficiency cores are definitely sucking up a lot of TLP in those workloads. Which begs the question, is SMT now worth keeping or should Intel (and AMD should they ever implement efficiency cores) ditch SMT completely in favor of these efficiency cores?

Honestly, I am leaning strongly towards having SMT, but not because I believe it necessarily increases multithreading performance significantly. I've been doing some research, and one interesting tidbit I came across was from a recently released Chips and Cheese article convinced me of the virtues of SMT:

Golden Cove’s Lopsided Vector Register File – Chips and Cheese

Modern high performance CPUs from both Intel and AMD use SMT, where a core can run multiple threads to make more efficient use of various core resources. Ironically, a large motivation behind SMT is likely the need to improve single threaded performance. Doing so involves targeting higher performance per clock with wider and deeper cores. But scaling width and reordering capacity runs into increasingly diminishing returns. SMT is a way to counter those diminishing returns by giving each thread fewer resources that it can make better use of. SMT also introduces complexity because resources may have to be distributed between multiple active threads.

I definitely agree with the author's assessment here and it supports the performance characteristics I saw in my encoding test with HT on and off. HT/SMT is no longer just about increasing multithreaded performance. It's also about increasing single threaded performance. Case in point, my 13900KF saw a 8.37% gain in performance just by switching on HT. Does this mean that there was some TLP left that the 16 efficiency cores didn't tap into? Perhaps......but I doubt it. The task manager showed all 32 threads on my system at 100% capacity, as 4K transcoding is very compute intensive. After reading the Chips and Cheese article, what I think happened now is that HT enabled the performance cores to increase throughput and efficiency and better utilize the P cores. That's why the gain was much smaller than in the past, because with the efficiency cores now eating up a lot of the TLP, SMT is now primarily about increasing overall throughput in the core irrespective of whether it's a single threaded or multithreaded application.

This is because of the lopsided vector register file structure. Apparently, this makes it easier for the cores to dynamically adapt to high TLP or low TLP workloads without negatively impacting performance. It seems it's kind of like having your cake and eating it too. Now if I had turned off the efficiency cores, the HT impact would have been much larger I suspect due to more TLP being available so the second thread would have been allowed more resources.

The author states that this approach is not only more performant, but more die space efficient as well. So with that said, I declare the SMT debate to be over with, in favor of SMT

OK I'm sure there will be plenty of dissent. But this to me is an indication that SMT is not what it used to be. It has evolved and is now much more adaptive to the workload.

This merits it being kept around in my opinion.

NostaSeronx · Dec 31, 2022

JustViewing said:
Regarding E-Core and AMD, AMD already tried E-Core concept and failed very badly with Bulldozer. But who knows how small a Excavator module on 5nm would be, they could add 64 core Excavator chiplet to Zen4 .

Bulldozer(K10.0) through Excavator(K10.7) were never E-cores. They were P-cores meaning HPC/Server optimized, for maximum frequency given that it had to scale to mobile. The E-core concept was Bobcat(+successor Jaguar) and early concept Bulldozer(before it was dropped to adopt the Glew(K10) -> Butler(K10) design).

The O.G. core for Bulldozer was Witt(Clustered-core Design w/ Duplicated FPU/Single-threaded only) -> Moore(Clustered-core Design-Shared FPU/Cluster-based Multithreading(ST or MT)) design.

Bobcat design => 1/3rd the K8 Hammer architecture on the same node.
Bulldozer (Original) design => 2/3rd the K8H Greyhound on the same node.

32nm
Bulldozer(Butler) => ~20 mm2 <-- P-core [Agroup: 10% of AMD's market late 2000s and early 2010s]
Greyhound[Husky] => ~9.7 mm2 <-- P-core [Agroup: 10% of AMD's market late 2000s and early 2010s]
Bulldozer(Moore) => ~6.4 mm2 <-- E-core Max [Bgroup: Geode&Sempron Pro and higher :: 70% of AMD's market in late 2000s and early 2010s]
Bobcat(Burgress) => less than 3.2 mm2 <-- E-core Min [Cgroup: Geode&Sempron :: 20% of AMD's market in late 2000s and early 2010s]

K10/Bulldozer (2010-32nm (Hi-Perf design)) => High-frequency in Server(Interlagos=8 modules) -> High-frequency in Mobile(Trinity2=1 module)
// 2+2*2 + 4 , High Max Freq
Bulldozer (2008-45nm (Low-Pow design)) => High-core count in Server(Sandtiger=16 cores) -> High-core count in Mobile(Falcon=3 cores)
// 2*2 + 3 + 4, Low Max Freq, 45nm GHz target was Agena's clocks => 1800 to 2600 MHz. This is paired with a low Vdd target range which didn't pop-up till Jaguar in 2013.

Bulldozer LP Team(the core we never saw) and the Superscalar<->SIMD Grid core team is suppose to be getting a revival and combining these techniques.
Gen1 ULP => Late 2018+(start of project+), LP 12FDX(84CPP/56Mx)
Gen2 ULP => Early 2020+(start of project+), HP 12FDX(64CPP/56Mx)

AMD's Grid is basically derived from this:

US6944744B2 - Apparatus and method for independently schedulable functional units with issue lock mechanism in a processor - Google Patents

A functional unit of a processor may be configured to operate on instructions as either a single, wide functional unit or as multiple, independent narrower units. For example, an execution unit may be scheduled to execute an instruction as a single double-wide execution unit or as two...

patents.google.com

Do to the SIMD aspect, might functionally look like this(rotated?):

SIMD Column 1/2 => One Cluster(fp:MUL+ADD, Int:Everything)
SIMD Column 3/4 => Another Cluster(fp:MUL+ADD, Int:Everything)
SIMD Column 5 => Memory Cluster(mem:Load/Store)

It isn't exactly the GRID(image above) or TRIPS(pdf below) implementation though:

https://www.cs.toronto.edu/~pekhimenko/courses/csc2224-f19/docs/TRIPS.pdf

GRID/TRIPS => Expects a large die to use, AMD's design => Expects a small die to use.

O.G. Bulldozer is pure SISD, while ULP-successor to BD is pure SIMD with grid-like scheduler; patent given: (The scheduler may be configured to independently schedule instructions (e.g. SISD instructions) to separate ones of the functional unit portions and atomically schedule other instructions (e.g. SIMD instructions) to the functional unit with the functional unit portions locked together.)

VALU0/1=c0 VALU2/3=c1 => 4x128-bit SIMD or 8x64-bit SISD FUs for General Purpose/Integer for one/two threads.

CMP scaling per thread => 1.7x // +70%
SMT scaling for one extra thread => 1.3x // +30%
Chip-level Multithreading => 1.55x // +55% (K10/Bulldozer // Glew-derived core)
Cluster-based Multithreading => 1.8x // +80% (Bulldozer // Witt-derived core)

2H2023+
Geode(Dual-core/two-processor FN1 socket) => 0.2V-0.7V Vdd Range // <1W-3W
Sempron(Quad-core/four-processor FT6 socket) => 0.4-0.9V Vdd Range // 2.5W-6W

DrMrLordX · Dec 31, 2022

Exist50 said:
Every single new architecture has a list of changes that someone wanted to make, but didn't have the time/resources to do.

There's no evidence that continuing to support SMT causes that potential problem.

Exist50 · Dec 31, 2022

DrMrLordX said:
There's no evidence that continuing to support SMT causes that potential problem.

Unless you assume that supporting SMT costs zero effort, then *something* is being traded off for it. How much, is the question.

DrMrLordX · Dec 31, 2022

Exist50 said:
Unless you assume that supporting SMT costs zero effort, then *something* is being traded off for it. How much, is the question.

Nah, there's no guarantee that anything is being traded off, at all.

Vattila · Dec 31, 2022

Interestingly, the RISC-V specification explicitly talks about hardware threads, or "harts" as they are called:

"The privileged instruction set specification explicitly defines hardware threads, or harts. Multiple hardware threads are a common practice in more-capable computers. When one thread is stalled, waiting for memory, others can often proceed. Hardware threads can help make better use of the large number of registers and execution units in fast out-of-order CPUs. Finally, hardware threads can be a simple, powerful way to handle interrupts: No saving or restoring of registers is required, simply executing a different hardware thread. However, the only hardware thread required in a RISC-V computer is thread zero."

Wikipedia

Carfax83 · Jan 1, 2023

Spider-man remastered on a 9900K based system with an RTX 3080 has terrible FPS drops with HT enabled. Golden/Raptor Cove CPUs however don't have this issue. I have been playing this game with HT enabled the entire time and I haven't seen any of these problems with hyperthreading. Hyperthreading on or off for me makes practically no difference.

Just goes to show what I said about earlier incarnations of HT is probably accurate in that equalizing the resources between both threads can yield higher performance in high TLP workloads, but suffers when there isn't as much TLP. Golden/Raptor Cove's SMT approach doesn't potentially have as high a performance increase as previous architectures but is much more capable of increasing performance in workloads with less TLP as well as preventing performance loss.

9900K system in Spider-Man Remastered:

13900K system (not mine, just pulled it off YouTube):

lopri · Jan 1, 2023

Why does 9900K drops frames with HT enabled? HT had been around way before 9900K's time and I see no reason why that has to be the case. Does 9900K NOT drop frames when HT is disabled?

DrMrLordX · Jan 1, 2023

It's possible that Spider Man: Remastered isn't placing any threads on logical cores on the 12900k/13900k.

Mopetar · Jan 1, 2023

Exist50 said:
Unless you assume that supporting SMT costs zero effort, then *something* is being traded off for it. How much, is the question.

Unless you're changing the design of your SMT (e.g. enabling more threads per core) or dynamic scheduling for micro-ops, not much will change generation to generation with SMT.

It's a bit like building a new type of more efficient hardware divider. You can just keep using the same design in subsequent generations. The only reason it might need to change or require support is if you want to overhaul pipeline stages.

It's certainly arguable that if a company were to spend a lot of time each generation tweaking or tuning it, that there's an opportunity cost incurred as time could be spent elsewhere, but SMT doesn't really need this.

Given that we don't see AMD or Intel touting improvements to their SMT in newer generations, I think it's pretty fair to assume that neither company is spending a lot of time with it.

Exist50 · Jan 1, 2023

Mopetar said:
Unless you're changing the design of your SMT (e.g. enabling more threads per core) or dynamic scheduling for micro-ops, not much will change generation to generation with SMT.

I wouldn't be so sure. SMT is a very complex feature that surely incurs maintenance costs and needs to be accounted for in other feature work. If nothing else, we know from Chips & Cheese's GLC article that Intel has been making changes to their implementation, and then there's the big issue of the side-channel wack-a-mole Intel's been dealing with for the past few years.

Mopetar said:
Given that we don't see AMD or Intel touting improvements to their SMT in newer generations, I think it's pretty fair to assume that neither company is spending a lot of time with it.

Tbh, even in their deep dives, Intel and AMD don't cover many of the key details. Like, all the widths and sizes they talk about gloss over so much else. If making a better CPU arch was as simple as turning some knobs, these companies wouldn't employ hundreds of people to do it.

Carfax83 · Jan 2, 2023

lopri said:
Why does 9900K drops frames with HT enabled? HT had been around way before 9900K's time and I see no reason why that has to be the case. Does 9900K NOT drop frames when HT is disabled?

The video clearly shows that when HT is disabled, the performance increases significantly. As to why, it's anyone's guess. I suspect it's because Skylake's older HT implementation is attempting to balance the resources between each thread and is somehow failing.

Golden/Raptor Cove's updated HT doesn't have this issue as it does load balancing internally and so the sibling thread probably isn't getting hardly any resources at all. When I tested HT on and off in this game on my system, there was practically no difference.

DrMrLordX · Jan 2, 2023

Carfax83 said:
When I tested HT on and off in this game on my system, there was practically no difference.

Did the game even move threads onto the logical cores?

Carfax83 · Jan 2, 2023

DrMrLordX said:
Did the game even move threads onto the logical cores?

Yes. This game is actually very CPU heavy because of the BVH building and maintenance, plus decompressing the assets while Spider-Man is moving around the city. This is the first game that I have tested on my system which activated all 32 threads on my CPU.

What I think is happening with the 9900K system is that it can't properly equalize the load balance between the hardware and logical thread, which is causing a performance drop off when HT is enabled. Seeing stuff like this is probably why Intel decided to change their HT implementation.

I really want to see a HT on and off comparison with Zen 4 in this game.

JustViewing · Jan 2, 2023

Carfax83 said:
What I think is happening with the 9900K system is that it can't properly equalize the load balance between the hardware and logical thread, which is causing a performance drop off when HT is enabled. Seeing stuff like this is probably why Intel decided to change their HT implementation.

Isn't that the Job of the OS scheduler? Far as I know CPU doesn't load balance on itself. Seems like high-load task and low-load tasks are allocated to single core, which causes the high load task to drop performance. Another explanation could be, cache thrashing with 16 threads all trying to fight for L1/L2/L3.

5800X vs 5800X3D vs 5950X should clear the cache thrashing theory

Mopetar · Jan 2, 2023

Exist50 said:
I wouldn't be so sure. SMT is a very complex feature that surely incurs maintenance costs and needs to be accounted for in other feature work. If nothing else, we know from Chips & Cheese's GLC article that Intel has been making changes to their implementation, and then there's the big issue of the side-channel wack-a-mole Intel's been dealing with for the past few years.

If your implementation of anything has security flaws or is otherwise broken, then yes it requires more effort to fix. This is true of anything, but SMT doesn't need a lot of changes to function. You can certainly design them to have some awareness of SMT to try to improve performance further, but it isn't strictly necessary.

Here's a diagram for SMT in the original AMD Zen core. Note that most resources (red) are shared between threads and have no additional awareness of SMT at all. TLBs or other caches (light blue) also require some extra tag bits to be able differentiate threads, but this is minimal and doesn't really necessitate a change in the logic of how they function. A few (blue) are aware of SMT and have some added logic to help with prioritization since the OS may have a larger understanding of the context in which the threads are used. It's only the statically partitioned (green) elements that are physically duplicated because the logic of managing a shared resource probably requires at least as many transistors as duplicating the resource.

Tbh, even in their deep dives, Intel and AMD don't cover many of the key details. Like, all the widths and sizes they talk about gloss over so much else. If making a better CPU arch was as simple as turning some knobs, these companies wouldn't employ hundreds of people to do it.

Designing some sub-component in isolation may be easy. You could probably design a very simple adder after an introductory course. Designing more complex hardware that uses optimized approaches is going to be a lot harder, and verifying that the design doesn't have edge cases where it fails or unwanted side effects harder still. However, it's designing everything to work together that's incredibly hard. A new sub-component might be really great, but does it fit within the existing pipeline? Are other components effected by changes to introduce changes to something else? Are the improvements largely pointless due to a bottleneck somewhere else?

I suppose if you're doing something like SMT8 as some big iron POWER or Sparc designs did, then there's obviously more design considerations and more time building a chip like that, but for consumer x86 it's largely a bolt-on feature. The number of additional transistors needed for the duplicated hardware components is minimal.

nicalandia · Jan 2, 2023

SMT performance gains will depend on how the CPU is design, SMT4 on Xeon Phi did provide impressive performance when compared to SMT Off.

There is a ARM Core that has SMT but I think it's a little core

Arm Announces Neoverse N1 & E1 Platforms & CPUs: Enabling A Huge Jump In Infrastructure Performance

www.anandtech.com

Exist50 · Jan 2, 2023

Mopetar said:
You could probably design a very simple adder after an introductory course. Designing more complex hardware that uses optimized approaches is going to be a lot harder, and verifying that the design doesn't have edge cases where it fails or unwanted side effects harder still.

There's a lot of academic literature on fancy adder designs from the early days of computing. And these days formally proving their correctness is very common. But SMT is more difficult in that it's a collection of features that all need to work together, and as the side channel attacks have shown, it's difficult to understand all the potential side effects.

Mopetar · Jan 3, 2023

Exist50 said:
There's a lot of academic literature on fancy adder designs from the early days of computing. And these days formally proving their correctness is very common. But SMT is more difficult in that it's a collection of features that all need to work together, and as the side channel attacks have shown, it's difficult to understand all the potential side effects.

A lot of the side channel attacks have more to do with speculation and out of order execution. I don't even think SMT was mentioned in the original publication of the exploit, but I may be wrong. If I recall, it was later discovered that Intel had some additional issues related to hyper threading being vulnerable to a specific type of these attacks which is why they disabled it in the 9000 series CPUs since they either couldn't fix it with firmware or the penalty was worse than just turning off SMT.

But your core premise is still wrong. SMT isn't really a collection of features that all need to work together. The simplest implementations are just brain dead simple and let a CPU execute instructions from another thread when the main thread is stalled due to a cache miss. I don't think the earliest implementations had much beyond that since the operating system wasn't aware of virtual cores or capable of managing.

But for the implementations that AMD and Intel are using, the design isn't that complicated. Again, refer to the graphic of Zen's SMT implementation. Most of the hardware doesn't even know SMT is there and wouldn't be any different if it weren't included. The duplicate hardware is minimal, even the added tag bits or logic for the hardware that is SMT aware is minimal.

You won't get a magical performance gain if you told your engineers to leave it out, and use the leftover time and transistors for something else.

Exist50 · Jan 3, 2023

Mopetar said:
But your core premise is still wrong. SMT isn't really a collection of features that all need to work together. The simplest implementations are just brain dead simple and let a CPU execute instructions from another thread when the main thread is stalled due to a cache miss.

"Just" is carrying a lot of weight there. It's a simple concept, harder to implement, and harder still to secure.

Mopetar said:
But for the implementations that AMD and Intel are using, the design isn't that complicated. Again, refer to the graphic of Zen's SMT implementation. Most of the hardware doesn't even know SMT is there and wouldn't be any different if it weren't included.

Just because most of the hardware doesn't care doesn't mean it's a simple feature. You could say the same about the branch predictor, but those are extremely complicated.

Mopetar · Jan 3, 2023

Exist50 said:
"Just" is carrying a lot of weight there. It's a simple concept, harder to implement, and harder still to secure.

AMD didn't have any problems with their SMT implementation that I'm aware of, or at least they didn't have a problem when it came to the exploit that targeted SMT on Intel chips.

It seems like the underlying issue with the Intel chips once again came down to speculative execution more than anything.

Just because most of the hardware doesn't care doesn't mean it's a simple feature. You could say the same about the branch predictor, but those are extremely complicated.

The most basic SMT is literally duplicating any hardware that stores the execution state of the thread and adding logic so that when the pipeline would normally have to stall, it just switches to the other thread.

Sure you can make some pretty simple branch prediction as well with some simple static techniques and a two-bit prediction mechanism, but modern branch predictors are far more complex than x86 SMT implementations. The added logic to other components that are SMT aware is largely about prioritization.

If you have some alternatives sources or information that shows that the Intel/AMD SMT implementations are far more complex than I'm assuming then by all means please feel free to share them, but once again I don't believe the core premise that it requires a lot of time or transistors and is somehow preventing Intel from making a faster CPU. The evidence just isn't there. It's like magic Vega drivers. Some people just want to believe.

Carfax83 · Jan 3, 2023

JustViewing said:
Isn't that the Job of the OS scheduler? Far as I know CPU doesn't load balance on itself. Seems like high-load task and low-load tasks are allocated to single core, which causes the high load task to drop performance. Another explanation could be, cache thrashing with 16 threads all trying to fight for L1/L2/L3.

I'm not a programmer or engineer so I honestly don't know, but going by what Chips and Cheese said about hyperthreading in their article, it seems to me that the CPU itself has some method of determining how many registers are allocated between threads. Skylake apparently uses the older method of halving the resources perfectly between threads:

Golden Cove is not the first Intel CPU to use such a watermarking scheme for the vector register file. Ice Lake SP shows signs of this too, so this feature was probably introduced in the previous generation. Skylake seems to cut the vector register file perfectly in two. Anyway, yet another optimization Intel has done to increase die area efficiency and SMT yield, while guaranteeing some level of fairness between two threads sharing a core.

As for Zen 4, this is what he said:

With regards to SMT register allocation, AMD appears to competitively share the vector register file. There is no cut-in-half partitioning or watermarking. One thread can use a very large portion of the vector register file even when the second thread is active. The difference in reordering capacity when a second thread comes active but doesn’t use FP registers, is accounted for by having to store architectural state for the second thread. That’s 32 AVX-512 registers and eight MMX/x87 registers. Other than that, the first thread is free to use whatever it needs from the vector register file. Zen 2 appears to use a similar strategy, though of course fewer registers are reserved to hold architectural state because AVX2 only has 16 vector registers.

Zen 2, Zen 3 and Zen 4 shouldn't have the same weakness as they don't halve the registers.

5800X vs 5800X3D vs 5950X should clear the cache thrashing theory

I wasn't able to find this particular comparison, but I did find a 5800x3D YouTube video and there doesn't appear to be any of the performance drops that the 9900K system had. However, I don't know if SMT was disabled for this test, but the uploader didn't say anything about it so I assume it was left on. I think this particular issue with HT/SMT is restricted to older Intel CPUs that used the perfectly halved register allocation;

Rocket Lake appears to have the same problem:

Question On the virtues of SMT, or lack thereof

Diamond Member

Diamond Member

Lifer

Platinum Member

Lifer

Senior member

Diamond Member

Elite Member

Lifer

Diamond Member

Platinum Member

Diamond Member

Lifer

Diamond Member

Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member