Is any other operating system's scheduler going to do better with an SMT CPU? Also if you try moving threads onto occupied cores, then you have the issue of what happens if the "big" thread runs a slice of code that can utilize all available execution resources (AVX2 or what have you). Now you have the scheduler trying to put a second thread on the CPU when there are no pipeline stalls or other obvious "gaps" where the second thread can execute. Now the scheduler is going to have to move that thread to another core entirely, which is probably why "mindless and braindead" schedulers pick physical cores over logical cores first. Or at least one reason why.
You have exactly the same issue with big.LITTLE. If a scheduler is theoretically capable of detecting high utilization threads and moving them from little to big cores it's also theoretically capable of moving them from SMT shared to dedicated physical cores. It's all a software problem.
But we are also talking about AMD. Their server core design will be present in all of their products, at least until they grow to the point where they want to maintain separate core designs. I see no clear indicator that AMD will even consider such a strategy on any of their roadmaps.
Did you see AMD going with SMT2 before they announced it? Did anybody see that first implementation beating Intel's HT with the very first implementation?
Do we want SMT4 on the desktop?
That's completely beside the point. Does the majority of desktop users need AVX2? Most very likely do not.
In a server, it's realistic to believe that most of a CPU's resources will be committed most of the time (if not all of the time).
That's actually wrong unless you are specifically talking about HPC specifically. Servers in general are all about over-provisioning all kinds of resources, being prepared for the worst case resource usage scenarios.
So we don't worry so much about when and how a scheduler wakes up a particular core.
Patently wrong. The more cores a chip contains in one shared envelope the more the cores' activity will affect each other. The more cores can be put into deep sleep state the more headroom other cores can make use of. And as we know AMD developed Zen's microcode in PB in a way to dynamically make use of more headroom so it profits from that now already.
Zen2 is heading for laptops in Renoir. Presumably, Zen3 will follow the same circuitous path. Does AMD want SMT4 in laptops? I don't think we should rationally consider it possible (or plausible) that AMD will emulate big.LITTLE or DynamIQ in their core designs, but you have to admit, if they did, it would ease the transition to low-end computing devices, far moreso than would adoption of SMT4. Realistically-speaking, I think AMD will avoid any change away from SMT2 in the near future. They will keep selling more of the same since it works.
But in the last two years AMD did the opposite of "selling more of the same since it works". Zen to Zen 2 completely changed the MCM topology. SMT is still very new to AMD, having been introduced only two years ago. Software support didn't prevent AMD from launching any of the Ryzen nor the Threadripper chips. Windows scheduler had serious issues with TR 1's NUMA, then again with TR 2 WX's unbalanced NUMA.
There's also the issue of SMT and VMs. A lot of cloud vendors just disable SMT/HT right out of the gate. AMD has every intention of selling hardware to them, and I do not think that SMT4 will be a big selling point for those buyers. I also question whether a DynamIQ-style ansychronous core arrangement would be useful since it would complicate the allocation of bare metal assets during creation of a VM.
What is this "allocation of bare metal assets during creation of a VM" you are speaking of, resource allocation can be changed even after the creation of a VM, just as you can change the PC hardware after installing an OS. That again is purely a software issue.
And disabling SMT/HT for cloud providers is due to them specifically offering resources per single vCPU, and you don't want this vCPU resource being a variable that depends on how many concurrent threads are on it. But that doesn't prevent server providers offering computing resources per CCX (or comparable big.LITTLE blocks) instead where SMT could be left enabled.
I think the answer is c). None of the above. AMD simply doesn't have little cores available to use, so being the frugal sorts that they are, they'll just punt on that question and add more of the same SMT2 cores they already have (with planned updates).
You yourself were arguing for the cat cores before.
Not entirely true. Some of those challenges are unique to Infinity Fabic.
...which is part of the uncore and offers intra chip connectivity that one always needs on any chip...
Others are unique to AMD's CCX design. The mobile SoCs can easily gate off lower-level caches since they are not shared (I think the standard DynamIQ design calls for shared L3). So can pretty-much anyone else.
And Zen cores can power gate everything except the shared L3$. (I think I remember the APUs can even power gate the L3$ itself since it's not shared due to its single CCX nature, not sure.)
ARM's DSU has some interesting additional features though, like being able to gate off part or all of a cluster's L3 cache depending on load:
That's a good area for further improvements for AMD there indeed. (Also finding a way to make the shared L3$ globally writable instead just local slices per core. Making better use of that massive L3$ should give a good performance boost.)
But that's again about the cores which are plenty optimized for power efficiency as is already. The uncore is where most further power efficiency optimizations can be done.
And putting other threads to same core that already runs high-priority thread instead of idle cores is just stupid as it will slow down that high-priority thread.
Is that what the Windows scheduler does?