Interesting, I had not really thought much about it before, but in essence what you are saying is that one could view the productivity gained by hyperthreading as an indicator of just how much (frequency x duration) the architecture stalls with a given instruction mix.
I.e. to realize hyperthreading performance "benefits" one is basically relying on an architecture being inherently unoptimized when it comes to certain instruction mixes...any changes in the architecture resulting in less stalling (and thus higher thread-level IPC) would actually result in lower thread scaling performance and less advantages of implementing a hyperthreaded architecture.
I had always assumed (never really worried about challenging my assumption on this until now) that hyperthreading was simultaneous, not switched, pipeline with shared resource contention being the cause of lower(ed) thread scaling.
But you are saying if a program is well optimized to intentionally avoid inducing/incurring pipeline stalls then thread scaling on an HT architecture actually suffers (not absolute performance, you lose thread scaling performance but you gain per-core performance) for it.
So it could be said that a program which scales well in a hyperthreaded environment is just a really poorly coded/compiled program which is inducing/incurring a plethora of pipeline stalls.
Or it could be said that the nehalem/westmere architecture is so poorly designed that with some applications the pipeline stalls are excessive to the point that implementing hyperthreading is about the only way to redeem the architecture from what would otherwise be a significant performance impact.
Is this "view" correct?
I am far from being a software engineer, but I would say that you are generally correct.
The biggest takeaway that I get from my engineers is that as you make your system more efficient (i.e. eliminate gaps in the processing chain), the less effective HT becomes.
One could argue (correctly) that we will probably never get to 0% cache misses, so, theoretically there will always be a place for an SMT-type architecture. But look at the differences between client and server. It gives bigger uplift in client because those apps are less threaded and *generally* less efficient. They were written for one big fast core instead of being written for multiple cores. Even in the world of single core processors, server apps were multithreaded, client apps were not.
If you could guarantee an environment where you had 0% pipeline stalls and 0% cache misses you'd kill SMT in a heartbeat. Why take the overhead if there is no benefit (because your pipelines would be full and you would never have a thread stop).
The only question to ask yourself is, do you believe people are working to make software more efficient or less efficient? If they are driving for higher efficiency, they are *probably* driving towards lower uplift on SMT over time.
Let's say a pipeline can handle 100 units of work at max. Because of things like cache misses and thread stalls you only get 80 units of work out of it. Now, with SMT you can load some additional work in during a stall, but there is a switching overhead (emptying/loading cache), so you get another 12 units of work out of it. 92 units out of a possible 100 units.
If you make your application and OS more efficient, you get an extra 6 units out of the pipeline. Then you make your cache larger and get another 4 units out. Now you are at 90 units. Because there is a physical limit of 100 units, and there is some SMT overhead, maybe you get 6 units out of SMT. The good news is that you are at 96 units of work, so you are getting more done and are more efficient, but guess what? Your SMT efficiency went from 15% (12 units on 80) to ~7% (6 units on 90).
Now, take that same 80 unit throughput and add a second core. 80 x 2 = 160 units. even if you assume 90% scaling, you are at 152 units. THAT is how you increase throughput.
Take your 152 unit dual core,and add the 4 units of cache and 6 units of SW enhancement. That adds 19 units (if you assume 90% scaling on the second core), or, roughly 12% uplift (to a total of 171). You are getting more benefit out of both of those increases because each has a full impact. In the SMT world, the benefit of those enhancements brings the SMT benefit down because you have the same number of pipes.
So the key is more cores will get you better increases as you make architectural changes towards better efficiency. In an SMT environment, you are banking on inefficiency in order to make 2 threads work on one set of pipelines.
Before anyone starts the flames, all of these numbers are made up and for comparison only. But you should get the idea.