HT owes it existence, and performance, to living in a bubble that is created by pipeline inefficiencies which are themselves created by coding inefficiencies.
Those programmed inefficiencies are forever hard-coded into legacy programs unless the source code is revisited and recompiled, an unlikely outcome.
There are plenty of examples of softwares that have been multi-threaded for decades. Posix threads and Win32 both are common examples that came about in the early 90s.
It is an issue of optimization. You can't optimize for every case. A system optimized for a single application may not scale as you add other applications. And you can't benchmark your system for one model, and then claim the optimization (HT) doesn't work for other models. A lot of the HT detractors benchmark a single application, see that HT performs worse, and mistakenly apply that to a multi-app system.
Lets take a single program as an example. I have control of the source code and can modify the implementation as I see fit. Context switching is optional. I can implement it single-threaded or I can divide the work. The penalty is, a context switch costs, so if my code isn't constrained by memory / cache stalls (take a simple tight loop), then it certainly won't perform better by breaking it into two pieces unless those pieces can truly run in parallel (physical cores), so HT's false logical cores don't help here, they hurt.
Now, take a second model, the one of multiple programs. The context switch isn't optional; the OS must timeshare between them. That same program may run alongside other programs _better_ with HT enabled because we are optimizing away that non-optional context switch somewhat with HT.
If I distribute one build of my software, I'm probably better off with a multi-threaded build, accepting the 3% penalty seen on single core systems, knowing that on multi-core there will be 100%-300% benefit. At best, my program can detect the number of cores and existence of HT at startup, and adjust accordingly, or take the route of many softwares and let the end user change a config file (Oracle is one of the best examples - extremely tunable, and extremely threaded if you so desire to configure it that way).
Too many artificial benchmarks that claim HT "suffers" focus on the wrong thing, by analyzing performance on the single program scenario, they expect that to apply to the multi-program scenario, or at least the readers of the benchmark are the ones making the mistake. It won't ever work that way. You can't use a single application as a benchmark for a multi-app system and vice-versa. But this is true for any optimization, not just HT.
As it stands HT is really an optimization that requires cooperation of hardware and software to work best; it isn't an optimization that can be simply dismissed, as it has measurable performance boost in the general case, but the fact that we can't assume a core is a core will always be a problem, and amounts to technical dishonesty for the sake of marketing.