It's not just utilizing time when the whole core is doing nothing, but utilizing time when as little as a single execution port would otherwise be doing nothing. Intel's current CPUs are not scheduling one thread when another is idle, but feeding both threads to be executed by shared hardware, at the same time (some parts swap between threads, like the decoders, but they will do this regardless of activity level). Most of what only services one thread per cycle tends not to be much of a bottleneck (decoders and renamers, FI, are more of a latency and power bottleneck, than a throughput bottleneck--they're made fast more so they can idle more often than because of needing lots of front-end bandwidth all the time).
As the CPU gets effectively wider and wider, SMT will get better and better, since each thread can only use so much at any given time. Also, as the CPU gets wider and wider, and buffers all around larger, the potential negative impact will become less and less, as the CPU ends up with much more hardware than it needs to just execute this one thread over here. At this point, Intel has gotten to where, while HT itself may only use a very small part of each core, it is definitely the reason they can justify making other parts bigger and more complicated (dual-cores w/ HT are really popular), so it's gotten to be kind of a vicious cycle (IE, HT only takes up a few %, but you really need HT to make use of all the available registers, execution ports, and instructions per port being added each generation).
If clock speeds had kept on increasing, this would not be the case, since that would allow each thread to run faster, so it would be worth improving the speeds it ran at, instead of making the CPU much wider, so long as basic ALU and AGU operations could be done in a single cycle.
As well, as we scale out further, you will not see most software pegging all your cores. You'll have 1 pegged, and others running the software but not to 100%. What more HW threads can offer, is to reduce the amount of work the main thread or threads have to do, v. the other threads. Outside of embarrassingly parallel problems, the low-hanging fruit has been picked (we're just waiting for the improvements that can be had to spread, as there are unnecessary bottlenecks all over the place in existing code bases). Just as we went to dual-channel memory for 10-20% gains, we'll end up doing the same with numbers of cores, and for the same reason: faster is either impossible or more expensive.
As of today, the biggest problem is that synchronizing with HT is as bad as doing it with 2x the actual cores, and any IPC at all takes just as long, if not longer, too, even if you don't need to sync. Can software moving to work queues fix that (by trading fine grained parallelism with easier implementation of coarser grained parallelism)? Can HLE help that (by optimistically executing anyway)? Can RTM get around that (by going lockless for anything but physical IO)? Time will tell. But, better software and scheduling are much more likely to improve the performance of multithreaded systems (not merely HT), than to make them obsolete, especially with power being such a big issue, these days. MT like SMT, or AMD's CMT+SMT, are dealing with inefficiencies at individual clock cycles up to some tens of clock cycles, which are basically impossible for the OS to deal with.