HT attempts to fill unused cycles in the pipelines of the execution engines. When a thread runs, it's very hard to keep them busy every CPU cycle. HT adds extra registers and other bits of logic to allow a second thread to make use of those otherwise unused cycles.
True enough, HT does provide an extra set of registers (though, they are unavailable to the programmers). However, it isn't really about unused clock cycles.
For most general purpose code, not much concern is given to optimization because it runs fast enough without out. For certain applications though, proper optimization increases the number of cycles the ALU/FPU can be busy doing work on the main/real thread (and thus decreases the number of "unused" ALU/FPU cycles), meaning there are less unused cycles for the HT/virtual thread. Of course it's very hard, if not impossible, to make sure that the ALU/FPU is doing something for the main/real thread every cycle, but for apps that run 24/7 and/or do heavy, time intensive crunching, you can be sure that time is spent on it because the gains are large (compared to say, how much you'll gain by optimization trivial threads or apps). The closer an app comes to 100% ALU/FPU utilization, the less the benefit of HT is, but since 100% is impossible to reach, there will always be some benefit to having HT.
A well optimized program doesn't try to utilize every part of the CPU. Thats just plain dumb. A lot of instructions are worthless for many applications. For example, the MMX instruction set. There is pretty much no point to writing an optimized program for MMX instructions, so the MMX registers basically sit around, unused. (HT doesn't make those registers/processing units get used either, you have to program to use them.). But there is more to it then that. What if you just use the ADD instruction? What about MUL and DIV, might they be used as well? The answer is, yes, they could be used. Or even further, what if your app is waiting on a cache miss or branch prediction? Could the CPU be doing something else? The answer is yes. None of those situations are the fault of a poorly written program, but each could give benefit to a threaded program.
If we assume a given thread is at 80% efficiency, then whatever the difference is from %100 (20%), plus whatever benefits from the extra registers and other HT logic (a few percent at best), is the gain you'll see from HT when running two of those threads simultaneously. If we can optimize this thread to 90%, then the HT gains become smaller. Conversely an app that's 70% efficient sees a larger benefit from HT because there are more free cycles to run the HT thread.
don't look at windows reported utilization of CPU as a measurement of efficiency. It isn't.
While it's not directly comparable, check out AT Bench between the i5 750 and i7 920, especially the CPU heavy encoding results for examples of how much HT is, or isn't, a benefit depending on the code being run.
Great example, look at the x264 encoding review. x264 is HIGHLY optimized, yet look at how the i920 compares to the i870. Even though the i870 is clocked higher then i920, the i920 is nearly the same in encoding speed. The big difference between the two is hyperthreading. A perfect example of a highly threaded application taking advantage of intel's hyperthreading.
HT's effectiveness does have to do with the code it's running, because that code is the largest factor in determining how much CPU is left available for HT threads. And as for making code work faster: If it's mutli-threaded code, then HT helps is run faster. If it's a single thread, then HT helps to run the hundreds of normal OS/App threads which can free up CPU cycles for the main thread.
This is just babble. Unfortunately I have to turn back to the P4 era when HT was something you could turn off, but here
http://www.neoseeker.com/Articles/Hardware/Reviews/intel306/6.html
Back in those days, pretty much all applications where single threaded. If you look at all the benchmarks, with and without HT, you'll notice that at best they perform the same and at worse HT will lose a little. Which is why I said, hyperthreading only helps multithreaded apps.
Outside of the early XP days when the OS didn't know the difference between virtual and real cores and you could stall the pipeline (or trying for a max overclock I suppose),
Um, yeah, the OS has no idea what your program is doing. It doesn't scan it or anything, it only schedules it for a spot to run and runs it (this is true of pretty much every multithread supporting OS from 95 to win 7). The CPU is doing the grunt work and saying "Hey, I'm waiting here, I'll start working on that other threads stuff". That is what HT is about, the CPU working on different threads when one thread stalls.
HT is always good. It gives you more performance. Often times, that performance can even be perceived by the user to rival real cores when you factor in user input, HDD access, etc. With a large enough IPC and clock speed gap, HT can rival real cores even with highly optimized code (bench the i3 530 and Athlon x4 620).
So, now you are saying that HT actually helps with optimized code (as I was saying).
Either way, as I posted above, HT is NOT always good, it does come at a small penalty. And in the day of multicored CPU's, its benefits are mitigated to the areas of highly threaded apps only.
I'm sure AMD wishes it had HT technology on it's low/mid consumer chips. I'm equally sure they aren't too worried about HT in a server CPU if they can keep the IPC and clocks close while having more real cores.
True enough. I don't know how AMD's architecture would fair with HT. (From some of the bulldozer slides, its speculated that they might be sporting a HT clone, who knows.)
It should be noted, However, that IPC is a somewhat worthless measurement of performance. SIMD instructions are excellent examples proving that.