Good day AT people. I have a question that I thought some people might like to discuss. Hyper-Threading is a proprietary technology developed by Intel where every physical core can be "split" into two logical cores that can perform processes simultaneously to speed up an application (that's a very watered down explanation but the point of this post is not to discuss what HT is but how).
Hyperthreading adds the minimum amount of additional resource handlers to a CPU core, so that that CPU core can execute multiple threads. HT as it is currently implemented is very close to a fully shared implementation. A few parts are still split, but the important bits are either added on (dedicated full-size resources for each thread), or shared (execution units, caches, etc.). It is much closer to fully shared SMT than it is to fully partitioned (IE, split up) SMT.
But the program must be developed to utilize Hyper-Threading in order for it to work...
This is also wrong. The program needs to be developed to use Hyperthreading
to extract the most performance from a Hyperthreading CPU, running low IPC/high CPI code that is not low IPC or high CPI due to bandwidth limitations or execution resource stalls. Such optimizations are generally either leftovers from console development (XB360 and PS3 CPUs have SMT that can be used like HT), or leftovers from the era of early P4 Xeons. Once the K8 came out, developers started to generally not care (this is a good thing, mind you), and HT in the Core i series has far superior performance to HT in the P4.
Ok, got that, that part makes sense now. But how or exactly why this is the case is still eluding me...
CPUs have gotten very fast very quickly,
but memory hasn't. Even as they stay <4GHz, they are doing more work per cycle, so a modern 3GHz CPU might be equivalent to what people in the early 90s would have expected from a 10GHz CPU.
A Pentium might have to take 20 cycles to get out to RAM (made up the number, but it should be close). A Core ix-2xxx might take 100-200 cycles (or more, if the address needed is far enough off from any open address, and on a RAM channel that's getting hammered by another thread). Ouch. Given that real IPC has been increasing, it would probably be equivalent to 300-500 cycles, if we were to normalize it to the older CPU's performance per clock.
This is why CPUs have been getting bigger and more complicated caches. However, as these caches get bigger, they also get slower, and then they also necessarily must be made slower for the CPU to reach higher speeds. So, today, going out to L2 is big of a deal as going out to main RAM was 15+ years ago.
To mitigate these latencies, your CPU is constantly trying to determine what the most likely instruction and data addresses are that it may need in the near future, and fetch them before you need them. Programming languages, compilers, preferred data structures, and preferred common algorithms have all been working in a complimentary fashion with the advancement of such speculative hardware; so while it is hard to look at past use and determine future use, we try to use methods that make it less hard on the hardware, when we can.
When the CPU is correct, it can perform speculative execution, as well. The first form that came to commodity CPUs, TMK, being speculative execution of a predicted branch. IE, if A do B else do C, it figures it's probably going to need C, and it has data on the CPU to execute C, so it does so. If it was right, the CPU never even had to wait on evaluating A. If it was wrong, it can get rid of everything after A, and jump to C (a performance hit is taken for this, but it is correct often enough that is worth occasionally being wrong). More recently, speculative re-ordering of loads and stores, and speculative execution by memory value (IE, goto A+B, but B has not been evaluated) have been getting more popular (and value speculation is
hard).
Then, modern systems use virtual memory. So, a memory address the program asks for is in its own little world, and then must be translated to a physical memory address, the location of which is controlled by the hardware and operating system. CPU's have buffers of address translation data (TLB), but these will not always have the value needed. Having to go look it up can take time, even going so far as multiple rounds trips out to main RAM.
Finally, instructions run in a CPU take time. Most basic instructions can complete very quickly (1 cycle, if everything goes well), but many memory operations, divides, and modulos (divide but return the remainder) can take up to ~30 cycles, even after getting all the data ready for them. When another instruction depends on one of those, you have a whole set of instructions twiddling its thumbs in a queue. Good out of order execution (OOOE) allows any instruction who's data is ready to execute, so that while that set of instructions is waiting on either one long instruction to complete, or on a cache request (if it has to go out to memory, OOOE won't help much), others can run, reducing the penalty.
Oh, wait, that's not the end. What we call operating systems today grew from what used to be called
time sharing systems, and the way threads are handled follows that history. When swapping from thread A to thread B, the CPU lets execution finish, then packs up the register state, and sends it out to RAM. Once it is done, it then loads the state for thread B. This takes for freaking ever. Its like wrapping a gift in a box, packaging it, sending it by post, then waiting for the receiver to send back a thank you letter, before handling the next gift. There are good reasons for sticking with it, but you would generally imagine the process being quicker and simpler than it is. Some of this latency can be hidden by HT, which greatly helped the P4.
Well, taken all together, you should be able to see how a CPU will often be left waiting on something to do, and thus how swapping between threads at the front end is not a big deal (the front end can decode and prepare instructions faster than they will ever be executed, outside of synthetic benchmarks).
The benefit of Hyperthreading, and any similar SMT implementation, is that it can get the CPU doing work while one thread is waiting around, making more efficient use of often-idle execution resources. The downsides with Hyperthreading, and any similar SMT implementation, are that the other thread is every bit as likely to stall as the first, and they both compete over the same memory and execution resources when they aren't stalled.