Originally posted by: Hacp
Why is HT only good for processors with Deep pipelines? Northwoods had it and they only had 20 pipe stages.
The idea that Hyperthreading only benefits processors with longer pipelines is usually perpetuated by those who also equate pipeline depth with IPC.
Pipeline depth has absolutely nothing to do with determining whether a given processor architecture is suitable for SMT.
Hyperthreading allows the P4's schedulers to pick from instructions from independent threads. If a thread has a very low amount of ILP (often caused by a high degree of data dependency) then it is going to be very difficult for the schedulers to find instructions that can be executed in parallel. Hyperthreading allows the schedulers to pick instructions from two
independent threads, meaning that the schedulers should be able to find non-dependent instructions it can issue in parellel, since data dependencies do not exist
across threads.
This is the whole point of SMT; to assist superscalar architectures to manage an IPC of > 1. Indeed, SMT would have been useless on pre-Pentium architectures, as superscalar execution didn't exist on the x86 platform previously.
There are two simple ways to increase the performance of an architecture:
1. An increase in frequency through an extended pipeline (accompanied with according improvements to the branch prediction mechanism to offset the negative impact on IPC).
2. An increase in theoretical IPC.
Of course, ideally we'd implement both methods, but since the high IPC method seems to have been given the spotlight of late, let's discuss this.
The maximum number of instructions a processor can execute per clock is determined by the number of execution units of each type. A thread comprised of integer code, for example, can only be executed by the integer units, and the maximum IPC attainable by the processor when executing this code will be determined by the number of these.
The
average IPC however, is determined by the issue rate, the accuracy of the branch predictor (and the number of clock cycles wasted in flushing the pipeline), and the ability of the schedulers to find instructions to execute in parallel, which was discussed above.
The Athlon can sustain a very high issue rate, thanks to its three general purpose decoders, but if the schedulers can't find an equal number of intructions to execute in parallel per clock, then its decode ability is going to go to waste. One way to increase the Athlon's (admittedly already high) IPC would be to increase the instruction window and correspondingly the number of tracked instructions. This would make life easier for the schedulers but would greatly increase the power draw.
An alternative method would be to introduce an independent instruction stream, reducing the need for a larger instruction window.
The importance of finding instructions that can be executed in parallel increases with the number of execution units.
So, if Conroe is going to be a wide-issue design, it will need to either have a very large instruction window, or it will need Hyperthreading.
Considering the impact on power requirements of the former method, I would expect Intel to place more emphasis on the latter.