If Intel releases another Hyperthreading cpu does that mean another inefficient cpu is coming our way?

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
I thought that Hyperthreading was a way to reclaim some of the lost cycles that occured when the long p4 pipeline had a misprediction and needed to be flushed. Basically making up for some of the inefficiencies of the Netburst design.

If Intel (or AMD) releases a cpu with Hyperthreading woudn't that imply an inefficient design?

Or can Hyperthreading made a good design better?

If the previous question is true then why didn't Intel include Hyperthreading with the Pentium M or Conroe (Merom)?

I know there are people here that know a lot about Hyperthreading...
 

aka1nas

Diamond Member
Aug 30, 2001
4,335
1
0
If Intel releases a Core-2 type CPU with Hyperthreading, the specifics of the implementation would probably be quite different than what it was on Netburst. HT is mainly just a marketting term and SMT isn't really tied to one particular type of CPU architecture.
 

mountcarlmore

Member
Jun 8, 2005
136
0
0
nehalem, which will likely be a core 2 based design, will have a new implementation of smt. it will not be like netburst's smt, and might actually do the opposite of what hyperthreading did, but intel is bringing back smt, no mistake about it.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
If Intel (or AMD) releases a cpu with Hyperthreading woudn't that imply an inefficient design?

No. That's a big overlook on CPUs. Unlike GPUs which can utilize quite close to 100% of the massively parallel execution units due to the exceedingly parallel nature of graphics, CPUs can't. That's why we don't see CPU companies increasing issue width for increased performance anymore(ok, core 2 duo went from 3 to 4, but that's nothing, since it was 3 almost 10 years ago), we have a limitation on how much parallelism we can extract out of instructions.

Hence came thread level parallelism. Increasing performance by having more threads are much easier than increasing issue width. If HT was for inefficient CPUs, we won't see Power 5 CPUs(which have really high performance per clock) using similar tech.

SMT(Hyperthreading is Intel's marketing name for SMT, and since it sounded good everyone thinks HT before SMT), is one way of taking advantage of lack of instruction level parallelism. It's making CPUs more efficient per core. Even on Core 2, we are nowhere efficient as CPU manufacturers would want to.

(The system of having to put quotes for AT forums really suck, I thought the guys had more brain than that, do not even mention about comment system's lack of editing system)

 

Banzai042

Senior member
Jul 25, 2005
489
0
0
My understanding of HT (as implemented on the netburst arch.) is this: When the processor is doing one task that has the proc at "100%" usage it's not using all the stages/same areas of the pipeline, because different processes do different things. HT created 2 virtual cores on one real core, with different areas of the pipeline split up, so that one task that used area A of the pipeline could run parallel with a seperate task that used area B of the pipeline. The idea itself is a reasonable one, it essentially allows for multicore style parallel computing on one processor core, assuming that the 2 processes use different areas of the core.

The problem with HT comes mostly from the software implementation, when the pentium D came out I read multiple reviews that said that in some situations performance actually dropped when HT was enabled. The pentium D with HT was seen by windows as 4 cores (proc 1 and 2 both having 2 virtual cores), so in some situations when windows filled what it saw as core 1 (physical core 1, virtual core 1) it would start trying to fill what it saw as core 2 (physical core 1, virtual core 2) with the same type of tasks, leaving physical core 2 untouched.

I would guess that if Intel and AMD were to create a standardized system similar to HT we would probably see an increase in parallel computing performance because the areas of the operating system that handle the assignment of processes to different cores could determine which process to give to which logical core.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
First Hyperthreading enabled Pentium 4 CPU
http://www.anandtech.com/showdoc.aspx?i=1746

As you see performance decreases were negligible.

The problem with HT comes mostly from the software implementation, when the pentium D came out I read multiple reviews that said that in some situations performance actually dropped when HT was enabled.

Pentium D was THE processor that ruined HT's reputation for increasing performance. It was perfectly fine on the single core....
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: IntelUser2000
First Hyperthreading enabled Pentium 4 CPU
http://www.anandtech.com/showdoc.aspx?i=1746

As you see performance decreases were negligible.

The problem with HT comes mostly from the software implementation, when the pentium D came out I read multiple reviews that said that in some situations performance actually dropped when HT was enabled.

Pentium D was THE processor that ruined HT's reputation for increasing performance. It was perfectly fine on the single core....

Pentium D had no hyperthreading enabled when it came out. The dual core netburst with HT were the Pentium Extreme Editions produced in small numbers.
 

Brunnis

Senior member
Nov 15, 2004
506
71
91
As said, it's a common misconception that SMT is of no use for CPUs with a shorter pipeline. Utilizing all the CPU resources is a problem in all modern CPUs, due to limited instruction level parallelism. Decreasing instruction dependencies in the pipeline by introducing a second thread can minimize this problem and that's obviously beneficial whether the pipeline is a long or short one.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
IntelUser2000,

Great explanation. Thanks.

From what you are saying it seems like the better branch predictor of the C2D is more of an indication why Intel at this point choose not to Hyperthread that CPU than the fact that it has a much shorter pipeline than P4 right?

So Hyperthreading was more a way to limit out of order instructions that would cause a pipeline stall correct? Instead of trying to predict the order of instructions it's easier to just run two different thread where the branch prediction probability will be much higher.

Now unless I'm totally off base (which I probably am) then I understand why P4 was Hyperthreaded. Not only was it's branch predictor not as good as C2D but a pipeline stall was much more costly than with C4D. For Prescott 31 cycles could be wasted vs. 14 for C2D.

Finally, it seems you are saying that Hyperthreading becomes more important as the CPU design becomes wider, and longer as a secondary concern.

Very interesting.

Thanks again and please feel free to correct me here, I'm kind of thinking out loud to try and work this out in my head!

I was just re-reading Anand's review of the P4 3.06 and am not totally following this paragraph.

"Another situation where execution units remain idle is when you're processing data streams using instructions that inherently take longer to execute than simpler ones. The problem with streaming situations is that there are usually very long dependency chains where you cannot execute multiple instructions in parallel because the outcome of one operation is necessary in order to process the next instruction. This is quite common with video encoding which is why we see such large performance increases with HT enabled in our DiVX tests. Remember that in order for us to see a performance gain while running a single application, the application must be multithreaded so it can dispatch more than one thread to the CPU at a time."


I don't understand the first sentence. How can one instruction take longer than another I thought all instructions moved through the pipeline at the clockspeed of the cpu?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
One other reason to do multithreading is do a thread switch during the relative long latency for memory accesses. It's not only for branch mispredicts.

The Itanium 2 microprocessor has a short pipeline depth but implements a blended form of TMT and SMT (temporal and simultanous multithreading, respectively).

I don't understand the first sentence. How can one instruction take longer than another I thought all instructions moved through the pipeline at the clockspeed of the cpu?
What you are saying is true of some instructions, and applies to RISC (Reduced instruction set computer). But IA32/x86 implements a CISC instruction and thus has many instructions that take many cycles to complete, for example, the FSQRT is an often cited example. It calculates the floating point square root. There is no dedicated chunk of circuitry for calculating square roots, so instead the instruction relies on repeated passes through using the multiplier and adder circuitry.
 

Noubourne

Senior member
Dec 15, 2003
751
0
76
Put it this way, the problem with Prescott wasn't Hyperthreading, and the Northwood HT enabled chips were ahead of their time (well, ahead of software at the time), so they didn't show much improvement with it.

Hyperthreading on its own isn't a bad thing. Of course, doing it in hardware should negate the need for a lot of wacky compiling, and so my largely uninformed opinion would be that I'd rather see this implemented in hardware, rather than force everybody to recompile their apps to make it work right.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Originally posted by: Hulk
IntelUser2000,

Great explanation. Thanks.

From what you are saying it seems like the better branch predictor of the C2D is more of an indication why Intel at this point choose not to Hyperthread that CPU than the fact that it has a much shorter pipeline than P4 right?

So Hyperthreading was more a way to limit out of order instructions that would cause a pipeline stall correct? Instead of trying to predict the order of instructions it's easier to just run two different thread where the branch prediction probability will be much higher.
No. It's just a way to utilize unused execution units and cycles because programs do not offer enough extractable IPC and because pipeline bubbles will always happen. An inefficient processor design will have more processor bubbles so SMT will be proportionally more useful in an inefficient design than an efficient design but even in an efficient design, resource utilization is nowhere near 100%.

I don't understand the first sentence. How can one instruction take longer than another I thought all instructions moved through the pipeline at the clockspeed of the cpu?
Some instructions take more than one cycle to execute. This also applies to risc processors. What pm says for transcendental functions like fsqrt is true but even simpler instructions can take more than 4 cycles to complete. For example, the integer multiply instruction takes 4 cycles to execute on most processors. In the hardware, the multiply instruction occupies 4 sequential stages of the pipeline. If you issue a multiply instruction and then immediately execute an unrelated instruction, there will be no pipeline bubbles. For example, if you execute:
a = b*c
d = d*e
You won't have any bubbles assuming that everything is in registers.

But if you execute:
a = b*c + d
Then you will have a pipeline bubble of at least 3 cycles in length because after you issue the "MUL B C" instruction, you cannot immediately execute the "ADD 'result of mul b c' D" instruction because the results of "mul b c" haven't been determined yet. It will be 3 more cycles before those results are in. And with the time needed for instruction retiring, it might actually be a little longer than 3 cycles (I don't really know).

 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
Now I'm understanding Anand's explanation of HT

Branch mispredictions - If there is a stall for one thread that is excuting, while the pipeline is being flushed the other thread that did not stall can continue to execute.

Thread switching during long latency memory access. Useful when multitasking.

Mix of instructions where some take longer than others. If the results of the "longer" instruction is needed for the shorter than execution units remain idle. Anand says this is typical of video encoding and why HT and I assume multi-cores help performance so much.


Now that I think about it I remember the biggest benefit I got from my P4 3.06 with HT on was that the system stayed quite responsive when I was running lots of apps.

So it seems as though just about any CPU can benefit from HT but those with ineffecient branch predictors, long latency memory subsystems, and deep pipelines will benefit the most.

Perhaps AMD never implimented HT on their Athlon cores because of the on-board memory controller and short"ish" pipeline? I guess it wasn't worth the silicon real estate?

It's times like this I wish I were an EE instead of an ME!

Really interesting info.

 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
The Itanium 2 microprocessor has a short pipeline depth but implements a blended form of TMT and SMT (temporal and simultanous multithreading, respectively).

Wow. Just took a look on the Itanium 2 9000(Montecito) multi-threading approach. Initially it was thought to use SoEMT exclusively, but it uses SoEMT for core and SMT for memory, interesting.

Pentium D had no hyperthreading enabled when it came out. The dual core netburst with HT were the Pentium Extreme Editions produced in small numbers.

My mistake. But I hope the original poster I replied to understands around my errors. Still, the dual core versions were the one that essentially ruined people's views on HT.

Perhaps AMD never implimented HT on their Athlon cores because of the on-board memory controller and short"ish" pipeline? I guess it wasn't worth the silicon real estate?

Well in terms of silicon real estate, adding HT is almost nothing. Hyperthreading on Pentium 4 took less than 5% of the total die size. I guess the Pentium 4 core was so under-utilized, even putting bare minimum form of HT would have made it significantly faster.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Perhaps AMD never implimented HT on their Athlon cores because of the on-board memory controller and short"ish" pipeline? I guess it wasn't worth the silicon real estate?

Probably the relatively low return on effort expended. Fine-grained SMT was notoriously difficult to validate on P4 for various reasons.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Fine-grained SMT was notoriously difficult to validate on P4 for various reasons.

What do you mean by that?? HT on Pentium 4 is just SMT.

If you mean Fine-grained SMT by Fine-grained MT, there's a performance degrade at single thread apps, which wouldn't have been good at P4 time. SMT in Pentium 4 suffers minor(if, any, go check out the 3.06GHz P4) single thread performance hit. Fine grained MT is better for MT, but people wouldn't have liked the sacrifice for single thread.

 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Originally posted by: Banzai042
My understanding of HT (as implemented on the netburst arch.) is this: When the processor is doing one task that has the proc at "100%" usage it's not using all the stages/same areas of the pipeline, because different processes do different things. HT created 2 virtual cores on one real core, with different areas of the pipeline split up, so that one task that used area A of the pipeline could run parallel with a seperate task that used area B of the pipeline. The idea itself is a reasonable one, it essentially allows for multicore style parallel computing on one processor core, assuming that the 2 processes use different areas of the core.

A processor does not assign different tasks to different parts of the pipeline; a single instruction must run through the entire pipeline regardless of the process - all 31 stages in the case of Prescott. It seems to be a common misconception that the goal of SMT was to improve the efficiency of very deep pipelines by filling 'empty slots' with instructions from different threads. All modern processors utilise superscalar execution, whereby multiple instructions are executed in parallel. As has already been mentioned however, they cannot always find instructions to execute together from a particular thread each clock cycle, due to data dependencies. SMT allows the scheduler to pick instructions from two threads, which naturally do not share data dependencies and can therefore be executed in parallel. Netburst's 'narrow and deep' design is actually less suitable for SMT than K8 or Core because of its small data/trace caches which must accomodate data from both threads, its narrower execution engine and its single decoder. An implementation of SMT on Core will yield greater boosts in performance than on Netburst due to its greater execution resources and larger caches, although Core already does a pretty good job of extracting ILP from threads due to its deep buffers and other technologies.

 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Netburst's 'narrow and deep' design is actually less suitable for SMT than K8 or Core because of its small data/trace caches which must accomodate data from both threads, its narrower execution engine and its single decoder.

I disagree. While Core microarchitecture is pretty wide, it is also quite a lot more efficient than Pentium 4. Real world benchmarks put the performance difference per clock of 2x. Though the future implementations of HT that will go on the Core-derivatives are supposed to be much better than the one on the P4, so this talk is irrelevant.

It seems to be a common misconception that the goal of SMT was to improve the efficiency of very deep pipelines by filling 'empty slots' with instructions from different threads.

Perhaps its better to say: "by filling 'different empty slots' with instructions from different threads"?? :D.
 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Originally posted by: IntelUser2000
Netburst's 'narrow and deep' design is actually less suitable for SMT than K8 or Core because of its small data/trace caches which must accomodate data from both threads, its narrower execution engine and its single decoder.

I disagree. While Core microarchitecture is pretty wide, it is also quite a lot more efficient than Pentium 4. Real world benchmarks put the performance difference per clock of 2x. Though the future implementations of HT that will go on the Core-derivatives are supposed to be much better than the one on the P4, so this talk is irrelevant.

Of course Core is more efficient. It also has far more execution resources than Netburst and therefore has more potential performance to exploit. Netburst was never going to be a great performer in terms of IPC which, incidently, has far less to do with its pipeline depth and far more to do with the design features I discussed above.

Core is capable of executing 4 instructions per clock, but I suspect even its designers will concede that it will very seldom achieve an execution rate even close to this when executing single threaded code, despite its efficiency. Netburst could only decode one instruction per clock, and issue maybe two with its trace cache, although the average decode rate is probably less than that. This means that Netburst was severely restricted in terms of IPC - a limitation Core does not have with its four decoders.

Core achieved its massive increase in IPC over Netburst thanks to the increase in its execution resources, along with its greatly expanded L1 cache. Although it is more efficient than Netburst in terms of pipeline flushes and wasted cycles due to its shorter pipeline and superior branch prediction, it still cannot engage all of its impressive execution resources all of the time. This is not a design problem, but more a problem with the x86 ISA, which severely inhibits the number of instructions that may be executed in parallel.
In other words, Core is not 'capped' as Netburst was and will achieve an even greater boost in performance from HT. Bloomfield will indeed be a monster.





 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Netburst could only decode one instruction per clock, and issue maybe two with its trace cache,

Trace Cache issue rate is THREE instructions/cycle. Core's maximum issue rate is only 33% higher. Once its in the Trace Cache, Pentium 4 is essentially a 3 issue wide CPU. Hyperthreading also helps for latency hiding, which Core microarchitecture based CPUs are superior in every way, Branch prediction/memory latency/cache/shorter pipeline. I would assume Netburst/Core can equally benefit, but if anything Netburst is still better suited to the implementation of SMT by Intel.

incidently, has far less to do with its pipeline depth and far more to do with the design features I discussed above.

Yea right. Intel said back when Willamette came that branch mis-prediction can cause 40% performance penalty, and Hans De Vries from chip-architect says the similar thing. There are other parts of the P4 that are poorly designed, but mis-prediction was said to be the biggest one and the much enhanced branch predictor isn't enough to overcome the extra pipeline stages.

This is not a design problem, but more a problem with the x86 ISA, which severely inhibits the number of instructions that may be executed in parallel.

Oh right. And one of the fastest CPUs nowadays happen to be an x86 CPU, and it was true before Core 2 Duo. It was thought by some that even superscalar wasn't possible on x86. A wider CPU, with a supposedly better instruction set, like the G5 doesn't perform faster than the K7. Why does a decoder exist?? To get around the problem of the x86 instruction. Nowadays, being x86 gives more room for performance increase rather than being a quirk(can you have a Trace Cache on a CPU with no decoders??).

Look at the super-wide CPUs like the Power 4/5: http://www.chip-architect.com/news/2003_08_22_hot_chips.html

A significant number of improvements had to be put on the core to take better advantage of the SMT. IBM quotes the performance would have been limited to 20% if it wasn't for the extra improvements, which is similar to what Intel can achieve with Hyperthreading on the server side.
 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Originally posted by: IntelUser2000

Trace Cache issue rate is THREE instructions/cycle. Core's maximum issue rate is only 33% higher. Once its in the Trace Cache, Pentium 4 is essentially a 3 issue wide CPU. Hyperthreading also helps for latency hiding, which Core microarchitecture based CPUs are superior in every way, Branch prediction/memory latency/cache/shorter pipeline. I would assume Netburst/Core can equally benefit, but if anything Netburst is still better suited to the implementation of SMT by Intel.

Trace cache was capable of issuing three instructions per clock, but obviously could only issue instructions that had already been decoded. As such, the average issue rate of Netburst is going to be far less. Certainly not three per clock!

Yea right. Intel said back when Willamette came that branch mis-prediction can cause 40% performance penalty, and Hans De Vries from chip-architect says the similar thing. There are other parts of the P4 that are poorly designed, but mis-prediction was said to be the biggest one and the much enhanced branch predictor isn't enough to overcome the extra pipeline stages.

If pipeline depth has such a massive impact on performance, then explain why the performance difference between Northwood and Prescott was negligable. A 55% increase in pipeline depth and Prescott still performed within a few percent of Northwood. By your logic, Northwood should have hammered Prescott. The reason it didn't is that there is no direct relationship between IPC and pipeline depth. The maximum theoretical performance of any processor is determined by its execution engine and its frequency. Increasing pipeline depth does worsen the impact of pipeline flushes, but Prescott's sophisticated branch predictor did a good job of preventing this. Hence again Prescott performing similar to Northwood.


Oh right. And one of the fastest CPUs nowadays happen to be an x86 CPU, and it was true before Core 2 Duo. It was thought by some that even superscalar wasn't possible on x86. A wider CPU, with a supposedly better instruction set, like the G5 doesn't perform faster than the K7.

Your first observation is obviously because the two main players in processor design back the x86 ISA. Intel has been trying to push EPIC for years however because of the limitations of x86. Had Intel and AMD gone down the RISC route, the performance of processors today could well be significantly higher. There was a time when RISC processors easily outperformed x86. Now that the emphasis has shifted from ILP to TLP, and from frequency to core number however, there is little reason for Intel or AMD to invest the necessary R&D costs to pursue RISC on the desktop.

Why does a decoder exist?? To get around the problem of the x86 instruction. Nowadays, being x86 gives more room for performance increase rather than being a quirk(can you have a Trace Cache on a CPU with no decoders??).

For your sake I will pretend I didn't read that.

A significant number of improvements had to be put on the core to take better advantage of the SMT. IBM quotes the performance would have been limited to 20% if it wasn't for the extra improvements, which is similar to what Intel can achieve with Hyperthreading on the server side.

That probably has more to do with the design features of the Power architecture than with RISC.

Here is a good article from IBM discussing SMT you should read. It also talks about the problems of cache hitrate in multithreading, which is what I was talking about when I mentioned one of the reasons Core was more suitable for SMT was its larger L1.

 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
I'm not an expert here but Prescott

had a larger L1 and L2 cache
better branch predictor
core execution improvements
13 new SSE instructions

Even with those improvements the Anandtech article comparing 3.2 Northwood vs. Prescott shows Northwood winning most of the tests by a very small margin.

So it would seem that IF EVERYTHING ELSE REMAINS THE SAME then a longer pipeline will result in decreased performance.

If one says a longer pipeline makes no IPC difference but don't mention the other improvements in the core than it would become impossible to compare any 1 architectural change in a cpu because you could offset that change with another. When you isolate pipeline as one factor in design it appears as though longer means less IPC.


On another topic I wonder if HT is already built into the C2D? Perhaps Intel is holding a card up their sleeve just in case...
 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Where did I say it made no difference to IPC? I said there was no direct relationship! If there were, then no amount of architectural improvements could compensate for a 55% dip in theoretical IPC. Increasing pipeline depth does decrease performance, but not proportionally.
Prescott did have larger caches, but their latencies were increased dramatically. The L1 Data Cache went from 1 cycle to 4, and the L2 went from something like 14 cycles to ~20. The official reason for this increase in cache size was to better accomodate instructions from two threads.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
"The reason it didn't is that there is no direct relationship between IPC and pipeline depth."