Why there is not HT on Core2 CPUS ?

Mir96TA

Golden Member
Oct 21, 2002
1,950
37
91
It was nice little trick, and works well.
Why they took it out or not add it in
Core2 CPUS ?
 

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
core2 CPUs are already multithread capable. I doubt there would be little or any benefit from implementing HT.
 

The-Noid

Diamond Member
Nov 16, 2005
3,117
4
76
Hyperthreading was used on an older marchitecture. Intel did not see much benefit on the core marchitecture.

Rumors have it coming back with the release of Nehalam. Only time will tell.
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
Because there's no reason to have HT, when you have two actual cores. HyperThreading was nothing more than a way to trick SMP-enabled software into thinking there was two cores, on a single-core processor. Oh, and the other reason is that it only helped with the Netburst architecture, because it was so long and convoluted. With the Core 2 Duo's shorter, wider (and more efficient) path, supposedly it wouldn't provide any benefit, even if they were to add it.
 

The-Noid

Diamond Member
Nov 16, 2005
3,117
4
76
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

 

SunnyD

Belgian Waffler
Jan 2, 2001
32,675
146
106
www.neftastic.com
Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

Pretty much this is why - Netburst had a very... VERY deep pipeline. Basically HT came about because if something stalled in the pipeline, that's where Netburst failed miserably. Netburst had to be constantly fed with data otherwise stalls introduced significant latencies. So they made HT as a way to "multitask" a deep pipeline to keep it doing work.

Core2 has a much shorter pipeline, and much better efficiency, with no need for HT.
 

jones377

Senior member
May 2, 2004
463
64
91
Originally posted by: SunnyD
Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

Pretty much this is why - Netburst had a very... VERY deep pipeline. Basically HT came about because if something stalled in the pipeline, that's where Netburst failed miserably. Netburst had to be constantly fed with data otherwise stalls introduced significant latencies. So they made HT as a way to "multitask" a deep pipeline to keep it doing work.

Core2 has a much shorter pipeline, and much better efficiency, with no need for HT.

SMT (HT) can also schedule instructions from two threads at the same clockcycle, so that argument doesn't fly. SMT benefits wide architectures, like Core. Why else would they add it into Nehalem again? Nehalem is going to keep the 4-issue width of C2D.

 

Brunnis

Senior member
Nov 15, 2004
506
71
91
Originally posted by: jones377
SMT (HT) can also schedule instructions from two threads at the same clockcycle, so that argument doesn't fly. SMT benefits wide architectures, like Core. Why else would they add it into Nehalem again? Nehalem is going to keep the 4-issue width of C2D.
Exactly. SMT is a way of making sure that the execution units are utilized to their full extent in each cycle. The wider the architecture, the harder it is to extract enough non dependent instructions that can be scheduled and executed in the same clock cycle. SMT makes this easier by providing two threads for the schedulers to choose instructions. The end result is higher IPC.

So, SMT clearly has its benefits as designs keep getting wider. This is, as already pointed out, probably the reason why it seems to be returning in Nehalem.
 

Mir96TA

Golden Member
Oct 21, 2002
1,950
37
91
To me HT is like HOP Up mod for your HOT ROD or RC cars.
I really think It will Speed up Core2 processor
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
But then it would mean that Nehalem will be a completely different architecture, cause currently, C2D will not see performance improvements using HT, it's execution units are not wide enough to remain idle. Actually HT was implemented in the Netbust (Yeah, Netbust) architecture to help to increase execution units usage etc, the long pipelines on the P4 most of the times remains idle, and the HT would promove the increase usage of it, but the P4 was never created with SMT in mind and that's why performance increases were minimal.

In order to make the P4 a better performer with SMT, it would need bigger, better L1 caches, higher amount of internal registers, a cache coherency aware branch predictor, so many things in the architecture level that simply doesn't worth the effort with such inneficient and power sucker architecture.

Since Pentium M, Intel Core Duo, Core 2 Duo etc, implementing HT was possible, but never made cause most of the time their execution units would remain fully loaded with work and simply there's not idle pipelines in the core to make HT work, HT is handy when the pipelines are idle. Also HT increases heat dissipation and power consumption.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: evolucion8
But probably some architectural changes will happen to Nehalem, cause currently, C2D will not see performance improvements using HT.

yes it would have. why else would nehalem have it

Actually HT was implemented in the Netbust (Yeah, Netbust) architecture to help to increase execution units usage etc, the long pipelines on the P4 most of the times remains idle, and the HT would promove the increase usage of it, but the P4 was never created with SMT in mind and that's why performance increases were minimal.

- p4 had smt from the beginning
- the long pipelines aren't idle "most of the times". even without smt, those execution units were quite busy with work. possibly even more so than c2d. whether the work was useful, that's a different story.

In order to make the P4 a better performer with SMT, it would need bigger, better L1 caches, higher amount of internal registers, a cache coherency aware branch predictor, so many things in the architecture level that simply doesn't worth the effort with such inneficient and power sucker architecture.

- with or without SMT, bigger cache and more physical resources is always good for any uarch if latency and frequency are maintained respectively. but it wouldn't cause SMT to become any more effective or otherwise.
- cache coherency aware branch predictor what now?
- so many other things, such as?

Since Pentium M, Intel Core Duo, Core 2 Duo etc, implementing HT was possible, but never made cause most of the time their execution units would remain fully loaded with work and simply there's not idle pipelines in the core to make HT work, HT is handy when the pipelines are idle. Also HT increases heat dissipation and power consumption.

see above. it'll work. probably even better than P4. as for power consumption, on average throughput gain is higher percentage-wise than the extra power it sucks up, so the change is a win.
 

sharad

Member
Apr 25, 2004
123
0
0
Intel's Nehalem will have HyperThreading.

http://www.techreport.com/discussions.x/13232

In its "largest configuration," Nehalem will pack eight CPU cores onto a single die. Each of those cores will present the system with two logical processors and be able to execute two threads via simultaneous multithreading (SMT)?a la HyperThreading. So a single Nehalem chip will be able to execute 16 threads at once.
 

Zap

Elite Member
Oct 13, 1999
22,377
7
81
Things like Hyperthreading (virtual dual cores per real core) and virtual memory ("fake" RAM using HDD) is like sex. It's better if you don't have to fake it. :D
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
- p4 had smt from the beginning
- the long pipelines aren't idle "most of the times". even without smt, those execution units were quite busy with work. possibly even more so than c2d. whether the work was useful, that's a different story.

Yeah, since Williamette, and yes, they're idle most of the time, tell me, do you know how many stage pipelines does have the Pentium 4? The older generation had 20 and Prescott and above have 31, do you think that if the pipelines weren't idle, HT would have to be necessary to be used? Actually HT increased the heat dissipation and power consumption cause it made the CPU worked harder, filling those long pipelines is an almost impossible thing cause they're too deep and when a branch misprediction occurs the CPU has to flush all the pipelines and reload them again, and you know that most programs currently have some certain branchy code, jumps and subroutines and that simply shows how inefficient is the P4 in such scenarios, the only appz that could benefit of such long pipelines are like media encoding which are pretty much linear applications, hence is able to fill most of the pipelines.

- with or without SMT, bigger cache and more physical resources is always good for any uarch if latency and frequency are maintained respectively. but it wouldn't cause SMT to become any more effective or otherwise.
- cache coherency aware branch predictor what now?
- so many other things, such as?

As far as I know a Trace Cache is not a good idea for implementing SMT cause of it's size and coherency issues, and what does latency has to do with SMT? Weird? The P4 just simply needed to be more scalar than it is now to show performance improvements with SMT.

see above. it'll work. probably even better than P4. as for power consumption, on average throughput gain is higher percentage-wise than the extra power it sucks up, so the change is a win.[/quote]

Since Nehalem is made with SMT in mind, probably the peformance gains will be great, but will never be able to outperform real cores, if it comes with Dual Core and it's a Virtual Quad Core, will not be able to outperform a Real Quad Core. Remember that unless if they modify HT in some ways, HT is just state register duplication, and shares the same execution unit, it would be a good idea to make HT with duplicate executions unit, mm

 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,081
3,583
126
Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

+1
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: evolucion8
Yeah, since Williamette, and yes, they're idle most of the time, tell me, do you know how many stage pipelines does have the Pentium 4? The older generation had 20 and Prescott and above have 31, do you think that if the pipelines weren't idle, HT would have to be necessary to be used? Actually HT increased the heat dissipation and power consumption cause it made the CPU worked harder, filling those long pipelines is an almost impossible thing cause they're too deep and when a branch misprediction occurs the CPU has to flush all the pipelines and reload them again, and you know that most programs currently have some certain branchy code, jumps and subroutines and that simply shows how inefficient is the P4 in such scenarios, the only appz that could benefit of such long pipelines are like media encoding which are pretty much linear applications, hence is able to fill most of the pipelines.

yeah i know how many nominal pipestages are in the p4's and they're not idle "most of the time", that is ridiculous. it isn't "impossible" to fill the machine. if it were, it wouldn't be using so much power.

so SMT uses slightly more power, sure, but like i said, it returns more perf for the power it uses, so who cares?

saying the P4 is inefficient because it handles "branchy code" poorly is pure ignorance, sorry.

As far as I know a Trace Cache is not a good idea for implementing SMT cause of it's size and coherency issues, and what does latency has to do with SMT? Weird? The P4 just simply needed to be more scalar than it is now to show performance improvements with SMT.

latency was referring to the cache, specifically the L1 cache. you originally referenced it as being important for SMT performance for some reason. weird. trace cache was big, but none of the issues you're raising are relevant.

more scalar? you mean like wider? easier said than done. but SMT still yields tangible returns on average even on the P4, so what's the beef?

Since Nehalem is made with SMT in mind, probably the peformance gains will be great, but will never be able to outperform real cores, if it comes with Dual Core and it's a Virtual Quad Core, will not be able to outperform a Real Quad Core. Remember that unless if they modify HT in some ways, HT is just state register duplication, and shares the same execution unit, it would be a good idea to make HT with duplicate executions unit, mm

ah yes, "Real Quad Core". only someone who uses AMD market-fud-speak can have this kind of misconstrued interpretation of p4 smt.

the whole point of SMT is to duplicate only the logic that are either absolutely necessary for functional correctness, or critical bottlenecks. duplicate the execution units? why not have two cores... oh yeah, that's right, double the power, unlike SMT. duh.
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
Originally posted by: dmens
Originally posted by: evolucion8
yeah i know how many nominal pipestages are in the p4's and they're not idle "most of the time", that is ridiculous. it isn't "impossible" to fill the machine. if it were, it wouldn't be using so much power.

so SMT uses slightly more power, sure, but like i said, it returns more perf for the power it uses, so who cares?

saying the P4 is inefficient because it handles "branchy code" poorly is pure ignorance, sorry.

Sorry, seems that you have more ignorance than me, if that's true. why you didn't say something to prove that I'm wrong, saying; " is pure ignorance, sorry." is a noobie thing. Seems that you don't have anything to say.

latency was referring to the cache, specifically the L1 cache. you originally referenced it as being important for SMT performance for some reason. weird. trace cache was big, but none of the issues you're raising are relevant.

You seems that you forgot that the Pentium 4 doesn't improve it's performance greatly with bigger caches? Extreme Editions anyone? If none of my stated issues are relevant? Why you just don't state them? Seems that you don't have anything to say but just to rant with no reason.


ah yes, "Real Quad Core". only someone who uses AMD market-fud-speak can have this kind of misconstrued interpretation of p4 smt.

Yeah, people like you for example, a true Real Quad Core have only a slighly performance advantage over non Native Quad Cores, not something revolutionary like AMD states, so seems that you're the one who fell with AMD Market Fud Speak, eh?

the whole point of SMT is to duplicate only the logic that are either absolutely necessary for functional correctness, or critical bottlenecks. duplicate the execution units? why not have two cores... oh yeah, that's right, double the power, unlike SMT. duh.

Duh, there's a lot of more things in the P4 that cause bottlenecks that simply bigger caches and SMT cannot solve completely, go do some research of the P4 architecture before you post senseless words here and then call me ignorant, duH!
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"

But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: evolucion8

I don't think that it would take that much to understand a CPU architecture

Damn, if that's the case you should send Intel your resume because why pay the uArch experts so much money when you can easily take a look and have it all figured out. I bet validation would love to have you around too.

 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Originally posted by: evolucion8
Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"

But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.

Wow, just think of the companies like Intel, AMD, IBM that spend billions of dollars developing cpu architectues. They must all be doing something wrong since it's so simple!

 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: evolucion8
But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.

damn, i wasted all those years in school and at work learning something oh-so-simple. guess i better quit and become a sewer diver or something.
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over. :cool:
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over. :cool:

uArch experts don't need to understand things down to the transistor level. I just found your last comment about understanding CPU architecture a little juvienile and as an EE, it irked me enough to jab you about it. ;)

But on a more productive note, the Pentium 4 was an overly complicated beast on a much deeper level than anyone can obtain from just buying a chip. A uArch expert even commented that even understanding the entire instruction flow and its various interactions was too much to fit in one brain.
 

Brunnis

Senior member
Nov 15, 2004
506
71
91
Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over. :cool:
As an electronics engineer with focus on digital systems design, I can tell you that the concepts behind modern high performance CPUs are extremely advanced. Sure, the basic things like pipelining and super scalar execution are pretty easily learned, but going from that to actually understanding a modern design is not done as easily.