Why there is not HT on Core2 CPUS ?

Mir96TA · Sep 18, 2007

It was nice little trick, and works well.
Why they took it out or not add it in
Core2 CPUS ?

nonameo · Sep 18, 2007

core2 CPUs are already multithread capable. I doubt there would be little or any benefit from implementing HT.

The-Noid · Sep 18, 2007

Hyperthreading was used on an older marchitecture. Intel did not see much benefit on the core marchitecture.

Rumors have it coming back with the release of Nehalam. Only time will tell.

myocardia · Sep 18, 2007

Because there's no reason to have HT, when you have two actual cores. HyperThreading was nothing more than a way to trick SMP-enabled software into thinking there was two cores, on a single-core processor. Oh, and the other reason is that it only helped with the Netburst architecture, because it was so long and convoluted. With the Core 2 Duo's shorter, wider (and more efficient) path, supposedly it wouldn't provide any benefit, even if they were to add it.

The-Noid · Sep 18, 2007

Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

SunnyD · Sep 18, 2007

Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

Pretty much this is why - Netburst had a very... VERY deep pipeline. Basically HT came about because if something stalled in the pipeline, that's where Netburst failed miserably. Netburst had to be constantly fed with data otherwise stalls introduced significant latencies. So they made HT as a way to "multitask" a deep pipeline to keep it doing work.

Core2 has a much shorter pipeline, and much better efficiency, with no need for HT.

jones377 · Sep 18, 2007

Originally posted by: SunnyD

Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

Click to expand...

Pretty much this is why - Netburst had a very... VERY deep pipeline. Basically HT came about because if something stalled in the pipeline, that's where Netburst failed miserably. Netburst had to be constantly fed with data otherwise stalls introduced significant latencies. So they made HT as a way to "multitask" a deep pipeline to keep it doing work.

Core2 has a much shorter pipeline, and much better efficiency, with no need for HT.

SMT (HT) can also schedule instructions from two threads at the same clockcycle, so that argument doesn't fly. SMT benefits wide architectures, like Core. Why else would they add it into Nehalem again? Nehalem is going to keep the 4-issue width of C2D.

Brunnis · Sep 18, 2007

Originally posted by: jones377
SMT (HT) can also schedule instructions from two threads at the same clockcycle, so that argument doesn't fly. SMT benefits wide architectures, like Core. Why else would they add it into Nehalem again? Nehalem is going to keep the 4-issue width of C2D.

Exactly. SMT is a way of making sure that the execution units are utilized to their full extent in each cycle. The wider the architecture, the harder it is to extract enough non dependent instructions that can be scheduled and executed in the same clock cycle. SMT makes this easier by providing two threads for the schedulers to choose instructions. The end result is higher IPC.

So, SMT clearly has its benefits as designs keep getting wider. This is, as already pointed out, probably the reason why it seems to be returning in Nehalem.

Mir96TA · Sep 18, 2007

To me HT is like HOP Up mod for your HOT ROD or RC cars.
I really think It will Speed up Core2 processor

evolucion8 · Sep 19, 2007

But then it would mean that Nehalem will be a completely different architecture, cause currently, C2D will not see performance improvements using HT, it's execution units are not wide enough to remain idle. Actually HT was implemented in the Netbust (Yeah, Netbust) architecture to help to increase execution units usage etc, the long pipelines on the P4 most of the times remains idle, and the HT would promove the increase usage of it, but the P4 was never created with SMT in mind and that's why performance increases were minimal.

In order to make the P4 a better performer with SMT, it would need bigger, better L1 caches, higher amount of internal registers, a cache coherency aware branch predictor, so many things in the architecture level that simply doesn't worth the effort with such inneficient and power sucker architecture.

Since Pentium M, Intel Core Duo, Core 2 Duo etc, implementing HT was possible, but never made cause most of the time their execution units would remain fully loaded with work and simply there's not idle pipelines in the core to make HT work, HT is handy when the pipelines are idle. Also HT increases heat dissipation and power consumption.

dmens · Sep 19, 2007

Originally posted by: evolucion8
But probably some architectural changes will happen to Nehalem, cause currently, C2D will not see performance improvements using HT.

yes it would have. why else would nehalem have it

Actually HT was implemented in the Netbust (Yeah, Netbust) architecture to help to increase execution units usage etc, the long pipelines on the P4 most of the times remains idle, and the HT would promove the increase usage of it, but the P4 was never created with SMT in mind and that's why performance increases were minimal.

- p4 had smt from the beginning
- the long pipelines aren't idle "most of the times". even without smt, those execution units were quite busy with work. possibly even more so than c2d. whether the work was useful, that's a different story.

In order to make the P4 a better performer with SMT, it would need bigger, better L1 caches, higher amount of internal registers, a cache coherency aware branch predictor, so many things in the architecture level that simply doesn't worth the effort with such inneficient and power sucker architecture.

- with or without SMT, bigger cache and more physical resources is always good for any uarch if latency and frequency are maintained respectively. but it wouldn't cause SMT to become any more effective or otherwise.
- cache coherency aware branch predictor what now?
- so many other things, such as?

Since Pentium M, Intel Core Duo, Core 2 Duo etc, implementing HT was possible, but never made cause most of the time their execution units would remain fully loaded with work and simply there's not idle pipelines in the core to make HT work, HT is handy when the pipelines are idle. Also HT increases heat dissipation and power consumption.

see above. it'll work. probably even better than P4. as for power consumption, on average throughput gain is higher percentage-wise than the extra power it sucks up, so the change is a win.

sharad · Sep 19, 2007

Intel's Nehalem will have HyperThreading.

http://www.techreport.com/discussions.x/13232

In its "largest configuration," Nehalem will pack eight CPU cores onto a single die. Each of those cores will present the system with two logical processors and be able to execute two threads via simultaneous multithreading (SMT)?a la HyperThreading. So a single Nehalem chip will be able to execute 16 threads at once.

Zap · Sep 19, 2007

Things like Hyperthreading (virtual dual cores per real core) and virtual memory ("fake" RAM using HDD) is like sex. It's better if you don't have to fake it.

evolucion8 · Sep 19, 2007

- p4 had smt from the beginning
- the long pipelines aren't idle "most of the times". even without smt, those execution units were quite busy with work. possibly even more so than c2d. whether the work was useful, that's a different story.

Yeah, since Williamette, and yes, they're idle most of the time, tell me, do you know how many stage pipelines does have the Pentium 4? The older generation had 20 and Prescott and above have 31, do you think that if the pipelines weren't idle, HT would have to be necessary to be used? Actually HT increased the heat dissipation and power consumption cause it made the CPU worked harder, filling those long pipelines is an almost impossible thing cause they're too deep and when a branch misprediction occurs the CPU has to flush all the pipelines and reload them again, and you know that most programs currently have some certain branchy code, jumps and subroutines and that simply shows how inefficient is the P4 in such scenarios, the only appz that could benefit of such long pipelines are like media encoding which are pretty much linear applications, hence is able to fill most of the pipelines.

- with or without SMT, bigger cache and more physical resources is always good for any uarch if latency and frequency are maintained respectively. but it wouldn't cause SMT to become any more effective or otherwise.
- cache coherency aware branch predictor what now?
- so many other things, such as?

As far as I know a Trace Cache is not a good idea for implementing SMT cause of it's size and coherency issues, and what does latency has to do with SMT? Weird? The P4 just simply needed to be more scalar than it is now to show performance improvements with SMT.

see above. it'll work. probably even better than P4. as for power consumption, on average throughput gain is higher percentage-wise than the extra power it sucks up, so the change is a win.[/quote]

Since Nehalem is made with SMT in mind, probably the peformance gains will be great, but will never be able to outperform real cores, if it comes with Dual Core and it's a Virtual Quad Core, will not be able to outperform a Real Quad Core. Remember that unless if they modify HT in some ways, HT is just state register duplication, and shares the same execution unit, it would be a good idea to make HT with duplicate executions unit, mm

aigomorla · Sep 19, 2007

Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.

+1

dmens · Sep 19, 2007

Originally posted by: evolucion8
Yeah, since Williamette, and yes, they're idle most of the time, tell me, do you know how many stage pipelines does have the Pentium 4? The older generation had 20 and Prescott and above have 31, do you think that if the pipelines weren't idle, HT would have to be necessary to be used? Actually HT increased the heat dissipation and power consumption cause it made the CPU worked harder, filling those long pipelines is an almost impossible thing cause they're too deep and when a branch misprediction occurs the CPU has to flush all the pipelines and reload them again, and you know that most programs currently have some certain branchy code, jumps and subroutines and that simply shows how inefficient is the P4 in such scenarios, the only appz that could benefit of such long pipelines are like media encoding which are pretty much linear applications, hence is able to fill most of the pipelines.

yeah i know how many nominal pipestages are in the p4's and they're not idle "most of the time", that is ridiculous. it isn't "impossible" to fill the machine. if it were, it wouldn't be using so much power.

so SMT uses slightly more power, sure, but like i said, it returns more perf for the power it uses, so who cares?

saying the P4 is inefficient because it handles "branchy code" poorly is pure ignorance, sorry.

As far as I know a Trace Cache is not a good idea for implementing SMT cause of it's size and coherency issues, and what does latency has to do with SMT? Weird? The P4 just simply needed to be more scalar than it is now to show performance improvements with SMT.

latency was referring to the cache, specifically the L1 cache. you originally referenced it as being important for SMT performance for some reason. weird. trace cache was big, but none of the issues you're raising are relevant.

more scalar? you mean like wider? easier said than done. but SMT still yields tangible returns on average even on the P4, so what's the beef?

Since Nehalem is made with SMT in mind, probably the peformance gains will be great, but will never be able to outperform real cores, if it comes with Dual Core and it's a Virtual Quad Core, will not be able to outperform a Real Quad Core. Remember that unless if they modify HT in some ways, HT is just state register duplication, and shares the same execution unit, it would be a good idea to make HT with duplicate executions unit, mm

ah yes, "Real Quad Core". only someone who uses AMD market-fud-speak can have this kind of misconstrued interpretation of p4 smt.

the whole point of SMT is to duplicate only the logic that are either absolutely necessary for functional correctness, or critical bottlenecks. duplicate the execution units? why not have two cores... oh yeah, that's right, double the power, unlike SMT. duh.

evolucion8 · Sep 19, 2007

Originally posted by: dmens

Originally posted by: evolucion8
yeah i know how many nominal pipestages are in the p4's and they're not idle "most of the time", that is ridiculous. it isn't "impossible" to fill the machine. if it were, it wouldn't be using so much power.

so SMT uses slightly more power, sure, but like i said, it returns more perf for the power it uses, so who cares?

saying the P4 is inefficient because it handles "branchy code" poorly is pure ignorance, sorry.

Sorry, seems that you have more ignorance than me, if that's true. why you didn't say something to prove that I'm wrong, saying; " is pure ignorance, sorry." is a noobie thing. Seems that you don't have anything to say.

latency was referring to the cache, specifically the L1 cache. you originally referenced it as being important for SMT performance for some reason. weird. trace cache was big, but none of the issues you're raising are relevant.

You seems that you forgot that the Pentium 4 doesn't improve it's performance greatly with bigger caches? Extreme Editions anyone? If none of my stated issues are relevant? Why you just don't state them? Seems that you don't have anything to say but just to rant with no reason.

ah yes, "Real Quad Core". only someone who uses AMD market-fud-speak can have this kind of misconstrued interpretation of p4 smt.

Yeah, people like you for example, a true Real Quad Core have only a slighly performance advantage over non Native Quad Cores, not something revolutionary like AMD states, so seems that you're the one who fell with AMD Market Fud Speak, eh?

the whole point of SMT is to duplicate only the logic that are either absolutely necessary for functional correctness, or critical bottlenecks. duplicate the execution units? why not have two cores... oh yeah, that's right, double the power, unlike SMT. duh.

Click to expand...

Duh, there's a lot of more things in the P4 that cause bottlenecks that simply bigger caches and SMT cannot solve completely, go do some research of the P4 architecture before you post senseless words here and then call me ignorant, duH!

dmens · Sep 19, 2007

nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"

evolucion8 · Sep 19, 2007

Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"

But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.

TuxDave · Sep 19, 2007

Originally posted by: evolucion8

I don't think that it would take that much to understand a CPU architecture

Damn, if that's the case you should send Intel your resume because why pay the uArch experts so much money when you can easily take a look and have it all figured out. I bet validation would love to have you around too.

Phynaz · Sep 19, 2007

Originally posted by: evolucion8

Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.

if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"

Click to expand...

But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.

Wow, just think of the companies like Intel, AMD, IBM that spend billions of dollars developing cpu architectues. They must all be doing something wrong since it's so simple!

dmens · Sep 19, 2007

Originally posted by: evolucion8
But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.

I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.

damn, i wasted all those years in school and at work learning something oh-so-simple. guess i better quit and become a sewer diver or something.

evolucion8 · Sep 20, 2007

I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over.

TuxDave · Sep 20, 2007

Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over.

uArch experts don't need to understand things down to the transistor level. I just found your last comment about understanding CPU architecture a little juvienile and as an EE, it irked me enough to jab you about it.

But on a more productive note, the Pentium 4 was an overly complicated beast on a much deeper level than anyone can obtain from just buying a chip. A uArch expert even commented that even understanding the entire instruction flow and its various interactions was too much to fit in one brain.

Brunnis · Sep 20, 2007

Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over.

As an electronics engineer with focus on digital systems design, I can tell you that the concepts behind modern high performance CPUs are extremely advanced. Sure, the basic things like pipelining and super scalar execution are pretty easily learned, but going from that to actually understanding a modern design is not done as easily.

Why there is not HT on Core2 CPUS ?

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Belgian Waffler

Senior member

Senior member

Golden Member

Platinum Member

Platinum Member

Member

Elite Member

Platinum Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Lifer

Lifer

Platinum Member

Platinum Member

Lifer

Senior member