Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.
Originally posted by: SunnyD
Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.
Pretty much this is why - Netburst had a very... VERY deep pipeline. Basically HT came about because if something stalled in the pipeline, that's where Netburst failed miserably. Netburst had to be constantly fed with data otherwise stalls introduced significant latencies. So they made HT as a way to "multitask" a deep pipeline to keep it doing work.
Core2 has a much shorter pipeline, and much better efficiency, with no need for HT.
Exactly. SMT is a way of making sure that the execution units are utilized to their full extent in each cycle. The wider the architecture, the harder it is to extract enough non dependent instructions that can be scheduled and executed in the same clock cycle. SMT makes this easier by providing two threads for the schedulers to choose instructions. The end result is higher IPC.Originally posted by: jones377
SMT (HT) can also schedule instructions from two threads at the same clockcycle, so that argument doesn't fly. SMT benefits wide architectures, like Core. Why else would they add it into Nehalem again? Nehalem is going to keep the 4-issue width of C2D.
Originally posted by: evolucion8
But probably some architectural changes will happen to Nehalem, cause currently, C2D will not see performance improvements using HT.
Actually HT was implemented in the Netbust (Yeah, Netbust) architecture to help to increase execution units usage etc, the long pipelines on the P4 most of the times remains idle, and the HT would promove the increase usage of it, but the P4 was never created with SMT in mind and that's why performance increases were minimal.
In order to make the P4 a better performer with SMT, it would need bigger, better L1 caches, higher amount of internal registers, a cache coherency aware branch predictor, so many things in the architecture level that simply doesn't worth the effort with such inneficient and power sucker architecture.
Since Pentium M, Intel Core Duo, Core 2 Duo etc, implementing HT was possible, but never made cause most of the time their execution units would remain fully loaded with work and simply there's not idle pipelines in the core to make HT work, HT is handy when the pipelines are idle. Also HT increases heat dissipation and power consumption.
In its "largest configuration," Nehalem will pack eight CPU cores onto a single die. Each of those cores will present the system with two logical processors and be able to execute two threads via simultaneous multithreading (SMT)?a la HyperThreading. So a single Nehalem chip will be able to execute 16 threads at once.
Originally posted by: Yoxxy
Also has to do with the pipe length and efficiency of Core marchitecture. Because it is significantly more efficient than netburst, almost all of the cycles are maximized. In the Netburst days you needed to almost oversaturate the cpu to fully load it.
Originally posted by: evolucion8
Yeah, since Williamette, and yes, they're idle most of the time, tell me, do you know how many stage pipelines does have the Pentium 4? The older generation had 20 and Prescott and above have 31, do you think that if the pipelines weren't idle, HT would have to be necessary to be used? Actually HT increased the heat dissipation and power consumption cause it made the CPU worked harder, filling those long pipelines is an almost impossible thing cause they're too deep and when a branch misprediction occurs the CPU has to flush all the pipelines and reload them again, and you know that most programs currently have some certain branchy code, jumps and subroutines and that simply shows how inefficient is the P4 in such scenarios, the only appz that could benefit of such long pipelines are like media encoding which are pretty much linear applications, hence is able to fill most of the pipelines.
As far as I know a Trace Cache is not a good idea for implementing SMT cause of it's size and coherency issues, and what does latency has to do with SMT? Weird? The P4 just simply needed to be more scalar than it is now to show performance improvements with SMT.
Since Nehalem is made with SMT in mind, probably the peformance gains will be great, but will never be able to outperform real cores, if it comes with Dual Core and it's a Virtual Quad Core, will not be able to outperform a Real Quad Core. Remember that unless if they modify HT in some ways, HT is just state register duplication, and shares the same execution unit, it would be a good idea to make HT with duplicate executions unit, mm
Originally posted by: dmens
Originally posted by: evolucion8
yeah i know how many nominal pipestages are in the p4's and they're not idle "most of the time", that is ridiculous. it isn't "impossible" to fill the machine. if it were, it wouldn't be using so much power.
so SMT uses slightly more power, sure, but like i said, it returns more perf for the power it uses, so who cares?
saying the P4 is inefficient because it handles "branchy code" poorly is pure ignorance, sorry.
Sorry, seems that you have more ignorance than me, if that's true. why you didn't say something to prove that I'm wrong, saying; " is pure ignorance, sorry." is a noobie thing. Seems that you don't have anything to say.
latency was referring to the cache, specifically the L1 cache. you originally referenced it as being important for SMT performance for some reason. weird. trace cache was big, but none of the issues you're raising are relevant.
You seems that you forgot that the Pentium 4 doesn't improve it's performance greatly with bigger caches? Extreme Editions anyone? If none of my stated issues are relevant? Why you just don't state them? Seems that you don't have anything to say but just to rant with no reason.
ah yes, "Real Quad Core". only someone who uses AMD market-fud-speak can have this kind of misconstrued interpretation of p4 smt.
Yeah, people like you for example, a true Real Quad Core have only a slighly performance advantage over non Native Quad Cores, not something revolutionary like AMD states, so seems that you're the one who fell with AMD Market Fud Speak, eh?
the whole point of SMT is to duplicate only the logic that are either absolutely necessary for functional correctness, or critical bottlenecks. duplicate the execution units? why not have two cores... oh yeah, that's right, double the power, unlike SMT. duh.
Duh, there's a lot of more things in the P4 that cause bottlenecks that simply bigger caches and SMT cannot solve completely, go do some research of the P4 architecture before you post senseless words here and then call me ignorant, duH!
Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.
if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"
Originally posted by: evolucion8
I don't think that it would take that much to understand a CPU architecture
Originally posted by: evolucion8
Originally posted by: dmens
nice attempt to change the topic. doesn't change the fact that everything you said about P4 SMT is confused and/or dead wrong.
if you're going to run your mouth on p4 (or any other design by anybody), at least spend some time to research the real weaknesses. unfortunately, the p4 is a complicated beast, might take you a while. just don't come back and say stuff like "p4 did SMT to make up for branch misprediction penalty on long pipes"
But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.
I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.
Originally posted by: evolucion8
But why you can't state the P4 weaknessess? I know it's a complicated beast and I had 3 Pentium 4 CPU's and I liked them all, even though they got outperformed by Athlon 64 in many scenarios, I found the P4 more appealing to me cause I do a lot of multi tasking and Media encoding and there's is were the P4 shines, although now I had to switch my P4 EE to this Pentium M to decrease power consumption, heat dissipation and increase performance in most scenarios, specially gaming were the P4 is far behind. Just stop being so biased towards a company, cause after all Intel was far from it's 10GHz target when the P4 was introduced, luckily the P4 is not an ugly mistake like the GeForce FX :laugh: and is able to offer enough performance for any current application.
I don't think that it would take that much to understand a CPU architecture, I'd prefer GPU architecture, they are more interesting and more challenging.
Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over.![]()
As an electronics engineer with focus on digital systems design, I can tell you that the concepts behind modern high performance CPUs are extremely advanced. Sure, the basic things like pipelining and super scalar execution are pretty easily learned, but going from that to actually understanding a modern design is not done as easily.Originally posted by: evolucion8
I mean the basics, I don't mean like creating a CPU architecture or something, just to understand how they work with code and stuff, diving inside of the transistor level is just plain japanese to me, anyways I don't care about what Phynaz, dmens and TuxDave says, just save the sarcasm for yourself and keep moving, the party is over.![]()
