Originally posted by: zsdersw
Prescott's branch predictor isn't "better" than that of Conroe. It's been my understanding that Prescott's branch predictor was the basis for the branch predictor of Conroe. And I think it was also present in Yonah.
Trace cache was capable of issuing three instructions per clock, but obviously could only issue instructions that had already been decoded. As such, the average issue rate of Netburst is going to be far less. Certainly not three per clock!
A 55% increase in pipeline depth and Prescott still performed within a few percent of Northwood. By your logic, Northwood should have hammered Prescott. The reason it didn't is that there is no direct relationship between IPC and pipeline depth. The maximum theoretical performance of any processor is determined by its execution engine and its frequency.
Had Intel and AMD gone down the RISC route, the performance of processors today could well be significantly higher.
For your sake I will pretend I didn't read that.
That probably has more to do with the design features of the Power architecture than with RISC.
Here is a good article from IBM discussing SMT you should read. It also talks about the problems of cache hitrate in multithreading, which is what I was talking about when I mentioned one of the reasons Core was more suitable for SMT was its larger L1.
Originally posted by: IntelUser2000
Far less?? Like 2?? The performance penalty of 2-issue vs 3-issue is minor when it doesn't reach 3 IPC for most code, plus the fact Trace Cache can store already decoded instructions. The 20 stage pipeline doesn't even include the decoding stages. The main talk here is about HT, which on the P4 would work to hide latency.
Prescott had bigger enhancements to cancel out the effects of the 55% longer pipeline than Willamette did on the 100% greater pipeline.
Maxiumum theoretical performance is determined by execution engine and frequency, but CPUs don't reach maximum theoretical performance, and it tells nothing about the real-world performance. If it was, we would have seen yearly increases in issue width just like we see graphics pipeline increase with every new generations.
Instead, CPUs are limited by things like memory subsystem. On every generation that came with widened issue rate, it came with techniques to take advantage of it. Pentium Pro got OOE along with the 3-issue width. Pure execution improvements without OOE would have severely limited the performance advantages. HT on P4 works to hide latency, which was always the bigger limitation over "issue rate".
We, do not know that. The only RISC CPU that significantly outperformed x86 CPU was the Alpha, which was clocked over 2x higher than the x86 counterparts.
The Power 4/5 example didn't have anything to do with RISC, it was about a wide CPU, that would not have had greater advantage from SMT as a much narrower P4 without the enhancements made to support SMT better.
Originally posted by: Hulk
Duvie,
I know that there are lots of apps out there right now that don't utilize 4 cores but I still want one!
I can be encoding video in one app while still efficiently editing in another. Or running a number of applications simultaneously. And I have a feeling now that the Quads have entered the marketplace that we'll be seeing more and more support.
Plus it's the whiners like us that keep pushing the software developers to optimize the code!
Originally posted by: Duvie
Originally posted by: Hulk
Duvie,
I know that there are lots of apps out there right now that don't utilize 4 cores but I still want one!
I can be encoding video in one app while still efficiently editing in another. Or running a number of applications simultaneously. And I have a feeling now that the Quads have entered the marketplace that we'll be seeing more and more support.
Plus it's the whiners like us that keep pushing the software developers to optimize the code!
Well then you need to whine louder, because you are not getting much down...
MOst of the apps that are multithreaded today in the area of video encoding was already pretty well SMP aware then...Whining has done what in gaming?
Multitasking will be fine but any HT will not garner the equivalence of another core anyways. I bet it wont garner 20% performance increase the rather inefficient P4 netburst HT did.
The only app I can truly get 100% of 4 cores is running 4 instances of F@H. NO program (non benchmark) has been able to tax 100% of the 4 cores. IO limitations are getting so extreme.
The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think. None of the articles I have seen show the single decoder as a big limitation, this is the first time hearing that it might be. P4 has many other flaws that is bigger than the single decoder(poor FPU, fast ALU limited by Trace Cache throughput which even then can't execute all ALU instructions, long pipeline stages).So you're saying reducing the issue rate from 3 to 2 will cause a minor performance hit? The average execution rate may be 1 to 2 instructions per clock, but in order to achieve that, more instructions must be loaded into the reorder buffer to ensure there are instructions to execute each cycle. If the decode rate was of 'minor' importance, Core would not have 4 decoders.
First of all, your math is wrong. 55% increase doesn't equal to 55% drop.(Perhaps you have done it in purpose to be sarcastic but I didn't get any hints that it was) 20/31 is 35% drop(ok that is like trying to find out fps you get in a game by dividing your video card fillrate by your resolution, which is incorrect).Assuming that you deem IPC and pipeline depth to be directly related, you're saying that Prescott's few enhancements compensated for a 55% dip in theoretical performance?
If memory serves, Alpha also featured a significantly higher IPC than any of its contemporary processors.
I am not saying P4 will gain more than the Core/Power 5. But on the current implementation of SMT by HT, it will gain equal(or more) than the either two. The next "HT" is supposed to be more advanced, making this point moot.Forgive me if I don't take your word for that. The general consensus among people who know what they're talking about in this subject is that processors with greater execution resources stand to gain more from SMT than those with fewer, your theory on the purpose of Hyperthreading notwithstanding.
1. Of course designers are going to ensure their processor can properly take advantage of wider execution resources. You've pretty much reiterated my own point about the limitations of the x86 ISA. Cheers.
Originally posted by: IntelUser2000
For your sake I will pretend I didn't read that.
Why?? Why would you say that?? The purpose for the decoder for current x86 CPUs was to break down the complex x86 instructions into simpler RISC-like instructions, so it can supposedly increase ILP and increase performance.
The P4's and the 970's fetch and decode pipeline phases are similar in one very important respect: both processors break down instructions in their native ISA's format into a smaller, simpler format for use inside the CPU. The P4 breaks down each x86 CISC instruction into smaller micro-ops (or "uops"), which more or less resemble the instructions on a RISC machine. Most x86 instructions decode into 2 or 3 uops, but some of the longer, more complex and rarely used instructions decode into many more uops. The 970 breaks its instructions down into what it calls "IOPs", presumably short for "internal operations". Like uops on the P4, it is these IOPs that are actually executed out-of-order by the 970's execution core. And also like uops, cracking instructions down into multiple, more atomic and more strictly defined IOPs can help the back end squeeze out some extra instruction-level parallelism (ILP) by giving it more freedom to schedule code.
The point is that RISC processors have no less of a need for instruction decoders. The idea of building a pipelined CPU without one is, um, hilarious. Unless you're playing with a toy ISA, it's not going to happen.
Q]
From Ars:
So in order for an x86 processor's instruction window to be able to rearrange the instruction stream for optimal execution, x86 instructions must first be converted into an instruction format that's uniform in size and atomic in function. This conversion process is called instruction set translation, and all modern x86 processors do some form of it.
x86 decoder: variable length instruction that contains more info of an instruction translated into fixed length, simpler instructions
RISC decoder: fixed length instruction that is much simpler than x86 instructions that translated into even more simpler instructions
Seems x86 decoders does more of "decoding" than RISC's decoder does.
This can be a nontrivial task even for so-called RISC ISAs such as POWER. I recall reading comments about how much harder it is to write an instruction disassembler for constant-length instruction RISC ISAs than variable-length ISAs like x86, because the POWER ISA, to work within the fixed-length format, has to change the encodings for register operands and execution flags for certain instruction types, while something like x86 has many fewer special cases, oddly enough. /Edit
Could be why there isn't a "miracle" solution to solving the problem of ILP. x86, you get disadvantages of variable length instruction, but go to things like RISC, EPIC, you get problems like increased code density, and greater reliance on compiler. In the end, instruction sets differences all become a wash.
This from the "Intel Core versus AMD's K8 architecture"
Which does not make P4 a less candidate to current version of HT than Merom, and the engineers are working on the one that is more optimized and suited for Merom.
Originally posted by: IntelUser2000
The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think. None of the articles I have seen show the single decoder as a big limitation, this is the first time hearing that it might be. P4 has many other flaws that is bigger than the single decoder(poor FPU, fast ALU limited by Trace Cache throughput which even then can't execute all ALU instructions, long pipeline stages).
Core microarchitecture having Pentium 4's HT won't benefit more than Pentium 4 will(and vice versa), and if anything Pentium 4 should.
Why does K8L still have 3 issue core if its rumored to perform similar to Conroe??
Originally posted by: IntelUser2000
x86 decoder: variable length instruction that contains more info of an instruction translated into fixed length, simpler instructions
RISC decoder: fixed length instruction that is much simpler than x86 instructions that translated into even more simpler instructions
Seems x86 decoders does more of "decoding" than RISC's decoder does.
[RISC CPUs] added more instructions and more complexity to the point where they?re every bit as complex as their CISC counterparts. Thus the "RISC vs. CISC" debate really exists only in the minds of marketing departments and platform advocates whose purpose in creating and perpetuating this fictitious conflict is to promote their pet product by means of name-calling and sloganeering.
At this point, I?d like to reference a statement made by David Ditzel, the chief architect of Sun?s SPARC family and [CTO] of Transmeta.
"Today [in RISC] we have large design teams and long design cycles," he said. "The performance story is also much less clear now. The die sizes are no longer small. It just doesn't seem to make as much sense." The result is the current crop of complex RISC chips. "Superscalar and out-of-order execution are the biggest problem areas that have impeded performance [leaps]," Ditzel said. "The MIPS R10,000 and HP PA-8000 seem much more complex to me than today's standard CISC architecture, which is the Pentium II. So where is the advantage of RISC, if the chips aren't as simple anymore?"
You would probably like to know the above response stemmed from your quote: "The average execution rate may be 1 to 2 instructions per clock, but in order to achieve that, more instructions must be loaded into the reorder buffer to ensure there are instructions to execute each cycle."The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think.
Like a processor, a car's theoretical potential is also limited. What's the point of a good engine when all you need is a 3-cyclinder 70hp engine to maintain highway speeds??A processor, no matter how advanced, cannot possibly have a higher retire rate than its issue rate. On average, if a processor issues two instructions per clock, then it is capable of retiring no more than two per clock. Consider this analogy: if you increase the fuel/air charge of an engine, what happens to its power? Conversely, what happens when you reduce it?
Netburst may be capable of retiring three instructions per clock in 'bursts', it cannot sustain this, due to the limitations imposed by its issue rate.
Maybe K8L is wasting its extra wide execution units?? Why does K8L need extra fetch width, while Core needs to be completely 4-issue wide. It probably has to do with the peculiarity of the architecture. Or maybe you are just generalizing things. Just like the K8 being different from Core architecture, so is P4.If the decode rate was of 'minor' importance, Core would not have 4 decoders.
Again, the response was for Bitbybit: "Intel has been trying to push EPIC for years however because of the limitations of x86..."intangir: Hm, wherever did you get the idea that RISC instructions are simpler than CISC? Certainly they were intended to be, but I don't think it ended up that way. As reference, see Hannibal's article on The Post-RISC era:
It's from one of the articles at Ars Technica: http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/5So I can't see how you can conclude x86 architecture's decoders are simpler than POWER's without knowing the relative complexities of x86 uops and POWER iops.