If Intel releases another Hyperthreading cpu does that mean another inefficient cpu is coming our way?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

zsdersw

Lifer
Oct 29, 2003
10,505
2
0
Prescott's branch predictor isn't "better" than that of Conroe. It's been my understanding that Prescott's branch predictor was the basis for the branch predictor of Conroe. And I think it was also present in Yonah.
 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
Enough of you ppl whine on this forum about no uses for 4 cores so what do you care about more threads and more virtual cores. Apps dont effectively use multithreaded conditions now.

I would be more interested in the concept of reverse hyper threading....Taking mulicores and combine them to one fast core that could give advantage to apps that wont see beyond one or two cores....


This will do little for most. I see maybe extreme models with this or the business chips. desktop chips now 3years since HT on the P4s and games are still not using more then 1 core.....Now a majority of us have dual cores and still software developers have not caught up....
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
Duvie,

I know that there are lots of apps out there right now that don't utilize 4 cores but I still want one!

I can be encoding video in one app while still efficiently editing in another. Or running a number of applications simultaneously. And I have a feeling now that the Quads have entered the marketplace that we'll be seeing more and more support.

Plus it's the whiners like us that keep pushing the software developers to optimize the code!
 

Aluvus

Platinum Member
Apr 27, 2006
2,913
1
0
Originally posted by: zsdersw
Prescott's branch predictor isn't "better" than that of Conroe. It's been my understanding that Prescott's branch predictor was the basis for the branch predictor of Conroe. And I think it was also present in Yonah.

I took it that he was making a comparison to Northwood. In that case it would be a true statement.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Trace cache was capable of issuing three instructions per clock, but obviously could only issue instructions that had already been decoded. As such, the average issue rate of Netburst is going to be far less. Certainly not three per clock!

Far less?? Like 2?? The performance penalty of 2-issue vs 3-issue is minor when it doesn't reach 3 IPC for most code, plus the fact Trace Cache can store already decoded instructions. The 20 stage pipeline doesn't even include the decoding stages. The main talk here is about HT, which on the P4 would work to hide latency.

A HT-optimized program for Pentium 4 would take in count of the Trace Cache limitation and adjust to it accordingly.
A 55% increase in pipeline depth and Prescott still performed within a few percent of Northwood. By your logic, Northwood should have hammered Prescott. The reason it didn't is that there is no direct relationship between IPC and pipeline depth. The maximum theoretical performance of any processor is determined by its execution engine and its frequency.

Prescott had bigger enhancements to cancel out the effects of the 55% longer pipeline than Willamette did on the 100% greater pipeline. Maxiumum theoretical performance is determined by execution engine and frequency, but CPUs don't reach maximum theoretical performance, and it tells nothing about the real-world performance. If it was, we would have seen yearly increases in issue width just like we see graphics pipeline increase with every new generations. Instead, CPUs are limited by things like memory subsystem. On every generation that came with widened issue rate, it came with techniques to take advantage of it. Pentium Pro got OOE along with the 3-issue width. Pure execution improvements without OOE would have severely limited the performance advantages. HT on P4 works to hide latency, which was always the bigger limitation over "issue rate".

Had Intel and AMD gone down the RISC route, the performance of processors today could well be significantly higher.

We, do not know that. The only RISC CPU that significantly outperformed x86 CPU was the Alpha, which was clocked over 2x higher than the x86 counterparts.

For your sake I will pretend I didn't read that.

Why?? Why would you say that?? The purpose for the decoder for current x86 CPUs was to break down the complex x86 instructions into simpler RISC-like instructions, so it can supposedly increase ILP and increase performance.

That probably has more to do with the design features of the Power architecture than with RISC.

Here is a good article from IBM discussing SMT you should read. It also talks about the problems of cache hitrate in multithreading, which is what I was talking about when I mentioned one of the reasons Core was more suitable for SMT was its larger L1.

The Power 4/5 example didn't have anything to do with RISC, it was about a wide CPU, that would not have had greater advantage from SMT as a much narrower P4 without the enhancements made to support SMT better.
 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Originally posted by: IntelUser2000

Far less?? Like 2?? The performance penalty of 2-issue vs 3-issue is minor when it doesn't reach 3 IPC for most code, plus the fact Trace Cache can store already decoded instructions. The 20 stage pipeline doesn't even include the decoding stages. The main talk here is about HT, which on the P4 would work to hide latency.

So you're saying reducing the issue rate from 3 to 2 will cause a minor performance hit? The average execution rate may be 1 to 2 instructions per clock, but in order to achieve that, more instructions must be loaded into the reorder buffer to ensure there are instructions to execute each cycle. If the decode rate was of 'minor' importance, Core would not have 4 decoders.

Prescott had bigger enhancements to cancel out the effects of the 55% longer pipeline than Willamette did on the 100% greater pipeline.

Assuming that you deem IPC and pipeline depth to be directly related, you're saying that Prescott's few enhancements compensated for a 55% dip in theoretical performance?

Maxiumum theoretical performance is determined by execution engine and frequency, but CPUs don't reach maximum theoretical performance, and it tells nothing about the real-world performance. If it was, we would have seen yearly increases in issue width just like we see graphics pipeline increase with every new generations.

My use of the word 'theoretical' was intended to convey my acknowledgement that processors generally do not achieve their maximum throughput, on average. My point, however, was that a processor's execution resources and frequency have far more bearing on performance than pipeline depth.

Instead, CPUs are limited by things like memory subsystem. On every generation that came with widened issue rate, it came with techniques to take advantage of it. Pentium Pro got OOE along with the 3-issue width. Pure execution improvements without OOE would have severely limited the performance advantages. HT on P4 works to hide latency, which was always the bigger limitation over "issue rate".

1. Of course designers are going to ensure their processor can properly take advantage of wider execution resources. You've pretty much reiterated my own point about the limitations of the x86 ISA. Cheers.

2. That last statement is utter tosh. The main point of Hyperthreading was to provide a pool of non-dependent instructions for the scheduler to send for execution (hence Netburst's deep buffers), improving IPC. A side effect of this was that context switching was also improved, since threads from different processes running simultaneously under preemptive multitasking did not have to be continually fetched from memory. In this respect, latency is hidden, but I doubt this is what you were talking about. It certainly does not hide any latency when executing instructions from a particular thread at the core level. Instruction latency is the minimum number of clock cycles it takes to execute a single instruction from a particular thread, and is determined by the number of pipeline stages.

"Increasing the instruction transfer rate constitutes an acceleration factor in itself, but it also provides a wider instruction window to the OOO engine that will facilitate its management of dependencies and in consequence its efficiency. We remind you that this was the same objective of optimisation of OOO functioning that has been at the origin of Hyper-Threading integration in Netburst."

Source

We, do not know that. The only RISC CPU that significantly outperformed x86 CPU was the Alpha, which was clocked over 2x higher than the x86 counterparts.

If memory serves, Alpha also featured a significantly higher IPC than any of its contemporary processors.

The Power 4/5 example didn't have anything to do with RISC, it was about a wide CPU, that would not have had greater advantage from SMT as a much narrower P4 without the enhancements made to support SMT better.

Forgive me if I don't take your word for that. The general consensus among people who know what they're talking about in this subject is that processors with greater execution resources stand to gain more from SMT than those with fewer, your theory on the purpose of Hyperthreading notwithstanding.

 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
Originally posted by: Hulk
Duvie,

I know that there are lots of apps out there right now that don't utilize 4 cores but I still want one!

I can be encoding video in one app while still efficiently editing in another. Or running a number of applications simultaneously. And I have a feeling now that the Quads have entered the marketplace that we'll be seeing more and more support.

Plus it's the whiners like us that keep pushing the software developers to optimize the code!


Well then you need to whine louder, because you are not getting much down...

MOst of the apps that are multithreaded today in the area of video encoding was already pretty well SMP aware then...Whining has done what in gaming?

Multitasking will be fine but any HT will not garner the equivalence of another core anyways. I bet it wont garner 20% performance increase the rather inefficient P4 netburst HT did.

The only app I can truly get 100% of 4 cores is running 4 instances of F@H. NO program (non benchmark) has been able to tax 100% of the 4 cores. IO limitations are getting so extreme.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
Originally posted by: Duvie
Originally posted by: Hulk
Duvie,

I know that there are lots of apps out there right now that don't utilize 4 cores but I still want one!

I can be encoding video in one app while still efficiently editing in another. Or running a number of applications simultaneously. And I have a feeling now that the Quads have entered the marketplace that we'll be seeing more and more support.

Plus it's the whiners like us that keep pushing the software developers to optimize the code!


Well then you need to whine louder, because you are not getting much down...

MOst of the apps that are multithreaded today in the area of video encoding was already pretty well SMP aware then...Whining has done what in gaming?

Multitasking will be fine but any HT will not garner the equivalence of another core anyways. I bet it wont garner 20% performance increase the rather inefficient P4 netburst HT did.

The only app I can truly get 100% of 4 cores is running 4 instances of F@H. NO program (non benchmark) has been able to tax 100% of the 4 cores. IO limitations are getting so extreme.


That post was meant to be humorous. Note the use of exclamation points!!!

Just trying lighten things up a bit in here.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
To quote from Ars Technica(Link: http://arstechnica.com/articles/paedia/cpu/prescott.ars):
"In the Geek.com interview, the Intel representative offered that the Pentium M's shorter pipeline is supposedly "not suited" for hyperthreading. Since there was no elaboration on that point in the Geek.com article, I'll do that here."

"Sure, hyperthreading improves execution efficiency, or "throughput," by (ideally) keeping the processor's pipeline filled with instructions from multiple threads, but where it really shines is in situations where one thread is stalled and the other is still moving. In a normal processor, a pipeline stall would mean that multiple clock cycles could pass without doing useful work."

"In a processor that's as deeply pipelined as Prescott, waiting on a cache access means multiple dead cycles, especially considering the fact that Prescott's L1 and L2 cache latencies are worse than those of Northwood (significantly worse in the case of the L1); and waiting on a main memory access means lots of dead cycles for a processor that's running at upwards of 3GHz. So what the Intel rep was really saying was this: Because we got obsessed with MHz as a marketing number and made Prescott's pipeline so ridiculously long, Prescott benefits much more from a latency-hiding technique like hyperthreading than a saner design like the Pentium M."

Core is wider than the Pentium M, but it can perform 30% faster per clock.
So you're saying reducing the issue rate from 3 to 2 will cause a minor performance hit? The average execution rate may be 1 to 2 instructions per clock, but in order to achieve that, more instructions must be loaded into the reorder buffer to ensure there are instructions to execute each cycle. If the decode rate was of 'minor' importance, Core would not have 4 decoders.
The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think. None of the articles I have seen show the single decoder as a big limitation, this is the first time hearing that it might be. P4 has many other flaws that is bigger than the single decoder(poor FPU, fast ALU limited by Trace Cache throughput which even then can't execute all ALU instructions, long pipeline stages).

Core microarchitecture having Pentium 4's HT won't benefit more than Pentium 4 will(and vice versa), and if anything Pentium 4 should.

It was said that K7 can issue average of 2.5 per cycle. http://penstarsys.com/editor/company/intel/conroe/index.html

"Currently AMD is saying that with the latest Athlon 64 cores they are seeing a utilization of around .97 issues per cycle."

It doesn't even reach 40% utilization from average issue rate.

There probably will be code that gets greater utilization than 1 issue per cycle, but nowhere near 3. That's why some wondered the point of 4 issue on the Conroe. Why does K8L still have 3 issue core if its rumored to perform similar to Conroe??
Assuming that you deem IPC and pipeline depth to be directly related, you're saying that Prescott's few enhancements compensated for a 55% dip in theoretical performance?
First of all, your math is wrong. 55% increase doesn't equal to 55% drop.(Perhaps you have done it in purpose to be sarcastic but I didn't get any hints that it was) 20/31 is 35% drop(ok that is like trying to find out fps you get in a game by dividing your video card fillrate by your resolution, which is incorrect).

A quote from Hans De Vries: " The 20-stage pipeline can contain 60 operations, over a 100 can be in flight. A well known rule states that about one in every six instructions is a branch instruction. The pipeline will contain 10-15 branches in this case. The new predictor selector hardware in the Willamette branch prediction should be able to reach 92-95% accuracy. This would mean that statistically the change for a miss-predicted branch in the pipeline is 80-90%, which would result in a performance degradation of 40%.

It quotes 40% performance penalty for Willamette not including any changes to the core except branch prediction and pipeline depth. It's probably for one of the bad cases, but still. Let's say 20% for Prescott.
If memory serves, Alpha also featured a significantly higher IPC than any of its contemporary processors.

One of the news back then in the Alpha time was how Pentium Pro shocked the industry by exceeding the specint95 score of Alpha for a brief time. Looking at the clock speed advantage, I wouldn't say that about the IPC. x86 CPUs started catching up(and exceeding) RISC CPUs when it obtained the features only RISC CPUs had, OOO, superscalar, etc.
Forgive me if I don't take your word for that. The general consensus among people who know what they're talking about in this subject is that processors with greater execution resources stand to gain more from SMT than those with fewer, your theory on the purpose of Hyperthreading notwithstanding.
I am not saying P4 will gain more than the Core/Power 5. But on the current implementation of SMT by HT, it will gain equal(or more) than the either two. The next "HT" is supposed to be more advanced, making this point moot.
1. Of course designers are going to ensure their processor can properly take advantage of wider execution resources. You've pretty much reiterated my own point about the limitations of the x86 ISA. Cheers.

Again regarding to the so-called "limitations" on the x86 ISA, some even thought superpipelining the FPU was not possible: http://www.azillionmonkeys.com/qed/cpujihad.shtml
 

intangir

Member
Jun 13, 2005
113
0
76
Originally posted by: IntelUser2000
For your sake I will pretend I didn't read that.

Why?? Why would you say that?? The purpose for the decoder for current x86 CPUs was to break down the complex x86 instructions into simpler RISC-like instructions, so it can supposedly increase ILP and increase performance.

Edit: oh, and one point I overlooked that perhaps should be made: CPU decode units don't necessarily do translation to another instruction format. Even processors that do "native" execution of the ISA instruction set need a decode unit to figure out what the heck the instruction does, which registers it works on, and so on. This can be a nontrivial task even for so-called RISC ISAs such as POWER. I recall reading comments about how much harder it is to write an instruction disassembler for constant-length instruction RISC ISAs than variable-length ISAs like x86, because the POWER ISA, to work within the fixed-length format, has to change the encodings for register operands and execution flags for certain instruction types, while something like x86 has many fewer special cases, oddly enough.
/Edit

The point is that RISC processors have no less of a need for instruction decoders. The idea of building a pipelined CPU without one is, um, hilarious. Unless you're playing with a toy ISA, it's not going to happen.

ArsTechnica PowerPC 970 article
The P4's and the 970's fetch and decode pipeline phases are similar in one very important respect: both processors break down instructions in their native ISA's format into a smaller, simpler format for use inside the CPU. The P4 breaks down each x86 CISC instruction into smaller micro-ops (or "uops"), which more or less resemble the instructions on a RISC machine. Most x86 instructions decode into 2 or 3 uops, but some of the longer, more complex and rarely used instructions decode into many more uops. The 970 breaks its instructions down into what it calls "IOPs", presumably short for "internal operations". Like uops on the P4, it is these IOPs that are actually executed out-of-order by the 970's execution core. And also like uops, cracking instructions down into multiple, more atomic and more strictly defined IOPs can help the back end squeeze out some extra instruction-level parallelism (ILP) by giving it more freedom to schedule code.

Anyway, the point of the above is providing more granularity in scheduling the execution units, and simplifying the handling of data hazards (dependencies). Which brings us back to why SMT benefits just about any high-performance processor these days.

No matter how good you make instruction scheduling, there will ALWAYS be data and control dependencies you have to wait for, and if one thread is stuck waiting for some loads to be fulfilled from cache, a divide instruction whose result it needs, or a new stream of instructions to be fetched and decoded because of a branch misprediction, the CPU might as well occupy otherwise idle execution units with useful work from another thread. Pipeline length only affects the control (branch) hazard latencies and has no impact on the amount of time spent waiting for data.

I think in practice (and disclaimer: I'm not really a microarchitect :) ), the main virtue of SMT is it allows you to average more simultaneous outstanding memory operations. At some point in an instruction stream, the CPU is going to be stuck waiting for loads/stores to finish before it can execute any more instructions, and the number of outstanding memory operations is going to be less than the CPU can theoretically handle. If you add another stream of independent instructions, you could probably hit that maximum number of in-flight memory operations much more of the time. This is a big win when one mem op takes, say 200 cycles, but 10 only take 220, and 20 takes 240 (disclaimer: numbers made up!).
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,726
136
This from the "Intel Core versus AMD's K8 architecture"

"Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy."

 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
The point is that RISC processors have no less of a need for instruction decoders. The idea of building a pipelined CPU without one is, um, hilarious. Unless you're playing with a toy ISA, it's not going to happen.
Q]
From Ars:
So in order for an x86 processor's instruction window to be able to rearrange the instruction stream for optimal execution, x86 instructions must first be converted into an instruction format that's uniform in size and atomic in function. This conversion process is called instruction set translation, and all modern x86 processors do some form of it.

x86 decoder: variable length instruction that contains more info of an instruction translated into fixed length, simpler instructions
RISC decoder: fixed length instruction that is much simpler than x86 instructions that translated into even more simpler instructions

Seems x86 decoders does more of "decoding" than RISC's decoder does.
This can be a nontrivial task even for so-called RISC ISAs such as POWER. I recall reading comments about how much harder it is to write an instruction disassembler for constant-length instruction RISC ISAs than variable-length ISAs like x86, because the POWER ISA, to work within the fixed-length format, has to change the encodings for register operands and execution flags for certain instruction types, while something like x86 has many fewer special cases, oddly enough. /Edit

Could be why there isn't a "miracle" solution to solving the problem of ILP. x86, you get disadvantages of variable length instruction, but go to things like RISC, EPIC, you get problems like increased code density, and greater reliance on compiler. In the end, instruction sets differences all become a wash.

This from the "Intel Core versus AMD's K8 architecture"

Which does not make P4 a less candidate to current version of HT than Merom, and the engineers are working on the one that is more optimized and suited for Merom.
 

BitByBit

Senior member
Jan 2, 2005
474
2
81
Originally posted by: IntelUser2000


The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think. None of the articles I have seen show the single decoder as a big limitation, this is the first time hearing that it might be. P4 has many other flaws that is bigger than the single decoder(poor FPU, fast ALU limited by Trace Cache throughput which even then can't execute all ALU instructions, long pipeline stages).

Okay. Since you seem to be having difficulty grasping the relationship between decode rate and IPC, let's ignore the execution engine for now.
A processor, no matter how advanced, cannot possibly have a higher retire rate than its issue rate. On average, if a processor issues two instructions per clock, then it is capable of retiring no more than two per clock. Consider this analogy: if you increase the fuel/air charge of an engine, what happens to its power? Conversely, what happens when you reduce it?
Netburst may be capable of retiring three instructions per clock in 'bursts', it cannot sustain this, due to the limitations imposed by its issue rate.

Core microarchitecture having Pentium 4's HT won't benefit more than Pentium 4 will(and vice versa), and if anything Pentium 4 should.

If Core had the same execution resources as Netburst, it would probably benefit less due to its ability to extract ILP from a particular thread. However, Core's execution engine is wider, meaning the likelihood of idle execution units is increased. While Netburst's execution engine is theoretically three-way, its instruction decoding ability severely limits the issue rate, resulting in an average IPC of far less than three. Core is not as limited as Netburst when it comes to issue rate, meaning it is capable of greater throughput when executing multithreaded code. Imagine two queue's of people trying to squeeze though a single doorway. This is Netburst when executing multithreaded code. A crude analogy I will concede, but now hopefully you understand why Core is able to gain more from HT.
I don't doubt that HT helps out with pipeline stalls, but accurate branch prediction is naturally going to limit the benefit here. Prescott benefits slightly more from HT, but that is due to its larger caches. Northwood's Data Cache was reduced to 4KB per thread when hyperthreading.


"Hyper-threading's strength is that it allows the scheduling logic maximum flexibility to fill execution slots, thereby making more efficient use of available execution resources by keeping the execution core busier. If you compare the SMP diagram with the hyper-threading diagram, you can see that the same amount of work gets done in both systems, but the hyper-threaded system uses a fraction of the resources and has a fraction of the waste of the SMP system; note the scarcity of empty execution slots in the hyper-threaded machine versus the SMP machine.

To get a better idea of how hyper-threading actually looks in practice, consider the following example: Let's say that the OOE logic in our diagram above has extracted all of the instruction-level parallelism (ILP) it can from the red thread, with the result that it will be able to issue two instructions in parallel from that thread in an upcoming cycle. Note that this is an exceedingly common scenario, since research has shown the average ILP that can be extracted from most code to be about 2.5 instructions per cycle. (Incidentally, this is why the Pentium 4, like many other processors, is equipped to issue at most 3 instructions per cycle to the execution core.) Since the OOE logic in our example processor knows that it can theoretically issue up to four instructions per cycle to the execution core, it would like to find two more instructions to fill those two empty slots so that none of the issue bandwidth is wasted. In either a single-threaded or multithreaded processor design, the two leftover slots would just have to go unused for the reasons outlined above. But in the hyper-threaded design, those two slots can be filled with instructions from another thread. Hyper-threading, then, removes the issue bottleneck that has plagued previous processor designs.
"

Source

Why does K8L still have 3 issue core if its rumored to perform similar to Conroe??

If I remember correctly, K8L has the same number of execution units, but they are wider. It also has a 32-byte instruction fetch rate per cycle, inproving the utilisation of its decoders.





 

intangir

Member
Jun 13, 2005
113
0
76
Originally posted by: IntelUser2000
x86 decoder: variable length instruction that contains more info of an instruction translated into fixed length, simpler instructions
RISC decoder: fixed length instruction that is much simpler than x86 instructions that translated into even more simpler instructions

Seems x86 decoders does more of "decoding" than RISC's decoder does.

Hm, wherever did you get the idea that RISC instructions are simpler than CISC? :) Certainly they were intended to be, but I don't think it ended up that way. As reference, see Hannibal's article on The Post-RISC era:

[RISC CPUs] added more instructions and more complexity to the point where they?re every bit as complex as their CISC counterparts. Thus the "RISC vs. CISC" debate really exists only in the minds of marketing departments and platform advocates whose purpose in creating and perpetuating this fictitious conflict is to promote their pet product by means of name-calling and sloganeering.

At this point, I?d like to reference a statement made by David Ditzel, the chief architect of Sun?s SPARC family and [CTO] of Transmeta.
"Today [in RISC] we have large design teams and long design cycles," he said. "The performance story is also much less clear now. The die sizes are no longer small. It just doesn't seem to make as much sense." The result is the current crop of complex RISC chips. "Superscalar and out-of-order execution are the biggest problem areas that have impeded performance [leaps]," Ditzel said. "The MIPS R10,000 and HP PA-8000 seem much more complex to me than today's standard CISC architecture, which is the Pentium II. So where is the advantage of RISC, if the chips aren't as simple anymore?"

Note that article was written in 1999.

I believe "RISC" today is more a historical term for a group of architectures than an accurate description of their complexity. Today's POWER instruction set certainly can't be called "reduced" in any way. So I can't see how you can conclude x86 architecture's decoders are simpler than POWER's without knowing the relative complexities of x86 uops and POWER iops.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
The fact that the Pentium 4 has the biggest reorder buffer tells it's probably not limited by decode rate as much as you think.
You would probably like to know the above response stemmed from your quote: "The average execution rate may be 1 to 2 instructions per clock, but in order to achieve that, more instructions must be loaded into the reorder buffer to ensure there are instructions to execute each cycle."

And that's the reason you said Core has 4 decoders.
A processor, no matter how advanced, cannot possibly have a higher retire rate than its issue rate. On average, if a processor issues two instructions per clock, then it is capable of retiring no more than two per clock. Consider this analogy: if you increase the fuel/air charge of an engine, what happens to its power? Conversely, what happens when you reduce it?
Like a processor, a car's theoretical potential is also limited. What's the point of a good engine when all you need is a 3-cyclinder 70hp engine to maintain highway speeds??
Netburst may be capable of retiring three instructions per clock in 'bursts', it cannot sustain this, due to the limitations imposed by its issue rate.

If you are saying Netburst is limited because of the single decoder, and it sounds like Trace Cache is unimportant to the performance equation to you, since it has to have previously decoded instructions, think of the other architectures. If the caching system isn't efficient enough currently that instructions are waiting for memory, all other wider CPUs shouldn't be any faster. Why have caches at all when the data in caches are never retrived for future execution(since new data comes in cache everytime) and waiting for memory?? CPUs have
caches because they do provide big of a benefit and lots of code is re-executed.

Athlons are said to have average issue rate of 2.5/cycle. Trace Cache can probably issue similar to the Athlon by itself, and after its decoded and cache in to the Trace Cache, its average is probably over 2. http://en.wikipedia.org/wiki/Cache
Caches have proven extremely effective in many areas of computing because access patterns in typical computer applications have locality of reference.

Your quote:
If the decode rate was of 'minor' importance, Core would not have 4 decoders.
Maybe K8L is wasting its extra wide execution units?? Why does K8L need extra fetch width, while Core needs to be completely 4-issue wide. It probably has to do with the peculiarity of the architecture. Or maybe you are just generalizing things. Just like the K8 being different from Core architecture, so is P4.
intangir: Hm, wherever did you get the idea that RISC instructions are simpler than CISC? Certainly they were intended to be, but I don't think it ended up that way. As reference, see Hannibal's article on The Post-RISC era:
Again, the response was for Bitbybit: "Intel has been trying to push EPIC for years however because of the limitations of x86..."

Instruction set nowadays provide no tangible benefit over x86 in terms of performance.
So I can't see how you can conclude x86 architecture's decoders are simpler than POWER's without knowing the relative complexities of x86 uops and POWER iops.
It's from one of the articles at Ars Technica: http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/5

Theoretically, the pipeline stages are simpler, it does not need things like micro-code sequencer ROM.

The comparison is to a PowerPC970, which has a fixed instruction length of 4 bytes, to x86's variable 1 to 15 bytes.

Surely decoding a fixed instruction to a fixed instruction is simpler than decoding variable instruction length to fixed instruction length.

"A RISC instruction set's fixed-length instruction format does more than just simplify processor fetch and decode hardware; it also simplifies dynamic scheduling, making the instruction stream easier to reorder in the execution core.

In addition to being fixed-length, RISC instructions are also atomic in that each instruction tells the computer to perform one specific and carefully delimited task (e.g. multiply, divide, load, store, shift, rotate, etc.). A single x86 instruction, in contrast, can specify a whole series of tasks, e.g. a memory access followed by an arithmetic instruction, a multi-step BCD conversion, a multi-step string manipulation, etc.

This non-atomic aspect of x86 instructions renders them pretty well impossible for the execution core to reorder as-is. So in order for an x86 processor's instruction window to be able to rearrange the instruction stream for optimal execution, x86 instructions must first be converted into an instruction format that's uniform in size and atomic in function. This conversion process is called instruction set translation, and all modern x86 processors do some form of it."