Originally posted by: IntelUser2000
Trace Cache issue rate is THREE instructions/cycle. Core's maximum issue rate is only 33% higher. Once its in the Trace Cache, Pentium 4 is essentially a 3 issue wide CPU. Hyperthreading also helps for latency hiding, which Core microarchitecture based CPUs are superior in every way, Branch prediction/memory latency/cache/shorter pipeline. I would assume Netburst/Core can equally benefit, but if anything Netburst is still better suited to the implementation of SMT by Intel.
Trace cache was capable of issuing three instructions per clock, but obviously could only issue instructions that had
already been decoded. As such, the
average issue rate of Netburst is going to be far less. Certainly not three per clock!
Yea right. Intel said back when Willamette came that branch mis-prediction can cause 40% performance penalty, and Hans De Vries from chip-architect says the similar thing. There are other parts of the P4 that are poorly designed, but mis-prediction was said to be the biggest one and the much enhanced branch predictor isn't enough to overcome the extra pipeline stages.
If pipeline depth has such a massive impact on performance, then explain why the performance difference between Northwood and Prescott was negligable. A 55% increase in pipeline depth and Prescott still performed within a few percent of Northwood. By your logic, Northwood should have hammered Prescott. The reason it didn't is that there is no direct relationship between IPC and pipeline depth. The maximum theoretical performance of any processor is determined by its execution engine and its frequency. Increasing pipeline depth does worsen the impact of pipeline flushes, but Prescott's sophisticated branch predictor did a good job of preventing this. Hence again Prescott performing similar to Northwood.
Oh right. And one of the fastest CPUs nowadays happen to be an x86 CPU, and it was true before Core 2 Duo. It was thought by some that even superscalar wasn't possible on x86. A wider CPU, with a supposedly better instruction set, like the G5 doesn't perform faster than the K7.
Your first observation is obviously because the two main players in processor design back the x86 ISA. Intel has been trying to push EPIC for years however
because of the limitations of x86. Had Intel and AMD gone down the RISC route, the performance of processors today could well be significantly higher. There was a time when RISC processors easily outperformed x86. Now that the emphasis has shifted from ILP to TLP, and from frequency to core number however, there is little reason for Intel or AMD to invest the necessary R&D costs to pursue RISC on the desktop.
Why does a decoder exist?? To get around the problem of the x86 instruction. Nowadays, being x86 gives more room for performance increase rather than being a quirk(can you have a Trace Cache on a CPU with no decoders??).
For your sake I will pretend I didn't read that.
A significant number of improvements had to be put on the core to take better advantage of the SMT. IBM quotes the performance would have been limited to 20% if it wasn't for the extra improvements, which is similar to what Intel can achieve with Hyperthreading on the server side.
That probably has more to do with the design features of the Power architecture than with RISC.
Here is a good article from IBM discussing SMT you should read. It also talks about the problems of cache hitrate in multithreading, which is what I was talking about when I mentioned one of the reasons Core was more suitable for SMT was its larger L1.