What is the IPC of the new PIV?

astroview · Nov 15, 2002

Just wanted to know the IPC, especially compared to other CPUs like top of the line G4, Athlons, and Alphas.

Anyone got this info somewhere?

Accord99 · Nov 15, 2002

Variable depending on application, memory, periperals, compiler, FSB, speed of the P4, etc. In short, there is no one accurate single number for the IPC for the P4 (or any other CPU) family.

UlricT · Nov 15, 2002

uh... WHAT????

IPC (Instructions Per Clock cycle) is variable?????

you sure about this man? coz i always thought it was kinda set according to the hardware, not the software... 😕

Markfw · Nov 15, 2002

Ok, its late and I can't find the article, but it is something like this: The P4 has one IPC, and the Athlon has 2 (maybe 3, I know it has three FPU units). No it is not variable, and I don't have the numbers and web pages to back it up at the moment. I hope someone will find the real numbers, but I know FOR A FACT THAT THE ATHLON HAS MORE IPC THAN THE P4.

Accord99 · Nov 15, 2002

Originally posted by: Markfw900
Ok, its late and I can't find the article, but it is something like this: The P4 has one IPC, and the Athlon has 2 (maybe 3, I know it has three FPU units). No it is not variable, and I don't have the numbers and web pages to back it up at the moment. I hope someone will find the real numbers, but I know FOR A FACT THAT THE ATHLON HAS MORE IPC THAN THE P4.

The Athlon and the P4 have a higher theoretical IPC limit but is never reached in practice and changes based on the things I said above.

Markfw · Nov 15, 2002

I don't know where you get your information Accord99, here is just one of many articles I have read, a quote from the following URL:
http://www.a1-electronics.co.uk/Intel_Section/CPUs/Pentium4_2GHzReview.shtml

" As the AMD Athlon is capable of doing more instructions per clock cycle it is no good trying to make comparisons with these different processors. As our simple graphs below show where only now does the Pentium 4 at 2GHz manage to beat the AMD 1.4MHz Athlon. But one can say that this 2GHz Pentium 4 is faster than the earlier versions. "

I can find many more, but I really need to get some sleep now, and don't want to spend any more time to prove you are in error. Please stop spreading incorrect information.

Edit Here is one more from "http://www.insideproject.com/showreview.cfm?reviewid=56"

" AMD's new performance rating is AMD's way of saying that their CPU's can compete with the best of Intel's, simply put their performance rating in this case 1900+, states that this CPU is the equivelant of a 1900MHz Intel CPU (the upcoming Northwood P4 is about same performance). Their processors are able to beat Intels CPU's clock for clock because they process more IPC's(Instructions Per clock Cycle)."

I can go on forever, so please don't make this a war.

Accord99 · Nov 15, 2002

Where am I wrong? The IPC is not a constant number for any processor. If the IPC is constant, than why is the P4 2400B faster than the P4 2400A? Or why is the Athlon significantly faster clock for clock than a P4 in RC5 but is slower clock for clock in Prime95?

Sunner · Nov 15, 2002

Accord99 is absolutely correct.

Just cause a CPU can theoretically handle x number of instructions per clock doesn't mean it ever will in the real world.
I seem to remember someone stating that the AXP's avarge(don't remember what tests were used to determine this) IPC is slightly above 1, don't know where that would put the P4's though.

Rand · Nov 15, 2002

Originally posted by: Markfw900

I can find many more, but I really need to get some sleep now, and don't want to spend any more time to prove you are in error. Please stop spreading incorrect information.

He's not spouting inaccurate information. He is 100% correct.
If the IPC stayed the same then the performance would always be exactly the same regardless of of memory, FSB etc.
Processor's performance = IPC * frequency

IPC will deviate drastically deopending upon the software code being executed, FSB, memory bandwidth etc.

It's not at all difficult to find situations in which the P4's IPC is as low as ~.03 instructions per clock cycle, similarly I can find a few situations in which it may peak as high as 2.0

One can make a rough estimate of the average IPC seen on most software, but it can vary considerably between code, and depending on hardware factors such as the memory in use.

HT will bring with it a higher average IPC for the Pentium 4 processor, but as clockspeed increases average IPC will decrease.
The peak IPC for the Pentium 4 3.06GHz remains exactly as it has been for all Pentium 4's from the first Willamette core on.
Under peak ideal consitions any Pentium 4 processor can peak at 4 IPC in integer of FPU code.

Both the AMD Athlon and the Pentium III can peak at 3 IPC, though both maintain a higher average IPC then the Pentium 4 in most cases.

imgod2u · Nov 15, 2002

If the IPC was fixed, then a P4 at a certain frequency, say 2.4 GHz will always be x percent faster or slower (exactly) than an Athlon at a certain frequency, say 1.8 GHz. If it didn't vary per software, then how come the P4 can run faster than the Athlon in some applications and slower than the Athlon in other applications? The frequency stays the same. I mean, common, this is just common sense. There's theoretical maximum IPC, and average achieved IPC. The former is fixed, the latter is not.
For reference, the P4 has a maximum instruction throughput of 1 x86 instruction per clock in non-repeatable code and 3 micro-ops per clock in repeatable code (assuming these micro-ops were distributed evenly among the execution units so that they're not a bottleneck). The Athlon has a theoretical maximum throughput of 3 x86 instruction per clock in all code.

thorin · Nov 15, 2002

Originally posted by: Accord99
Where am I wrong? The IPC is not a constant number for any processor. If the IPC is constant, than why is the P4 2400B faster than the P4 2400A? Or why is the Athlon significantly faster clock for clock than a P4 in RC5 but is slower clock for clock in Prime95?

The RC5/Prime95 results are different due to SSE/3DNow etc, what one processor might be able to handle in one instruction another may have to use 2 if that particular instruction hasn't been optimized as part of 3DNow or SSE, etc.....

Thorin

Bovinicus · Nov 15, 2002

In general, the P4 has a significantly lower IPC when compared the the Athlon XP. However, highly optimized applications can make more efficient use of the P4 (SSE2 optimized applications). So, the P4 does have a more competitive IPC in a number of applications. Programs containing lots of branches that are hard to predict seem to really hurt the P4's IPC rating.

Sohcan · Nov 15, 2002

Accord is absolutely correct guys. Unfortunately many enthusiast sites confuse IPC with peak instruction dispatch rate (the maximum number of instructions a processor can dispatch from its reservation stations to the execution units). While the latter is fixed, the former is dependent on a large number of factors: peak fetch rate, peak issue rate, peak dispatch rate, reorder window size, peak retire rate, branch prediction (predictor layout and size, branch target buffer size and associativity, local vs. global predictor), memory hierarchy (multilevel cache size, latency, bandwidth, blocking vs. non-blocking caches, number of hits under misses for non-blocking caches, block sizes, set associativity, number of ports, main memory bandwidth and latency), in-order vs. out-of-order execution, speculative execution, number and organization of reservations stations, number of renaming registers, retirement policy (history file vs. future file vs. reorder buffer vs. register renamer), translation-lookahead buffer size and associativity, clock rate, instruction set characteristics (number of logical registers, number of operands), the compiler, the software....the list goes on, and it can't possibly be quantified in a single number.

While the P4 has a peak dispatch rate of 6 uops/cycle and the Athlon 9 uops/cycle, they are both limited by a peak issue and retire rate of 3 uops/cycle. The peak dispatch rate is a rather poor indication of performance. As an example, with the P3's peak dispatch rate of 5 uops/cycle, the P4 and the Athlon respectively have a 20% and 80% higher dispatch rate, but this is by no means a good indication of the relative performance of these microprocessors.

The Athlon has a theoretical maximum throughput of 3 x86 instruction per clock in all code.

While the Athlon can fetch and decode 3 x86 instructions into 3 to 6 uops/cycle vs. the P4's peak fetch rate of 3 uops/cycle, both are limited by issuing (checking dependencies and sending the instructions to the reservation stations) and retiring 3 uops/cycle.

DT4K · Nov 15, 2002

Sohcan,
Not disputing your info, I'm just a little confused.
I know the peak IPC is 6 for the P4 and 9 for the Athlon XP, as you stated. But what I don't understand is when you say they are both limited to a peak issue and dispatch rate of 3.

I assumed the main reason that the XP significantly outperforms the P4 at equal clock rates was the higher IPC. If this is not the case, why is there such a discrepancy in clock for clock performance. I have read lots of articles basically stating that performance can be roughly measured by clock speed X IPC. They point to the P4 and XP as an example of this. A rough generalization: 1400 Mhz x 9 IPC = 12600 and 2000 Mhz x 6 = 12000 would be a good reason that a 1400 Mhz XP is close to the same performance as a 2000 Mhz P4. This seems to be about right with the Willamette, but I guess it doesn't really fit with northwood. An 1400 Mhz XP would match up well with a 2000 Mhz Willamette but not with the 2000 Mhz Northwood.

Can you explain a little further why the clock for clock performance discrepancy if it's not because of a difference in IPC?

Thanks

Sohcan · Nov 15, 2002

Originally posted by: Shanti
Sohcan,
Not disputing your info, I'm just a little confused.
I know the peak IPC is 6 for the P4 and 9 for the Athlon XP, as you stated. But what I don't understand is when you say they are both limited to a peak issue and dispatch rate of 3.

6 and 9 uops/cycle is not their respective peak IPC, it's their peak dispatch/execution rate. Absolute peak IPC is determined by the peak fetch/issue and retire rate, which on any implemented microprocessor is always equal to or lower than the peak dispatch rate.

In a dynamically scheduled microprocessor (like any modern high-performance microprocessor except the Sun US-III and Itanium 1/2), instructions are fetched, decoded, and issued in program order....during the issue stage register dependencies are checked and register naming is performed, in which the x86 logical registers used in the instructions are mapped to a larger set of physical registers. From there the instructions are sent to reservation stations, where they sit until their operand values are ready and the instruction can be sent to the execution units out of program order. It is out of the reservation stations to the execution units that the peak dispatch rate can be higher, ie 6 uops/cycle for the P4 and 9 uops/cycle for the Athlon. After execution, the instructions are buffered in a reorder buffer (a space is reserved for an instruction during issue), where the results of the instruction are written to the register file or memory in program order. Thus the absolute peak performance is determined by the fetch/issue and retire rate, which for the Athon and P4 are equal. The Athlon can decode more uops/cycle than the P4, but they both issue and retire 3 uops/cycle.

So using peak dispatch rate to judge IPC is rather fruitless, since the other microarchitectural parameters have a much greater effect. Having a higher peak dispatch rate can help dispatch more instructions/cycle out of the reservation stations after the reservation stations fill up due to a cache miss that goes to main memory, but in the grand scheme of things this buys very little extra performance since you are still limited by the lower retire rate.

I assumed the main reason that the XP significantly outperforms the P4 at equal clock rates was the higher IPC. If this is not the case, why is there such a discrepancy in clock for clock performance.

This is true, the XP tends to yield a higher IPC than the P4 on most workloads. What I was clarifying is that IPC is a function of a large number of parameters, and cannot be universally defined by a single number, ie the P4 has X IPC and the Athlon has Y IPC. Aside from the fact that the IPC can vary widely from one program to the next (between 0.5 to 1.5 on most common CPU-intensive desktop workloads) since software characteristics affect miss-rates in caches, TLBs, branch prediction and the branch target buffer (the buffer that predicts which instruction address from which to fetch in the following clock cycle), the relative IPC between two microprocessors vary as well.

I have read lots of articles basically stating that performance can be roughly measured by clock speed X IPC. They point to the P4 and XP as an example of this. A rough generalization: 1400 Mhz x 9 IPC = 12600 and 2000 Mhz x 6 = 12000 would be a good reason that a 1400 Mhz XP is close to the same performance as a 2000 Mhz P4. This seems to be about right with the Willamette, but I guess it doesn't really fit with northwood. An 1400 Mhz XP would match up well with a 2000 Mhz Willamette but not with the 2000 Mhz Northwood.

Can you explain a little further why the clock for clock performance discrepancy if it's not because of a difference in IPC?

Well, the accurate equation for performance is execution time = # instructions * CPI (inverse of IPC) * clock cycle time. Assuming that two processors use the same compiler, compiler optimizations, and instruction set, you can ignore # of instructions. To compare performance between the P4 and XP using IPC and clock rate, you can't use peak dispatch rate...especially since their respective dispatch rates of 9 and 6 instructions/cycle is FAR higher than average IPC for most programs, which is between 0.5 and 1.5 x86 instructions/cycle on both the P4 and Athlon. I can go over more of this later (I have to head off now), but the generally lower IPC of the P4 is due to other factors, including longer branch misprediction penalty, higher clock rate (because the P4 is clocked higher, memory latency with respect to the CPU clock cycle time is higher), pecularities in which it handles FP instructions, and a number of other reasons.

DT4K · Nov 15, 2002

Thanks,
That makes a lot of sense and clarified some things I didn't understand.

Sohcan · Nov 15, 2002

Back to IPC and the P4 & Athlon...

Another reason that peak dispatch rate gives a poor indication of IPC on the Athlon and P4 is that the P4 shares dispatch ports between the integer and FP units whereas the Athlon does not. While this potentially gives the Athlon a dispatch bandwidth advantage in FP code, it gives a poor picture in integer code.

For integer code, the P4 can actually achieve a higher peak dispatch bandwidth than the Athlon. The P4 can dispatch up to 4 integer execution instructions per cycle, vs. the Athlon's 3. For integer loads and stores, the P4 can dispatch 2 address generation calculations per cycle vs. the Athlon's 3; both microprocessors then perform up to one load and one store memory operation per cycle.

There are a number reasons that the P4 generally yields a lower IPC than the Athlon in integer code, but it's hard to gauge each parameters effect on performance without detailed simulations. The largest source of IPC degredation is likely due to the P4's longer pipeline. The reason is due to branch instructions; for microprocessors that perform dynamic speculation (as every high-performance MPU these days does), when a branch instruction is encountered, it must decide which path to follow. Branch prediction is performed to dynamically decide at run-time which outcome the branch takes, but then it speculates by executing down that path of the branch. The instructions that are executed after the speculation point cannot be retired until the branch condition is resolved in order to be sure the program is executed correctly.

But when the speculation occurred incorrectly, the microprocessor must squash the instructions that occurred after the speculation by backing up those results using the reorder buffer that holds the results of executed instructions before retirement. At a minimum, the penalty is proportional to the time between the speculation and when the branch is resolved; there may be an even greater penalty if the branch condition resolution is delayed due to a data dependency. While speculative execution has a large benefit for dynamic scheduling, it's penalty can be equally large. The goal is to make sure that the branch prediction is accurate enough that in the common case, the speculation is successful. This effect makes branch prediction accuracy and misprediction penalty VERY important for performance; studies have shown that going from even 95% branch prediction accuracy to 100% increases IPC by 25-50% for moderate misprediction penalties.

Since the P4 has a relatively long pipeline, it's average branch misprediction penalty is roughly twice as great for the P3 and Athlon. Although the P4's more robust branch prediction amortizes some of this disadvantage, it doesn't completely make up for it. Thus it is quite common that the P4's IPC may be up to 20-25% lower than that of the Athlon. It is also important to note that the very fact that the P4 is clocked higher also lowers it's relative IPC; since it has a faster clock, memory with equal latency "appears" to have a higher latency relative to the CPU. This makes L2 cache misses more detrimental, although the P4's higher bandwidth and larger L2 alleviates some of this disadvantage.

As for FP code, it is important to realize that FP applications are not composed 100% of FP instructions; rather the mix is closer to around 25% FP instructions and 75% integer instructions on average. The P4's longer pipeline is not a disadvantage in this case, since FP code is more loop-intensive and generally has much more predictable branches than integer code. The Athlon's discrete issue ports may give it a small advantage; in FP code, the Athlon can dispatch one FP add/sub, one FP multiply, and one FP load/store in addition to its integer dispatch rates. While the P4's integer load/store dispatch rate remains the same, its FP add/multiply and FP load/store unit shares issue ports with the integer units. Combined with other factors; the P4's higher latency FP instructions, lack of fully-pipelined FP multiply, more sensitivity to non-aligned FP data in the caches, and limitation of only issuing on FXCH instruction per cycle (which swaps operands on the x87 FP stack); the P4 generally achieves a lower IPC in x87 FP code.

What is the IPC of the new PIV?

astroview

Golden Member

Accord99

Platinum Member

UlricT

Golden Member

Markfw

Moderator Emeritus, Elite Member

Accord99

Platinum Member

Markfw

Moderator Emeritus, Elite Member

Accord99

Platinum Member

Sunner

Elite Member

Rand

Lifer

imgod2u

Senior member

thorin

Diamond Member

Bovinicus

Diamond Member

Sohcan

Platinum Member

DT4K

Diamond Member

Sohcan

Platinum Member

DT4K

Diamond Member

Sohcan

Platinum Member

TRENDING THREADS