PDA

View Full Version : IPC of different CPU's


cheeta05r
08-21-2002, 12:23 AM
I think there is some confusion about the IPC's of different processors. I think a simple article would be useful. Here is what I have come up with, but I think I may be wrong.

P4 IPC = 6
Intanium 2 IPC = 6
Athlon XP IPC = 9
K8 IPC = 10+

Sources: Athlon XP & P4

Still, even with that in mind, it's obvious that clock-frequency
isn't the sole deciding factor in system performance. If it was, the
P4 would have crushed the XP in the marketplace long ago. In reality,
the number of instructions a CPU can actually complete per cycle
(expressed as "IPC") is just as important as the number of cycles it
goes through in a second. The Pentium 4, with its ultra-long
hyperpipeline, is able to achieve astronomic clock frequencies, but
at the price of lower IPC performance. The Athlon XP, on the other
hand, goes through fewer cycles per second, but manages to get more
work done on each pass -- 9 instructions per cycle, as opposed to the
P4's 6 -- giving it a 150% advantage in IPC.
http://www.active-hardware.com/english/reviews/processor/xp-2100.htm

Source: Itanium 2

The Intel Itanium 2 processor is able to issue instructions at a
peak rate of six instructions per cycle, and is equipped with
hardware resources to ensure a sustained throughput that is closer to
this maximum. It has leading price performance, especially when
combined with the HP zx1 Chipset, which is a high bandwidth,
low-latency chipset designed to be cost-efficient.
http://www.hp.com/products1/itanium/performance/index.html

Source K8

The upcoming AMD Hammer family promises a significant increase in
performance compared with the current generation of Athlon
processors. Some of the increase comes from the low latency on chip
memory controller, some of it comes from the extra architectural
registers of the x86-64 instruction set and some comes from a number
of new features that improve the ability of the Hammer
micro-architecture to recognize a higher IPC ( Instructions executed
Per Clock) than previous generation micro-architectures. Here we look
into some depth into the latter.
http://www.chip-architect.com/news/2002_06_24_Hammers_Two_Extra_PipelineStages.html

m0ti
08-21-2002, 02:54 AM
I'm not completely sure how aaccurate the numbers are, though they seem to be in line with others that I've heard. The thing to remember is that this is maximum IPC! In the case of Itanium, it is 100% dependent on static compilation to take advantage of it's available 6 IPC's. As for the others, this is all done at run-time, where instructions are dynamically checked for independence (of course, though, the compiler sets the original order of the instructions (as well as which registers they access, etc), which definitely has a big effect on IPC). While this does allow more work to be done (concurrently), it is costly in terms of chip real-estate. VLIW processors throw out all the detection and renaming mechanisms, and all the various methods of allowing OOE (out-of-order execution). This tends to make them much simpler. Transmeta's crusoe actually still does do detection and register renaming, but it does in software dynamically at run-time (since they wanted to make the lowest power chip possible). However, VLIW's tend to suffer from the fact that their ISA is dependent on the number of functional units. Again Crusoe gets around this by translating CISC instructions. Recently, in fact they've doubled the data path from 128bits to 256 bits, allowing them to (potentially) double their IPC.

If you look around, a lot of people are beginning to think that the end of super-scalar is upon us. IMO, there's still more that can be done there, but run-time software is starting to show it's worth (see HP's Dynamo), and when put into hardware could seriously improve performance. Of course this is true both of super-scalar and VLIW designs.

I think that the results will be fairly dependant on the Itanium/Opteron battle. If Itanium wins, then the future will be VLIW (EPIC is more of an ISA designed to take advantage of VLIW in my opinion). If Opteron wins it's the continuation of the status quo. It doesn't mean that the future won't be VLIW in the future, just that it'll continue to be super-scalar a while longer.

cheeta05r
08-21-2002, 11:22 AM
Here is something else I read.
[quote]
The maximum IPC of any Athlon processor is 3, and the same is true for the Pentium III and Pentium II.

The Pentium IV, on the other hand, is a little bit different. I don't have a lot of information to back me on the Pentium IV, but I believe it is capable of executing up to 5 instructions per clock cycle, 1 FPU instruction and 4 16-bit ALU instructions OR 1 FPU instruction and 2 32-bit ALU instructions. Correct me if I'm wrong.

The confusion comes from someone (who obvioulsy doesn't know a thing about CPU's) analyzed the Athlon's architecture and came to the conclusion that Athlon is capable of executing 9 instructions per clock cycle.

3 instruction decoders + 6 instruction pipelines = 3 instructions per clock cycle. Yes, 3 + 6 = 9, but we're talking instructions per cycle, not operations per cycles.

Since the instruction stream passes through the instruction decoder at a maximum rate of 3 per cycle, the scheduler cannot issue more than 3 instructions to the pipelines per clock. Remember, the whole system runs only as fast as its slowest component--the bottleneck. Its that simple.[quote]

Sohcan
08-21-2002, 02:26 PM
P4 IPC = 6
Intanium 2 IPC = 6
Athlon XP IPC = 9
K8 IPC = 10+
This is not accurate...first of all, IPC is dependent on numerous factors, including peak fetch rate, peak issue rate, reorder window size, peak retire rate, branch prediction, memory hierarchy (multilevel cache size, latency, bandwidth, blocking vs. non-blocking caches, block sizes, set associativity, main memory bandwidth and latency), in-order vs. out-of-order execution, translation-lookahead buffer size, clock rate, instruction set characteristics, the software...it's a long list; the IPC is dependent on the microarchitecture of the microprocessor. :)

You've mixed two terms in the list above: peak fetch rate and peak issue rate. The fetch rate is the rate at which instructions are fetched, decoded, and queued into a reorder buffer (in the case of dynamically scheduled CPUs), after which instructions may be issued to the execution units at a higher rate than the fetch rate (the issue rate).

The P4, Athlon,and K8 are all fundamentally 3-way fetch superscalar CPUs; the Athlon and K8 have 3 parallel x86 decoders which can fetch and decode 3 x86 instructions/cycle into smaller RISC-like operations (uops). Most frequently, the x86 instructions get decoded to 1-2 uops, thus the max number of uops fetched/cycle is likely 3-6. In practice, due to x86 limitations, cache size and latency, and branch prediction and bandwidth limitations, the achieved throughput is around 1.5 to 2 uops/cycle. From its reorder buffer, the Athlon and K8 can issue 9 uops/cycle: 3 integer uops, 3 floating-point uops, and 3 address generation uops. The address generation execution units then feed into a load/store unit, which can perform two load/stores per cycle.

The P4, on the other hand, predecodes uops and stores them in its trace cache, and fetches a maximum of 3 uops/cycle. The P4 has 7 execution units (two of which have a throughput of 2 instructions/cycle) compared to the Athlon and K8's 9, but some of the P4's execution units share issue ports. The P4 can issue a maximum of 4 integer uops, 2 floating-point uops, and 2 load/store uops, with a maximum of 6 uops/cycle.

The Itanium 2, on the other hand, is a true 6-way fetch core. It has far more execution resources than any x86 CPU, with 6 integer ALUs, 6 multimedia ALUs, 2 double-precision FPUs, 2 single-precision FPUs, 2 load units, 2 store units, and 3 branch units. It can issue 6 instructions/cycle from its in-order bundle buffer. With (among other things) its higher peak fetch rate, far more sophisticated cache and memory hierarchy, shorter pipeline, and nifty compiler techniques (loop unrolling and software pipelining), it achieves a much higher "IPC" than either the Athlon or P4. Despite its much lower clock rate, in SPECint2K (http://www.aceshardware.com/SPECmine/index.jsp?b=0&s=1&v=1&if=0&r1f=2&r2f=0&m1f=0&m2f=0&o=0&o=1) it scores similarly to the P4 and Athlon, and it far outscores any x86 CPU in SPECfp2K (http://www.aceshardware.com/SPECmine/index.jsp?b=2&s=1&v=1&if=0&r1f=2&r2f=0&m1f=0&m2f=0&o=0&o=1&start=20).

This is just a little hint of what's involved in MPU microarchitecture...trying to assign a particular number to the IPC of a microprocessor is fruitless, especially since its variable depending on the ISA, clock rate, and software used. Hence, as you can see in your first quotation above, many online hardware enthusiast sites tend to confuse "IPC" with peak issue rate. The P3 actually has a peak issue rate of 5 uops/cycle, one fewer than the P4. Since the P3 performed comparably to the Athlon classic and the TBird, the difference in attained IPC between the Athlon and P4 can hardly be solely attributed to the Athlon's 50% higher peak issue rate.

cheeta05r
08-21-2002, 06:49 PM
Thanks. I thought something seemed wrong.

BurntKooshie
08-22-2002, 05:28 PM
Perhaps something worth emphasizing (that Sohcan did, albiet it is buried in his post), is that the IPC of a given processor will vary according to frequency -- all else being equal. The "IPC" of an Athlon XP at 1.5ghz will likely not be the same as an Athlon XP at 2.00ghz for most applications. This is the whole reason why, for a given market, it is also important to look at performance scaling of the processor with regards to clock-rate, the headroom for frequency growth, and the time frame for which it will be able to achieve those greater frequencies.

The reason why I think it is worth emphasizing that one point about how IPC is itself a function the clockrate of a processor is because it is a great example to show how poor BOTH frequency and "IPC" for determining performance.
ac
/me gets off his horse.