Average IPC

podspi

Golden Member
Jan 11, 2011
1,969
75
91
As Bulldozer's launch inches nearer, there has been a lot of talk about potential IPC. Depending on which reading of the optimization manual you believe in BD's theoretical max IPC is between 2 ~ 4.


From my (relatively layman's) understanding, K10's theoretical IPC is ~ 3, but that it pretty much never reaches anything close to that. I was just wondering what average IPC modern CPUs actually achieve. I was able to find some IPC numbers for C2D :

http://www.ece.lsu.edu/lpeng/papers/isast08.pdf

C2D doesn't ever appear to break 2 IPC, at least in spec 2006. Does anybody have any more modern numbers for PhII and Sandy Bridge (or even Nehalem?). I've also noticed some older CPU reviews mention actual IPC, but newer reviews don't seem to.


And before everybody jumps in telling me to stop worrying about IPC, yes I know IPC is not the end-all to single-thread performance (can't forget about that clock part), nor is single-thread performance the end-all to throughput.



However, from an efficiency point of view, if average IPC for modern processors is still < 2, and Bulldozer's theoretical max IPC turns out to be 2, but approaches this limit, I would call that an incredibly clever design.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,415
1,734
136
IPC depends very much on the code you are running. I've seen hand-optimized crypto code that got ~4 ipc on nehalem, while most compiled code will never break 2. Different optimizations can also dramatically effect ipc -- notably, merging alu ops with memory loads will halve the instructions you will execute, and thus the ipc, while usually making the program faster.

So what I mean is that you cannot take an ipc numbers from here and there and expect them to be comparable in any meaningful way. The very least you should be running the same program compiled with the same compiler with the same settings. Which poses it's own complications, because some compiler optimizations are better for some processors than others.

However, from an efficiency point of view, if average IPC for modern processors is still < 2, and Bulldozer's theoretical max IPC turns out to be 2, but approaches this limit, I would call that an incredibly clever design.

In practice, both the dependencies in the code you are currently running and the processor it's running on limit the maximum available ipc, and what is actually reached is the lower of these limits. Code is also very not uniform -- in normal programs there are typically segments where ipc could reach >5 if the processor had the resources, right next to segments where a memory reference drops the maximum attainable ipc to <0.01. If BD can only ever reach 2 IPC, it has no hope to approach this limit, because you increase the average IPC by running much faster when there are no limits. (And reducing the times you hit the wall hard by better branch prediction and memory unit.)

Which is why it's good that a BD core has the most integer execution resources for AMD ever. :)
 

podspi

Golden Member
Jan 11, 2011
1,969
75
91
Which is why it's good that a BD core has the most integer execution resources for AMD ever. :)

Thanks for the insightful post :biggrin: Do you mean core or module? I was under the impression that BD barely had any integer execution resources... Is it the AGLU's?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,415
1,734
136
Thanks for the insightful post :biggrin: Do you mean core or module? I was under the impression that BD barely had any integer execution resources... Is it the AGLU's?

Well, they and the fact that nearly all real compiled x86 code has 1/3 or more memory ops, which no longer block the integer pipelines.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
IPC depends very much on the code you are running.

Building on Tuna's excellent post, I'd just like to point out the obvious in that IPC stands for Instructions Per Clock.

Naturally it stands to reason that the IPC for any given architecture will strictly depend on the specific Instruction you are observing.

Average IPC is like an average of the throughput of the various instructions supported by the ISA. For a modern x86 processor there are over 700 instructions.

x86ISAovertime.jpg


Each of those 700 instructions has its own "IPC".

There are tools to measure the IPC on a per-instruction basis. Everest makes a cool benchmark program that determines the latency of your specific processor (as it can depend on stepping, microcode, mobo BIOS, etc).

http://www.behardware.com/articles/623-5/intel-core-2-duo-test.html
We felt that it was interesting to observe Core behaviour on common x86 instructions such as arithmetical operation, shifting, and rotations. We have studied a tool integrated to the Everest which provides the latency and transfer rates of several instructions chosen amongst the x86/x87, MMX, SSE 1, 2 and 3. This tool is included in the evaluation version and you just have to right click in the status bar of the Everest, select « CPU Debug » and then « Instructions latency dump » in the menu.

When running a software program the program itself will represent a specific mix of instructions.

Depending on the individual IPC of those given instructions, combined with any data-driven latencies (pipe stalls, waiting for fetches, etc) the aggregate - or average - IPC you will observe in benching any given app will vary.