There is no number for "the" IPC of a CPU....to blatantly rip off of one of my previous posts:
<<
IPC is a function of fetch and decode width, dispatch and retire width, reorder buffer size, # of functional units, instruction latency characteristics, # of issue ports, pipeline length, misprediction penalty, branch prediction rate, # of renaming registers, # of logical registers, instruction format (stack vs. 2 register vs. 3 register), L1 access time, L1 hit rate, L2 access time, L2 hit rate, main memory access time, out-of-order execution vs. in-order execution, out-of-order retirement vs. in-order retirement, out-of-order vs. in-order memory access, TLB access time and hit rate, scheduling policies, cache replacement policies, code characteristics.... >>
In other words, the entire microarchitecture.

One thing to note is that the memory subsystem is extremely important, and that the hit rates for caches and the branch predictor is variable depending on the code that is being executed.
The "upper limit" to the IPC is generally considered to be closely related to the fetch width, dispatch width, and retire width....typically, the fetch and retire width is the same, but the dispatch width can be higher. In an out-of-order superscalar core, instrcutions are fetched into reorder buffers before they are dispatched out-of-order to the execution units. That way, if there's a memory access that stalls all the instructions due to read-after-write dependencies, the buffers might fill up....when the memory stall is done, the instructions can "burst" out of the reorder buffers at a rate higher than the fetch width. But the average throughput is still limited by the fetch and retire width.
The Athlon is a 3-way fetch / 9-way (I think) dispatch core, and the P4 is a 3-way fetch / 6-way dispatch core. In terms of x86 instructions, they both average between .8 and ~1.15 instructions/cycle (obviously, the Athlon often has a higher IPC given the same code). But the x86 instructions are decoded into shorter RISC-like micro-ops; the fetch and dispatch rates are for micro-ops, not x86 instructions. Typically, x86 instructions are decoded into 1 or 2 micro-ops, sometimes more for the rarely used archaic instructions. IIRC, for the "higher IPC" x86 code, the average is 1.4 micro-ops/x86 instruction, and for the "lower IPC" code, the average is 1.6 micro-ops/x86 instruction. That places the average IPC, in terms of the micro-ops, at around 1.28 - 1.6.
The G4e, IIRC, is a 3-way fetch / 6-way dispatch core. I don't know any performance numbers off-hand, but consider the Alpha EV6, which is an uber-expensive high-end RISC CPU. It is a 4-way fetch / 6-way dispatch core, but even with its large amounts of cache, the upper-limit to its IPC (considered one of the best for a superscalar, out-of-order core) is around 2.0 (you can see why simultaneous multithreading is of such interest to CPU architects

). So for comparison's sake, consider the G4e's average IPC to be slightly higher than the numbers for the Athlon/P4, but definitely below that of the EV6.
Also, be aware that performance
is not IPC * clock rate....the Iron Law says that Execution time = (cycles/instruction) * (seconds / cycle) *
(instructions/program). The last term is variable, depending on the compiler used and the instruction set. In theory, RISC ISAs have longer code path lengths than that of CISC CPUs, since RISC instructions tend to do "less work" than a CISC instruction. I don't know how true this is anymore...modern x86 compilers don't use the more archaic "ultra CISCy" x86 instructions that often (if ever), and I think register - register arithmetic is used more often than register - memory. But this is definitely an issue to consider, depending on the state of the compilers used for x86 programs vs. PowerPC programs.