Originally posted by: drag
However the Pentium 4 has 128 GPR's, which is 4 times as much as PowerPC's ISA uses.
It can do this and still be x86 because it continously renames the registers so that each time the software sees the hardware it's presented with a different group of 8 registers. Thus the pentium 4 can proccess much more at once then would normally be possible.
Just a nit-pick, the P4's 128 rename registers cannot be called GPRs, they are not accessable to software. Your understanding is off on rename registers vs. logical (architected) registers.
Having a pool of physical renaming registers that is larger than the number of logical registers is necessary to support out-of-order execution. Take the following snippit of code: (the register to the left of "=" is the destination register, the two to the right are the source operands)
add
r1 = r2,r3
add r4 = r2,
r1
There is a RAW (read-after-write) dependency between the two instructions; the second add uses r1, a result of the first add, so the two instructions cannot be executed out-of-order. But say you had the following code:
add r1 = r2,
r3
add
r3 = r2,r4
There is a WAR (write-after-read) dependency...the second add writes to r3, which the first add uses as a source operand. While the second add does not use the result of the first, you still cannot execute these instructions out-of-order, because the second add would over-write whatever value used to be in r3, changing the result for the first add. Say there are only 4 registers in the instruction set...the compiler, running out of logical registers, was forced to re-use register 3 as a destination operand for the second add...it's ordering does not reflect any data actually flowing between the two instructions.
The solution to the WAR (and WAW, write-after-write) problem is register renaming. Instead of using the architected registers to feed data to the operands of instructions, we have a larger pool of physical registers. A structure maintains the current mapping of logical registers to physical registers...instructions are "renamed" in program order, and each "right hand side" (of the equals sign) source register uses the current logical-to-physical mapping, and the destination register receives a new logical-to-physical mapping. Any subsequent instruction that uses this logical register as a source uses this new mapping.
Reusing the WAR example, let's say (magically) that currently logical register r1 maps to physical register r1 (ditto for r2, r3, r4). Say there are 8 physical registers, so register r5 to register r8 are currently "free" (they have no mapping). The first instruction is renamed, and it just so happens that it uses the same physical register numbers. The second instruction is renamed, and the logical source registers r2 and r4 receive the current mappings to physical registers r2 and r4. The source register, r3, gets a new mapping to physical register r5. So the code now looks like:
add r1 = r2,r3
add r5 = r2,r4
The instructions can now be executed out-of-order, and store their results without fear of violating some dependency between the instructions. If, for example, the first add depended on a memory access and was stalled, the second add could continue to execute. Any subsequent instruction that uses logical r3 as a source will use the mapping that the second add produced, so it will get mapped to r5. Thus, once the second add has executed, any subsequent instruction that used its destination register mapping can execute.
At the end of the pipeline, the instructions have to be "retired" in program order. The physical register used for an instruction's destination gets returned to the list of free physical registers. This method I've described is an overview of how the MIPS R10000 does out-of-order and renaming....other out-of-order CPUs have used similar or different methods of achieving renaming.
Anyway, what this gets down to is that the physical renaming registers are not software visible. Having more physical registers allows you to keep more instructions in-flight...the P4 can have 126 instructions in-flight, so it has 128 renaming registers. The POWER4/PPC 970 can have 200 instructions in-flight, so it has somewhere around 180-190 renaming registers as I recall.
But this doesn't get around the fundamental 8 architected register limitation of x86, which is all the software can see. Regardless of how many physical registers are present, if a compiler needs more than 8 registers for a particular procedure, it has to start spilling registers to memory, which is going to reduce the amount of instruction-level parallelism. Having fewer architected registers also limits compiler optimizations, such as loop-unrolling and software pipelining.
Hrrm, that ended up being a lot longer than I intended...hopefully it makes a little bit of sense.
