The P4, on the other hand, uses the normal FP register stack (each 80-bit) to store FP data
You're confusing SSE1/2 with MMX and 3DNow...the latter two SIMD sets alias their 64-bit vector datatypes onto the 80-bit registers of the x87 FP stack. SSE in its implementations on the P3, P4 (including SSE2), and the Athlon XP has a completely independent 8 x 128-bit register file...this is akin to Altivec, although it has 32 architectural 128-bit SIMD registers instead of 8. Altivec, aside from any particular microarchitectural implementation, is also a bit more powerful than MMX/3DNow/SSE/SSE2 since it has four distinct operands: two source, one destination, and one filter/modifier that is handy for permute operations. The latter x86 SIMD extensions unfortunately follow in the tradition of x86/x87 by only having two operands: one source, and another that doubles as both a second source and the destination.
One important distinction between the implementations of Altivec and SSE is that while the G4 has a full 128-bit datapath to and from its vector units, the P3, P4 and Athlon have a 64-bit datapath, probably since their vector units share execution resources with their FP units. Thus to perform a 128-bit vector operation, they must be divided into two sequential 64-bit operations, therefore halving the potential throughput. These are just microarchitectural features not dependent on the SIMD instruction set specification; it is possible that a future x86 MPU could feature seperate SIMD execution units with 128-bit datapaths.
however, it cannot store as many SIMD instructions at once as the G4 can
I assume you're referring to the SIMD datatypes? Because instruction words are not stored in the register file, but rather directly fetched from the memory address space.
If I'm not mistaken. Intel dumped most of the hardware execution features in favor of a more generalized method of doing everything
I definitely wouldn't say that at all...I should clarify that the shifts/rotates on the P4 aren't necessarily performed by any sort of "software emulation," but rather the P4 does not feature a fast barrel shifter that previous x86 MPUs have had. Shifts and rotates on the P4 are handled by the third ALU rather than by one of the two double-speed ALUs; its higher latency, 4 execution cycles vs. 1 on previous MPUs, is just an artifact of the circuit design decisions and higher clock speed employed by the P4 design. Despite the increase in execution latency, the throughput (cycles between issues of the same instruction) remains at 1 like earlier designs. Likewise, some of the P4's x87 FP instructions increased in execution latency from the P3, but this is just an artifact of the the circuit design required to achieve higher clock speeds. For the most part, the instruction execution throughput remains unchanged relative to the P3, except perhaps for the simpler arithmetic and logical instructions, which effectively halved in latency and doubled in throughput on the P4 with respect to the P3 due to the two double-speed ALUs. With out-of-order execution, throughput is generally more important towards performance rather than relative instruction latency.
Only on occasion will you find on most MPUs that, except for the simpler arithmetic, logical, and memory operations, the instruction execution latency is 1 cycle. More often than not the hardware must be pipelined to achieve the desired clock cycle; even when the hardware for a complex arithmetic operation can be implemented in parallel, such as using a Wallace tree to implement multiply, it may need to be pipelined into 3 to 5 cycles or more. More complex (though less frequently used) operations, such as divides and transcendental functions, need to be executed in a more serial manner using a finite state machine (akin to, for example, long-hand division, but with techniques to speed up the number of steps) that may take 40 to over 200 cycles to execute.