Soccerman: The idea behind SIMD is to operate on vectors of data...instead of having a 32-bit ALU and 32-bit registers that do a single 32-bit operation, you have a 128-bit ALU and 128-bit registers, allowing you to do, for example, one 128-bit add, two 64-bit adds, four 32-bit adds, eight 16-bit adds, or sixteen 8-bit adds at one time. The challenge in coding for SIMD is to find, for example, four sequential 32-bit adds without interdependent data hazards that can be synthesized into a single 128-bit SIMD add. Compilers are not that good at extracting SIMD parallelism, so the best results come by hand-coding. SIMD instructions include normal integer and floating point operations, as well as a bunch of permute, pack/unpack, merge, and alignment ops.
Altivec is the most powerful of the SIMD sets because it can perform 128-bit integer and floating-point ops, and has its own set of 32 128-bit Altivec registers. It also has two dedicated Altivec execution units: one for ALU ops, the other for permutation ops. The units are fast, with latencies of 1 cycle for simple ops and 3-4 for more complex ops, and can operate in parallel with the normal integer and FP units.
MMX has integer instructions only, and operates on 64-bit vectors. The MMX registers share the same register space with the x87 FP register file, so you can't mix MMX instructions with x87 FP instructions.
SSE uses 128-bit FP instructions, and adds its own set of eight 128-bit SSE registers. The problem is that it only has 64-bit SIMD units (one add, one multiply) and datapaths, so it has to break most 128-bit SIMD ops down into two 64-bit ops. It can do a single 128-bit add-multiply (useful for matrix dot-products), or a 64-bit add/64-bit multiply in one cycle; otherwise it has to perform two 64-bit adds or multiplies to simulate a single 128-bit add or multiply. This limitation was incurred because adding full 128-bit execution units and datapaths would have been costly to the die size. Also, the CPU has to switch states to use the SSE register set, so mixing a lot of SSE ops with x87/MMX ops can incur a costly delay.
SSE2 extends the MMX integer instructions to 128-bit using the SSE register set, and adds some more FP instructions. I'm not too sure how SSE2 is implemented on the P4, but I believe it has the same 64-bit execution limitation as the P3. Since SSE/SSE2 are 128-bit instruction sets, it is possible for future Intel or AMD implementations to have 128-bit execution units and datapaths.
3DNow does 64-bit integer and FP instructions, and uses the same register space as x87/MMX. It can simulate 128-bit adds and multiplies the same way the P3 does.