To expand a little more, for FP in the beginning there was x87, which was bad because it used a pre-defined co-processor instruction encoding that had very limited bits available, so it was designed to use a stack register model (which was in vogue at the time, see sparc and AM29000). Intel however messed it up because a stack register model requires automatic spill and restore to work in practice, and they never did that, so in reality when people used x87 they minimized using registers for anything and usually spilled everything to the C stack on every function call boundary if not more often, which made it much slower than it could be. For 8088 this didn't really matter that much because all FP operations were dog slow anyway, but the problem was that as the processors got better, x87 remained a dog.
Then in the late 90's Intel realized that they'd really like to have integer SIMD for image decompression and audio decoding/etc, and specifically wanted to do 4x16 bit and 8x8bit operations fast. (This is relevant for jpeg and mp3.) For this, they needed registers that fit at least 64 bits, and the x87 registers were right there. But at this point, decoding restrictions were relaxed, and so MMX could use normal register operands. The problem with this is that x87 also uses those same registers for its stack. MMX was a fundamentally a pretty good integer SIMD extension, except if you needed to do any FP work at all, you'd always have to spill your MMX state, stop using MMX, do the x87 work, and then restore. (Well you didn't have to, but it was easy to mess up if you didn't.) So, you guessed it, in practice compilers always spilled all the registers on function call boundaries (if not more often), just to be sure, and we are back to a world where you better inline freaking everything if you want your code to be fast. (And also you get 16kB of icache, have fun.)
3DNow! was AMD basically extending MMX for 32-bit FP. This worked decently, and ironically the places where it helped the most were maybe things that just used MMX but needed to occasionally do some 32-bit FP, because you could mix 3DNow! with MMX at your leisure. It had no 64-bit support, which greatly limited applicability, especially browsers really could have used it (because they both dealt with things that used MMX a lot, and needed 64-bit float support because JS).
SSE1 was Intel finally realizing that the original sin was aliasing registers that are addressed in different ways, and creating an entirely new set of 8 128-bit regs, and doing a fundamentally pretty good FP SIMD extension for them. The only major flaw, lack of integer operands, was fixed SSE2. After the SSE2 was available, there has never been a good reason to emit any x87, MMX or 3DNow! ops. It just basically entirely supercedes them. Yes, there are no sin/cos in SSE, but that's strictly because the sin/cos of x87 were terrible and you can do better (in both accuracy and speed) using a library.
Then in the late 90's Intel realized that they'd really like to have integer SIMD for image decompression and audio decoding/etc, and specifically wanted to do 4x16 bit and 8x8bit operations fast. (This is relevant for jpeg and mp3.) For this, they needed registers that fit at least 64 bits, and the x87 registers were right there. But at this point, decoding restrictions were relaxed, and so MMX could use normal register operands. The problem with this is that x87 also uses those same registers for its stack. MMX was a fundamentally a pretty good integer SIMD extension, except if you needed to do any FP work at all, you'd always have to spill your MMX state, stop using MMX, do the x87 work, and then restore. (Well you didn't have to, but it was easy to mess up if you didn't.) So, you guessed it, in practice compilers always spilled all the registers on function call boundaries (if not more often), just to be sure, and we are back to a world where you better inline freaking everything if you want your code to be fast. (And also you get 16kB of icache, have fun.)
3DNow! was AMD basically extending MMX for 32-bit FP. This worked decently, and ironically the places where it helped the most were maybe things that just used MMX but needed to occasionally do some 32-bit FP, because you could mix 3DNow! with MMX at your leisure. It had no 64-bit support, which greatly limited applicability, especially browsers really could have used it (because they both dealt with things that used MMX a lot, and needed 64-bit float support because JS).
SSE1 was Intel finally realizing that the original sin was aliasing registers that are addressed in different ways, and creating an entirely new set of 8 128-bit regs, and doing a fundamentally pretty good FP SIMD extension for them. The only major flaw, lack of integer operands, was fixed SSE2. After the SSE2 was available, there has never been a good reason to emit any x87, MMX or 3DNow! ops. It just basically entirely supercedes them. Yes, there are no sin/cos in SSE, but that's strictly because the sin/cos of x87 were terrible and you can do better (in both accuracy and speed) using a library.