Question Greatest x86 innovations?

Tuna-Fish · Mar 3, 2024

To expand a little more, for FP in the beginning there was x87, which was bad because it used a pre-defined co-processor instruction encoding that had very limited bits available, so it was designed to use a stack register model (which was in vogue at the time, see sparc and AM29000). Intel however messed it up because a stack register model requires automatic spill and restore to work in practice, and they never did that, so in reality when people used x87 they minimized using registers for anything and usually spilled everything to the C stack on every function call boundary if not more often, which made it much slower than it could be. For 8088 this didn't really matter that much because all FP operations were dog slow anyway, but the problem was that as the processors got better, x87 remained a dog.

Then in the late 90's Intel realized that they'd really like to have integer SIMD for image decompression and audio decoding/etc, and specifically wanted to do 4x16 bit and 8x8bit operations fast. (This is relevant for jpeg and mp3.) For this, they needed registers that fit at least 64 bits, and the x87 registers were right there. But at this point, decoding restrictions were relaxed, and so MMX could use normal register operands. The problem with this is that x87 also uses those same registers for its stack. MMX was a fundamentally a pretty good integer SIMD extension, except if you needed to do any FP work at all, you'd always have to spill your MMX state, stop using MMX, do the x87 work, and then restore. (Well you didn't have to, but it was easy to mess up if you didn't.) So, you guessed it, in practice compilers always spilled all the registers on function call boundaries (if not more often), just to be sure, and we are back to a world where you better inline freaking everything if you want your code to be fast. (And also you get 16kB of icache, have fun.)

3DNow! was AMD basically extending MMX for 32-bit FP. This worked decently, and ironically the places where it helped the most were maybe things that just used MMX but needed to occasionally do some 32-bit FP, because you could mix 3DNow! with MMX at your leisure. It had no 64-bit support, which greatly limited applicability, especially browsers really could have used it (because they both dealt with things that used MMX a lot, and needed 64-bit float support because JS).

SSE1 was Intel finally realizing that the original sin was aliasing registers that are addressed in different ways, and creating an entirely new set of 8 128-bit regs, and doing a fundamentally pretty good FP SIMD extension for them. The only major flaw, lack of integer operands, was fixed SSE2. After the SSE2 was available, there has never been a good reason to emit any x87, MMX or 3DNow! ops. It just basically entirely supercedes them. Yes, there are no sin/cos in SSE, but that's strictly because the sin/cos of x87 were terrible and you can do better (in both accuracy and speed) using a library.

NTMBK · Mar 4, 2024

Tuna-Fish said:
To expand a little more, for FP in the beginning there was x87, which was bad because it used a pre-defined co-processor instruction encoding that had very limited bits available, so it was designed to use a stack register model (which was in vogue at the time, see sparc and AM29000). Intel however messed it up because a stack register model requires automatic spill and restore to work in practice, and they never did that, so in reality when people used x87 they minimized using registers for anything and usually spilled everything to the C stack on every function call boundary if not more often, which made it much slower than it could be. For 8088 this didn't really matter that much because all FP operations were dog slow anyway, but the problem was that as the processors got better, x87 remained a dog.

Then in the late 90's Intel realized that they'd really like to have integer SIMD for image decompression and audio decoding/etc, and specifically wanted to do 4x16 bit and 8x8bit operations fast. (This is relevant for jpeg and mp3.) For this, they needed registers that fit at least 64 bits, and the x87 registers were right there. But at this point, decoding restrictions were relaxed, and so MMX could use normal register operands. The problem with this is that x87 also uses those same registers for its stack. MMX was a fundamentally a pretty good integer SIMD extension, except if you needed to do any FP work at all, you'd always have to spill your MMX state, stop using MMX, do the x87 work, and then restore. (Well you didn't have to, but it was easy to mess up if you didn't.) So, you guessed it, in practice compilers always spilled all the registers on function call boundaries (if not more often), just to be sure, and we are back to a world where you better inline freaking everything if you want your code to be fast. (And also you get 16kB of icache, have fun.)

3DNow! was AMD basically extending MMX for 32-bit FP. This worked decently, and ironically the places where it helped the most were maybe things that just used MMX but needed to occasionally do some 32-bit FP, because you could mix 3DNow! with MMX at your leisure. It had no 64-bit support, which greatly limited applicability, especially browsers really could have used it (because they both dealt with things that used MMX a lot, and needed 64-bit float support because JS).

SSE1 was Intel finally realizing that the original sin was aliasing registers that are addressed in different ways, and creating an entirely new set of 8 128-bit regs, and doing a fundamentally pretty good FP SIMD extension for them. The only major flaw, lack of integer operands, was fixed SSE2. After the SSE2 was available, there has never been a good reason to emit any x87, MMX or 3DNow! ops. It just basically entirely supercedes them. Yes, there are no sin/cos in SSE, but that's strictly because the sin/cos of x87 were terrible and you can do better (in both accuracy and speed) using a library.

Don't forget the fun detail that x87 had 80-bit precision internally, but this would be rounded to 64-bit whenever it spilled to the C stack... and you had no control over when the compiler would choose to do this, so the precision of the exact same C code would vary wildly between different compilers. (Or even the same compiler, if you made a seemingly unrelated change and triggered a spill.)

Fun times!

FelixDeCat · Mar 4, 2024

Tuna-Fish said:
To expand a little more, for FP in the beginning there was x87, which was bad because it used a pre-defined co-processor instruction encoding that had very limited bits available, so it was designed to use a stack register model (which was in vogue at the time, see sparc and AM29000). Intel however messed it up because a stack register model requires automatic spill and restore to work in practice, and they never did that, so in reality when people used x87 they minimized using registers for anything and usually spilled everything to the C stack on every function call boundary if not more often, which made it much slower than it could be. For 8088 this didn't really matter that much because all FP operations were dog slow anyway, but the problem was that as the processors got better, x87 remained a dog.

Then in the late 90's Intel realized that they'd really like to have integer SIMD for image decompression and audio decoding/etc, and specifically wanted to do 4x16 bit and 8x8bit operations fast. (This is relevant for jpeg and mp3.) For this, they needed registers that fit at least 64 bits, and the x87 registers were right there. But at this point, decoding restrictions were relaxed, and so MMX could use normal register operands. The problem with this is that x87 also uses those same registers for its stack. MMX was a fundamentally a pretty good integer SIMD extension, except if you needed to do any FP work at all, you'd always have to spill your MMX state, stop using MMX, do the x87 work, and then restore. (Well you didn't have to, but it was easy to mess up if you didn't.) So, you guessed it, in practice compilers always spilled all the registers on function call boundaries (if not more often), just to be sure, and we are back to a world where you better inline freaking everything if you want your code to be fast. (And also you get 16kB of icache, have fun.)

3DNow! was AMD basically extending MMX for 32-bit FP. This worked decently, and ironically the places where it helped the most were maybe things that just used MMX but needed to occasionally do some 32-bit FP, because you could mix 3DNow! with MMX at your leisure. It had no 64-bit support, which greatly limited applicability, especially browsers really could have used it (because they both dealt with things that used MMX a lot, and needed 64-bit float support because JS).

SSE1 was Intel finally realizing that the original sin was aliasing registers that are addressed in different ways, and creating an entirely new set of 8 128-bit regs, and doing a fundamentally pretty good FP SIMD extension for them. The only major flaw, lack of integer operands, was fixed SSE2. After the SSE2 was available, there has never been a good reason to emit any x87, MMX or 3DNow! ops. It just basically entirely supercedes them. Yes, there are no sin/cos in SSE, but that's strictly because the sin/cos of x87 were terrible and you can do better (in both accuracy and speed) using a library.

Thanks for the history lesson it was interesting 👍

DrMrLordX · Mar 10, 2024

Schmide said:
(x86-64 EMT64)

As an aside, the Intel name for x86-64 was actually EM64T. EMT64 ("empty 64") was originally a jab at Intel for being forced to copy x86-64 (when Intel was clearly trying to push IA64) and for some bugs that existed in early implementations of EM64T.

igor_kavinski · Mar 11, 2024

Schmide said:
AVX was lacking and thusly it only lasted a couple years. AVX2 (haswell) was where things settled. Recently it was declared as the next dividing line for modern OSs x86-64-v3 (AVX2, FMA, MOVEB, bits).

Yes, the innovation list needs to be updated with what Haswell brought to the table.

Then Cannon Lake for bringing AVX-512.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

When we crank on the AVX2 and AVX512, there is no stopping the Cannon Lake chip here. At a score of 4519, it beats a full 18-core Core i9-7980XE processor running in non-AVX mode which scores 4185. That's insane. Truly a big plus in Cannon Lake's favor.

That's two cores in AVX-512 mode beating 18 cores in non-AVX mode!

igor_kavinski · Mar 14, 2024

One more x86 innovation: Highest ever 6.2 GHz ST frequency without overclocking

Core i9-14900KS reviewed but Intel ARK not updated yet with entry for the new CPU.

SolidQ · Oct 15, 2024

Would be interesting, how it turn up

https://videocardz.com/newz/intel-and-amd-want-to-make-x86-architecture-better-by-working-together

Nothingness · Oct 15, 2024

The best way to make it better is to trash it. Sorry, could not resist 😅

Markfw · Oct 15, 2024

igor_kavinski said:
Yes, the innovation list needs to be updated with what Haswell brought to the table.

View attachment 95163
View attachment 95164

Then Cannon Lake for bringing AVX-512.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

That's two cores in AVX-512 mode beating 18 cores in non-AVX mode!

This is why Zen 5 with Full avx-512 mode is important and a great help to 9950x and Turin. !

Search

Question Greatest x86 innovations?

Tuna-Fish

Golden Member

NTMBK

Lifer

FelixDeCat

Lifer

DrMrLordX

Lifer

igor_kavinski

Lifer

AnandTech Forums: Technology, Hardware, Software, and Deals

igor_kavinski

Lifer

SolidQ

Golden Member

Nothingness

Diamond Member

Markfw

Moderator Emeritus, Elite Member

AnandTech Forums: Technology, Hardware, Software, and Deals

TRENDING THREADS