Originally posted by: MadRat
I read from Intel's website that x87 was 80-bit precision.
Yes, this is true. Internally, the core of the original Pentium used 80bits precision. However, in reading off the data to main memory, the floating point number is reduced to 64-bit to maintain compatibility with the IEEE standard. Unfortunately, I forgot the exact IEEE number and don't have a reference book handy. That's why increasing the precision to 128 does very little unless you're working with the same data numrous times.
The FP is done in such a way that it cannot be done with renamed registers, which is its main drawback. SSE/SSE2 can both use the renamed registers to boost their thoroughput, which is the main advantage of FP through SiMD. The secondary advantage of SSE/SSE2 is that multiple FP operations can be done in parrallel as long as they are loaded and scheduled around the limitations of their architecture. The SSE is limited to a maximum of 32-bit precision and SSE2 to a maximum of 64-bit precision according to their information.
Is not a wider pathway (in this case 128-bit) not necessary for single-cycle loads of FP operations beyond 64-bit? It wouldn't be any advantage I'd think to use FP over stacked SSE2 operations to compute beyond 80-bit if it wasn't for lower latency.
Well, first off, instructions are limited to 32-bit, I believe. However, if you're extending the floating point precision, it's obvious that having a data path as wide or wider than the data you're working on would decrease latency. Single cycles loads are faster than multiple cycle loads given the same clock period.
The question becomes "do we really need to have better precision?" The FP unit is limited to 64-bits by IEEE standards and for all intents and purposes, it's good enough for most applications. Extending precision to 128 is quite possible and well within reach from an architectural standpoint. The problem would be whether or not you could get the 128-bit unit to work as fast as the 64-bit unit. Obviously, the wider your data, the harder it is to ramp up speed.
If only 1% of the world required 128-bit precision and none of them really "require" (but would "love") real-time performance, but the rest of the world can live with 64-bit or even 32-bit precision, then it's a bad idea to implement 128-bit pathways or have full dedication for 128-bit FP units because doing so would slow down your entire processor. You would, in fact, be designing a chip to cater to the minority and end up innovating your way into Apple's part of the world.