x87 beyond 80-bit precision in AMD64?

MadRat

Lifer
Oct 14, 1999
11,967
280
126
Is there a new x87 standard for the FP with the AMD64 instruction set. Seems that everything else was expanded so what about FP. If not, would it be possible to expand past 80-point precision with the FPU? I read that SSE2 is good for 64-bit precision, whereas SSE was good for 32-bit precision. Would there be a way, or would it make more sense, to expand the precision to 128-bit with a new SIMD set? (Seems like Altivec uses 128-bit precision which coincidentially matches its 128-bit memory pathway.) Would there need to be a 128-bit memory pathway to match a 128-bit precision?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
I can think of only a handful of scientific and graphics applications that really *need* even 64-bit precision, so I'm guessing that pushing it much past there is a game of diminishing returns. It might provide a speed boost for a few esoteric, specially-compiled apps, but you'd probably get more benefit using the die space for more cache, or (if you're making architectural changes) just adding more general-purpose 32- or 64-bit registers.
 

Sahakiel

Golden Member
Oct 19, 2001
1,746
0
86
x87 defines 64-bit precision, I believe. That means that any extra bits internally would simply mean a slightly more accurate result after many iterations of the same data. I am also at a loss to find any consumer application which would require 128-bit precision.
As for the memory pathway, the data width is irrelevant to the data precision. You could have a 2 bit pathway if you so desired. The obvious problem would be the multiple cycles required to send a single piece of data for comutation.
If I remember correctly, the L1 cache of the Pentium 4 provides a few GB/s worth of bandwidth dedicated to the FP units. Of course, Intel is trying to push the use of the SSE units, so that may have something to do with it. On the AMD side, I'm somewhat less familiar with the general architecture, so I can't say too much off the top of my head.
 

MadRat

Lifer
Oct 14, 1999
11,967
280
126
I read from Intel's website that x87 was 80-bit precision. The FP is done in such a way that it cannot be done with renamed registers, which is its main drawback. SSE/SSE2 can both use the renamed registers to boost their thoroughput, which is the main advantage of FP through SiMD. The secondary advantage of SSE/SSE2 is that multiple FP operations can be done in parrallel as long as they are loaded and scheduled around the limitations of their architecture. The SSE is limited to a maximum of 32-bit precision and SSE2 to a maximum of 64-bit precision according to their information.

Is not a wider pathway (in this case 128-bit) not necessary for single-cycle loads of FP operations beyond 64-bit? It wouldn't be any advantage I'd think to use FP over stacked SSE2 operations to compute beyond 80-bit if it wasn't for lower latency.
 

Sahakiel

Golden Member
Oct 19, 2001
1,746
0
86
Originally posted by: MadRat
I read from Intel's website that x87 was 80-bit precision.
Yes, this is true. Internally, the core of the original Pentium used 80bits precision. However, in reading off the data to main memory, the floating point number is reduced to 64-bit to maintain compatibility with the IEEE standard. Unfortunately, I forgot the exact IEEE number and don't have a reference book handy. That's why increasing the precision to 128 does very little unless you're working with the same data numrous times.

The FP is done in such a way that it cannot be done with renamed registers, which is its main drawback. SSE/SSE2 can both use the renamed registers to boost their thoroughput, which is the main advantage of FP through SiMD. The secondary advantage of SSE/SSE2 is that multiple FP operations can be done in parrallel as long as they are loaded and scheduled around the limitations of their architecture. The SSE is limited to a maximum of 32-bit precision and SSE2 to a maximum of 64-bit precision according to their information.

Is not a wider pathway (in this case 128-bit) not necessary for single-cycle loads of FP operations beyond 64-bit? It wouldn't be any advantage I'd think to use FP over stacked SSE2 operations to compute beyond 80-bit if it wasn't for lower latency.

Well, first off, instructions are limited to 32-bit, I believe. However, if you're extending the floating point precision, it's obvious that having a data path as wide or wider than the data you're working on would decrease latency. Single cycles loads are faster than multiple cycle loads given the same clock period.
The question becomes "do we really need to have better precision?" The FP unit is limited to 64-bits by IEEE standards and for all intents and purposes, it's good enough for most applications. Extending precision to 128 is quite possible and well within reach from an architectural standpoint. The problem would be whether or not you could get the 128-bit unit to work as fast as the 64-bit unit. Obviously, the wider your data, the harder it is to ramp up speed.
If only 1% of the world required 128-bit precision and none of them really "require" (but would "love") real-time performance, but the rest of the world can live with 64-bit or even 32-bit precision, then it's a bad idea to implement 128-bit pathways or have full dedication for 128-bit FP units because doing so would slow down your entire processor. You would, in fact, be designing a chip to cater to the minority and end up innovating your way into Apple's part of the world.
 

uart

Member
May 26, 2000
174
0
0
Yes, this is true. Internally, the core of the original Pentium used 80bits precision. However, in reading off the data to main memory, the floating point number is reduced to 64-bit to maintain compatibility with the IEEE standard.

Actually the full FPU 80 bits can be saved to memory if required (though there are obviously speed penalties involved in doing so. Most compilers even implement an "extended" data type for doing exactly this.
 

andreasl

Senior member
Aug 25, 2000
419
0
0
A number of wrong premises in the original post here.

1) There is no x87 supported in AMD64 64-bit mode. It uses SSE2 instead, or more specific, scalar SSE2.

2) x87 supports up to 80-bit precision. The other supported are 32-bit and 64-bit.

3) Altivec only support 32-bit precision.

The difference here, and what I think causes some confusion, is the width of the registers and the precision supported. SSE and Altivec registers are 128-bit wide. But instead of operating on one 128-bit data they operate on four 32-bit data with a single instruction. That's the point behind Single Instruction Multiple Data. SSE2 operates on two 64-bit data instead. Altivec has no such mode.
 

Matthew Daws

Member
Oct 19, 1999
31
0
0
andreasl,

) There is no x87 supported in AMD64 64-bit mode. It uses SSE2 instead, or more specific, scalar SSE2.

This is not really true. If you check out AMD64 Developer Resources and grab the "AMD64 Architecture Programmers' Manuals" then you'll see that x87 is fully supported in AMD64 64-bit mode. However, it is recommended that SSE2 be used for speed, as it supports 16 flat registers (vs 8 stack based) and can do instructions parallel. But, to do things like trig or taking logs etc. you have to use x87, as only square-roots and simple arithematic are supported in SSE(2). I *think* you also have to use x87 if you want full IEEE compatibility with regards to rounding etc.

But I fully agree with you on the rest. I think it's a shame that AMD didn't re-write the floating point ISA in 64-bit mode, doing away with x87 all-together. I guess they felt it was easier to keep it for compatibility reasons, and because the overhead of the stack based model is minimal once you start seriously doing trig etc.

I'd also like to see some benchmarks from software like media-encoders that have been optimised to use those 16 XMM registers: I'm guessing AMD could start to win back the ground they have lost to Intel in this area...

--Matt