what exactly is a floating point operation?

MobiusPizza

Platinum Member
Apr 23, 2004
2,001
0
0
Floating pont is a scheme of representing numbers in computer.
There's also a fixed point representation, which is similar.
Both are used to represent real numbers (non-integer)
Let me first talk about fixed point:

The number 136.75 can be represented like this, using fixed point representation
10001000.1100


128_64_32_16_8__4__2__1__|__1/2_1/4_1/8_1/16
_1___0__0__0_1__0__0__0__.___1___1___0___0
(Sorry for the underline, can't get the upper row and bottom row to align)


The "." means the binary point, which separates the unit and fraction, similar to denary system's decimal point

128 + 8 + 0.5 + 0.25 = 136.75


Fixed point representation assums a binary point is in a set position as there is no third symbol to store it explicitly

The advantage of fixed point representation is simple arithmetic, cause the binary point is fixed. The disadvantage is the limited range, as increasing the number of bits after the binary point for precision decreases the range and vice versa

Even using 4 bytes (32bits) to hold each number, with 8 bit for the fractional part after the binary point, the largest number that can be held is just over 8 million.

The floating point representation is invented for the light of this problem
It's called floating point as there's no definite place for the binary point. Where the binary point is depend on a property called exponent.

Each floating point number is made of 2 parts, the mantissa and the exponent.
Mantissa stores the actual value, the exponent stores the magnitude of the number
Say a 16bit floating point number is made of a 10-bit matissa and 6 bit exponent
e.g.

0110101011 | 111110

From matissa, The number has a value of 1+2+8+32+128+256 = 427
From exponent, the number is -2

The value represents 427 * 2^-2 = 427*1/4 = 106.75
In fixed point iteration it is
01101010.11 = 0.25+0.5+2+8+32+64 = 106.75 (same answer)

Can you spot the link?
The exponent's value is -2
It simply means shift the binary point to right "-2" times, which mean shifting to left 2 times

That's similar to the scientific notiation in our denary system
like
12835.24 = 1.283524 * 10^5

you can see like the decimal system, the decimal point is not fixed, what's why it's called floating point.

Floating point operation basically means the addition, multiplications, etc of floating point numbers
 

Mday

Lifer
Oct 14, 1999
18,647
1
81
there is no such thing as a "normal" operation.

floating point is a way to represent numbers as AX mentioned above. a floating point number generally means a non-integer and is stored as a number with precision. by precision, I mean numbers after the decimal point. Take pi for example, we all use the approximation is 3.14, but what if you want more precision? 3.14159 takes more bits to represent than 3.14. Ultimately, the more precise you want things, the more space you would need to store it. This is why desktop computing moved from 8 to 16 to 32, and now to 64 bit processors.

How efficient calculations are taken with floating point numbers is a means to compare one processor from another to see which one is better. That's why you see the it so often when ppl review processors.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
there is no such thing as a "normal" operation.
Obviously the OP means integer operations.

This is why desktop computing moved from 8 to 16 to 32, and now to 64 bit processors.
That's not really true - the 32->64 bit change was related to the address space. Both 32-bit and 64-bit processors use up to 64-bit floating point values (or 80, for a special x86 mode). For the very vast majority of cases, 32-bit processors could represent enough numbers.
 

harrkev

Senior member
May 10, 2004
659
0
71
To summarize AnnihilatorX: "Floating point" is the same as "Scientific Notation", but in binary. That is as simple as you can get.

64-bit floating-point registers have been available for years in x86 architecture.

Also note that even integer math can be used to represent fractions and decimal points. What you give up is dynamic range. It is easy to make a number so small that it is rounded to zero using fixed-point math. It is also easy to make a number so large, that it appears to be infinite. This is done quite a lot when making IIR filters in DSPs.

It is also possible to make numbers that are too large and too small using floating-point. But you have to work at it a lot harder ;)
 

Calin

Diamond Member
Apr 9, 2001
3,112
0
0
Originally posted by: CTho9305
there is no such thing as a "normal" operation.
Obviously the OP means integer operations.

This is why desktop computing moved from 8 to 16 to 32, and now to 64 bit processors.
That's not really true - the 32->64 bit change was related to the address space. Both 32-bit and 64-bit processors use up to 64-bit floating point values (or 80, for a special x86 mode). For the very vast majority of cases, 32-bit processors could represent enough numbers.

However, financial applications are limited by the 32-bit integers (first considering they must take into account hundreds of monetary unit, and lastly because of 4 billions limit). If you don't have 4 billions whatever, remember that banks can.