Fma3 & fma4

uribag

Member
Nov 15, 2007
41
0
61
Can someone explain me in a simple way what exactly FMA3 and FMA4 do? (and the differences between them)

How hard is the implemantation on existing software?

What are the programs that can benefit the most?

What can we expect in terms of performance gains in the best case cenario?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Can someone explain me in a simple way what exactly FMA3 and FMA4 do? (and the differences between them)
D = (A * B) + C, the differences between the two are irrelevant.
What are the programs that can benefit the most?
Programs that get the most benefit from FMA3/FMA4 ops are ones that multiply then add.
What can we expect in terms of performance gains in the best case cenario?
2x speedup in the same power envelope of a multiply operation.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,415
1,734
136
When computing with floating point, numbers are stored as x * 2^y. Sort of like engineers speak of large numbers as x * 10^y. One effect of this is that such numbers need to be rounded and normalized (so that, for example, the intermediary result of 101.111 * 2^1 gets turned into 1.01111 * 2^11) after every operation. Also, before doing an addition, you need to match exponents (for example, 1.01 * 2^10 + 1.1 * 2^1 needs to be turned into 1.01*2^10 + 0.11 * 2^10). This is actually a harder operation than the addition itself.

So, when computing d = a * b + c, the operations done are multiply, normalize, round, normalize, add, normalize, round. Doing a * b + c is very common, especially when doing matrix computations (geometry, graphics). So this seems inefficient. FMA only does multiply, normalize, add, normalize, round. Since it does less intermediate rounding, it's not quite the same operation as using separate instructions, it's more precise.

The difference between FMA3 and FMA4 is that in FMA4, all of {a, b, c, d} are separate, but in FMA3, d must be one of {a c}. This means that you sometimes need to add an extra mov to save the present value of d if you need it again. FP movs are really, really cheap on modern CPUs, so this is not very bad. Space in instructions is at a premium, so having only 3 operands is better if you aren't gonna need the fourth.

AMD originally proposed FMA3 and Intel proposed FMA4. Since the companies don't actually directly talk to each other (can't have that now, that would be way too mature), they both changed what they implemented in the middle to match each other, leaving AMD with FMA4 and Intel with FMA3.

Since FMA3 will eventually be implemented by (way) more cpus, it will likely be the instruction that people will actually use.


2x speedup in the same power envelope of a multiply operation.

Not quite true. Because of the lack of rounding, there are some things that are much easier when working with FMA. For example, emulating full-precision divide with FMA is mask, shift, add, shift, mask, OR, and a bunch of FMAs where the longest dependent path is 8. Without FMA, you need tens.

Realistically, in most tasks you get somewhat less than 2x (adds and multiplies don't perfectly match always).
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
FMA3 is non-destructive it takes a or c from memory and doesn't use a register. Making the comparison between FMA3 vs FMA4, irrelevant.

In the best case scenario you will get 2x speed up.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,415
1,734
136
FMA3 is non-destructive it takes a or c from memory and doesn't use a register. Making the comparison between FMA3 vs FMA4, irrelevant.

This is not true. By Intel's reference, there is no version of FMA that does not overwrite one of its arguments. This makes sense, because of the way mod/r/m byte works, you cannot have a memory operand in addition to the normal register operands, you need to sacrifice one register operand to use a memory operand.

In the best case scenario you will get 2x speed up.

No, because eliminating the intervening rounding stage wins you precision. If you care about that precision, you need to spend dozens of cycles recovering it without a proper FMA.

For most real tasks, this is not relevant. If you are writing a game you don't care if the unit at seventh place is off by one or not. However, in the best case scenario, FMA is known to give a lot better than 2x speedups.
 

uribag

Member
Nov 15, 2007
41
0
61
Thanks for the answers guys (specially Tuna-Fish for the detailed explanations).