When is FMA3 better than FMA4?

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Neither one is "better".

Each has its own way of working. The difference comes into play when coding the application.

Both should offer about the same performance gain.

Also, AMD supports both now with Piledriver. I think Intel will support both eventually as well.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
FMA4 allows for completely independent source and destination operands, while FMA3 requires that one of the source operands is overwritten with the destination. So FMA4 is more flexible, but since FMA3 offers all permutations for selecting which register you want overwritten it's only rarely that FMA4 is actually more useful.

My guess is that Intel wanted to go with FMA3 because it was more efficient to implement in hardware. If their uop format is only three operand then a single four operand instruction presents a problem.

I think Intel will support both eventually as well.

I strongly doubt that.
 

jones377

Senior member
May 2, 2004
450
47
91
AMD's support for FMA4 comes from the fact that it was Intel's own specification for FMA before they changed their mind and the specs to FMA3. It's still amazing that AMD managed to get TWO chips with FMA out before Intel when Intel was the one controlling the instruction specification.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Why did intel decide to make AVX2 use FMA3 while AMD went with FMA4?
Intel's first AVX specification actually used the FMA4 instruction format. AMD's SSE5 specification used FMA3. Intel thought FMA3 was a good idea, while practically simultaneously AMD decided to drop SSE5 and implement the original AVX specification...
Is FMA3 better for general purpose while FMA4 is better for games?
No. They're both specifications for the FMA instruction and they have the exact same uses. The difference is negligible to the end user. FMA3 is slightly more efficient to implement in hardware, but there's also a tiny chance that every now and then an extra instruction is required to work around the limitation it imposes (but on modern processors that instruction takes no execution time).

For AMD the support of FMA4 is dead weight more than anything else, now that they also support FMA3. They might get rid of FMA4 at some point. It shouldn't be of any concern to consumers.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
It's still amazing that AMD managed to get TWO chips with FMA out before Intel when Intel was the one controlling the instruction specification.
There's nothing amazing about that. AMD's implementation is pretty horrible. They have two 128-bit vector units per module, while Intel has two 256-bit vector units per core. AMD compensated by making each unit capable of executing a multiplication, an addition, or a fused multiplication and addition per cycle. But they compromised on latency.

Intel hasn't added FMA before, simply because having two 256-bit vector units (one for multiplication and one for addition) is plenty to exhaust the available load/store and cache bandwidth. With Haswell, Intel will double the bandwidth so dual 256-bit FMA becomes useful. What's more, they're not worsening the latencies.

AMD will have to double the width of its vector units to keep up. But none of their roadmaps make any mention of it. Nor have they announced AVX2 support yet. They're betting the farm on HSA, but it's in deep trouble.
 

lambchops511

Senior member
Apr 12, 2005
659
0
0
FMA3 /probably/ has a shorter instruction than FMA4 since 4 requires you to specify 4 registers while 3 requires you to specify 3; IF this is true, fma3 MIGHT be better in the regards that its smaller and takes less space (e.g., more instruction in cache etc..). I haven't read the specs so don't quote me.

In terms of other performance wise, there probably isn't that big of a difference due to register aliasing, actually, my guess is Intel resorted to FMA3 to make branch prediction easier (e.g., related to the reason of register aliasing + score boarding).

My also guess other main reason for FMA3 vs. FMA4 is due to the load capacitance, 256-bit registers are wide and hold a lot of capacitance, w/ 3 vs. 4 that is 33% extra capacitance; my guess is this extra drive current might be better used to increase clock speed, or this would increase power consumption too much (and everything is about power these days).
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
FMA3 /probably/ has a shorter instruction than FMA4 since 4 requires you to specify 4 registers while 3 requires you to specify 3; IF this is true, fma3 MIGHT be better in the regards that its smaller and takes less space (e.g., more instruction in cache etc..). I haven't read the specs so don't quote me.
The length of the macro-instruction encoding is not so critical. It's the length of the micro-instruction encoding that's the issue. With FMA4, the uop cache would require extra bits for the fourth operand, while no other instruction would use it.
My also guess other main reason for FMA3 vs. FMA4 is due to the load capacitance, 256-bit registers are wide and hold a lot of capacitance, w/ 3 vs. 4 that is 33% extra capacitance; my guess is this extra drive current might be better used to increase clock speed, or this would increase power consumption too much (and everything is about power these days).
Regardless of the encoding format, three input operands have to be read and one result is written. Besides, 256-bit registers aren't wide at all. GPUs use registers of up to 4096-bit, using less advanced semiconductor process technology. That said, AVX can be extended to 1024-bit, and possibly beyond...