When is FMA3 better than FMA4?

Anarchist420 · Nov 9, 2012

Why did intel decide to make AVX2 use FMA3 while AMD went with FMA4?

Is FMA3 better for general purpose while FMA4 is better for games?

Edrick · Nov 9, 2012

Neither one is "better".

Each has its own way of working. The difference comes into play when coding the application.

Both should offer about the same performance gain.

Also, AMD supports both now with Piledriver. I think Intel will support both eventually as well.

Exophase · Nov 9, 2012

FMA4 allows for completely independent source and destination operands, while FMA3 requires that one of the source operands is overwritten with the destination. So FMA4 is more flexible, but since FMA3 offers all permutations for selecting which register you want overwritten it's only rarely that FMA4 is actually more useful.

My guess is that Intel wanted to go with FMA3 because it was more efficient to implement in hardware. If their uop format is only three operand then a single four operand instruction presents a problem.

Edrick said:
I think Intel will support both eventually as well.

I strongly doubt that.

jones377 · Nov 9, 2012

AMD's support for FMA4 comes from the fact that it was Intel's own specification for FMA before they changed their mind and the specs to FMA3. It's still amazing that AMD managed to get TWO chips with FMA out before Intel when Intel was the one controlling the instruction specification.

BenchPress · Nov 9, 2012

Anarchist420 said:
Why did intel decide to make AVX2 use FMA3 while AMD went with FMA4?

Intel's first AVX specification actually used the FMA4 instruction format. AMD's SSE5 specification used FMA3. Intel thought FMA3 was a good idea, while practically simultaneously AMD decided to drop SSE5 and implement the original AVX specification...

Is FMA3 better for general purpose while FMA4 is better for games?

No. They're both specifications for the FMA instruction and they have the exact same uses. The difference is negligible to the end user. FMA3 is slightly more efficient to implement in hardware, but there's also a tiny chance that every now and then an extra instruction is required to work around the limitation it imposes (but on modern processors that instruction takes no execution time).

For AMD the support of FMA4 is dead weight more than anything else, now that they also support FMA3. They might get rid of FMA4 at some point. It shouldn't be of any concern to consumers.

BenchPress · Nov 9, 2012

jones377 said:
It's still amazing that AMD managed to get TWO chips with FMA out before Intel when Intel was the one controlling the instruction specification.

There's nothing amazing about that. AMD's implementation is pretty horrible. They have two 128-bit vector units per module, while Intel has two 256-bit vector units per core. AMD compensated by making each unit capable of executing a multiplication, an addition, or a fused multiplication and addition per cycle. But they compromised on latency.

Intel hasn't added FMA before, simply because having two 256-bit vector units (one for multiplication and one for addition) is plenty to exhaust the available load/store and cache bandwidth. With Haswell, Intel will double the bandwidth so dual 256-bit FMA becomes useful. What's more, they're not worsening the latencies.

AMD will have to double the width of its vector units to keep up. But none of their roadmaps make any mention of it. Nor have they announced AVX2 support yet. They're betting the farm on HSA, but it's in deep trouble.

lambchops511 · Nov 9, 2012

FMA3 /probably/ has a shorter instruction than FMA4 since 4 requires you to specify 4 registers while 3 requires you to specify 3; IF this is true, fma3 MIGHT be better in the regards that its smaller and takes less space (e.g., more instruction in cache etc..). I haven't read the specs so don't quote me.

In terms of other performance wise, there probably isn't that big of a difference due to register aliasing, actually, my guess is Intel resorted to FMA3 to make branch prediction easier (e.g., related to the reason of register aliasing + score boarding).

My also guess other main reason for FMA3 vs. FMA4 is due to the load capacitance, 256-bit registers are wide and hold a lot of capacitance, w/ 3 vs. 4 that is 33% extra capacitance; my guess is this extra drive current might be better used to increase clock speed, or this would increase power consumption too much (and everything is about power these days).

BenchPress · Nov 10, 2012

lambchops511 said:
FMA3 /probably/ has a shorter instruction than FMA4 since 4 requires you to specify 4 registers while 3 requires you to specify 3; IF this is true, fma3 MIGHT be better in the regards that its smaller and takes less space (e.g., more instruction in cache etc..). I haven't read the specs so don't quote me.

The length of the macro-instruction encoding is not so critical. It's the length of the micro-instruction encoding that's the issue. With FMA4, the uop cache would require extra bits for the fourth operand, while no other instruction would use it.

My also guess other main reason for FMA3 vs. FMA4 is due to the load capacitance, 256-bit registers are wide and hold a lot of capacitance, w/ 3 vs. 4 that is 33% extra capacitance; my guess is this extra drive current might be better used to increase clock speed, or this would increase power consumption too much (and everything is about power these days).

Regardless of the encoding format, three input operands have to be read and one result is written. Besides, 256-bit registers aren't wide at all. GPUs use registers of up to 4096-bit, using less advanced semiconductor process technology. That said, AVX can be extended to 1024-bit, and possibly beyond...

Search

When is FMA3 better than FMA4?

Anarchist420

Diamond Member

Edrick

Golden Member

Exophase

Diamond Member

jones377

Senior member

BenchPress

Senior member

BenchPress

Senior member

lambchops511

Senior member

BenchPress

Senior member

TRENDING THREADS