When computing with floating point, numbers are stored as x * 2^y. Sort of like engineers speak of large numbers as x * 10^y. One effect of this is that such numbers need to be rounded and normalized (so that, for example, the intermediary result of 101.111 * 2^1 gets turned into 1.01111 * 2^11) after every operation. Also, before doing an addition, you need to match exponents (for example, 1.01 * 2^10 + 1.1 * 2^1 needs to be turned into 1.01*2^10 + 0.11 * 2^10). This is actually a harder operation than the addition itself.
So, when computing d = a * b + c, the operations done are multiply, normalize, round, normalize, add, normalize, round. Doing a * b + c is very common, especially when doing matrix computations (geometry, graphics). So this seems inefficient. FMA only does multiply, normalize, add, normalize, round. Since it does less intermediate rounding, it's not quite the same operation as using separate instructions, it's more precise.
The difference between FMA3 and FMA4 is that in FMA4, all of {a, b, c, d} are separate, but in FMA3, d must be one of {a c}. This means that you sometimes need to add an extra mov to save the present value of d if you need it again. FP movs are really, really cheap on modern CPUs, so this is not very bad. Space in instructions is at a premium, so having only 3 operands is better if you aren't gonna need the fourth.
AMD originally proposed FMA3 and Intel proposed FMA4. Since the companies don't actually directly talk to each other (can't have that now, that would be way too mature), they both changed what they implemented in the middle to match each other, leaving AMD with FMA4 and Intel with FMA3.
Since FMA3 will eventually be implemented by (way) more cpus, it will likely be the instruction that people will actually use.
2x speedup in the same power envelope of a multiply operation.
Not quite true. Because of the lack of rounding, there are some things that are much easier when working with FMA. For example, emulating full-precision divide with FMA is mask, shift, add, shift, mask, OR, and a bunch of FMAs where the longest dependent path is 8. Without FMA, you need tens.
Realistically, in most tasks you get somewhat less than 2x (adds and multiplies don't perfectly match always).