Official: AMD re-introduces FX Brand for high-end Processors

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Abwx

Lifer
Apr 2, 2011
12,012
4,973
136
Bulldozer's shared floating-point cluster with FMA support indeed offers some advantages with legacy 128-bit instructions.

But that advantage will be short-lived once Intel adds FMA support. It would give Intel the same flexibility, and double the peak throughput at the same time. And once applications use 256-bit instructions, it doubles again!

Even if AMD doubles the width of each ALU to 256-bit, it still wouldn't match the same peak throughput per core. Also note that if the IGP is removed from Sandy Bridge, it creates room for two more cores at no extra cost (to Intel). So they can launch such an enthusiast part soon after Bulldozer, and steal AMD's thunder.

BD has double the FP peak throughput of SB when legacy SSE and
AVX128 codes are used and is as powerfull with AVX256.

Once intel adds FMA?...on the fly ?..
It will only be the case with Haswell which is in two years...
By the time , AMD can double their FPU cluster wich will
then be 2X256bit , since it s already a 258bit/one cycle FPU.

Also, it s not possible to run code using only AVX256 , whatever
the compilers s efficencies, so i guess that code being mixed
and interleaved, BD will be up to a 6C SBE even on FP.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
BD has double the FP peak throughput of SB when legacy SSE and AVX128 codes are used...
Only when FMA4 is used.
...and is as powerfull with AVX256.
Again, only when FMA4 is used.

There will be a tug of war between Intel promoting AVX256 and AMD promoting FMA4, since each have the potential of doubling performance on their respective architectures. The harsh reality is that Intel will win this, because AVX256 is supported by both, and Bulldozer will have too little market share for developers to bother implementing and maintaining another code path. Even if some software supports FMA4, it is likely to also use AVX256, putting Bulldozer merely on par in peak performance per clock. Taking into account that it can't execute independent FADD and FMUL in the same clock, in practice Intel has the advantage.
It will only be the case with Haswell which is in two years...
More like a year and a half really. And Bulldozer isn't in the shops just yet. So AMD's potential lead for software using AVX128 and FMA4, will be pretty short-lived. Up against 6-core Intel chips, there won't be any lead at all.
By the time , AMD can double their FPU cluster wich will then be 2X256bit , since it s already a 258bit/one cycle FPU.
It's not that simple. They also need to increase the cache bandwidth to sustain such a high throughput. That's likely the primary reason Intel delayed FMA support till Haswell. But it means Intel already has solid plans for this. I seriously doubt AMD has much post-Bulldozer development going on, so Haswell will reign supreme for quite a while. And once AMD does catch up, Intel could trump them again with technology like gather/scatter...
Also, it s not possible to run code using only AVX256 , whatever
the compilers s efficencies, so i guess that code being mixed
and interleaved, BD will be up to a 6C SBE even on FP.
Neither can every pair of floating-point operations be merged into one FMA operation. So you need to realize that with AVX256 code, Sandy Bridge has the upper hand. The only situation that puts Bulldozer at an advantage, is AVX128 code with lots of FMA instructions. You have to stay realistic about that; it's going to be quite rare, and Intel will blow that marginal advantage to pieces with 6-core CPUs and soon after that Haswell will go balls to the wall.

Don't get me wrong, Bulldozer is a brilliant response to Hyper-Threading technology (4-module/8-core Bulldozer has a shot at outperforming a 4-core/8-thread Sandy Bridge), but it's a late response and the competition already has new products lined up.

I just hope I'm wrong about Bulldozer 2 not being in a far stage of development yet. If they managed to double the floating-point execution width and cache bandwidth, and perhaps add in gather/scatter support, all in a year's time, that would put some serious heat on Intel.
 

Riek

Senior member
Dec 16, 2008
409
15
76
Only when FMA4 is used.

Again, only when FMA4 is used.

There will be a tug of war between Intel promoting AVX256 and AMD promoting FMA4, since each have the potential of doubling performance on their respective architectures. The harsh reality is that Intel will win this, because AVX256 is supported by both, and Bulldozer will have too little market share for developers to bother implementing and maintaining another code path. Even if some software supports FMA4, it is likely to also use AVX256, putting Bulldozer merely on par in peak performance per clock. Taking into account that it can't execute independent FADD and FMUL in the same clock, in practice Intel has the advantage.

More like a year and a half really. And Bulldozer isn't in the shops just yet. So AMD's potential lead for software using AVX128 and FMA4, will be pretty short-lived. Up against 6-core Intel chips, there won't be any lead at all.

It's not that simple. They also need to increase the cache bandwidth to sustain such a high throughput. That's likely the primary reason Intel delayed FMA support till Haswell. But it means Intel already has solid plans for this. I seriously doubt AMD has much post-Bulldozer development going on, so Haswell will reign supreme for quite a while. And once AMD does catch up, Intel could trump them again with technology like gather/scatter...

Neither can every pair of floating-point operations be merged into one FMA operation. So you need to realize that with AVX256 code, Sandy Bridge has the upper hand. The only situation that puts Bulldozer at an advantage, is AVX128 code with lots of FMA instructions. You have to stay realistic about that; it's going to be quite rare, and Intel will blow that marginal advantage to pieces with 6-core CPUs and soon after that Haswell will go balls to the wall.

Don't get me wrong, Bulldozer is a brilliant response to Hyper-Threading technology (4-module/8-core Bulldozer has a shot at outperforming a 4-core/8-thread Sandy Bridge), but it's a late response and the competition already has new products lined up.

I just hope I'm wrong about Bulldozer 2 not being in a far stage of development yet. If they managed to double the floating-point execution width and cache bandwidth, and perhaps add in gather/scatter support, all in a year's time, that would put some serious heat on Intel.

actually BD and SB have +- the same sustainable AVX throughput without FMA.
Haswell isn't for 2 years so time enough to go with the flow. 6core is being countered by 5M/10 of BD+ core. Besides the different process and power consumption it would do farely well i assume.

Also you assume haswell magically can go wider with all bells and humps but BD is made to be unchangeable for years? BD is designed with a certain id in mind not with only this year but with a certain future. Thats why they listed a new updated revision almost every year from now.
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
actually BD and SB have +- the same sustainable AVX throughput without FMA.
Sandy Bridge has a 256-bit FMUL and 256-bit FADD unit per core. Bulldozer only has a single 256-bit FMA unit, per module. So without FMA instructions Bulldozer's throughput per module would be only half the throughput of a Sandy Bridge core.
Haswell isn't for 2 years so time enough to go with the flow. 6core is being countered by 5M/10 of BD+ core. Besides the different process and power consumption it would do farely well i assume.
AMD can't sell a 10-core Bulldozer at the price of a 6-core Sandy Bridge. And Intel will have 10-core parts as well. Especially Ivy Bridge based 22 nm chips will be very tough competition for AMD. So they really can't just "go with the flow" between now and Haswell.
Also you assume haswell magically can go wider with all bells and humps but BD is made to be unchangeable for years? BD is designed with a certain id in mind not with only this year but with a certain future. Thats why they listed a new updated revision almost every year from now.
Intel's R&D budget is way bigger than AMD's. They have overlapping product development cycles using multiple teams, allowing them to deliver a brand new architecture every two years. In contrast, three and a half years have passed since Phenom, and counting...

So I'm just trying to be realistic here. If it takes Intel a year and a half to add FMA and increase bandwidth, what makes you believe AMD can double the execution width and also increase the bandwidth, in less time? Even if they can somehow pull that off, Intel's 22 nm + FinFET technology still offers higher clock frequencies at lower power consumption.

That said, I'm very curious about AMD's long term plans. Intel is clearly aiming to achieve 1 TFLOP with an 8-core Haswell. By converging AVX and LRBni (by adding gather/scatter and supporting wider vectors executed in a sequenced fashion) they'd create an incredibly powerful homogeneous architecture. But AMD is placing its bets on a much less flexible heterogeneous Fusion architecture, unless those continue to be intended for the low-end market and they do intend on expanding AVX capabilities with Bulldozer 2...
 

Abwx

Lifer
Apr 2, 2011
12,012
4,973
136
Even if some software supports FMA4, it is likely to also use AVX256, putting Bulldozer merely on par in peak performance per clock. Taking into account that it can't execute independent FADD and FMUL in the same clock, in practice Intel has the advantage.

.

How BD cant execute two independant FADD and FMUL simultaneously?

Each FPU can execute either a FMUL or a FADD while SB s FPUs
are stuck in a single op capability.
If two FADD or two FMUL are scheduled, BD can execute the two ops
in one cycle while SB will use at least two cycles.

You should take a closer look at AMD s FlexFp.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
How BD cant execute two independant FADD and FMUL simultaneously?
It has two 128-bit FMA units. But for AVX-256 they have to work in tandem so it can only execute a single ADD, MUL, or FMA instruction. Sandy Bridge can simultaneously execute a 256-bit ADD and MUL.

By the way, it has been confirmed that Haswell will support gather instructions: Haswell New Instructions (AVX2).
 
Last edited:

Abwx

Lifer
Apr 2, 2011
12,012
4,973
136
It has two 128-bit FMA units. But for AVX-256 they have to work in tandem so it can only execute a single ADD, MUL, or FMA instruction. Sandy Bridge can simultaneously execute a 256-bit ADD and MUL.

By the way, it has been confirmed that Haswell will support gather instructions: Haswell New Instructions (AVX2).

You pointed, (unwillingly?) , a good concept : simultaneity...

Certainly that SB can exe a FADD and FMULT 256b AVX
simultaneously , BUT , for doing so, it will make use of two
of its three execution ports.

So basically, it can do these two ops plus a third one using
the remaining exe port wich is able to do only ALU ops
shuffles and boolean AVX.

On the other hand, BD can exe both FP and Integer ops
thanks to its three schedulers per module.

So it can exe two INTEGER/core ops simultaneously with :
-one AVX256 or
-two AVX128 or
-one SSE and one AVX128.

BD has far more flexibility and average throughput capability,
it s just up to devellopers to exctract the good juice.

You seems to assume that AVX256 will be the rule while
it will be rather the exception for still a while.

Usual code is interleaved, and you can do nothing about it,
otherwise old legacy instructions would have been abandonned,
but to this day, if you want good precision before rounding, X87
is still used in lieu of SSE2, and even AVX will change nothing
since it s 64b before computation precision based.

Not counting that even intel recommend use of 128AVX in many
many cases as AVX256 would be no more efficient.
 
Last edited:

Voo

Golden Member
Feb 27, 2009
1,684
0
76
There will be a tug of war between Intel promoting AVX256 and AMD promoting FMA4, since each have the potential of doubling performance on their respective architectures. The harsh reality is that Intel will win this, because AVX256 is supported by both, and Bulldozer will have too little market share for developers to bother implementing and maintaining another code path. Even if some software supports FMA4, it is likely to also use AVX256, putting Bulldozer merely on par in peak performance per clock. Taking into account that it can't execute independent FADD and FMUL in the same clock, in practice Intel has the advantage.
Well most software isn't written explicitly to use SSEX or whatever either. Sure there are some data structures that make vectorizing code easier for the compiler, but then I'd extremely hope that the compiler is able to generate useful code for both extensions from the higher level abstraction.
I don't see why most programmers would/should have to care about this, so it's more about what the compiler writers will do and how much they can reuse (and many optimizations are done on intermittent code anyways so that doesn't look that bad)
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
BD can exe both FP and Integer ops
thanks to its three schedulers per module.

So it can exe two INTEGER/core ops simultaneously with :
-one AVX256 or
-two AVX128 or
-one SSE and one AVX128.
Bulldozer can only sustain a maximum of four operations per clock, due to being limited to four decoders per module. So you need to realize that having additional execution ports hardly ever helps. What really counts is the average port utilization, and Intel has balanced things really well.

Bulldozer's architecture is basically a consequence of avoiding to use Intel's Hyper-Threading IP. Integer ALUs are cheap, so they basically added two sets of them, each dedicated to a particular thread. This doesn't mean these ALUs will get high utilization.
Not counting that even intel recommend use of 128AVX in many
many cases as AVX256 would be no more efficient.
Could you point me to any document where Intel recommends AVX-128 in "many many" cases?

AVX2 will also extend integer operations to 256-bit. So clearly Intel wants us to use 256-bit.