Well I think bandwidth is not the only the problem.
More data means more register files. Sandy Bridge use a new register system to reduce this problem.
AMD Bulldozer also has a physical register file (PRF), if that's what you're referring to.
Vector peformance pre Watt is essential, but not the only one. Scalar peformance pre watt, which is useful in high IOps application, is also a point we need to think of.
There are very few things left that can be done to improve scalar performance per Watt, without sacrificing single-threaded performance. One thing they could do is to extend macro-op fusion so that a scalar move instruction followed by a dependent arithmetic instruction is decoded as one non-destructive instruction.
It adds some complexity to the decoding stages, which consumes a bit of power, but fused instructions are executed in half the time and they also free up a slot in the decoded instruction cache and schedulers so other instructions benefit too. Although in theory it could double the execution rate, in practice an IPC improvement of 5-10% for scalar code would be more realistic. Still, if it only consumes 1% more power (all else being equal) then it's a worthwhile performance/Watt improvement.
I don't know much about EE so for me it just seems like FMA unit is gonna cost more space for its unit and the memory subsystem = =
FMA units are tiny. The RV770 has 800 of them, at just under a billion transistors on a 55 nm process. A quad-core Haswell only needs 64 of them, on a 22 nm process, five years later. So it's really not a lot to ask for. And they're surrounded by so much other logic that if you replaced them with MUL and ADD units instead, it would probably have less than 1% of impact on the transistor count.
As for the memory subsystem note that even accessing a single byte of data requires fetching a 64-byte cache line. Sandy Bridge can read 2 x 16-byte, but it would be straightforward to extend it to 2 x 32-byte. Again note that GPUs have had much higher aggregate L1 cache bandwidth for many years now, while still achieving good power efficiency. Likewise the L2 cache bandwidth per core hasn't changed since 2006, so you shouldn't worry about doubling it on a process with three times smaller feature size.
It may seem like overkill for scalar code, but that actually doesn't matter. It would just take only half the time to transfer a cache line, after which that bus can go back to sleep. So the average power consumption remains the same, and might even be lower thanks to a simpler design.
BTW once heard an idea that DLP design is uneffiency in dealing with branch, is that correct?
Yes. Each of the SIMD components undergoes the same operation, so when you want other operations to be performed on some of them, the old results have to be thrown out and the new values have to be computed with different instructions, and then merged back in.
The problem gets worse with wider vectors. So for code that takes many different control paths it's a good thing that AVX supports vectors that are no wider than the execution units, and that it has multiple execution units.
Of course the opposite is true for less branchy code. With AVX-512 and AVX-1024 executed on 256-bit units you could get better latency hiding and lower power consumption, like on a GPU, but unlike a GPU you can still use AVX-256 to improve efficiency for more granular code.
Can't they get a way to deal with it? There are some asymmetric design I know..
Anyway, Sounds like single MUL instruction is better to go to MUL unit.
Then you're not taking advantage of the ability to execute two MUL instructions simultaneously.
Note that pair of FMA units can execute FMA+FMA, FMA+ADD, FMA+MUL, MUL+ADD, MUL+MUL, and ADD+ADD. The latter two also help legacy code. With an FMA+MUL configuration where a MUL instruction is always executed by the MUL unit, you can only execute FMA+MUL or ADD+MUL. And it doesn't help legacy code. So it's insane to go through the trouble of implementing FMA+MUL and get so little benefit, while with a tiny bit more effort you get a vastly superior solution.
Yes it's possible to make FMA+MUL units also handle MUL+MUL instructions, but you have to be careful not to block FMA instructions from executing. For instance take the following code: MUL, MUL, FMA, FMA. If they're all independent then if you start by executing MUL+MUL then the FMAs will take two more cycles, for a total of three cycles. If instead you execute MUL+FMA twice it completes in two cycles. Making such a decision is certainly possible in theory, but in practice it requires time during pipeline stages which are already critical for achieving good clock rates.
With dual FMA units you don't have that problem. So for all the above reasons it is most likely that Haswell will have dual FMA units.