No, read again. I said that average of 2 insns per clock would probably be enough, unless you waste the decode bandwidth.Okay hey the 4-wide decoder and 32 Byte fetch is great but you forget that for one core that means only 2-wide decoding and 16 Byte fetch.
I still maintain that 2-wide with SNB-class branch prediction is probably better in the real world than 3-wide with phenom-class prediction.Decoding is an issue because AMD would need 6-wide decoders
AMD has always done some predecode, including measuring instruction lengths, when code is loaded from L2 to L1i. (which makes L1i miss much worse than it seems when you just look at the latencies) With that predecode, I don't think 6 insns per clock is that close to impossible. Possibly still too expensive, though.which is nearly impossible with x86. The main reason why obviously IBM was able to get it right with Power was that they do not have the decoding issues. And that is what makes Bulldozer suffer.
No, Phenom II could execute at most 3 ops per clock, one for each of the pipes.You have to be careful with execution width. Now the width you use indicates me that you talk about micro ops not macro ops. So just consider that Phenom II was 6-wide!
Can't SNB execute 5 in the situation of (all independent, and has spare decode from elsewhere) add, add, add, mov reg mem, mov reg mem?But it is very misleading to discuss the execution width in MicroOps because what counts in the end is the MacroOps execution width and there you have for each core:
Sandy Bridge, Decoding: 3 wide *, Execution 3 wide + MacroOp fusion (=4 if fused), Max allowed address ops. 2
There are 4 result busses -- so BD is Decode: 2 wide, Execution 4 wide (2 of which can be ALU ops and 2 can be memory ops/inc/dec/lea)Bulldozer, Decoding: 2 wide, Execution 2 wide, Max allowed address ops 2
Shared instruction decoding does raise some interesting possibilities -- notably, during a code cache miss, and during a branch mispredict when you know you missed but don't yet know the real target, the other thread can, in theory, use all the decode bandwidth to run ahead in decode a little, and then in turn let the thread that missed use more than it's share when it does finally get data. As this would increase the decode bandwidth exactly in the situation where it matters the most, it could in theory give quite an advantage. I have no idea if AMD does this, or if they just do the inflexible "if a thread isn't sleeping, it gets all the even/odd clocks".
