They claim that each core is capable of 4 integer issues/cycle
compared to 3 for K10, but i still didn t see any technical
explanation about this in any site.
This is complicated. For e.g. Phenom it could do 3 ALU or AGU operations per cycle. It had 3 AGU and 3 ALU units, a total of 6 units!
So in a cycle Phenom might be able to do:
3 ALU, 2 ALU + 1 AGU, 1 ALU + 2 AGU, 3 AGU
Bulldozer core has only 4 units! 2 AGU and 2 ALU units. Normally you would say, hey that is 33% less. Now comes Bulldozer with a trick! They are all in parallel with a dedicated pipeline. Therefore despite having 2 units less they can do more. However the usage is restricted somewhat:
Bulldozer in a cycle might be able to do:
2 ALU + 2 AGU
So if the code to execute fits to this 50% ALU/AGU scheme everything is fine and Bulldozer can do 4 ops / cycle. If it's very uneven this might drop to 2 in some cases. Now what helps here really a lot is that you have long queues in the scheduler so you can compensate short times of uneven code by the scheduler "backlog".
This 4 ops/cycle/core is therefore a bit less than you might expect. Intel CPUs are also restricted in their code mix, only CPU which could fully intermix AGU and ALU was the AMD K7-K10. But well at the expense of using 6 units and a lot of die space therefore. The advantage is that Intel could not intermix fully as well, so code optimized for Intel will run much better on AMD Bulldozer than on previous generations. And that without having to recompile them!