I was just wondering if anyone could point out the flaws in this logic.
Depending on the actual load, the design can have insignificant performance hit, or can be much slower.
This decision in Bulldozer is not so much as a "here's a great idea!" thing, but more of a "hmmm... we need a balancing act here, and maybe this is a worthy trade-off". So with that out of the way, the best/ideal way is still separate
everything for everything, except for the components that need to be shared because that is their function (for example, shared cache (L2/L3))
In a perfect/ideal/mystical world, the CPU would not only have separate FPUs for each core, but would also have a branch predictor that is hybrid, two-level adaptive with global history and local history, and a loop counter, and subroutine return prediction, and have a BTB capacity of a hundred-thousand lines. It would also have as much private L1 as it can, then 10x more shared L2 for the entire chip. No slower L3 used.
We can continue this exercise of ideal fantasy and enumerate more parts of the CPU that should be "perfected", but the point is already obvious that we don't have them because it is impossible within the bounds we need to play with - our chips need to be within a particular size, power, and thermals so as to be within the realm of economic feasibility. The perfect, turbo-charged branch predictor is impossible, else it would take up half of the die size of our current chips now. High-speed cache memory is extremely expensive. And we have a limited transistor budget, for the thermals and die size we are shooting for, which of course are constrained by the process we have.
I am saying this merely as a point of context - when AMD decided to share the FPU, it is not because it was an insight that nobody has thought of before and they are geniuses for figuring it out finally. Rather, it's a design for efficiency, as a way to balance all constraints and come up with their projected power/performance/thermal/size targets.
So the logic behind a shared FPU is more of
"we can save die-space / transistors by fusing this together, and the trade-off is acceptable because we think X and Y" (where X and Y are justifications for why the trade-off is acceptable), and less of
"discrete FPU units aren't really needed for each core in the real world".
Are the aforementioned X and Y really valid justifications? I don't know. AMD thinks so. They even think so, too, for server loads. I assume they know more than me since they have solid data from their partners/customers, so at the end of the day, it may indeed appear that yes, sharing the FPU is ok due to real-world data saying separate 256-bit FPUs are hardly necessary today and we can get away for now with shared 256-bit FPUs that can do two 128-bit operations at the same time. But from the designers' point of view, it is always just a balancing act, because ideally we still want discrete components in everything to better handle all situations.