A Vishera die shot:
http://www.guru3d.com/articles-pages/amd-fx-8350-processor-review,2.html
Take a look at how much the L3 takes up . . . then take a look at how large are the L2 cache blocks. AMD clearly had problems with cache density, and I do not think the problem has simply vanished either.
Now check out Kaveri (and Trinity, and LLano):
http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/4
It certainly appears in the Trinity and Kaveri shots that the ratio of die space taken up by the modules vs. the accompanying L2 blocks has changed, with the L2 taking up a more-modest amount of real estate relative to the modules themselves. So relative cache density improved a bit, maybe. But the other thing to keep in mind, is how much die space appears to be committed to the GPU. It certainly appears that AMD could have made a 4m SR chip sans L3 that would be smaller than the current Kaveri die. There are those in the know who insist such a beast could never exist within an reasonable power envelope, and they probably know exactly why . . .
I guess, the point I want to make, is that defining Instructions Per Clock when only running one thread of software is completely useless, especially when dealing with AMD's CMT modules. If throughput is what you want to call it, then fine. In the end, that is the only number that really matters, so long as there is software that can scale to the full thread capacity of the chip.