I don't think Bulldozer's L2 is bad at all at least compared to their own CPUs... from Anandtech's review:
http://forums.anandtech.com/newreply.php?do=newreply&noquote=1&p=33513489
L1/L2 latency-
FX-8150: 4/21 cycles
Phenom II X6: 3/14 cycles
L1 cache latency increased due to clock speed reasons. The culprit for the higher L2 latency is twofold, one is that its much larger at 1MB, and the second is that its a shared cache while in Phenom II its a dedicated one. Despite that, its delivering more bandwidth than the Phenom II, and I think that's quite respectable.
Shared L2 cache in Core Duo increased latency to 14 cycles, up from 10 cycles.
I think the problem is again the module concept. You can see even Sandy Bridge's E's direct adding of 2 more cores and scaling interconnect and memory bandwidth offered diminishing returns, now add having less performance/clock, and the module concept delivering less gains than adding cores.
-6 K10 cores to 8 Bulldozer cores, in theory is 33% increase.
-But in reality that ends up being less than that. For 50% more cores, 3960X ends up being mostly low-40% faster
-Then you add that modules don't deliver as much cores
-And in rest of the applications, it doesn't benefit from having more cores and there's less performance/clock
Trying to push the clocks even higher might work with Piledriver, but I think dropping to 28nm with Steamroller will increase the heat density and leakage and work against them visa vi clock speed improvements.
Relying on further clock increases won't work with Steamroller, as 28nm might end up somewhat less performing than 32nm. 28nm doesn't just forgo SOI(which is only responsible for few % but still), but may be a lower power process. Fortunately, 28nm should improve leakage characteristics, as that's the benefit of a slower transistor.