I'd say both. It is pretty clear by now that AMD vision for the APU, anemic CPU + good GPU, didn't pan out. They get bashed by the poor performance of the CPU and can't charge a premium for their better graphics performance. But this doesn't change the fact that Bulldozer is a very inefficient architecture in performance per area and performance per watt.
it was never meant to be an anemic CPU ( not that piledriver really is).
The problem isn't architecture per say, nothing wrong with the module design and nothing wrong with bulldozers pipeline, the issues are components within the pipeline . Each module is 214million transistors ( about the same size as SB core) and has the same throughput despite all its issues.
Remember IPC was meant to go up, it went down a fair way and if anything it costs them even more power with things like L$I aliasing etc. Also clock it at 3.0-3.5ghz and it all of a sudden its power consumption looks a lot better(A-5700 etc). Clocks have been pushed to the limit because per clock performance wasn't adequate.
Now here are the list of issues that i can see seemed to have played a major part in bulldozers problems, you will notice how pretty much all of them have nothing to do with it being a module or require radical changes of the pipeline:
L$I cache aliasing (fixed in SR)
L$D write bandwidth/latency , it is 6:1 R/, i could understand 4:1 on both cores given an aggregate to the L2 of 2:1, most people blame the Write Coalescing Cache ( improved L1D mentioned by papermaster for SR)
instruction decode, its Vertical' multithreading which hurts both threads and it cant decode enough instruction to feed two cores. ( fixed in SR, dual decoders)
branch miss predict latency, its been noted in articles even on anand, it hurts both power consumption and potentially throughput ( loop buffer added for SR)
you gain on average 15% perf per core on BD/PD in threaded workloads by disabling one core in the module, that only alleviates one of those issues. SR is supposed to fix all of them plus additional benefits (L/S prefecters/predictors etc). Bulldozers issues are largely in instruction throughput which is sad for the rest of the module which appears to behave as expected.
There have been a bunch of things like the L2 22 cycle latency that people have latched on to and blamed but i have never seen anyone ever show that it is a performance bottleneck.