Branch prediction improves nearly every generation doesn't it? And usually branch prediction changes don't add drastically much IPC anyway right?
On the contrary. Branch prediction is pretty much a necessity because it's one of the biggest bottlenecks to instruction level parallelism, and one of the only ways to constantly increase single thread performance. Without consistent work on branch prediction over the past 20 years, the performance we get with CPUs nowadays
wouldn't be possible. Without improved branch prediction, most of the other features would be rendered null. It allows modern 6+ wide decoder setups to be usable.
Remember the reason much criticized architectures like Netburst(Pentium 4) and AMD's Bulldozer performed badly was because pipeline increase was more than what could be countered by branch predictor improvements.
It's increasing cache that's very easy to do, even though internal caches like L1 is harder than increasing L3.
Also, branch predictor enhancements require significant work, because it's actually lot of thought required rather than just increasing the size of the buffers, even though it's part of it. It is the importance in performance that any significant uarch changes accompany branch prediction improvements.
Biggest reason for Load/Store changes lies in continual work in vector performance. It benefits general purpose code but at some point it's more for wider AVX. You'll notice every time SIMD width doubles, so do L/S.