I wouldn't say that. The jump is significant when you take into account the code optimization and maintenance cost.
The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.
The 3 biggest time sinks when programming older in-order PPC CPUs:
1. Lack of store forwarding hardware
Store buffer stalls (also known as load-hit-store stalls or LHS stalls) are the biggest pain in the ass when programming a CPU without store forwarding hardware. The stall is caused by the fact that memory writes must be delayed by ~40 cycles (because of buffering & possible branch misses). If the CPU reads a memory location that was just written to, it stalls for up to 40 cycles. These stalls are everywhere. C/C++ compilers are notorious in pushing things to stack and reading the data back from the stack right after that (register spilling, function calls push parameters to stack, read & modify on class variables in loops, etc). Normally you just want to hand optimize the most time critical part of your program (expert programmers are good at this), but LHS stalls effect every single piece of code, so you must teach every singe junior programmer techniques to avoid them... or you are in trouble. LHS stalls are a huge time sink.
Modern CPUs have robust store forwarding hardware. You no longer need to worry about this at all. Result: Lots of saved programmer time.
2. Cache misses and lack of automatic data pre-fetching
The second largest time sink in writing code for older CPUs. Caches have been the most important thing for CPU performance for long long time. If the data you access is not in L2 cache (cache miss), you have to wait for up to 600 cycles. On old in-order CPUs the CPU does nothing during this time (you lose up to 600 instruction slots). Modern out-of-order CPUs can reorder some instructions to hide the memory stalls partially. Modern CPUs also have automatic data cache pre-fetchers that find patterns in your load addresses, and automatically load the cache lines you would likely access before you need them. Unfortunately the old PPC cores didn't have automated data pre-fetching hardware. You had to manually pre-fetch data, even in linear accesses (going through an array for example). Again every programmer must know this, and add the manual cache pre-fetch instructions in their code to avoid the up to 600 cycle stalls.
Modern CPUs have robust fully automatic data prefetching hardware that does better job almost every time than a human, and with no extra coding & maintenance cost. Modern CPUs also have larger caches (Jaguar has 2 MB per 4 core cluster) that are faster (lower load to use latency).
3. Lack of out-of-order execution, register renaming + long pipeline
Long pipeline means that instructions have long latencies. Without register renaming the same register cannot be used if it is already used by some instruction in the pipeline. Without out-of-order execution this results in lots of stalls. The main way to avoid these stalls is to unroll all tight loops. Unfortunately unrolling needs to be often done manually, and this takes time and leads to code that is often hard to maintain and modify.
Modern Intel CPUs (and AMD Jaguar) have relatively short pipelines (and loop caches). All modern CPUs have out-of-order execution and register renaming. On these CPUs, loop unrolling often actually degrades performance instead of improving it (because of extra instruction footprint). So, the choice is clear: Keep the code clean and let the compiler write a proper loop. Save lot of time now, and even more time when you need to modify the existing code.
Answer
Jaguar is a fully modern out-of-order CPU. It has good caches, good pre-fetchers and fantastic branch predictor (that AMD actually adopted later to Steamroller, according to Real World Tech:
http://www.realworldtech.com/jaguar/). With Jaguar, coders can focus on optimizing things that actually matter, instead of writing boilerplate "robot" optimizations around the wast code base.
Jaguar pushes though huge majority of the old C/C++ code hand optimized for PPC without a sweat. You can actually remove some of the old optimizations and make it even faster. Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).
You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.