Well, that's right. But such applications will mostly be run on real CPUs or GPUs, not anemic APUs. Even if an APU would be more efficient, the overall system efficiency probably would be lower. Maybe useful in dense servers like Seamicros, but not widely useful in the PC market as such. And of course we have to wait for more info about Xeon Phi performance, which should be well suited for just such application profiles.
Separate CPUs and GPUs are a complete pain in the arse to develop for. Have you tried write software for them? In any sufficiently complex algorithm, you will have sections that perform better on parallel processors like a GPU, and certain bits which run better on a CPU with strong single threaded performance and excellent branch performance. And generally speaking, these sections are not neatly divided into one half you can run on the CPU and one half you can then offload to the GPU. But because of the costs of passing tasks from one to the other (scheduler overheads, and uploading and downloading data to and from a card) you need to draw the line somewhere. If you try passing back and forth from one to the other, you will get absolutely killed by that PCIe bus. Just look at PC gaming. Sections of the graphics pipeline like frustum clipping, occlusion clipping are inherently parallel problems which *should* map very well to a GPU. But because of the cost of running them there, passing the results back to the CPU and then using them to schedule more work for the GPU, they are run on the CPU instead. By having the two combined into one, with a combined memory pool with no need for passing data across external buses, you can look at this kind of fine grained algorithm and just do the tasks where it makes sense. If you want some reading on the occlusion culling, frustum culling stuff, go read this.
http://www.slideshare.net/zlatan4177/gpgpu-algorithms-in-games (Warning: contains AMD slides, and hence marketing jargon... but still contains useful information.)
As for Xeon Phi- it's a graphics card, basically. It has some neat features, like the ability to run a Linux + MPI stack natively, but it's a graphics card. It has all the drawbacks of separate memory pool to manage, sucky single threaded performance, and requirements for massively parallel tasks to extract performance out of it. However, it will get a lot more interesting if Intel integrate it into their CPUs. (As Semiaccurate is suggesting they will at Skylake.)