Let's see... hope I don't miss anything...
In the past, multi-core processor packages were reserved for the high-end due to the extraordinary cost/performance ratio. Thanks to Moore's Law, transistor budgets are now high enough to offset the cost of extra logic. In fact, for the past several years, the increasing on-die caches have been a function of both increasing transistor budgets and increasion memory gap.
Nowadays, a lot of people view dual-core for the consumer as a band-aid. In a way, that's true because of the growing difficulties in clock scaling. It's not like nobody saw it coming. It's just that the problems showed up sooner than anyone predicted. No one's fault, really. CPU architects have to predict four to five years into the future and process technicians look even further. Since clock scaling is hitting a brick wall, the easiest way to increase performance (give consumers incentive to buy) is dual-core. Same situation with the internet boom. Whether or not people really need more performance, CPU companies have to make money.
Dual-cores are like SMP systems with a shared bus architecture. That translates to roughly 40-80% performance increase on average. For single-threaded applications, there will be about 1-2% decrease in performance.
Simply widening a single core processor will do little more than slow down the whole processor. A wider-issue processor will run slower, for various reasons electrically and logically. However, studies on ILP put the sweet spot at around 3-4 instructions. Few instruction streams issue 4+ sequential instructions that are independent. The majority of general purpose code averages 2 sequential instructions. Second up is 1, and there is still a significant presence of 3.
CPU/GPU combo packages will not show up anytime soon. HT already discussed the topic to death.
The only possibility may be low-end packages with the dog-slow performance of five to ten years ago. The problem is both financial and technical. Technically, a super chip with high-end graphics and high-end CPU is possible. However, it's going to take years to develop, around 5,000 of the best engineers in the world, and the Walton's family fortune. It will require close to 2000 or more contacts, 300-400 M transistors (NV40 is 222 M, I believe), and manufacturing yields will top out around 10% (Fabs top out at 90+% these days). Once the chip is ready, then there are logistical problems with configurations and technical problems with the rest of the system, including power, and memory (unless you're going on-package memory, in which case everything gets harder by a factor of 4).
On the other hand, bus interconnects would be blazingly fast. On-die buses have much lower latency and can likely run faster than motherboard traces. This is assuming you can get around the noise.
That said, system on chip is coming in the future. It's just going to take a long time. Embedded systems are already there, but that is because power and simplicity is more important than performance and flexibility. If everyone in the world used low-performance PC's with the exact same configuration, we'd have single-chip systems everywhere.
A while back, GPUs had very deep pipelines. This was because they didn't have as many exceptions as CPUs, which stall or flush the current contents of the processor. More familiar exceptions include branches and context switches. These days, I'm not sure how deep the average graphics pipeline extends. Either way, a 1GHz single-pipeline GPU would probably still outperform a 3GHz Pentium 4. The primary reason is customization. GPU's are designed from the ground up to render images. CPU's are designed for general purpose tasks. What takes one pass from a GPU may require multiple, dependent instructions on a CPU. Factor in the massive parallelization of a single GPU, massive memory bandwidth, and what is called the embarrassingly parallel nature of rendering, and you get a 300MHz GPU that can outrender a 3.2 GHz Pentium 4.
That said, GPUs run rather slow relative to CPUs on the same process. There are a few reasons for this, but I'm only familiar with a few. First off, GPU's have shorter product cycles (thanks to nVidia) so they are usually designed using HDL's. CPU's are designed by hand. That has traditionally been the case, although who knows what really goes on in company's labs these days. I believe Intel has been experimenting with HDL on the Prescott design and I'm guessing nVidia and ATI are looking to extend product cycles and, subsequently, design cycles.
Also, GPUs are highly parallel architectures with a masisve number of transistors dedicated to logic. CPUs are somewhat parallel with a majority of transistors dedicated to memory. Between noise, coupling, power, and other issues, ramping up a GPU is more difficult. With memory, only certain sections are active at any given time whereas logic is always on. Of course, this is assuming no power-saving logic.