There are two ways to do this. First you can put two separate die (CPU's) into the same package. IBM have been doing this for quite a while - mainly because they own most of the major patents on using liquid cooling to cool these solutions. As an extreme case of this, IBM has their Z900 multichip module with 20 CPU's and 32MB L2 cache all within 1 package that dissipates a little over 1.4kW (that's just this package alone, not the entire computer... just the 20 CPU multi-chip module). They use some form of liquid cooling through the package itself to have the die itself see an average temperature of 10C.
So that's one way - the expensive way.
The other way, is the way that the entire industry appears to be headed over the long term and that's multiple CPU's on the same die. If you think about it, there's room on a Pentium 4 die for a whole lot of 486's. So one way to think about it is putting a whole lot of little CPU's all onto the same chunk of silicon and hooking them up together. There are, of course, plenty of problems with this approach - or else everyone would be doing it. The biggest is that you need to have software that requires multiple CPU's to have it work out effectively. Multiple cores makes sense when you are IBM and you have one of their "eServers" running some transaction processing function on a website like Ebay. It makes a lot less sense for the average computer users. How many users on AT have dual-CPU systems? And what benchmarks really show off the power of that dual-CPU system? Plus, modern compilers do not handle more than dual CPU systems very well, nor do typical modern OS's (neither Windows nor Linux).
But multiple cores on the same die is most definitely in the cards in the future. Not the near future, but easily within the decade, I would think (no insider knowledge on this guess, by the way). The problem in the long term is that future process technologies have a lot of transistors available. The amount you can fit within a given space doubles every two years with each process generation. You can imagine that with hundreds of millions being used now, when we are three process technologies down the road and now have literally billions of transistors available. CPU's are already extremely complex - you have several hundred engineers working for several years (if not longer) to create these. If you increase the transistor count by 8-16x, there are only so many engineers you can throw at the problem (and having a team of 1000 is vastly less efficient/productive than typical teams today of 200 or so due to communication and logistical issues). So either you come up with an automated way to do CPU design (they have been trying to do this practically for as long as CPU's have been available largely unsuccessfully), you use a lot of transistors to make a really big cache (this is already being done in current CPU's - the HP PA-RISC 8700 and 8800 and Intel's Itanium 2), ), or you find a way to duplicate a core and thus hopefully nearly double performance while reducing the complexity of the chip nearly in half (you only have to create one core and then duplicate it).
The speed boost will depend entirely on the software, the application, the compiler, the available external bandwidth, the internal bandwidth, the microarchitecture, the OS, etc. It could theoretically be almost exactly double the performance of an individual CPU - but this is obviously best case. More realistically you could expect to see a 60-80% performance improvements in applications that are compiled to take advantage of this. In the worst case, you'd be barely better off that only having one core. The issues of power and cooling are really the exact same problem. The solution to this is to make sure that each core doesn't use too much power. You set a limit on the entire die, and then you make sure that each core doesn't use so much power that all together they are over the total limit.
As far as articles... there aren't that many. Most of them are fairly technical. I can suggest a couple of research papers, and some implementation papers. But for a basic overview... pretty much what I wrote above is a good start.
Patrick Mahoney
Microprocessor Design Engineer
Intel Corp.
Fort Collins, CO