After reconsidering the need for paired memory, these are the likely possibilities:
256bit = High end:
512MB/256bit (8x64
DDR/375-
425MHz - Most practical 512MB 'Extreme Edition' solution)
256MB/256bit (4x64
DDR2/400-
500MHz; Most practical 256MB/256bit solution)
128bit = Mid-Range:
256MB/128bit (4x64 DDR/375-425MHz)
128MB/128bit (2x64
DDR2/400-
500MHz; Most practical 128MB/128bit solution)
64bit = Low end: (All solutions Would still necessitate the use of on-mainboard expansion slots)
128MB/64bit (2x64 DDR/375-425MHz)
64MB/64bit (2x32 DDR/325-375MHz)
By placing the memory controller on-die and mounting the memory physically next to the processor then there is the minimization of the pathway - the most direct path that is - from controller to memory bank. Don't forget that high-end 500MHz GDDR can be down to one-fourth the latency of current 166MHz desktop memory! The GPU functions require alot of memory bandwidth, so what you said about latency being king is not entirely true in all cases, especially for a design with an integrated GPU. Plus, by sharing the memory accross architectures we simplify the system, thereby eliminating costly duplication of components which in this case includes processor functions and physical memory.
I like to think of this design as more of a brain than just an MPU, where the CPU is like a right brain and the GPU is like the left brain. Or do I have that backwards?

The memory controller would be the medulla obligata and route data traffic to the proper location. The raw memory bandwidth would be devoured by the GPU functions and the high clock speeds would enable low latency for the CPU funtionality. Common data between the components no longer has to travel an independent bus or port, because the two components would be practically built on top of each other and the transfer time between MPU halves neglible. In the end, both sides of the equation benefit.
The ALU and FPU in current processors are basically hamstrung by the front-side bus. They can never achieve a broad parrallelization because of the lack of raw bandwidth to them, meaning that there is no need to further increase their design while the front-side buses are so hampered. In reality both AMD and Intel have moved away from beefing up raw FPU for this exact reason, instead moving as much FPU functionality over to SiMD structures. Intel builds their ALU functions off their SSE2 units, which takes this one step further in the evolution to replace the traditional ALU in their design. (Amazingly, though, its rumoured that Intel is actually moving back away from the SSE2 unit for Prescott's ALU units.) SSE3 may be yet another step for Intel to make the traditional FPU a secondary function in future processors, but we don't really have much information out there yet to know. ALU units are not nearly as complex as many of the other components of the processor, and its one of the components that have alot of headroom for gains in raw clock speed without moving to longer pipelines. In the case of the current Intel and AMD designs, the ALUs have a substantially higher IPC than the FPU units. Heck, the P4's ALUs are double-pumped even on the highest end models and still are relatively weak in comparison to specialized processors that key in on these functions.
I'm willing to bet that these issues of design have all already been tackled years ago. They may not be to the scale I've laid out, but likely the same issues you've brought up were tackled when Intel worked on SOC technology. In my opinion the memory speeds and raw processor clock speeds have made this technology vary feasible now.