I still don't understand how an L3 cache at 3GHz isn't going to outperform 200MHz system memory. Just seems that people want to throw an fsb roadblock in the way when its unnecessary to think in those terms. The trace length to main memory is what basically limits current memory to the sub-250MHz front-side bus, right? HT is able to traverse the same distance at 800MHz, though, which means the 250MHz limitation is more or less an interface-specific limitation.
Like someone said before, HT probably isn't the way to go. A 16-bit HT link provides 6.4GB/sec of theoretical bandwidth. A 32-bit HT link provides 12.8GB/sec of theoretical bandwidth. The problem is that the latency in translating the cache addresses and memory packets across the serial bus is prohibitive. On the same token, we see $80 video graphics cards with 256-bit links in the same theoretical bandwidth ranges, but in their case using memory not caches. So perhaps we don't want an L3, but rather NUMA - non-uniform memory architecture(?) - on board the CPU sockets.
So instead of an L3 cache, too bad they cannot make interchangeable CPU sockets with integrated stage-1 NUMA memory. The stage-2 NUMA memory (system memory) would still be there running much slower. We already see memory speeds in the 500MHz+ ranges in the top end cards, so mating the memory to a 256-bit controller should be a cinch. In previous threads, someone already said the OS needs to be NUMA-aware, too, else the raw horsepower goes largely wasted. I can just see it now, in a few years people will be slaving a mediocre 256MB of 600MHz DDR RAM to their stage-1 memory add-in card and another 2-3GB of plain ole PC2700 memory into their NUMA-aware Windows XP 7.x system... because you know by then they just might need it. And I might say, with this architecture the average customer wouldn't probably ever mind performance from integrated graphics anymore.
CPU (internal)
...L1 cache <----Fastest
...L2 cache <----Current 1MB L2 caches would still suffice for most consumer apps
Socket (external)
...Stage-1 memory (integrated into a CPU socket) <----Also could be used to run integrated graphics
...Stage-2 memory (expansion slots on motherboard) <----Slowest