For a given system that never evolves, there is a sweet spot that you don't want to get too far from.
What we have in real life is a ballooning piece of hardware and a ballooning library of instructions to run on it. Not only that, but the market, more or less, has decided it will be increasingly unkind to 150+ watt CPUs, so this limits us to the 2.0-4.0 GHz operating range for now. Holding frequency somewhat constant, RAM latencies are going to get worse at a faster rate than cache latencies.
Now the increasing size and usefulness of the IGP as well as the increasing instruction-level intimacy between x64 and GPGPU will place greater demands on the cache system. If you think 30 cycles at 3 GHz is bad, then what about 30 cycles at 700 MHz? I don't even know what speed the IGP or cache runs at, I'm just spitballing to illustrate.
intel hasn't had to make any production decisions on this yet, but IBM has. What would you do in 2005 if you had to share a nice CPU and GPU with a horrible memory system? Would you rather give the GPU a 64-bit GDDR5 sideport or would you wedge in the largest eDRAM that you could possibly fit?
Consumer Haswell systems will probably continue to use 128-bit DDR3-1333, 1600, etc, but some of the higher-end parts will be 8-threaded CPUs with some larger, more evolved version of the HD 4000. So they are definitely between a rock and a hard place here, and the only elbow room they have left is the presumed maturity and density of their 22nm tech which could enable 24MB or larger last level caches. Not a bad stopgap until faster/wider and cheaper memory systems come to town.
Can someone compute the Romley cache area and interpolate to 22 or 16nm? What about higher density 1T-SRAM-Q or eDRAM?
edit: Romley's 15MB L3 cache needs about 108 sq mm. 1T-SRAM-Q could fit 15 MB into 17 sq mm (on 45nm SOI).