Current i7s have 2MB L3 per core, right?
Yes, 8mb of L3 which is shared to all cores. The large exclusive cache is 1mb per core (L2)
As for performance, more cache is great, but sometimes having just enough cache that is faster can be better. There's a sweet spot somewhere, and if I remember right, the L3 of the i7 hexcore is a little larger than those of the quads, but in some cases the performance actually went down a bit due to increased latency (bigger but slower cache).
No L3 cache also shows worse drops, as seen in Phenom II vs Athlon II comparisons. It depends on the workloads, but I think AT found in their testing that clock for clock the Phenom II's have somewhere around 10% better performance in general. Sometimes it is less (3-4%
😉, sometimes it is more significant (especially in games).
And the Celerons, if I understood them right, were crippled Pentiums, and this crippling was mostly a halving of the cache.
All those point that more cache is always better (assuming no big sacrifices were made to increase latency disproportionately), saving the trip to main memory. I suppose that was the idea behind their server chip ideas. Among other things, perhaps a full 24mb of L3 was impossible to implement without latency going up enough to make going to 24mb useless (no real performance gain) so they settled on the 18mb.
I seem to remember Anand quoting some engineer at Intel (Ronak?) that he was not actually 100% satisfied with Nehalem's cache, and that he thinks it should be higher. I am not sure as this was just one detail from an article long ago, and I may be wrong. I also seem to remember from that same article that, based on the same engineer, Nehalem's cache is almost (or actually is) the bare minimum amount that the cache should be for the chip to perform acceptably.