Anand was likely referring to Northwood, the .13um P4...it is rumored to have 512KB of L2 cache, and as production ramps up, it will become more mainstream than Willamette has been.
As far as performance improvements from P4-aware compilers, don't expect miracles...a 10%, maybe up to 20%, improvement for general applications is likely, mostly from the re-ordering of code for the P4's pipeline. SIMD instructions (SSE2) can yield large performance increases when inside loops are hand-compiled, but no one has the time money to do so for commercial applications. SIMD-supported compilers aren't very good at extraction vector parallelism out of code, so most real-world apps gain 5-10% at best. The case of the K6-2 was a bit different...it's FP-unit wasn't pipelined, so it performed horribly in code that was optimized for the P3's pipelined FPU. The benchmarks that showed a huge boost from 3DNow were mostly hand-assembled, so they weren't a very fair indication of SIMD's usefullness for the K6-2.
But if you're really concerned about gaming performance, remember that the bottleneck is the video card, not the CPU. Benchmarks may show a big difference between the fastest available CPU and one that is 1+ years old, but those benchmarks are taken at 640x480x16bit color. Bump the resolution and color up past 1024x768x32bit color, and the video card fillrate and memory bandwidth becomes the bottleneck, and a 1GHz P3 might only be 10-15% slower than a 1.8GHz P4.