You may be right, but we're likely to hear rumors about a 32- or 64-bit RDRAM supporting chipset at least 6-12 months before its released.
In regards to your original question, I don't think PC1066 will be what makes or breaks Northwood's performance...RDRAM is said to be 85% bandwidth efficient, so the i850's 2.7GB/sec of effective bandwidth shouldn't be stressed for a bit, at least until the P4 ramps up beyond 3GHz (unless there are undisclosed features in Northwood that use more memory bandwidth *cough* SMT *cough*

).
As far as access time, there won't be much of a difference as far as the CPU is concerned. RDRAM reads two packets of 8 words at a time, so assuming:
1. it takes two FSB cycles for the memory controller to latch the address and control, and decode the control information.
2. RDRAM has a 20ns page access latency (according to Rambus)
3. on a medium-loaded RDRAM channel, the target chip is 1 foot from the memory controller (4ns round-trip electrical signal propogation)
4. after the data is parallelized in the memory controller, it takes one FSB cycle for the data to reach the CPU.
Thus, for PC800 on a 100MHz FSB, an estimate of the access time is:
10ns * 2 + 1.25ns * 8 * 2 + 20ns + 1.25ns * 8 + 4ns + 10ns = 84ns
For PC1066 on a 100MHz FSB, the access time is:
10ns * 2 + .9375ns * 8 * 2 + 20ns + .9375ns * 8 + 4ns + 10ns = 76.5ns
For PC1066 on a 133MHz FSB, the access time is:
7.5ns * 2 + .9375ns * 8 * 2 + 20ns + .9375ns * 8 + 4ns + 7.5ns = 69ns
For the P4 (Willamette), the L1 cache has a 2-cycle latency and average 90% hit-rate, and the L2 cache has a 7-cycle latency and 90% hit-rate (9% of all memory access go to L2). Assuming a 2GHz P4, the average memory access time (not including FP data, since it goes direct to L2) is:
For PC800 on a 100MHz FSB:
.9 * 2 + .09 * 7 + .01 * 168 = 4.11 cycles
For PC1066 on a 100MHz FSB:
.9 * 2 + .09 * 7 + .01 * 153 = 3.96 cycles (3.6% improvement)
For PC1066 on a 133MHz FSB:
.9 * 2 + .09 * 7 + .01 * 138 = 3.81 cycles (7.3% improvement)
This doesn't necessarily translate to a 7.3% improvement in performance, since the cache is pipelined and dual-ported, and speculative pre-fetch plays an important role.
I guess I'm kind of cynical about memory technology...as long as you have enough bandwidth, it doesn't play much of a role since 99% of accesses go to the cache. Intel and AMD are good at staying on top of required bandwidth for their processors...when they introduce a new step in memory technology, it's usually to ensure enough bandwidth for future processors, not to dramatically increase performance for what they currently have out. IMHO, the Holy Grail for x86 performance is branch prediction...branch misses account for around 1.5% of x86 instructions in programs, but attribute to a 30% loss in performance. The P4 has a good branch predictor (94%, compared to P3's 90%, Athlon's 92%, and K6's impressive 95%), but it's 19 cycle branch prediction penalty hurts performance.
If there's any mistakes in my math, point it out (there probably is).
BTW, can you guess who is at work and really bored?
