<< Where's pm or Sohcan when we need them? >>
Uh oh, I smell another monster post coming... 🙂
I always find it amusing when people get up in arms about RDRAM's "horrible" latency....coming from the perspective of computer architecture, all DRAM, whether its 50ns, 100ns, or 150ns per read request, has "horrible" latency as seen by the CPU.
Comparing latencies of RDRAM to SDRAM gets sketchy, since there are so many variables....do you include the FSB latencies? Memory controller arbitration? Latency to the critical first word or for the entire read request? While RDRAM's latency to the critical first word is significantly worse than SDRAM (since RDRAM has to wait for both 16 byte packets to parallize before they can be sent across the FSB, whereas SDRAM can immediately send the first 8 byte word and continue bursting), the time to complete an entire 32 byte read request is only about 10-20% longer than an eight word 64 byte DDR SDRAM burst (note that the potential difference between the sizes of read requests is not a big issue with pipelined memory requests). This is due to the large amount of time spent on memory controller arbitration and FSB latency...Before the read request is committed, the memory controller has to latch the address and control information (1 cycle), and spend at least 2 cycles arbitrating the request with DMA accesses and determine if the request is to the memory or if it is an IO request. Then there's another cycle to cross the memory controller/FSB boundary, then any number of cycles to transfer the data.
It is true, however, that a decrease in memory bus cycle time has a much greater affect on RDRAM's latency than on DDR SDRAM. This is simply because memory bus cycle time takes up a larger portion of the entire read latency for RDRAM than for DDR SDRAM due to its serial nature. Excluding bus time and memory controller arbitration, SDRAM access time is almost entirely dependent on page request times, ie the CAS/RAS/Precharge ratings for SDRAM. DRAM is divided into banks, and the memory controller can keep a certain number of banks "open" at a time. Within each bank is a sense amp that stores an open row, or "page" (not to be confused with virtual paging). A read request to an open bank and an open page is a page hit, only requiring the CAS latency. A read request to a closed bank requires the RAS + CAS latencies. A read request to a bank that is open but not in the currently open page requires the sense amps to be flushed, thus requiring the precharge + RAS + CAS latencies. A 2/2/2 rating for PC266 SDRAM means that the respective latencies are 15ns, 30ns, and 45ns....I've heard that the respective frequencies of occurance are around 50%, 40%, and 10% of the time.
RDRAM's read latency is not only dependent on the page latencies, but also on its serial nature and electrical "time-of-flight" (since the bus is physically long).
I'll take a quick stab at estimating the latencies of DDR SDRAM and RDRAM for their full read requests (8 x 8 bytes and 2 x 16 bytes respectively). I'll use a 133MHz P4 bus, and assume that the CAS/RAS/Precharge latencies are 15ns/15ns/15ns for both DDR SDRAM and RDRAM, and their frequencies are 50%, 40%, and 10%. Also, I'll assume that the bus and DRAM latencies are unchanged to specifically measure the impact of a reduction in DRAM cycle time.
PC266 DDR SDRAM: 7.5ns cycle time
Here the bus and DRAM cycle times are fortunately synchronous at 7.5ns. The 8 word burst takes 4 DRAM cycles to transfer. Given 2 cycles to latch and decode address and control, and 1 cycle delay to send data back across the FSB:
Latency (CAS) = 7.5ns * 2 + 15ns + 7.5ns + 7.5ns * 4 = 67.5ns
Latency (CAS + RAS) = 7.5ns * 2 + 30ns + 7.5ns + 7.5ns * 4 = 82.5ns
Latency (CAS + RAS + Precharge) = 7.5ns * 2 + 15ns + 7.5ns + 7.5ns * 4 = 97.5ns
Average latency = 76.5ns
PC800 RDRAM: 2.5ns cycle time
Here it gets tricky. The 400MHz RDRAM bus is not synchronous with the FSB; at the start of the read request, the buses are on average 1/2 a cycle out of phase = 1.25ns. The memory controller then sends two 8 word command packets down the bus, taking 4 cycles to transfer. After one page read latency after the end of the second packet, the second data packet is transfered and parallelized. In addition, there is the round trip time-of-flight latency of around 2.5ns each way. At this point the two data packets can be transferred back after the two buses have become synchronized again (this would make a lot more sense if I could show the diagram I just drew 🙂). It takes one FSB cycle (after the one cycle delay through the memory controller) to transfer the 2 16 byte packets.
Latency (CAS) = 7.5ns * 2 + 1.25ns + 2.5ns + 4 * 2.5ns * 2 + 15ns + 2.5ns + 4 * 2.5ns = 66.25ns => 66.25ns / 7.5ns (round up) = 9 cycles => 9 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 82.5ns
Latency (CAS + RAS) = 7.5ns * 2 + 1.25ns + 2.5ns + 4 * 2.5ns * 2 + 30ns + 2.5ns + 4 * 2.5ns = 66.25ns => 81.25ns / 7.5ns (round up) = 11 cycles => 11 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 97.5ns
Latency (CAS + RAS + Precharge) = 7.5ns * 2 + 1.25ns + 2.5ns + 4 * 2.5ns * 2 + 45ns + 2.5ns + 4 * 2.5ns = 66.25ns => 66.25ns / 7.5ns (round up) = 13 cycles => 13 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 112.5ns
Average latency = 91.5ns (20% worse than PC266 DDR SDRAM).
Now let's decrease both DRAM cycle times by 50%, and assume that the bus cycle time and CAS/RAS/Precharge latencies stay the same:
PC400 DDR SDRAM: 5ns cycle time
Here it gets hairy as well since the buses aren't synchronous. The average wait time at the start of the read request is 2.5ns, at which point (according to my diagram) the burst of the eighth word is synchronous with the FSB. Thus after 1 FSB cycle delay, the last word can be transferred 1/4 of a cycle later.
Latency (CAS) = 7.5ns * 2 + 5ns + 15ns + 4 * 5ns + 7.5ns + 7.5ns / 4 = 64.375ns
Latency (CAS + RAS) = 7.5ns * 2 + 5ns + 30ns + 4 * 5ns + 7.5ns + 7.5ns / 4 = 79.375ns
Latency (CAS) = 7.5ns * 2 + 5ns + 45ns + 4 * 5ns + 7.5ns + 7.5ns / 4 = 94.375ns
Average latency = 73.325ns (an improvement of 4%). Note that this does not mean PC400 DDR SDRAM is useless; deeply buffered and pipelined memory controllers are capable of exploiting up to 90% of the potential bandwidth from DDR SDRAM.
PC1200 RDRAM: 1.67ns cycle time
Latency (CAS) = 7.5ns * 2 + .833ns + 2.5ns + 4 * 1.67ns * 2 + 15ns + 2.5ns + 4 * 1.67ns = 55.84ns => 55.84ns / 7.5ns (round up) = 7.5 cycles => 7.5 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 71.25ns
Latency (CAS + RAS) = 7.5ns * 2 + .833ns + 2.5ns + 4 * 1.67ns * 2 + 30ns + 2.5ns + 4 * 1.67ns = 70.84ns=> 70.84ns / 7.5ns (round up) = 9.5 cycles => 9.5 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 86.25ns
Latency (CAS + RAS + Precharge) = 7.5ns * 2 + .833ns + 2.5ns + 4 * 1.67ns * 2 + 45ns + 2.5ns + 4 * 1.67ns = 85.84ns => 85.84ns / 7.5ns (round up) = 11.5 cycles => 11.5 cycles * 7.5 ns/cycle + 7.5ns + 7.5ns = 101.25ns
Average latency = 80.25ns (an improvement of 12%, now only 9% worse than PC400 DDR SDRAM)
These estimates are, of course, just estimates that I just threw together....don't quote me during any religious wars. 😉 FSB cycle times and DRAM latencies decrease more slowly than DRAM cycle times (and much more slowly than CPU cycle times), and thus they have been ignored...though any decrease of those two quantities help DDR SDRAM and RDRAM latencies significantly.
As for the "RDRAM is dead" comment, I think that, despite its technological benefits and limitations, RDRAM's role in the desktop area will remain small. Though RDRAM is by no means disappearing; it's being used in Playstation 2 and with the Alpha EV7 due this summer.