DRAM latency...

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
I would like to talk about latency issues with current and some of the future DRAM and EDRAM architectures.



I hear like the following here and on every BB/forum community I have visited."What will changng my CAS from 2.5 to 2.0 do in terms of preformance." or "What performance gains can I expect if I set my timings to more agressive levels."

So just to make it clear. I thought that I should add this for those of you who wonder if your memory can handle a CAS Latency 2 setting.

Background...

tCLK = System Clock Speed

CL = The CAS Latency

tCAC = Column Access Time

The "rule" for determining CAS Latency timing is based on this equation: CL * tCLK >= tCAC

In English: "CAS Latency times the system clock cycle length must be greater than or equal to the column access time". In other words, if tCLK is 10ns (100 MHz system clock) and tCAC is 20ns, the CL can be 2. But if tCAC is 25ns, then CL must be 3. The SDRAM spec only allows for CAS Latency values of 1, 2 or 3.
This is taken from http://www.vml.co.uk/Support/Sdram Timing.htm It is based off of 100MHz PC100. So for PC2700 which is a 166.667MHz clock and the fact that most DDR is a CAS 2.5 clock cycles. So the switch is only 0.5 cycles to CAS 2.

So... on a 266 FSB / 133.33MHz x 2 or 400/5 system clock... the tCLK should be around 7.5ns, nanoseconds. (1 second = 1,000,000 ns)

1 second / 1 MHz or cycles per second (clock speed) = 1000ns

(1 MHz = 1,000,000 cycles per 1 second.)

So for 400/3Mhz or 133.33MHz the tCLK will be...

1000ns /(400/3) = 7.5ns

CL * tCLK >= tCAC

2 x 7.5 = 14ns

So for a CAS latency setting of 1.4 would be enough. So you should be more than able to run PC2700 in this example at CL2.





So you can see, as the clock speeds increase the latency as a percentage decreases.

Just to prove that,

From CL 2.5 to 2 on DDR systems the time improvement, at 500/3MHz (or 166.667), is only 0.0000000270 seconds. So for a one half clock from 2.0 to 1.5 it is the same improvement.

That is about 3 tenths, ...of a millionth, ...of one second.

That is not worth the hassle to set the CAS Latency at 1.5 cycles even if you could get a setting for it. Speed increases are much better overall. So until FSB clock speed increase we will have to take these miniscule improvements through lower latencies.

If you look at the number of bits that are delayed for 0.5 cycles on DDR is only 1 bit.

0.5 cycles x 2 bits per cycle = 1 bit.

Now that is 1 bit per second. The difference of increasing the speed by 0.5 MHz is 1,000,000 bits.

You should easily see that speed is more important than latency settings. I am not saying that lower latencies are not important at all, just not AS important as increased clock speeds.

Now that was for SDRAM baased memories. RDRAM needs to decrease its latencies as well but the same holds try as speeds increase.



To put it another way... Increasing the speed of the memory is more beneficial than that of dropping the timing settings down to the most aggressive levels. (Under current conditions.) Even if you could save 10 cycles of latency, that would be the difference between...

1 - tRC Timing: 3, 4, 5, 6, 7, 8, 9 cycles

2 - tRP Timing: 3, 2, 1, 4 cycles

3 - tRAS Timing: 2, 3, 4, 5, 6, 7, 8, 9 cycles

4 - CAS Latency: 2, 2.5, 3 cycles

5 - tRCD Timing: 1, 2, 3, 4 cycles


of 9-3-5-3-3 setting and a 3-1-2-2-2. That would be 10 cycles. So for 10 cycles that would be 20 bits delayed for a total of around 0.000000602 seconds. That over the course of a year, running full time 24-7-365, would be the difference of 1.9 seconds. However if you increase the speed to 200MHz clock with DDR400 the difference and using the DDR333 with the more aggressive settings the difference is 1.58 seconds. Again it is miniscule but the difference in speed is obvious. At 166MHz, with zero latency, there would be 21,200,000,000 bits transferred. At 200MHz, that would be 25,600,000,000 bit transferred. That is 440,000,000 bits via the speed increase and only 20 bits from the timing change. So you see, speed increases are more important than latency settings.


I know that was a lot but it needs to be said. I had to get that out of the way before someone turns this into a "CL 2.5 to CL 2.0" or "using the most agressive timing settings will improve your memory performance a lot." thread.



What are your your thoughts about the newer forms of memory that are about to become reality in the mainstream?

Example...DDR ESDRAM (Fromerly the old DDRII), QDR, and QDRII.
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
From my experience with a highly programmable embedded chipset, the order of "things that improve your DRAM performance", from highest benefit to lowest, is:

1. Clock speed
2. CAS latency
3. tRC
4. tRAS
5. tRP and tRRC

regards, Peter
 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
"1. Clock speed
2. CAS latency
3. tRC
4. tRAS
5. tRP and tRRC"

I agree whole heartedly on the speed for #1.

CAS and all of the others are no big deal. No offense but from the most aggressive to the least aggressive setting that is only 18 clock cycles of delay/latency. At current DDR333 speeds of 500/3 or ~166.667MHz that is on the difference of 18 cycles out of the 166,666,667 cycles. That is a slow down of ~0.000000108 seconds. That would be a differnce of ~3.406 secondes per calander year. Or at peak bandwidth that would be ~18,165.333 bits or ~2,2270.667 bytes per 365 day calander year.

That is not much at all.

Speed is much more beneficial.

I think bandwidth is also another factor that needs to be inhanced.


The new DDRII standard with SRAM, called DDR2_EMS, has row registers will make the precharge basically 0 and the array bandwidth is then fully utilized. The total latency for the first 64 bits will be 3.5 to a max of 9.5 cycles which equates to 17.5 to 47.5 nanoseconds. This is compaired to RDRAM which has 14 to 32 cycles of latency which equates to 35 to 80 nanoseconds.

Even if they did drop the overall latency the speed is the biggest improvement. Using the Srams, EMS, will help bandwidth. So those two together will do more than decreasing latencies.

Back to you...
 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
Good point.

One reason that RDRAM has such high latencies. A lot of banks and a lot of reactivations of those banks not to mention that the biggest delay for RDRAM is the fact that the sense amps are shared with adjacent banks. However having tons of them helps improve that latency problem.


-Latency-
"The time between initiating a request for data and the beginning of the actual data transfer. On a disk, latency is the time it takes for the selected sector to come around and be positioned under the read/write head. Channel latency is the time it takes for a computer channel to become unoccupied in order to transfer data. Network latency is the delay introduced when a packet is momentarily stored, analyzed and then forwarded." - The Computer Language Company Inc.

"Main Entry: 1la·tent
Pronunciation: 'lA-t&nt
Function: adjective
Etymology: Middle English, from Latin latent-, latens, from present participle of latEre to lie hidden; akin to Greek lanthanein to escape notice
Date: 15th century
: present and capable of becoming though not now visible, obvious, or active <a latent infection>
- la·tent·ly adverb
synonyms LATENT, DORMANT, QUIESCENT, POTENTIAL mean not now showing signs of activity or existence. LATENT applies to a power or quality that has not yet come forth but may emerge and develop <a latent desire for success>. DORMANT suggests the inactivity of something (as a feeling or power) as though sleeping <their passion had lain dormant>. QUIESCENT suggests a usually temporary cessation of activity <the disease was quiescent>. POTENTIAL applies to what does not yet have existence or effect but is likely soon to have <a potential disaster>." - Merriam-Webster Collegite Dictionary


However the fact still remains that speed is the major driver in hiding latencies. Another is prefetching, and still another is eliminating them. The last is the hardest to do as bandwidth increases. Unless we all go to SRAM. Probably not as long as costs remain high.
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Bum, you're completely off on your calculations.

From fastest to slowest flavor on a given clockspeed (in terms of CL, tRC, tRRD, tRP, tRAS) you typically lose four to five clocks - not in a second, but on every block of eight quadwords (a "burst"), and on every single access that isn't a full burst. 15 or 20 cycles for such a burst, 8 or 13 for a single access - that DOES make a difference, a very noticeable one in applications that are bound on RAM performance ... and also at the other end of the performance spectrum, on shared-memory-VGA low cost systems.

regards, Peter

 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
"Bum, you're completely off on your calculations."

How so?

You said... "From fastest to slowest flavor on a given clockspeed (in terms of CL, tRC, tRRD, tRP, tRAS) you typically lose four to five clocks - not in a second, but on every block of eight quadwords (a "burst"), and on every single access that isn't a full burst."

(I am not talking down to you for notation I am just making it so others can read along. :) )

b = bits
B = BYTES

One quadword = 64b. 8 quadwords = 512b or 64B.

So what you are saying is that for every 512b there will be a delay of 4 to 5 clocks/cycles?

If this is so...think of this...

Take a memory system using DDR266. That has a memory bandwidth of ~2.1333GB/s.

So using what you stated in this scenario that would be 4,166,666 quadwords maximum under peak bandwidth.

So 4,166,666 x 4 cycles per quadword = 16,666,666 cycles lost to latency.

(400/3 = ~133.3333MHz)

16,666,666 devided by (400/3MHz) = ~12.5% loss.

That is just not true.

Look at this article. Scroll down to the table.

after looking at the results from the link above. Did you see the Tyan board with the DDR266? The average bandwidth was 1020MB/s.

The peak bandwidth of a the board is 1,066.6667MB/s. This not the memory bus but the FSB. The memory bus will be constricted to the max performance of the FSB.

1020/1066 = .9568480 or on the low side a 4.5% loss.

There is a huge difference between 12.5% and 4.5%.

Now if we were talking about RDRAM that would be a higher percentage.

"15 or 20 cycles for such a burst, 8 or 13 for a single access - that DOES make a difference, a very noticeable one in applications that are bound on RAM performance ... and also at the other end of the performance spectrum, on shared-memory-VGA low cost systems."

I lost you here. Were you just totaling the number of cycles? Also do you mean on an EV6 bus?


Back to you...
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
Your example of only a 4% loss (1.02GB/s memory bandwidth vs. 1.066GB/s bus bandwidth) is using SDR SDRAM.

SDR supported full page burst, where the 8-13 cycle penalty was suffered for the begining of a burst (start of a read at a new non-sequential location), but not every 8 quadwords.

DDR does not support full page burst, and must have that delay every 8 quadwords (64 bytes), even if they are sequential.
 
Jun 26, 2002
185
0
0
Originally posted by: BumJCRules
I would like to talk about latency issues with current and some of the future DRAM and EDRAM architectures.



The "rule" for determining CAS Latency timing is based on this equation: CL * tCLK >= tCAC


1000ns /(400/3) = 7.5ns

CL * tCLK >= tCAC

2 x 7.5 = 14ns

So for a CAS latency setting of 1.4 would be enough. So you should be more than able to run PC2700 in this example at CL2.


Wait a second I don't understand one thing. Isn't more like tCAC/tCLK >=CL

I didn't think the tCAC was variable. It is set by the design of the DRAM. Where did you get that tCAC was only 10.5 ns?

1.4 X 7.5 = 10.5ns

or for PC2700 2 X 6ns = 12ns tCAC for this?

In the example you stated at min 14ns is needed for CAS 2 on a 133Mhz bus. Most specs want CAS 2.5 so 18.75ns for TCAC.

So then if you wanted to run PC2700 or a 167Mhz bus at a tCAC of 14ns you would need a CAS of 2.33 so 2.5.

I'm just trying to understand the math you used. I don't want to get into what is faster. Buy you are right in the % change as speeds increase.
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Bumjc, the essentials ... one quadword is 64 BYTES (one byte is 8 bits, one word is 16 bits, a doubleword is 32, and a quadword ... you get it), an 8-quadword SDRAM burst is 512 bytes (same for DDR), which happens to be a cacheline size in Pentium/Pro/II/III/4 and K5/K6/K7 systems.

Do your calculation again, and you're suddenly in the same ballpark as your quoted real world results.

Regarding full page bursts, SDR SDRAM may have supported it, but actual x86 chipsets don't use that. It's either a single access or an 8-quad burst.

With DDR, the the remaining 7 quadwords after the initial one pump in 3.5 clocks not 7 as with SDRAM, magnifying the effect of initial access latency.

And yes, I was totalling the cycles for a burst, in case I lost you there, that's why it's 7 more than for a single access.

In the real world, a good SDRAM chipset like ServerWorks Champion LE with PC133-222-520 ECC SDRAM gets you 810/237 MB/s read/write throughput. With worst settings, it's down to 770/206. Best setting is CAS latency 2, RAS cycle time 7, act-to-deact 5, act-to-r/w 2, RAS precharge 2. Worst was 3-10-6-3-3.

With DDR, the difference is larger because the burst "tail" is only 3.5 cycles not 7. I have none at hand, but I can simulate with an SDRAM chipset that does only 4-quadword bursts (burst "tail" 3 cycles long):

NatSemi's Geode GX1 with SDRAM (at 111 MHz, it's an odd chipset), with a burst length of 4 quadwords, pumps 195 MB/s with 2-8-6-2-2 timing, while at the same clock speed and 3-9-6-3-3, we're down to 174, which is almost as slow as the best 2-8-5-2-2 SDRAM at 95 MHz that manages 173. 3-9-5-3-3 at 95 MHz is 158 MB/s, and it's all downstairs from there.

So from "normal" to "best" RAM, at the same clock speed, we have a burst throughput gain of five to 10+ percent, and much more in single accesses. Sure, thanks to optimized caching strategies in the CPUs, the latter hardly happen anymore - but five to ten percent is not something one would call unimportant. In other corners of the computer, lots of money is being spent for such a gain.

regards, Peter

All measurements taken with CACHEMEM.EXE 2.65MMX. "Clock speed" refers to the SDRAM bus. No changes in CPU clock were made during test. ServerWorks chipset w/ 800 MHz mobile Tualatin, NatSemi GX1-333 MHz.
 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
"Wait a second I don't understand one thing. Isn't more like tCAC/tCLK >=CL"

See here.


"I didn't think the tCAC was variable. It is set by the design of the DRAM. Where did you get that tCAC was only 10.5 ns?

1.4 X 7.5 = 10.5ns"


The rating for 400/3MHz signal, or 133.33MHz, is 7.5ns.



"or for PC2700 2 X 6ns = 12ns tCAC for this?"

If you have 6ns rated memory chips, yes.



"In the example you stated at min 14ns is needed for CAS 2 on a 133Mhz bus. Most specs want CAS 2.5 so 18.75ns for TCAC.

So then if you wanted to run PC2700 or a 167Mhz bus at a tCAC of 14ns you would need a CAS of 2.33 so 2.5.

I'm just trying to understand the math you used. I don't want to get into what is faster. Buy you are right in the % change as speeds increase."

Yes need is the word. However the clock generator does not do anything smaller than a half-clock for timing settings. So hence you get 2, 2.5, and 3 clocks as settings.
 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
I don't know where to go from here.....?

I was incorrect with my look at the P4X266 chipset on the Trinity 510 S2266 motherboard. That is a P4 board and had a 3.2GB/s FSB, 4 channel by 100 MHz by 64bit. So the percentage is not correct. The numbers will also be skewed because of the prefetch of the P4 and some other variables.

So I will have to look at the AMD board and will look for other published results from creditable sources. Because those don't look correct either.


As for the timing/latency issue for 8 quad-word bursts and single accesses...

Word = 32bit

quad word = 128bit

8 quad word = 1024bits or 128Bytes

Thank you for pinting that out to me. This is not a 16 bit architecture. Just another senior moment. :D

"Timings have been improved, resulting in faster transfers between the synchronized Front Side Bus and Memory Bus. Also, the new memory controller with Performance Driven Design can burst up to eight Quad Words of data per clock, up from four in previous designs. Data queues have also been deepened, allowing faster and more efficient access to buffered data." - Via Technologies KT266A Whitepapers - 8-24-2001 Page 2


"include tightened timings on the S2K front side bus, deeper instruction and data queuing, and the unique ability to burst up to eight Quad Words per clock." - Via Technologies KT266A Whitepapers - 8-24-2001 Page 4

DDR266 with a bandwidth of ~2.1333GB/s.

So using 1024bits per quadword and 4 cycles lost per quadword...that would be 2,083,333 quadwords maximum under peak bandwidth.

So 2,083,333 x 4 cycles per quadword = 8,333,333 cycles lost to latency.

(400/3 = ~133.3333MHz)

8,333,333 devided by (400/3MHz) = ~6.25% loss.

Now that makes sense. Thanks again for the correction.

So based on the calculations above. You are stating that the lost perfomance from peak/zero latency would be ~6.25%. Am I hearing you correct?

If not, please do your own calculation here to show me what you would like me to see.


Onto another part of this... The Ace's Hardware article and more specifically the numbers...

So why are the average bandwidth numbers so low on that Tyan with the P4 and the two AMD boards?

I get almost the same numbers with PC133 using STREAM v2.0 and Cachmem v2.6 on my laptop.

PIIIM @
647.2MHz

PC100 CL2 2-2-2
FSB 100MHz
Chipset is i440BX
MCH is 82443BX


STRAM v2.0 under Window98SE

C:\>STREAMD
DOS/4GW Protected Mode Run-time Version 1.97
Copyright (c) Rational Systems, Inc. 1990-1994

STREAM for DOS v2 by Dennis Lee
===============================
1 MB = 1000000 Bytes in the following measurements.

For accurate results, this benchmark should be executed
in a true DOS session, and not a DOS shell under another OS.

Time Operation Mem Speed Error
---- --------- --------- -----
1.92 sec COPY32 333.33 MB/s 3.2%
2.04 sec COPY64 313.73 MB/s 3.0%
2.03 sec SCALE 315.27 MB/s 3.0%
2.64 sec ADD 363.64 MB/s 2.3%
2.91 sec TRIAD 329.90 MB/s 2.1%

These results are comparable with those on the STREAM website.
See <http://www.cs.virginia.edu/stream> for info on STREAM.

Type 'streamd ?' for help

C:\>


STRAM v2.0 under "True" DOS...

C:\>STREAMD
DOS/4GW Protected Mode Run-time Version 1.97
Copyright (c) Rational Systems, Inc. 1990-1994

STREAM for DOS v2 by Dennis Lee
===============================
1 MB = 1000000 Bytes in the following measurements.

For accurate results, this benchmark should be executed
in a true DOS session, and not a DOS shell under another OS.

Time Operation Mem Speed Error
---- --------- --------- -----
1.98 sec COPY32 323.23 MB/s 3.1%
2.03 sec COPY64 315.27 MB/s 3.0%
2.09 sec SCALE 306.22 MB/s 3.0%
2.47 sec ADD 388.66 MB/s 2.5%
2.80 sec TRIAD 342.86 MB/s 2.2%

These results are comparable with those on the STREAM website.
See <http://www.cs.virginia.edu/stream> for info on STREAM.

Type 'streamd ?' for help

C:\>




Cachmem 2.6 under Window 98SE - And don't listen to line 2. The numbers are only marginally skewed by windows.

Cache size/Memory speed info tool 2.6MMX - (c) 1999-2001, LRMS - DJGPP compiled
** Warning! Results are unreliable under Windows! **
CPUID support detected... 'GenuineIntel' with FPU TSC MMX
Family=6 Model=8 Step=3 Type=0 Chipset (Vendor/Device ID(Rev)): Intel/7190(03)
CPU clock: 647.2 MHz
Using 32MB physical memory block (alignment = 32)
Bandwidth - MMX linear access test... Read/Write/Copy (MB/s)
Block of 1KB: 4209.1 / 3099.4 / 5759.6
Block of 2KB: 4419.9 / 3089.4 / 5908.1
Block of 4KB: 4391.9 / 3244.3 / 5990.8
Block of 8KB: 4563.2 / 3288.5 / 6022.5
Block of 16KB: 4425.7 / 3307.3 / 3048.9
Block of 32KB: 2544.8 / 2232.3 / 3006.3
Block of 64KB: 2557.2 / 188.5 / 1595.9
Block of 128KB: 1474.0 / 1336.9 / 998.7
Block of 256KB: 1043.6 / 707.8 / 368.3
Block of 512KB: 735.1 / 229.1 / 324.4
Block of 1024KB: 707.1 / 204.7 / 306.5
Block of 2048KB: 711.4 / 202.6 / 97.0
Block of 4096KB: 668.8 / 202.5 / 301.0
Block of 8192KB: 711.9 / 195.2 / 293.5
Block of 16384KB: 714.9 / 197.6 / 289.5
Block of 32768KB: 710.7 / 191.3
Latency - Memory walk tests... ("pointer chasing")

Null size: 3 cycles 1 cycles (overhead 126 cycles)
steps: 4 8 16 32 64 128 256 512 1k 2k 4k (bytes)
Block of 1KB: 3 3 3 3 3 3 3 3 - - - cycles
Block of 2KB: 3 3 3 3 3 3 3 3 3 - - cycles
Block of 4KB: 3 3 3 3 3 3 3 3 3 3 - cycles
Block of 8KB: 3 3 3 3 3 3 3 3 3 3 3 cycles
Block of 16KB: 3 3 3 3 3 3 3 3 3 3 3 cycles
Block of 32KB: 3 4 5 7 7 7 7 7 7 7 7 cycles
Block of 64KB: 3 4 5 7 7 7 7 7 7 7 7 cycles
Block of 128KB: 3 4 5 7 7 7 7 7 7 7 7 cycles
Block of 256KB: 4 5 23 47 50 50 50 51 65 57 62 cycles
Block of 512KB: 19 23 42 75 73 79 73 74 76 81 90 cycles
Block of 1024KB: 19 23 42 72 73 72 74 74 76 84 90 cycles
Block of 2048KB: 19 23 42 72 72 72 73 74 76 81 90 cycles
Block of 4096KB: 19 23 42 73 72 72 73 74 77 81 90 cycles
Block of 8192KB: 19 23 42 72 77 73 74 78 76 81 90 cycles
Block of 16384KB: 19 23 43 72 72 100 84 76 77 81 90 cycles
Block of 32768KB: 19 23 42 72 72 72 73 74 77 81 91 cycles
Done.
This system appears to have 3 cache levels (enabled).
L1 cache (16KB) speed (MB/s): Read=4609.0, Write=3322.7
L2 cache (128KB) speed (MB/s): Read=2552.6, Write=2178.2
L3 cache (256KB) speed (MB/s): Read=933.1, Write=565.0
Main memory speed (MB/s): Read=714.5, Write=185.3

I see the 19 cycle delay. However wouldn't it be the 1K column? 3-77 cycles depending on the packet size?


Back to you...
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Still not right ... word 16, quadword 64 bits (which is exactly the width of both the CPU and RAM busses, this is why the burst length is always being stated in qwords).

P6 and K7 CPUs use a burst length of 8 qwords on the CPU bus side ("S2K" for the K7), and it's not a coincidence that the RAM controllers in the chipsets mirror that behavior.

Let's look at CACHEMEM results. STREAM results don't tell that much.

What you have there is ye olde Intel i440BX with a Coppermine CPU on.

The throughput is where it should be (you are running 100 MHz CPU and RAM busses there, regardless of your RAM being PC133 capable).

The latency is measured in CPU cycles, you got a 650 MHz there, so you need to divide by 6.5 and round for real RAM clock cycles.

The leftmost columns are affected by FIFOs and caches, forget those. In the 4th column, you see the actual initial access latency, mainly influenced by tCL and tRAS. You got 11 cycles initial latency there, meaning that an 8-quadword burst takes 18 cycles or 180 ns. If more bursts follow, there is a delay inbetween, influenced by the less important timing parameters. This can be seen from the latter three columns, there is a penalty of up to three extra cycles here.

The occasional slow value inbetween is from system management interrupts (handling USB and power management).

Questions?

regards, Peter
 

BumJCRules

Junior Member
Apr 5, 2002
22
0
0
Okay...

Now you've done it.

I had 16Bit before and I thought you corrected me. So what you are saying is that I was correct before with quad-word being 64B. (Bytes or 512 bits)

Ugghhhh!!!!

I understand the physical structures. I just need to understand how the delays, protocols, etc.

So please elaborate if you would...?