• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

question about cache latencies

eLiu

Diamond Member
Hi all,
Quick question... I often see cache latencies quoted something like (i7, for example):
L1: 4 cycles
L2: 11 cycles
L3: ~40 cycles
RAM: ~100 cycles

So I understand that to mean that if I want to read line X, and X is sitting in L1 cache, it takes me 4 cycles to retrieve it. And if X isn't in L1 but it is in L2, then did I:
1) incur 4 cycles (check L1 for X) + 11 cycles (check & get X from L2) + 4 cycles (put X in L1) before I can use the contents of X
2) incur 11 (get from L2) + 4 (put in L1) before I can use contents of X (i.e. asking whether something is/is not in cache is fast)
3) incur 11 (get from L2) + nothing for L1 to do before I can use X
4) incur 11 (includes getting from L2 & putting in L1)
?
Or is it like case 2), but X is availalbe immediately so I don't have to wait for L1 to write in line X?

Also, does the quoted main memory latency include the time it takes for the RAM to act? Like if we have 533mhz (bus rate) DDR2 with 6-6-6 timings, it'll take something like 180ns (worst) to 60ns (best) for the memory to send the first bit of data back after the request goes through. But this could vary pretty widely depending on what kind of RAM you have.

On a 3ghz processor, 90ns is way more than 100 cycles... so my guess is that 100 cycles means it takes 100 cycles for the processor to ask the RAM to do something (so figuring out that X isn't in cache, looking up where to find X in TLB or doing addr translation, physically sending the signals around, etc). And then the RAM takes however long to respond. Is this correct?
 
Cache cycles are average cycles to usage. To answer your question (4) 11 cycles from the L2 to usage and the L1 is loaded in that 11 cycle time.

Caches have associativity. Each level of associativity is basically a window into main memory. If you have a 1 way associate cache, then the mapping of main memory to cache can only reside in 1 place. A 2 way associate cache the mapping can reside in, you guessed it, 2 places; and so on to a fully associative cache where each cache line can map to anywhere in main memory. The implications of the above relationship is for every level of associativity you have, you have to make up to that many checks to see if the requested memory location is already loaded in the cache. More associativity, more checks so longer latency; but also more cache hits, as there is more chances that the requested data was not previously evicted from the cache.

The implications of this and where more latencies can be incurred to main memory. Say for instance you request a chunk of memory that is not in any cache, there are only so many spots where that memory can reside and it must be loaded into the L1. Caches usually work on an evict oldest policy and if the oldest spot is dirty, it's going to be pushed down the hierarchy possibly causing a forced write to main memory. Lucky for us many modern processors have read and write buffers that can soften this collision and deep L3 caches with many spots. However, the penalty is going to be incurred somewhere and that's where the averages come in.

Where the variation to main memory comes from whether memory is recovering from a write, changing a row, and/or column.
 
Cache cycles are average cycles to usage. To answer your question (4) 11 cycles from the L2 to usage and the L1 is loaded in that 11 cycle time.

Caches have associativity. Each level of associativity is basically a window into main memory. If you have a 1 way associate cache, then the mapping of main memory to cache can only reside in 1 place. A 2 way associate cache the mapping can reside in, you guessed it, 2 places; and so on to a fully associative cache where each cache line can map to anywhere in main memory. The implications of the above relationship is for every level of associativity you have, you have to make up to that many checks to see if the requested memory location is already loaded in the cache. More associativity, more checks so longer latency; but also more cache hits, as there is more chances that the requested data was not previously evicted from the cache.

The implications of this and where more latencies can be incurred to main memory. Say for instance you request a chunk of memory that is not in any cache, there are only so many spots where that memory can reside and it must be loaded into the L1. Caches usually work on an evict oldest policy and if the oldest spot is dirty, it's going to be pushed down the hierarchy possibly causing a forced write to main memory. Lucky for us many modern processors have read and write buffers that can soften this collision and deep L3 caches with many spots. However, the penalty is going to be incurred somewhere and that's where the averages come in.

Where the variation to main memory comes from whether memory is recovering from a write, changing a row, and/or column.

Yeah, I know about set associativity. Though I wasn't sure how much it contributes to the latency b/c maybe they have some clever hash implemented to resolve "is this line present?" without checking all 8 possibilities (in an 8-way cache). But I wasn't sure about how to interpret the latency numbers that you usually see, so thanks.

For the RAM, I just realized I screwed up my estimate... extra factor of 6. I was trying to do... 533mhz i/o bus DDR2 RAM, 6-6-6 timings.

So the CAS latency is 6/533mhz = 11.3ns if I have the right row open & just have to read off some column. The worst (?) case is if I have to precharge and open a row and then read a column... so that's 18/533mhz = 33.9ns.

At 3ghz, 1 cycle is about 0.33ns. Most likely the 100 cycle estimate includes the 40 cycles for the L3... so that only leaves 60 cycles = 19.8ns of time remaining to poll the RAM. Looks like 19.8ns is about the avg of 11ns and 33ns, which answers my own question. Screwed up earlier b/c I thought the RAM took btwn 60ns and 180ns to respond, oops.

One more question though: that ~100cycles number would be a bit worse if you're jumping around a lot in memory (such that you have substantial DTLB misses), correct? i.e., I don't think the quoted latency numbers take TLB misses into account.
 
1. Associative checks are usually implemented in parallel. They don't affect cache access time linearly, as much as cache power.

2. Times are approximate because they're unpredictable. At the first level of shared caching, requests from processor P0 have to share resources with P1, P2, etc. This leads to some delay just from queueing. It's fairly common to quote the uncontended latency when you want caches to seem fast, and the average when you want to be honest.

3. As you guys have pointed out, DRAM timing is fairly complicated. When you quote an average, it should account for DRAM timing operations, as well as open/close page policies and refresh timing. But without knowing assumptions, you can't know for sure.

4. Most parts assume an L1 hit, and then take corrective action if it's a miss. As in, after 4 cycles (miss detection), the L2 cache is accessed. If that misses (after 11 cycles, in your example), the request goes to L3. From the L3, the processor is usually no longer directly involved, until the request is eventually serviced, by "someone". In other words, in some cases, the request might be forwarded to other cores (i.e., coherence miss), or might eventually be serviced by the memory controller.
 
About associativity - yes they are usually implemented in parallel, but every bit of complexity may cause a unit to run at a slightly lesser speed, thus more latency. It's really a trade off.

Although AMD has let its cache structure (L1/L2) stagnate for the past few generations, probably mostly because of the cost of redesigning it, they seem to trade larger sizes for less associativity. While Intel has smaller more associative caches that have evolved in many directions over the past generations.

Ironically when it comes to L3, Intel went with a faster 16 way cache, while AMD went with a slower 48 way.

To some degree cache speed usually works out to

size * associativity + exclusivity = relative latency
 
Last edited:
Back
Top