• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

question about cache writes

eLiu

Diamond Member
Hi all,
On 'current' processors (say core2, i7, amd phenom), when you write X to memory, what happens next?

I guess if X is register allocated, then you're done.
If X lives in memory somewhere, then you check the L1 cache (if X isn't there, bring it in). Update the value in L1.

Now, does updating the L1 value require you to also update the value in L2? Similarly, does updating the value in L2 also require updating the value in RAM? I imagine the answer here is "sometimes." lol

Or does the write only get propagated down the hierarchy when the line X is on gets evicted? i.e., do modern processors "gather" all writes to the same cache line & handle the actual writing as infrequently (lazily) as possible?

I think the phrases I'm looking for are 'write-back' and 'write-through'. But I'm having trouble googling for my answers b/c most/all results are from people asking configuration questions for their SSDs or whatever. I'm wondering how 'current' processors handle memory writes. (Also I'm assuming we aren't interacting w/SSE's non-temporal write instructions.)

Thanks,
-Eric
 
The biggest difference in L1/L2 is exclusivity. Generally AMD uses an exclusive cache Intel non-exclusive. In an exclusive cache the segment of memory resides in either the L1 or L2 but never both, while a non-exclusive cache what's in the L1 is always in the L2. Advantages for exclusive is you get more total cache but you usually have a cache penalty (cycle) to maintain it. All shared L3+ caches are that I know of non-exclusive.

As for write back vs write through, it really has to do with when other memory/cache locations are updated. In a write back cache hierarchy, memory locations are not synchronized immediately but rather marked as dirty and only updated when used or cleared. A write through cache the update to other memory locations starts immediately.

As far as I know all modern processors follow the mesi protocol for cache associativity. Even memory must maintain coherency at some point using its rules.

I haven't really studied the eviction rules to main memory.
 
The following is specific to x86 chips. I assume those are the ones in which you are interested.

Modern intel parts use write-back L1-Ds. This discussion is also specific to that model, as well. In general, most products have a write-back level of caching whereever the coherence protocol interacts... or an included level.

Just FYI, in my parlance:
core == processor,
as distinct from die, or chip.

1. Stores retire into a per-processor store buffer. In general, there are many stores in the store buffer at a time. The local processor snoops (looks in) the store buffer when it accesses memory, so the local processor sees updates from the same core. Other processors cannot see each other's store buffers (this is where processor consistency comes up).

2. The head of the store buffer is outstanding to the memory system. In a write-back L1 cache, the head of the store buffer empties into the L1-D when the cache coherence protocol delivers a writable data allocation to the L1-D. It is also possible that the block was already in exclusive state -- in that case, the head of the store buffer can write into the cache without consulting other cores and caches.

2.a) The exact rules about how and when exclusive modification rights transfer between caches is dependent on the cache coherence protocol. People get PhDs discussing cache coherence protocols. To summarize in one sentence, the cache coherence protocol ensures that only one cache ever has permissions to change the contents of a given cache line.

3. In a write-back cache, the now-dirty (exclusive) line remains in the L1-D, until one of four events happens.

3.a) The line can be evicted if the local processor wants to access other lines that map to the same set. If that happens enough times, the (pseduo) LRU bit on the dirty line indicates that it is in the least-recently-used position. The line moves into a write buffer, and the L1-D initiates a write-back action to the next level of cache -- usually the L2. The L2 then caches the block in modified state.

3.b) The line will be downgraded to owned/shared if another processor attempts to read the line. The cache holding the dirty copy sends the data to the requesting processor node, and retains a clean, non-writable version for itself.

3.c) The line will be downgraded to invalid if another processor attempts to write the line (i.e., another processor's store to the same line reaches the head of the other processor's store buffer). The cache holding the dirty copy sends the data to the requesting processor node, then marks the permissions on the line as invalid.

3.d) The local processor node could initiate an explicit cache flush, under software control (not particularly likely).

So, to answer your specific questions:
... does updating the L1 value require you to also update the value in L2?
Not for write-back L1D's. Intel chips pre-nehalem were write-through write-invalidate L1-Ds -- their store buffers drained into the private L2 caches.
Similarly, does updating the value in L2 also require updating the value in RAM?
Definitely not. That would eat a helluva lot of power, and be a serious DRAM bandwidth issue. But, last-level caches will evict dirty lines, from time to time, when they're not being used anymore. In other words, write-backs are lazy.

do modern processors "gather" all writes to the same cache line & handle the actual writing as infrequently (lazily) as possible?
Inasmuch as your question is concerned, modern systems write-coalesce committed data in caches. However, I want to say this fairly precisely. The memory order is strictly defined by the memory consistency model. The cache coherence protocol plays some role in this -- namely write exclusivity. But the combined effect of write-coalescing, in this sense, is the product of store buffers, caches, write-buffers, and the cache coherence protocol.

In short, it's pretty hard to get it right (hard enough that there's still a ton of debate over how to do it correctly -- e.g., Intel's errata documents are chock full of concurrency bugs.). There are a lot of corner cases.
 
dinkumthinkum: yeah, already tried to read some of those. But I'm pretty far from being a CS guy, so there were many remaining points of confusion.... actually I refer back to intel's billion page manuals in my response below, lol

The following is specific to x86 chips. I assume those are the ones in which you are interested.

Modern intel parts use write-back L1-Ds. This discussion is also specific to that model, as well. In general, most products have a write-back level of caching whereever the coherence protocol interacts... or an included level.

Just FYI, in my parlance:
core == processor,
as distinct from die, or chip.

1. Stores retire into a per-processor store buffer. In general, there are many stores in the store buffer at a time. The local processor snoops (looks in) the store buffer when it accesses memory, so the local processor sees updates from the same core. Other processors cannot see each other's store buffers (this is where processor consistency comes up).

2. The head of the store buffer is outstanding to the memory system. In a write-back L1 cache, the head of the store buffer empties into the L1-D when the cache coherence protocol delivers a writable data allocation to the L1-D. It is also possible that the block was already in exclusive state -- in that case, the head of the store buffer can write into the cache without consulting other cores and caches.

2.a) The exact rules about how and when exclusive modification rights transfer between caches is dependent on the cache coherence protocol. People get PhDs discussing cache coherence protocols. To summarize in one sentence, the cache coherence protocol ensures that only one cache ever has permissions to change the contents of a given cache line.

3. In a write-back cache, the now-dirty (exclusive) line remains in the L1-D, until one of four events happens.

3.a) The line can be evicted if the local processor wants to access other lines that map to the same set. If that happens enough times, the (pseduo) LRU bit on the dirty line indicates that it is in the least-recently-used position. The line moves into a write buffer, and the L1-D initiates a write-back action to the next level of cache -- usually the L2. The L2 then caches the block in modified state.

3.b) The line will be downgraded to owned/shared if another processor attempts to read the line. The cache holding the dirty copy sends the data to the requesting processor node, and retains a clean, non-writable version for itself.

3.c) The line will be downgraded to invalid if another processor attempts to write the line (i.e., another processor's store to the same line reaches the head of the other processor's store buffer). The cache holding the dirty copy sends the data to the requesting processor node, then marks the permissions on the line as invalid.

3.d) The local processor node could initiate an explicit cache flush, under software control (not particularly likely).

Hm I'm confused about this store buffer business. So processor A is ticking along and it computes X and ST is issued on X. So X hops into A's store buffer (but say not at the head b/c there's a bunch of other stuff in there). 1 tick later, processor B, which is working with the same data set as A, reads the location X is going to. But procB does not see X... or does it??

Does hardware avoid this situation automatically? If the stuff in the store buffer falls under the same coherency protocol (MESI? That's the only one I know about) as the cache (I THINK it does), then it sounds like the hardware takes care of business. If not, I'm confused.

I think your phrase "processor consistency" addresses this issue, but I have no idea what that means. And my noob-ness is making it hard to search... actually this thread comes up as 1st or 2nd hit pretty often :/

That said, where does getting say a data lock (in a threading environment) in software come into play? Or a fence (i think that's the right word)/barrier (in MPI terminology)? I guess what I'm getting at is what does the CPU cache circuitry do for me, what do I do on my own, and when do (if they do at all) these overlap?

So, to answer your specific questions:

Not for write-back L1D's. Intel chips pre-nehalem were write-through write-invalidate L1-Ds -- their store buffers drained into the private L2 caches.

Definitely not. That would eat a helluva lot of power, and be a serious DRAM bandwidth issue. But, last-level caches will evict dirty lines, from time to time, when they're not being used anymore. In other words, write-backs are lazy.

Hm this document:
download.intel.com/design/intarch/datashts/320670.pdf
claims that the Core 2 duo (penryn i believe) has a write-back L1D. Additionally, I'm pretty sure Core 2 duos/quads don't have private L2 cache; aren't they all shared across pairs of cores?

Inasmuch as your question is concerned, modern systems write-coalesce committed data in caches. However, I want to say this fairly precisely. The memory order is strictly defined by the memory consistency model. The cache coherence protocol plays some role in this -- namely write exclusivity. But the combined effect of write-coalescing, in this sense, is the product of store buffers, caches, write-buffers, and the cache coherence protocol.

In short, it's pretty hard to get it right (hard enough that there's still a ton of debate over how to do it correctly -- e.g., Intel's errata documents are chock full of concurrency bugs.). There are a lot of corner cases.

So it seems like nehalem (i7 & descendents) does *not* have a write combining buffer. Any idea why? Is it due to coherency issues? B/c it seems WC buffers in all previous intel processors weren't "snooped." So it lived outside of the 'mesi' protocol?

But if there isn't a write cache buffer, how on earth can writes be coalesced? (Citing: http://www.intel.com/Assets/PDF/manual/253668.pdf
Not that I doubt that write-combining is going on. As you said, it would be damn expensive to skip this. I think it's been around since the Pentium/II and K6-3 days. But I wanted to do some fact-checking b/c I can't find much information about this stuff online. The WC buffer seemed like a logical place for combining/coalescing writes. Since nehalem doesn't have any... how does that work?

Getting back to your post, what do you mean by "memory order"?

I feel like every month or two, I spawn one of these threads & you end up spending a bunch of time answering my questions, lol. So thanks again for all your help.
 
Last edited:
I feel like every month or two, I spawn one of these threads & you end up spending a bunch of time answering my questions, lol. So thanks again for all your help.
No problem. I like sharing what I know. This is what forums are for.

Getting back to your post, what do you mean by "memory order"
This question underlies all the other questions.

The following is the most useful thing I will probably ever read on memory consistency models. Most people don't even know they're there... but read it. It's really interesting.
www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

Teaser:
On x86 CPUs, the following is completely legal and can happen:
Assume all values in memory are zero unless otherwise written in this example.
Code:
PROC 0:                                       PROC 1:
ST   X <- 99                                 ST  Y<-99
LD   R1 <- Y                                 LD  R2 <-X
Both P0's R1 and P1's R2 can read the value zero.

Hm this document ... claims that the Core 2 duo (penryn i believe)
My bad -- it must have been penryn that started having WB L1s. But I know it's a recent thing for intel -- they used to just write-through to L2.

Hm I'm confused about this store buffer business. So processor A is ticking along and it computes X and ST is issued on X. So X hops into A's store buffer (but say not at the head b/c there's a bunch of other stuff in there). 1 tick later, processor B, which is working with the same data set as A, reads the location X is going to. But procB does not see X... or does it??
In your example, processor B reads the old value of X. This is called processor consistency, and it is the correct operation of x86 multithreading. I know it sounds difficult -- and it is.

Does hardware avoid this situation automatically? If the stuff in the store buffer falls under the same coherency protocol (MESI? That's the only one I know about) as the cache (I THINK it does), then it sounds like the hardware takes care of business. If not, I'm confused.
Some chips do, say the MIPS R10k -- which uses the sequential consistency model instead of processor consistency.
The thing to understand is that before stores reach the head of the store buffer, the coherence mechanism doesn't even know those stores exist. Store buffer entries aren't part of the protocol.

That said, where does getting say a data lock (in a threading environment) in software come into play? Or a fence (i think that's the right word)/barrier (in MPI terminology)? I guess what I'm getting at is what does the CPU cache circuitry do for me, what do I do on my own, and when do (if they do at all) these overlap?

Great questions. First of all, locks on non-sequentially-consistent systems are implemented with hardware atomic operations. Hardware atomics usually force the store buffer to fully drain before they execute, so the ordering of atomics is consistent across all threads.

Fences exist in hardware as explicit instructions (not just as MPI concepts) -- essentially a fence forces the store buffer to drain, making all outstanding stores visible to other threads.

But if there isn't a write cache buffer, how on earth can writes be coalesced ... ?

You just have to be careful about terminology. Writes are coalesced. Once they have been properly sequenced by the commit logic (aka, the store buffer + coherence protocol). It is correct to coalesce writes, in the order in which stores are satisfied in the store buffer, in caches. It's just not correct to coalesce multiple stores into one, nor to let stores violate program order (thats actually OK under even weaker consistency models, though).
 
Back
Top