Somewhat-technical question about Core 2 and P4

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Does anyone happen to know whether core 2 or P4 do write-combine stores out of order with respect to loads? If you do a bunch of write-combine stores (e.g. MOVNTQ?) and then a load that accesses data just written, can it ever return stale data (i.e. leaving out sfence could be bad), or does it behave functionally like it's in-order (i.e. sfence doesn't really matter, it'll return data from the WC buffers rather than the cache anyway)?
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Did you look at the stack and find something that you didn't expect? I can't imagine why the processor would pull data out of order, but then I have never written code for an X86 processor at that level (I have only worked with Microcontrollers - Motorola and Hitachi). I just don't understand why they would pull data out of order, if it is indeed designed to do that.

edit: I am really interested as to what you are doing to ask that question. It has been close to 6 years since I have written anything at that low of a level, so this really intrigues me. It makes me want to pull out my old FORTH and Microprocessors books.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Martimus
Did you look at the stack and find something that you didn't expect?

Nope. I'm wondering what I should expect. I haven't figured out how to write something to test this reliably yet.

I can't imagine why the processor would pull data out of order, but then I have never written code for an X86 processor at that level (I have only worked with Microcontrollers - Motorola and Hitachi). I just don't understand why they would pull data out of order, if it is indeed designed to do that.

There are certain cases where the programmer knows things about his program that could potentially allow the hardware to take shortcuts, improving performance. If a program is writing out a large amount of data and knows it isn't going to read the data for a while, there are two things you could do. The normal behavior would be:
1) The program writes the first 8 bytes
1A) The CPU misses in its cache, and fetches the correct 64 byte line into its cache. There is bus traffic involved, sometimes even if the data was already in the L2 cache (due to coherency protocol requirements)
1B) The write occurs
2) The program writes the next 8 bytes
2A) The CPU hits in its cache (because of 1A) and writes
After the first 64 bytes are written, the cycle repeats (another miss, then a bunch of quick hits). As part of 1A, something else gets evicted from the cache.

This sucks for two major reasons:
1) If the program is writing all the data in a 64 byte line, reading the old data into the cache is a waste of bandwidth. If your goal is just to copy a lot of data (e.g. what memcpy does), you now read 2x as much as you really need to. memcpy is an example of where the bandwidth waste becomes really huge (an application where there's calculations going on might be using less bandwidth and more computation time).
2) Every time a line of data is brought into the cache, something else gets kicked out. Remember the programmer knows this data isn't going to be used for a while, so basically data in the cache that's useful is being replaced with data that doesn't need to be in the cache. Anything else that's happening, or whatever happens after this writing, will be slowed down since it will have to bring its data back into the cache.

You could solve #1 by telling the CPU, "Hey, don't bother fetching the old data, I'm going to write the whole cache line.". You can solve #2 by having a dedicated buffer, separate from the cache that is used for this type of operation. "Write combine" is the term used for the solution that addresses both (dedicate buffers for writing entire lines of data at a time, which don't load old data before the writes start).

So, since the P3 (I think? I think it was introduced with SSE1... not sure if 3dnow! had it first), x86 processors have supported this. The buffers are accessed with special instructions (or by setting some processor registers to certain values). You can access the same memory location with both regular instructions and special instructions. Now, here's the tricky part: what happens if a program reads address 0x1234ABCD (to bring it into the cache), and then writes 32 bytes (half of a cache line) to address 0x1234ABCD-0x1234 with a special instruction (so the data goes to the write combine buffer).... then reads address 0x1234ABDD (an address which just had data written to it) with a regular instruction? Does the processor return the old data that's in the data cache, or the new data that's in the special buffers? What happens if it's a different CPU making the read request? The instruction set specification allows the CPU to actually return the old data that's in the cache, unless the program explicitly says, "I'm done writing to the special buffer - force what I just wrote to become globally visible" by using a "fence" instruction.

I'm asking whether these CPUs actually work that way, or whether they always return the most recent data (i.e. from the write combine buffers).

edit: I am really interested as to what you are doing to ask that question. It has been close to 6 years since I have written anything at that low of a level, so this really intrigues me. It makes me want to pull out my old FORTH and Microprocessors books.

Talking to processor architects about older CPUs, and wondering about new ones.

Originally posted by: btcomm1
Sure, just make a right at the old cow barn.

Thanks for the thread crap.
 

Acanthus

Lifer
Aug 28, 2001
19,915
2
76
ostif.org
The guys how at highly technical eat this kind of question up Ctho. Youll more than likely be able to get an answer there.
 

Lord Banshee

Golden Member
Sep 8, 2004
1,495
0
0
To me it seems like it would be a flawed CPU design if it did read the old values if you already stored newer values. Two things should take care of this (Hazard Detection, forwarding the data from pipeline different stages) and NOP and stalls. I do not know the real answer (never looked at x86 architecture "yet") and my CPU design is purple on an older MIPS, but if i was writing special instructions as this i would surely make sure it would hooked into the Hazard Detection unit somehow.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
I am nearly positive that Intel chips do not let any other cores access data in the L2 cache that another core is processing, as this could cause errata if the two cores work on the same data and one uses an obsolete value that is still posted in the L2 cache because it hasn't been updated yet. This is exactly what the AMD TLB errata is, as far as I understand it. If the architecture worked the way it was originally designed to, the TLB errata would not exist, but since the L3 cache runs at a lower frequency than the CPU and not higher as originally designed it comes across this problem. Using the same example that you gave, what can happen is that the the core that changes the value in the L3 cache sends a command to write to the address, and clears the bit protecting that address saying that it has been changed to the new value. Another core tries to access that data on the next clock cycle, but the memory has not updated yet because it was running at a slightly lower clock speed and had finished a clock cycle just before the command to update the address, and has not gotten to the next clock cycle to update before being accessed by the second command core. That means that this errata is more likely the more the CPU clock speed differs from the L3 clock speed. Fixing it on current steppings would mean increasing the L3 Cache speed, or lowering the CPU clock speed. The fact that this made it into the design makes sense too, since AMD had planned to make the L3 clock speed much higher than the CPU clock speed, which would completely alleviate this problem. Also this solution is faster than checking to make sure the address was changed before accessing it, since if the architecture worked right (higher L3 clock than CPU clock) it would have been changed first.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
edit: It looks like the architecture may require that accesses to the same address read the write combine buffer; out-of-order writes could only occur to different addresses.

Original post:
Originally posted by: Lord Banshee
To me it seems like it would be a flawed CPU design if it did read the old values if you already stored newer values. Two things should take care of this (Hazard Detection, forwarding the data from pipeline different stages) and NOP and stalls. I do not know the real answer (never looked at x86 architecture "yet") and my CPU design is purple on an older MIPS, but if i was writing special instructions as this i would surely make sure it would hooked into the Hazard Detection unit somehow.


So, what do you think the purpose of the "sfence" instruction is then? From here:
Moves the quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an MMX? technology register, which is assumed to contain packed integer data (packed bytes, words, or doublewords). The destination operand is a 64-bit memory location.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation such as SFENCE should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the memory location.
From here:
void _mm_sfence(void)

(uses SFENCE) Guarantees that every preceding store is globally visible before any subsequent store.
And the best, from here:
Write Combining (WC)?System memory locations are not cached (as with uncacheable memory) and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write buffer to reduce memory accesses. The writes may be delayed until the next occurrence of a buffer or processor serialization event, e.g., CPUID execution, a read or write to uncached memory, interrupt occurrence, LOCKed instruction execution, etc. if the WC buffer is partially filled. This type of cache-control is appropriate for video frame buffers, where the order of writes is unimportant as long as the writes update memory so they can be seen on the graphics display. See Section 9.3.1., ?Buffering of Write Combining Memory Locations?, for more information about caching the WC memory type. The preferred method is to use the new SFENCE (store fence) instruction introduced in the Pentium® III processor. The SFENCE instruction ensures weakly ordered writes are written to memory in order, i.e., it serializes only the store operations.
9.3.1. Buffering of Write Combining Memory Locations
Writes to WC memory are not cached in the typical sense of the word cached. They are retained in an internal buffer that is separate from the internal L1 and L2 caches. The buffer is not snooped and thus does not provide data coherency. The write buffering is done to allow software a small window of time to supply more modified data to the buffer while remaining as nonintrusive to software as possible.
Software developers creating software that is sensitive to data being delayed must deliberately empty the WC buffers and not assume the hardware will.
Once data in the WC buffer has started to be propagated to memory, the data is subject to the weak ordering semantics of its definition. Ordering is not maintained between the successive allocation/deallocation of WC buffers (for example, writes to WC buffer 1 followed by writes to WC buffer 2 may appear as buffer 2 followed by buffer 1 on the system bus). When a WC buffer is propagated to memory as partial writes there is no guaranteed ordering between successive partial writes (for example, a partial write for chunk 2 may appear on the bus before the partial write for chunk 1 or vice versa).

I'm asking whether any CPUs really take advantage of this, or whether they behave as if they don't. I wrote some code yesterday and ran it on a Core (Yonah) and it seemed to behave as if it was not taking advantage of this. Of course, since I'm not really familiar with this stuff, my code may not have done what I was expecting it to.
 

Lord Banshee

Golden Member
Sep 8, 2004
1,495
0
0
what i was saying has nothing to do with the cache:
if this was my pipeline
{fetch}-{decode}-{exec}-{mem wt/rd}-{wt bk}
fetch = fetch instruction from L1 Instruction Cache
decode = decode instruction
exec = Do ALU stuff
mem wt/rd = Store/Read from L1 Data Cache
wt bk = Store to register

So we had and instructions that is setup like this in the pipeline.
{I5}-{I4}-{I3}-{I2}-{I1}
If the I3 required the data from I2 which has not been written back the cache or register yet, the data would be forwarded back into the exec stage of the pipeline for I3 so I3 has the correct data to work with.

I have not real experience with x86 or real cache systems so i have no idea how these buffers and cache stuff all work together (only overview system level and still that is confusing as hell). So i guess i have idea how to answer for question as i do not have a clue lol :) I really need to start looking into x86 stuff but i have a test to study for so i just can't do it at the moment. If you find the answer please post back.