Originally posted by: Martimus
Did you look at the stack and find something that you didn't expect?
Nope. I'm wondering what I
should expect. I haven't figured out how to write something to test this reliably yet.
I can't imagine why the processor would pull data out of order, but then I have never written code for an X86 processor at that level (I have only worked with Microcontrollers - Motorola and Hitachi). I just don't understand why they would pull data out of order, if it is indeed designed to do that.
There are certain cases where the programmer knows things about his program that could potentially allow the hardware to take shortcuts, improving performance. If a program is writing out a large amount of data and knows it isn't going to read the data for a while, there are two things you could do. The normal behavior would be:
1) The program writes the first 8 bytes
1A) The CPU misses in its cache, and fetches the correct 64 byte line into its cache. There is bus traffic involved, sometimes even if the data was already in the L2 cache (due to coherency protocol requirements)
1B) The write occurs
2) The program writes the next 8 bytes
2A) The CPU hits in its cache (because of 1A) and writes
After the first 64 bytes are written, the cycle repeats (another miss, then a bunch of quick hits). As part of 1A, something else gets evicted from the cache.
This sucks for two major reasons:
1) If the program is writing all the data in a 64 byte line, reading the old data into the cache is a waste of bandwidth. If your goal is just to copy a lot of data (e.g. what memcpy does), you now read 2x as much as you really need to. memcpy is an example of where the bandwidth waste becomes really huge (an application where there's calculations going on might be using less bandwidth and more computation time).
2) Every time a line of data is brought into the cache, something else gets kicked out. Remember the programmer knows this data isn't going to be used for a while, so basically data in the cache that's useful is being replaced with data that doesn't need to be in the cache. Anything else that's happening, or whatever happens after this writing, will be slowed down since it will have to bring its data back into the cache.
You could solve #1 by telling the CPU, "Hey, don't bother fetching the old data, I'm going to write the whole cache line.". You can solve #2 by having a dedicated buffer, separate from the cache that is used for this type of operation. "Write combine" is the term used for the solution that addresses both (dedicate buffers for writing entire lines of data at a time, which don't load old data before the writes start).
So, since the P3 (I think? I think it was introduced with SSE1... not sure if 3dnow! had it first), x86 processors have supported this. The buffers are accessed with special instructions (or by setting some processor registers to certain values). You can access the same memory location with both regular instructions and special instructions. Now, here's the tricky part: what happens if a program reads address 0x1234ABCD (to bring it into the cache), and then writes 32 bytes (half of a cache line) to address 0x1234ABCD-0x1234 with a special instruction (so the data goes to the write combine buffer).... then reads address 0x1234AB
DD (an address which just had data written to it) with a regular instruction? Does the processor return the old data that's in the data cache, or the new data that's in the special buffers? What happens if it's a different CPU making the read request? The instruction set specification allows the CPU to actually return the old data that's in the cache, unless the program explicitly says, "I'm done writing to the special buffer - force what I just wrote to become globally visible" by using a "fence" instruction.
I'm asking whether these CPUs actually work that way, or whether they always return the most recent data (i.e. from the write combine buffers).
edit: I am really interested as to what you are doing to ask that question. It has been close to 6 years since I have written anything at that low of a level, so this really intrigues me. It makes me want to pull out my old FORTH and Microprocessors books.
Talking to processor architects about older CPUs, and wondering about new ones.
Originally posted by: btcomm1
Sure, just make a right at the old cow barn.
Thanks for the thread crap.