Determining memory-load delays when designing data prefetching features in hardware/compilers

compuser

Member
Feb 14, 2000
152
0
0
Simple question: given that newer architectures of the past decade have resorted to data prefetching as a viable option to hide memory load delays (especially given that CPU:FSB/MEM_BUS ratios are ever-increasing); how is prefetching accomplished given the constant ratio changes and therefore memory load delay changes?

To clarify, be it IPF/x86/Power PC/et al. family of CPU/chipset hardware, fact remains that speed bumps and newer and better system designs for a given architecture keep arising. This directly translates to the fact that the time taken to load data from memory is - while generally decreasing - constantly changing even between every CPU speed-bump. At the same time, my understanding of prefetching tells me that HW/compiler places prefetch-load instructions in stretegic places in the compiled code so as to bring the potentially needed data as close to the execution units by the time they are needed.

How then can the same binary be just as efficient say after a CPU speed bump for example, when the CPU:FSB/MEM_BUS ratio has changed. This HW change has caused the effective memory load delay to go up. This in turn - as I interpret - means that the data may NOT be in the desired location/level of cache/registers by the time it's needed with this newer 'configuration' (faster CPU with same system board etc).

A side question: what's the difference in doing this prefetching in hardware vs. software/compilers (I guess the latter option may only pertain to IPF/ EPIC architures?) Pros/cons?
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
To answer your question, hardware prefetch mechanisms don't need to consider memory latency while compiler support for software prefetch often does.

Take a simple hardware prefetch mechanism for example, the sequential stream buffer. This is a small buffer placed beside the cache that typically holds 4 cache lines (usually 64 to 128 bytes per line). When a miss occurs in the cache, the stream buffer fetches the next four sequential cache lines. If then a miss occurs in the cache but hits in the stream buffer, the line is moved to the cache, thus reducing the cache miss latency. If a miss occurs in both the cache and the stream buffer, the stream buffer is flushed and begins fetching again. A multiway stream buffer, with, for example, four sets of four cache lines each, can reduce instruction cache misses by 72% and data cache misses by 43%. Pros: can do very well with sequential streams. Cons: extra hardware, it's a general solution, and can increase used bandwidth.

Software techniques requires instruction set and compiler support, and comes in two forms: prefetch instructions and memory reference speculation. Prefetch instructions can be either fetch into registers or cache (usually the latter), but are most effective when they are semantically invisible. This means that they can't change the contents of registers or memory, and will not cause virtual memory exceptions in a page miss; otherwise they would effect the execution of the program. Prefetch instructions are likely most useful in tight inner loops, so while memory latency can determine their effectiveness, you may be able to prefetch data from iterations far enough ahead to make it effective. Pros: can hand-tune occasions where prefetching is desired, and can be made to be very effective. Cons: requires instruction set support, adds instructions, can pollute the cache with undesired data.

The other method, memory reference speculation, is a bit different. While software prefetching inserts instructions to trigger cache misses, load speculation attempts to "hoist" a load to a point earlier in the program. This causes a large number of difficulties. If the compiler is moving the load instruction past a store, the memory location might have a different value at the new location of the load instruction than what is intended in the program. In addition, moving the load instruction might change from where the load occurs, since load instructions use the contents of registers to form the address for the memory operation. This memory reference speculation can become difficult especially if the load is moved before a branch instruction. So even if a compiler moves a load instruction because it thinks it is okay to do so, there still needs to be hardware support to verify this.

IA-64 registers have a poison bit at the end, making them actually 65-bits long. Speculative load instructions activate this poison bit on the register to which it is loading, and if any non-speculative instruction tries to use that register, it causes a program exception. Itanium implementations have what's called an advanced load address table (ALAT), which saves the addresses of speculative loads. If a store changes the value of the memory location before a check instruction that replaced the original load, then the speculation failed and the result of the load is tossed out. If the check succeeds, then the result of the load can be used. And just like software prefetch instructions, speculative loads cannot cause virtual memory exceptions that will change the result of correct programs. Speculative loads are likely used more sparingly, and their pros and cons are much like those of prefetch instructions. Memory and cache latency certainly has a large effect, so recompilations for new implementations will improve the effectiveness.

* not speaking for Intel Corp. *