Back when the Brisbane cores came out, everybody was reporting drastic increases in the L2 latency, and since I knew that the latency was not 20 cycles, I wondered what was going on. I was also wondering if the memory latency increase with the Phenom TLB patch was real (I don't know the answer). I recently wrote my own simple measurement app, which reported 20 cycles. I looked at what cpu-z's latency measurement program was doing (it also gives 20 cycles), and it effectively does the same thing as my program (a rdtsc, a loop with "mov eax, [eax]" unrolled a bunch, then another rdtsc).
I came across this (fairly old) analysis of the K7 and K8 L2 caches and wondered if I could use that information to get the right answer on Brisbane.
I couldn't come up with a clean way to run tests with a variable number of delay slots, so I ended up dynamically generating the code - I create a buffer and write x86 instructions into it, then jump into that buffer to run the test. With this setup, I got the correct result: 14 cycles on my cpu (with 6 delay slots). I don't think the results are as interesting on Intel cpus, but I don't have any more recent than a 400MHz celeron handy 😉.
My program walks the test buffer 3 different ways: sequentially, randomly, and an interleaving of an upwards and downwards-sequential walk (e.g. 1, 10, 2, 9, 3, 8, 4, 7, etc). The elements are at 64 bytes intervals (so if the code runs on a cpu with 128-byte lines, it would most likely give the wrong answer).
If anyone is interested, the 64 bit version is here, and the 32 bit version is here. I compile it with "gcc -Wall -O0 -save-temps -olatency latency.c" (-O0 is probably not needed any more; a previous version of the code hit what I'd argue is a gcc bug; -save-temps is not needed). It will not work with a Microsoft compiler because the assembly syntax is different (read: better... gcc's syntax sucks). It will compile with cygwin gcc (probably any win32 gcc) though.
It takes about an hour to run to completion on my machine. If you reduce the iterations on line 51 it will finish faster. You can also decrease the range of the loop on line 55 (e.g. eliminate the larger sizes) to speed it up.
If it prints "crap, no go 🙁", comment out both mprotect() calls. If the 32-bit version crashes, uncomment both mprotect() calls. The dynamic code generation happens in the function setupTest().
Hopefully someone finds it interesting. I'd love to do some experiments on a Phenom or Barcelona, but don't have easy access to any.
I came across this (fairly old) analysis of the K7 and K8 L2 caches and wondered if I could use that information to get the right answer on Brisbane.
I couldn't come up with a clean way to run tests with a variable number of delay slots, so I ended up dynamically generating the code - I create a buffer and write x86 instructions into it, then jump into that buffer to run the test. With this setup, I got the correct result: 14 cycles on my cpu (with 6 delay slots). I don't think the results are as interesting on Intel cpus, but I don't have any more recent than a 400MHz celeron handy 😉.
My program walks the test buffer 3 different ways: sequentially, randomly, and an interleaving of an upwards and downwards-sequential walk (e.g. 1, 10, 2, 9, 3, 8, 4, 7, etc). The elements are at 64 bytes intervals (so if the code runs on a cpu with 128-byte lines, it would most likely give the wrong answer).
If anyone is interested, the 64 bit version is here, and the 32 bit version is here. I compile it with "gcc -Wall -O0 -save-temps -olatency latency.c" (-O0 is probably not needed any more; a previous version of the code hit what I'd argue is a gcc bug; -save-temps is not needed). It will not work with a Microsoft compiler because the assembly syntax is different (read: better... gcc's syntax sucks). It will compile with cygwin gcc (probably any win32 gcc) though.
It takes about an hour to run to completion on my machine. If you reduce the iterations on line 51 it will finish faster. You can also decrease the range of the loop on line 55 (e.g. eliminate the larger sizes) to speed it up.
If it prints "crap, no go 🙁", comment out both mprotect() calls. If the 32-bit version crashes, uncomment both mprotect() calls. The dynamic code generation happens in the function setupTest().
Hopefully someone finds it interesting. I'd love to do some experiments on a Phenom or Barcelona, but don't have easy access to any.