• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Measuring cache latency with dynamically-generated code

CTho9305

Elite Member
Back when the Brisbane cores came out, everybody was reporting drastic increases in the L2 latency, and since I knew that the latency was not 20 cycles, I wondered what was going on. I was also wondering if the memory latency increase with the Phenom TLB patch was real (I don't know the answer). I recently wrote my own simple measurement app, which reported 20 cycles. I looked at what cpu-z's latency measurement program was doing (it also gives 20 cycles), and it effectively does the same thing as my program (a rdtsc, a loop with "mov eax, [eax]" unrolled a bunch, then another rdtsc).

I came across this (fairly old) analysis of the K7 and K8 L2 caches and wondered if I could use that information to get the right answer on Brisbane.

I couldn't come up with a clean way to run tests with a variable number of delay slots, so I ended up dynamically generating the code - I create a buffer and write x86 instructions into it, then jump into that buffer to run the test. With this setup, I got the correct result: 14 cycles on my cpu (with 6 delay slots). I don't think the results are as interesting on Intel cpus, but I don't have any more recent than a 400MHz celeron handy 😉.

My program walks the test buffer 3 different ways: sequentially, randomly, and an interleaving of an upwards and downwards-sequential walk (e.g. 1, 10, 2, 9, 3, 8, 4, 7, etc). The elements are at 64 bytes intervals (so if the code runs on a cpu with 128-byte lines, it would most likely give the wrong answer).

If anyone is interested, the 64 bit version is here, and the 32 bit version is here. I compile it with "gcc -Wall -O0 -save-temps -olatency latency.c" (-O0 is probably not needed any more; a previous version of the code hit what I'd argue is a gcc bug; -save-temps is not needed). It will not work with a Microsoft compiler because the assembly syntax is different (read: better... gcc's syntax sucks). It will compile with cygwin gcc (probably any win32 gcc) though.

It takes about an hour to run to completion on my machine. If you reduce the iterations on line 51 it will finish faster. You can also decrease the range of the loop on line 55 (e.g. eliminate the larger sizes) to speed it up.

If it prints "crap, no go 🙁", comment out both mprotect() calls. If the 32-bit version crashes, uncomment both mprotect() calls. The dynamic code generation happens in the function setupTest().

Hopefully someone finds it interesting. I'd love to do some experiments on a Phenom or Barcelona, but don't have easy access to any.
 
I'm not quite sure what you're doing here, but what the heck do you use to sample the clock, and how the hell do you get that resolution? I assume you're taking a large (LARGE!) number of iterations of some loop and then dividing the total cycle time by that number. Do you poll the performance counters directly? If so, we should talk.

I use something called PAPI in my work. And, frankly, it sucks. I have no idea what type of resolution it's supposed to have, but I don't trust the numbers it gives very much. Granted, this is all done on an ancient MIPS R10k processor.

So I'm not really interested in the numbers you get (I'm not even sure what the point is!) but I'm very interested in how you got them. And, if you don't mind me asking, what prompted you to do this in the first place?
 
RDTSC instruction reads the processor timestamp counter, present on all chips since the Pentium came out, that increments once per processor cycle.

The other way to get high res is to read the countdown register of the 8254 timer. The first method gives you one "tick" for each processor cycle. The second gives you ticks at approx. 1.2 mhz.
 
Originally posted by: scootermaster
I'm not quite sure what you're doing here, but what the heck do you use to sample the clock, and how the hell do you get that resolution? I assume you're taking a large (LARGE!) number of iterations of some loop and then dividing the total cycle time by that number. Do you poll the performance counters directly? If so, we should talk.

I use something called PAPI in my work. And, frankly, it sucks. I have no idea what type of resolution it's supposed to have, but I don't trust the numbers it gives very much. Granted, this is all done on an ancient MIPS R10k processor.

So I'm not really interested in the numbers you get (I'm not even sure what the point is!) but I'm very interested in how you got them. And, if you don't mind me asking, what prompted you to do this in the first place?

As Markbnj said, x86 processors since the Pentium have had a time stamp counter that increments by one every cycle. You can get the value with the "rdtsc" instruction. There are some potential "gotchas" when there are power-state transitions going on, but for something that keeps the system loaded enough to stay in the highest power state, it'll be pretty accurate. My code basically generates a block of code like this:

foo:
mov eax, [eax] ; p = *p in C/C++ (just chasing a pointer)
add eax, edx ; this takes 1 cycle
add eax, edx ; this takes 1 cycle
... 64 total copies ...
mov eax, [eax]
add eax, edx
add eax, edx
dec ecx ; decrement the loop counter
jnz foo ; if it's not yet 0, jump back to foo
ret ; return

This is unrolled so that the taken-branch overhead is minimal (a taken branch introduces a 1 cycle bubble on most processors), and the number of delay instructions ("add eax, edx") is different each time.

And then I have a small block of code that jumps into this dynamically generated code with a "call" instruction (it also sets up edx=0 so the adds become functionally NOP, and sets up the loop counter). This loop runs many times. Immediately before and after this loop, I read the timestamp counter.

What prompted me to do this was wanting to get the right answer for my CPU's L2 latency.
 
Back
Top