• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Help with code profiling? (Understanding which CPU events I should measure)

eLiu

Diamond Member
Hi all,
I'm attempting to do some profiling on the my group's code project (CFD stuff... maybe some ppl remember from my previous posts). I have access to Intel's VTune software (and I'm already using their compiler & the MKL), but I'm kind of confused on how to use their event counters. From what I understand, the performance counters are for intel processors (I'm on a 45nm Core2Duo at 2.8ghz) and are not unique to VTune. Currently I'm trying to get a grasp of how our code performs in terms of L1 misses, L2 misses, and branch mispredictions. If there are other factors I should be measuring, please enlighten me!

In particular, with VTune there seems to always be at least 2 or 3 ways to measure a performance factor (like impact of branch mispredicts, L2 misses, L1 misses). I don't want to just run every possible combination b/c a single run in VTune seems to only allow 4 event counters, and I'd just as soon not have to run every test in my suite like 5 times!

The various options seem to generate similar-ish results, but I'm still confused as to why there are so many choices and which one is "best". For example, to see how L1(data) misses are affecting me, there are 2 suggested ratios: L1 Data Cache Miss Rate (L1D_REPL / INST_RETIRED.ANY) and L1 Data Cache Miss Performance Impact (8 * L1D_REPL / CPU_CLK_UNHALTED.CORE). These are based on L1D_REPL, the #lines brought into L1 cache. But there are other ways to figure that. Like L2_RQST (# of requests to the L2 cache) would be a different measure of what's entering L1. There's also MEM_LOAD_RETIRED.L1D_LINE_MISS which claims to precisely count the "number of load operations that miss the L1 data cache".

Similarly options for L2 seem to be L2_LINES_IN.ANY or MEM_LOAD_RETIRED.L2_LINE_MISS. So many of these sound like they do pretty much the same thing, at least to my untrained ear. Which are the most informative? Are there advantages or are they really pretty much the same?

For branch misprediction, a document on intel's website suggests the ratio: BR_INST_RETIRED.MISPRED / BR_INST_RETIRED.ANY. But within VTune itself, the ratio: RESOURCE_STALLS.BR_MISS_CLEAR / CPU_CLK_UNHALTED.CORE is suggested. Clearly the first one measures what percentage of branches are mispredicted & the second estimates what percentage of execution time is spent waiting due to branchfail. Which is more meaningful? Do I really need both?

Also, as I mentioned VTune measures at most 4 events per run of the code. Suppose I have events A, B, C, D, E, F. That'd take 2 runs; the first handles ABCD and the second EF. Does anyone know if I can make it so that the first run does ABCD and the second does ABEF? (I'd like to see AB alongside EF without having to manually click around.) I'd be happy queueing up a bunch of runs too, but I can't figure out how to do that either.

Thanks,
-Eric
 
Last edited:
I don't know anything about VTune. But I do know a few things about the measurements you mentioned. So I'll try to answer at least that part. Unfortunately, without knowing VTune specifically, my answers apply more to the events themselves than the Intel names for the events.

* L1 Data Cache Miss Rate (L1D_REPL / INST_RETIRED.ANY): This is not I would call the "cache miss rate." L1D_REPL / MEM_LOAD_RETIRED.ANY would be my definition, unless you're talking about a nehalem, in which case I woudl use L1D_REPL / (MEM_LOAD_RETIRED+MEM_STORE_RETIRED).

*I have no idea what the rationale for "L1 Data Cache Miss Performance Impact" is. Obviously it is related to replacements/second, but that isn't really a great proxy for "performance impact". I would use the former, given the choice.

* L2_RQST is not necessarily the same as L1D_REPL, because L2's serve the L1I and the L1D, and non-allocating accesses may hit in a cache but not cause an allocation.

* L2_LINES_IN is also not the same as the number of load L2 misses, again because of instruction accesses. Furthermore, multiple accesses can be serviced by a single line filling the L2. Obviously, without reading the documentation very carefully, its hard to know for sure what Intel means by the various events.

* Branch misprediction rate is commonly quoted as mispredictions per instruction, or per kiloinstruction. Thats not very useful. I would prefer the resource stall rate for this metric, since time lost to branch mispredictions really is a good metric for figuring out if that is a performance issue or not.

--> What other stuff is there in RESOURCE_STALL? If counters are provided that precisely count the cycles in which the processor is waiting for resource X (and only resource X), then:

Let f = RESOURCE_STALL.X / CPU_CLK_UNHALTED.CORE

According to Amdahl's Law, performance loss due to this stall condition is f * 100%. That is, if you could make this stall condition go away completely, the execution should speed up by a factor of 1/(1-f). Again, this assumes that the resource stall counter is precise, and only ticks when X is the only lacking resource -- it is possible to stall on multiple resources as well... but the analysis of that is less straightforward.
 
First, references for all of these codes can be found here:
http://www.jaist.ac.jp/iscenter-new...rojects/reference_olh/whgdata/whlstt53.htm#64

I have a 45nm Core2Duo (at 2.8ghz). If you type the name into google, that also usually brings up the relevant doc page too. Unfortunately on some pages, the documentation is quite sparse.

I don't know anything about VTune. But I do know a few things about the measurements you mentioned. So I'll try to answer at least that part. Unfortunately, without knowing VTune specifically, my answers apply more to the events themselves than the Intel names for the events.
Oh I thought those codes were VTune specific. I was really more interested in getting help understanding the codes, so this will work well.

* L1 Data Cache Miss Rate (L1D_REPL / INST_RETIRED.ANY): This is not I would call the "cache miss rate." L1D_REPL / MEM_LOAD_RETIRED.ANY would be my definition, unless you're talking about a nehalem, in which case I woudl use L1D_REPL / (MEM_LOAD_RETIRED+MEM_STORE_RETIRED).

Hm, MEM_LOAD_RETIRED.ANY is not an option. There's .DTLB_MISS, .L1D_LINE_MISS, .L1D_MISS, .L2_LINE_MISS, and .L2_MISS. There is one called INST_RETIRED.LOAD (and INST_RETIRED.STORES); I'm guessing that is equivalent?

What about MEM_LOAD_RETIRED.L1D_LINE_MISS? What makes that better or worse than using L1D_REPL? (Or are they the same...)

*I have no idea what the rationale for "L1 Data Cache Miss Performance Impact" is. Obviously it is related to replacements/second, but that isn't really a great proxy for "performance impact". I would use the former, given the choice.

Me neither. But it's one of the default ratios that you can choose from, and I've been trying to understand why I'd want to look at it. Sounds like probably I can ignore this one.

* L2_RQST is not necessarily the same as L1D_REPL, because L2's serve the L1I and the L1D, and non-allocating accesses may hit in a cache but not cause an allocation.

* L2_LINES_IN is also not the same as the number of load L2 misses, again because of instruction accesses. Furthermore, multiple accesses can be serviced by a single line filling the L2. Obviously, without reading the documentation very carefully, its hard to know for sure what Intel means by the various events.

Oops, good point; I was only thinking about data. Yeah problematically the documentation often says very little. Well, more likely it's my lack of knowledge that's the real issue. One of intel's default ratios is "L2 Cache Miss Rate" = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY. Is the MEM_LOAD_RETIRED.L2_LINE_MISS any better? Do you of better events to look at? I'm going through the list here and there isn't anything that's really standing out to me.

* Branch misprediction rate is commonly quoted as mispredictions per instruction, or per kiloinstruction. Thats not very useful. I would prefer the resource stall rate for this metric, since time lost to branch mispredictions really is a good metric for figuring out if that is a performance issue or not.

--> What other stuff is there in RESOURCE_STALL? If counters are provided that precisely count the cycles in which the processor is waiting for resource X (and only resource X), then:

Let f = RESOURCE_STALL.X / CPU_CLK_UNHALTED.CORE

According to Amdahl's Law, performance loss due to this stall condition is f * 100%. That is, if you could make this stall condition go away completely, the execution should speed up by a factor of 1/(1-f). Again, this assumes that the resource stall counter is precise, and only ticks when X is the only lacking resource -- it is possible to stall on multiple resources as well... but the analysis of that is less straightforward.

Other RESOURCE_STALL things are: ANY; BR_MISS_CLEAR (stall due to branch mispred); FPCW (stall due to FPU control word write); LD_ST (the pipeline has exceeded load or store limit or waiting to commit all stores; i.e. all store/load buffers in use); ROB_FULL (ROB is full: 96 instructions in reorder buffer); and RS_FULL (32 instructions waiting for inputs in reserve station).

Seems like the last 3 have to do with the pipeline being unable to continue due to... something or the other. Actually it seems like these may be potential measures of cache misses if I could figure out if any one of them is correlated specifically to cache.

So to be clear, you are suggesting RESOURCE_STALL.BR_MISS_CLEAR / CPU_CLK_UNHALTED.CORE as the best way of seeing how important branch misprediction is? Are the other RESOURCE_STALLS going to be better indicators of L1/2 cache performance?

Lastly, another general question: are there other events/ratios that would be good for me to look at to get a better understanding of what does/doesn't work in our code? My task is just basically "make it faster", and I hardly even know where to start. I've always heard that L2 miss, branch mispred ,and L1 miss are good places to start (hence why I'm asking about these), but any additional advice would be appreciated.

I've really only had experience speeding up small codes of a few hundred or maybe a thousand lines, where it's relatively easy to profile, then look around and see what bits are written poorly. This project is like well more than hundred thousand lines so it's pretty daunting.
 
Oh one more question...

Is it surprising that event-based profiling and time-based profiling would give different results for what's slow/fast? (By slow/fast, I mean some functions that retire relatively few instructions in event-based seem to have comparatively large amounts of "self-time" in time-based.)

I can understand that small/simple functions that are called often may not generate many/any interesting events, but they'll flare up like crazy on time-based since the instrumenting code becomes a considerable part of the execution time.

For example, there's a function "PXShape". It's responsible for evaluating basis: given some data, it decides which type of basis is appropriate & evaluates it. It calls 3 helper functions to figure out which basis is appropriate (these are all big switch statements). Then it calls realloc() on the input to make sure there's enough space. Lastly there's a switch statement to call a function that will actually do the evaluation (the most expensive part).

The total time reported by time-based profiling is ~23sec. PXShape has total time 2.85sec (about 13% total) and self-time 1.25sec (about 5% of total). The self-time seems huge given how little actually happens within this function. Coupled w/the fact that it is called fairly often, (3 million calls compared to about 90mil calls total) is it safe to say that most of the time spent here is due to inserted instrumentation code?

In event-based profiling, PXShape retires 214,400,000 instructions, which is 0.83% of the total. (The fcn that retires the most instructions retires 3,342,400,000 of them, which is 13.14%)

So for examples like PXShape, should I be looking at trying to make PXShape faster or trying to decrease the number of calls? i.e. is PXShape actually a point of inefficiency?
 
Last edited:
I started this response with a fresh cup of coffee. It was gone by the time it was done. If you are considering reading this entire thing, I'm truly very sorry. To quote "The Incredibles": You got me monologueing!

I want to start this post with two high-level points.
1. Performance counters suck. They're never documented well enough. What I'm saying in this post about RESOURCE_STALL, etc., requires that those counters only tick when the given resource is the only bottleneck. Whether or not that is true is not clear to me from the documentation.
2. Performance counters frequently don't work. As in, they often report incorrect information. You would be wise to validate your counters with some known behaviors.

First, references for all of these codes can be found here:
Hm, MEM_LOAD_RETIRED.ANY is not an option. There's .DTLB_MISS, .L1D_LINE_MISS, .L1D_MISS, .L2_LINE_MISS, and .L2_MISS. There is one called INST_RETIRED.LOAD (and INST_RETIRED.STORES); I'm guessing that is equivalent?

Retired loads would be what I was after with my made-up "MEM_LOAD_RETIRED.ANY".

What about MEM_LOAD_RETIRED.L1D_LINE_MISS? What makes that better or worse than using L1D_REPL? (Or are they the same...)

L1D_REPL = Number of lines that come into the cache.
MEM_LOAD_RETIRED.L1D_LINE_MISS = Number of misses.
These are not the same for one key reason. Two loads can miss to the same line simultaneously. Consider these uOps:
LD 0x1000 -> R1
LD 0x1010 -> R2
These loads access the same cache line (Line 0x1000), but they are two different misses if that line isn't already in the L1D.

The better indicator of whether or not a performance bottleneck exists is L1D_REPL. BUT, I think a solution using RESOURCE_STALL might be better (see below).

Oops, good point; I was only thinking about data. Yeah problematically the documentation often says very little. Well, more likely it's my lack of knowledge that's the real issue. One of intel's default ratios is "L2 Cache Miss Rate" = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY. Is the MEM_LOAD_RETIRED.L2_LINE_MISS any better? Do you of better events to look at? I'm going through the list here and there isn't anything that's really standing out to me.

Intel's ratio tells you how many cache misses are occurring as a percentage of all instructions. It doesn't tell you how much it affects performance. See my comments on RESOURCE_STALLS below.

Experiment: Can you check to see if L2_M_LINES_IN is > 0? I thought Core2's had WT/WI L1s, which would suggest this should be 0. I could very well be wrong. It would not be the first time.

Other RESOURCE_STALL things are...
Elided for brevity

So to be clear, you are suggesting RESOURCE_STALL.BR_MISS_CLEAR / CPU_CLK_UNHALTED.CORE as the best way of seeing how important branch misprediction is?

I really like the RESOURCE_STALL counters. The reason why you should to is that just counting events without measuring their impact doesn't give you the full picture of what is going on in the microarchitecture. RESOURCE_STALLs are the manifestation of all kinds of bad things in the underlying circuitry. E.g., RESOURCE_STALL.BR_MISS_CLEAR is the number of cycles stalled due to branch misprediction. Not all mispredictions are the same -- some are worse than others! Counting the cycles gives you an iron-law measure of how bad the problem is.

Aside: OK, branch mispredicts aren't the best example here, because there is also opportunity cost associated with the mispredict that isn't measured by a stall time... anyway, back the main response.

Are the other RESOURCE_STALLS going to be better indicators of L1/2 cache performance?

Yes (if what I said at the start of this post is true). Lets assume you're not doing much FP work. If you are, thats OK, just more complicated.

Code is typically bound by one of two resources:
- Compute bound: These codes are hitting in the L1D a LOT, and the limiting factor on the performance is the degree of superscalar execution (i.e., the maximum number of uOps that can issue in a single cycle). This is where you want to be, if possible. A workload that is computation bound is just about as optimized as it can be for its target platform.
- Memory bound. These codes are missing in the L1D, and probably missing in the L2 as well.

Aside 1: Both of these bounding factors are usually lumped into the compute-bound category when talking about software. It would be more appropriate to call them I/O insensitive.

Aside 2: There are other kinds of boundedness, e.g., control-bound, but they're uncommon.

Both of these kinds of boundedness will have different effects on RESOURCE_STALLS. In particular, memory-bound workloads will lose performance because of RESOURCE_STALL.ROB_FULL or RESOURCE_STALL.RS_FULL conditions. So these measure actual performance lost due to long-latency events (e.g., cache misses), as opposed to simply measuring the number of long-latency events (e.g., L2_REPL).

If I were you, I would look at ALL the resource stall numbers. If my assumption about these counters is true -- namely that only one counter ticks in a cycle, if any -- then you should pick the one that represents the largest percentage of execution time and optimize that metric.

Lastly, another general question: are there other events/ratios that would be good for me to look at to get a better understanding of what does/doesn't work in our code? My task is just basically "make it faster", and I hardly even know where to start. I've always heard that L2 miss, branch mispred ,and L1 miss are good places to start (hence why I'm asking about these), but any additional advice would be appreciated.
They're great places to start, but not the best place, actually. The best thing to do is to look at the overall application organization and figure out where the high-level efficiencies are, if any. Look for redundant work, wasted time, useless I/O, etc. Thats the low-hanging fruit.

I've really only had experience speeding up small codes of a few hundred or maybe a thousand lines, where it's relatively easy to profile, then look around and see what bits are written poorly. This project is like well more than hundred thousand lines so it's pretty daunting.
Yea, not fun.

Oh one more question...

Is it surprising that event-based profiling and time-based profiling would give different results for what's slow/fast? (By slow/fast, I mean some functions that retire relatively few instructions in event-based seem to have comparatively large amounts of "self-time" in time-based.)

I can understand that small/simple functions that are called often may not generate many/any interesting events, but they'll flare up like crazy on time-based since the instrumenting code becomes a considerable part of the execution time.
I believe I've said before that 99% of code executed is about 1% of code written. This is another way that the above is true.

In general, event counting sucks. The idea that all instructions, or all branches, or all loads, etc., are the same is brain-dead (no offense intended to present company). Loads have bias towards misses or hits, branches have bias toward correct or incorrect predictions, etc. These things only come out at runtime, and are very dependent on the underlying core architecture.

For example, there's a function "PXShape". It's responsible for evaluating basis: given some data, it decides which type of basis is appropriate & evaluates it. It calls 3 helper functions to figure out which basis is appropriate (these are all big switch statements). Then it calls realloc() on the input to make sure there's enough space. Lastly there's a switch statement to call a function that will actually do the evaluation (the most expensive part).

The total time reported by time-based profiling is ~23sec. PXShape has total time 2.85sec (about 13% total) and self-time 1.25sec (about 5% of total). The self-time seems huge given how little actually happens within this function. Coupled w/the fact that it is called fairly often, (3 million calls compared to about 90mil calls total) is it safe to say that most of the time spent here is due to inserted instrumentation code?

Stand back everyone, I'm about to rail about indirect jumps!
For now, I'm going to blame your big switch statements. They're actually worse than your profiling would indicate. Here's why.

Its fairly common to implement a big switch as a register-indirect jump. Something like:
LD [SWITCH_BASE_ADDRESS + SWITCH_OFFSET] -> R1
JMP R1

If SWITCH_OFFSET tends to vary, this will mispredict every time. Register-indirect jumps are typically predicted from a structure called the Branch Target Buffer, or BTB. The BTB usually holds the previous branch target used by the same branch during the last execution. In other words, it holds the location belonging to the previous SWITCH_OFFSET.

There's a very small secondary issue at work, too. Space in the BTB is finite. The above JMP pollutes the BTB, potentially evicting useful information which will negatively affect other code somewhere else, totally invisibly.

Unfortunately, if you control has to fan out that widely, the switch statement may be unavoidable. A big if/else chain would be worse.

In event-based profiling, PXShape retires 214,400,000 instructions, which is 0.83% of the total. (The fcn that retires the most instructions retires 3,342,400,000 of them, which is 13.14%)

Again, not all instructions are equal. A core2 can retire 214,400,000 in about 80M cycles or so if they're all nops or simple arithmetic. It could take 85B cycles if they're all cache misses -- or worse -- TLB misses!

So for examples like PXShape, should I be looking at trying to make PXShape faster or trying to decrease the number of calls? i.e. is PXShape actually a point of inefficiency?

Lastly, I'm going to again cite Amdahl's Law.
http://en.wikipedia.org/wiki/Amdahl's_law

If you have a function that is taking up f% of the execution time, then the best performance you can possibly achieve by optimizing that function is an improvement of 1/(1-f). For f=0.13 (e.g., PXShape), best possible speedup is 15%. That only happens if you make PXShape take ZERO time.

Final point:
If you don't have a single clear piece of code that kills performance, you have to think at a higher level. What does your code do? Does it loop through memory in a funny order? Does it have a lousy data structure. Is it poorly abstracted? Etc. Some perf counters can tell you this, but only you can tell if you're using an O(N^2) algorithm instead of an O(NlogN).
 
Last edited:
I started this response with a fresh cup of coffee. It was gone by the time it was done. If you are considering reading this entire thing, I'm truly very sorry. To quote "The Incredibles": You got me monologueing!

Haha, I appreciate all the time you're taking to help a noob. Also, man these posts are getting huge.

I want to start this post with two high-level points.
1. Performance counters suck. They're never documented well enough. What I'm saying in this post about RESOURCE_STALL, etc., requires that those counters only tick when the given resource is the only bottleneck. Whether or not that is true is not clear to me from the documentation.
2. Performance counters frequently don't work. As in, they often report incorrect information. You would be wise to validate your counters with some known behaviors.

Nuts. I thought performance counters were an excellent way of seeing how things are going in the code 🙁.

L1D_REPL = Number of lines that come into the cache.
MEM_LOAD_RETIRED.L1D_LINE_MISS = Number of misses.
These are not the same for one key reason. Two loads can miss to the same line simultaneously. Consider these uOps:
LD 0x1000 -> R1
LD 0x1010 -> R2
These loads access the same cache line (Line 0x1000), but they are two different misses if that line isn't already in the L1D.

The better indicator of whether or not a performance bottleneck exists is L1D_REPL. BUT, I think a solution using RESOURCE_STALL might be better (see below).

Have you seen this: http://people.redhat.com/drepper/cpumemory.pdf ?
On p78 he claims L1D_REPL, DTLB_MISS, and L2_LINES_IN are the basic factors to examine when profiling memory use. I mean I can kind of see this... if I have an "exact" count, and I know which functions are time consuming, I can go to those bits of code and examine what's problematic.

Experiment: Can you check to see if L2_M_LINES_IN is > 0? I thought Core2's had WT/WI L1s, which would suggest this should be 0. I could very well be wrong. It would not be the first time.

I ran it, and it's definitely nonzero. It's mostly larger than L2_LINES_IN.SELF.ANY.

I really like the RESOURCE_STALL counters. The reason why you should to is that just counting events without measuring their impact doesn't give you the full picture of what is going on in the microarchitecture. RESOURCE_STALLs are the manifestation of all kinds of bad things in the underlying circuitry. E.g., RESOURCE_STALL.BR_MISS_CLEAR is the number of cycles stalled due to branch misprediction. Not all mispredictions are the same -- some are worse than others! Counting the cycles gives you an iron-law measure of how bad the problem is.

Aside: OK, branch mispredicts aren't the best example here, because there is also opportunity cost associated with the mispredict that isn't measured by a stall time... anyway, back the main response.

But the other RESOURCE_STALL.blah events don't seem to really correspond to "this is L2" or "this is L1" or "this is TLB" or whatever. Could you provide me w/some guidance on how to interpret the other measures? Like the docs for LD_ST, ROB_FULL, and RS_FULL seem to nearly say the same thing. I tried measuring them and they don't appear to be strongly correlated. So I don't really know what to make of this information.

I don't suppose there's any real way of measuring opportunity cost...


Yes (if what I said at the start of this post is true). Lets assume you're not doing much FP work. If you are, thats OK, just more complicated.

Haha, it's fluid dynamics--about the only calculation we do is floating point. Basically it boils down to assembling a huge matrix, and then solving a (huge) linear system over and over and over again.

Code is typically bound by one of two resources:
- Compute bound: These codes are hitting in the L1D a LOT, and the limiting factor on the performance is the degree of superscalar execution (i.e., the maximum number of uOps that can issue in a single cycle). This is where you want to be, if possible. A workload that is computation bound is just about as optimized as it can be for its target platform.
- Memory bound. These codes are missing in the L1D, and probably missing in the L2 as well.

Aside 1: Both of these bounding factors are usually lumped into the compute-bound category when talking about software. It would be more appropriate to call them I/O insensitive.

Aside 2: There are other kinds of boundedness, e.g., control-bound, but they're uncommon.

Both of these kinds of boundedness will have different effects on RESOURCE_STALLS. In particular, memory-bound workloads will lose performance because of RESOURCE_STALL.ROB_FULL or RESOURCE_STALL.RS_FULL conditions. So these measure actual performance lost due to long-latency events (e.g., cache misses), as opposed to simply measuring the number of long-latency events (e.g., L2_REPL).

I'd say we are computation bounded in a few places, but most of the code is memory bounded. Somteimes it's hard to avoid... like computing an outer product of 2 nx1 vectors. There's 3 memory accesses (read vec1,read vec2,write resultmatrix) PER floating point operation. I will be looking at trying to "lump" these operations so that I can call matrix-matrix multiplies instead. Not sure if this is possible but it's one of the things I'm examining.

If I were you, I would look at ALL the resource stall numbers. If my assumption about these counters is true -- namely that only one counter ticks in a cycle, if any -- then you should pick the one that represents the largest percentage of execution time and optimize that metric.


They're great places to start, but not the best place, actually. The best thing to do is to look at the overall application organization and figure out where the high-level efficiencies are, if any. Look for redundant work, wasted time, useless I/O, etc. Thats the low-hanging fruit.

Yeah as I mentioned above, I'm pretty confused about how to interpret the RESOURCE_STALL numbers :/

There's a ton of low-hanging fruit. But a lot of it happens in functions that don't take up much time. So even if I fixed some of the most obvious dumb things, I probably couldn't even measure the speed-up. (Or so I think anyway, having not yet tried to fix all the silly things.) Well unless you're suggesting that I really should go through and pick off all the easy fixes first, even if they are very minor.


In general, event counting sucks. The idea that all instructions, or all branches, or all loads, etc., are the same is brain-dead (no offense intended to present company). Loads have bias towards misses or hits, branches have bias toward correct or incorrect predictions, etc. These things only come out at runtime, and are very dependent on the underlying core architecture.

Like I said before, "nuts." How are you supposed to use profiling then? The message I'm getting is that profiling is a finicky and unreliable source of performance information and that there are just too many factors at play to even try and measure it. Surely I'm interpreting things in a much too pessimistic fashion.

Stand back everyone, I'm about to rail about indirect jumps!
For now, I'm going to blame your big switch statements. They're actually worse than your profiling would indicate. Here's why.

Its fairly common to implement a big switch as a register-indirect jump. Something like:
LD [SWITCH_BASE_ADDRESS + SWITCH_OFFSET] -> R1
JMP R1

If SWITCH_OFFSET tends to vary, this will mispredict every time. Register-indirect jumps are typically predicted from a structure called the Branch Target Buffer, or BTB. The BTB usually holds the previous branch target used by the same branch during the last execution. In other words, it holds the location belonging to the previous SWITCH_OFFSET.

There's a very small secondary issue at work, too. Space in the BTB is finite. The above JMP pollutes the BTB, potentially evicting useful information which will negatively affect other code somewhere else, totally invisibly.

Unfortunately, if you control has to fan out that widely, the switch statement may be unavoidable. A big if/else chain would be worse.

Given what you said, I'd be inclined to say that the branch is predicted correctly most of the time. The switch array is something like... case A,B,C,D,E then call Func1. F,G,H,I,J then call Func2; there are like 8 such sets. In our solver, we divide a domain into triangles (in 2D anyway). The triangles are lumped into "element groups", each of which has a basis type that fits into one of the 8 categories. In fact typically, all element groups have the same basis type. But at worst there are 2 or 3 different basis types. Since we handle all things in elementgroup1 before group2, this still means that SWITCH_OFFSET changes very little.

This is why I'm thinking that a lot of the time reported (in the time-based profiling) for PXShape is due to the instrumentation. Additionally when I increase the problem size, the percent-runtime used by PXShape goes down. There's another more extreme example: PXErrorReport. It's just 1 if-statement that checks if the error code is not "NO_ERROR" and halts the program if there is an error. This check occurs with EVERY function call and the time-profiling claims it accounts for like 15% of the total runtime. I find that very hard to believe. I inlined it and the runtime of a test-case didn't change but the time-profiling results were different (namely PXErrorReport was gone).


Again, not all instructions are equal. A core2 can retire 214,400,000 in about 80M cycles or so if they're all nops or simple arithmetic. It could take 85B cycles if they're all cache misses -- or worse -- TLB misses!

Lastly, I'm going to again cite Amdahl's Law.
http://en.wikipedia.org/wiki/Amdahl's_law

If you have a function that is taking up f% of the execution time, then the best performance you can possibly achieve by optimizing that function is an improvement of 1/(1-f). For f=0.13 (e.g., PXShape), best possible speedup is 15%. That only happens if you make PXShape take ZERO time.

Well unfortunately the function that eats the most time is still only like 10-15% of the total time. So if I could take like the top 3 functions and remove them completely, that wouldn't even be a 2x speedup 🙁

Also I'm familiar with Amdahl's Law. I guess my question about PXShape was more pointed toward trying to understand the event-based & timing-based results I was seeing. Like trying to figure out which, if any, result(s) I should trust. And trying to figure out how to best use the profiling tools available to guide me in this optimization project.

Final point:
If you don't have a single clear piece of code that kills performance, you have to think at a higher level. What does your code do? Does it loop through memory in a funny order? Does it have a lousy data structure. Is it poorly abstracted? Etc. Some perf counters can tell you this, but only you can tell if you're using an O(N^2) algorithm instead of an O(NlogN).

Nope, don't appear to have a single piece. I wish it were that easy. As far as I know, there also aren't meaningful places where we're choosing a silly algorithm. There definitely are places were we step through memory in a funny way, and places where we recompute things needlessly and/or allocate space inefficiently. I've been trying to use profiling to give me some idea of which ones of these factors actually matter. I don't think there is enough of me to comb through the code line by line and make everything "perfect."

I guess what I'm still a bit confused on is 1) what profiling measures should I use? Event-based (which ones) or time-based? How to reconcile the differences between the two?; and 2) how do I interpret the various profiling measures? I've been taught that profiling should guide the optimization process, but I'm now feeling more lost on how to do that since I'm not even certain how to figure out what parts of the program are slowest. Total time? Instructions retired? unhalted clock cycles? something else?

Ahhhhhh craziness.
-Eric
 
Nuts. I thought performance counters were an excellent way of seeing how things are going in the code 🙁.
As you're discovering, performance counters have their own set of issues. I would suggest performance counters when its hard to tell why a particular function or subroutine is slow, not why the whole thing is slow.


Have you seen this: http://people.redhat.com/drepper/cpumemory.pdf ?
On p78 he claims L1D_REPL, DTLB_MISS, and L2_LINES_IN are the basic factors to examine when profiling memory use.

If you have a point of reference, they're fine. But suppose you ran these and saw 80M, 45, and 10M? Are those numbers high or low for your workload? How much does it hurt performance? Just the raw numbers don't answer these questions.

I ran it, and it's definitely nonzero. It's mostly larger than L2_LINES_IN.SELF.ANY.
Interesting. Thanks for the heads-up. It satisfies my curiosity, but isn't really related to your question.

Could you provide me w/some guidance on how to interpret the other measures? Like the docs for LD_ST, ROB_FULL, and RS_FULL seem to nearly say the same thing.

One thing to watch out for, that I didn't mention before, is the DTLB miss rate. I would say, DTLB_MISSES.ANY / InstructionsRetired. If this is greater than, say, 5%, I would consider that to be high. This is caused by walking through memory in big steps, or in a funny pattern, or by a slight breeze.

Now, here is my best hack at interpreting Intel-speak for the other resource stall conditions. Most of these can become 'f' in an Amdahl's-Law-style estimation.

RESOURCE_STALLS.BR_MISS_CLEAR = Number of cycles the pipeline spends recovering from branches.

RESOURCE_STALLS.FPCW = No bloody clue (sorry). Based on my understanding of what the FP control word does, this shouldn't cause stalls.

RESOURCE_STALLS.ROB_FULL = Every microop in flight needs an entry in the ROB. The ROB keeps final instruction sequencing in program order, despite the fact that the execution itself is out-of-order. In a compute-bound workload, you usually will not run out of ROB space. In a memory-bound workload, you will.

RESOURCE_STALLS.LD_ST = Complicated. Long answer below. Short answer: Loads and stores need special resources. When you're executing lots of memory ops, regardless of whether they hit or miss, you can run out of these resources before you run out of others (e.g., before you run out of ROB space).

RESOURCE_STALLS.RS_FULL = RS stands either for reservation stations or register scheduler, based on my meager understanding of Intel-speak and the description in the documentation. In either case, RS_FULL means that new instructions cannot dispatch (enter the pipe) because there are too many earlier waiting instructions. Waiting instructions are instructions that are literally waiting for other instructions to complete, before they can execute. The scheduling logic ('RS') is usually smaller than the ROB, because its not very scalable, but this is often OK because not all instructions have to wait for earlier instructions to complete -- it depends a LOT on the workload. This, too, is an indicator that you are waiting on a memory-bound workload.

RAT_STALLS.* = The RAT stands for Register Alias Table. The RAT is used by the decode logic for register renaming, to make the register writes appear in-order despite actually going on out-of-order. Stalls at the RAT mean that no instructions flow past it into the execution logic.
DELAYED_BYPASS = Time wasted because the compiler is bad at scheduling code for this particular microarchitecture, its pipeline depths, etc. This is notoriously hard to measure and understand in out-of-order cores. Ignore it.

Long answer regarding RESOURCE_STALLS.LD_ST: Modern pipelines execute many instructions at a time, and often out of order. The problem is, you can't execute memory operations out of order without making sure that loads see the values written by earlier stores. In other words, the programmer cannot know that there is an out-of-order execution going on under the hood.
This is a huge problem in microarchitecture, the topic of hundreds (maybe thousands) of research proposals. In many current products, this problem is handled with a combination of prediction and special in-core queues used to sequence memory operations. These queues (the Store Queue, SQ, and Load Queue, LQ, sometimes collectively called the LSQ) are expensive, so they're typically smaller than the ROB (which bounds the number of ops--of any kind--that can be in flight at a given time). A memory-intensive workload can exhaust the space in the LSQ, which means you cannot start any new memory operations until space clears up (by having other memory operations retire).

I don't suppose there's any real way of measuring opportunity cost...
I'm working on this for my thesis, in a way. Not so much measuring as just quantizing. Its pretty tricky.

Haha, it's fluid dynamics--about the only calculation we do is floating point. Basically it boils down to assembling a huge matrix, and then solving a (huge) linear system over and over and over again.

This is a hail-mary pass, but is your algorithm related to anything in the SPEC FP 2006 suite? There's a ton of fluid dynamic stuff going on there, and I'm fairly familiar with the behaviors in that suite.

I'd say we are computation bounded in a few places, but most of the code is memory bounded. Somteimes it's hard to avoid... like computing an outer product of 2 nx1 vectors. There's 3 memory accesses (read vec1,read vec2,write resultmatrix) PER floating point operation. I will be looking at trying to "lump" these operations so that I can call matrix-matrix multiplies instead. Not sure if this is possible but it's one of the things I'm examining.
Yes, memory-boundedness is hard to avoid. Here is a technique that might help:
1. Numerical codes like the ones you describe are often easily prefetchable. Try inserting software prefetches.
2. Big working sets typically overflow caches, especially when operations stream though memory (e.g., matrix multiply). If you know your entire working set doesn't fit in-cache, use prefetchnta (see http://groups.google.com/group/comp...hread/2ae6c66f8e69ae82/1950f09a79cd4056?pli=1) to avoid polluting the cache.

There's a ton of low-hanging fruit. But a lot of it happens in functions that don't take up much time... Well unless you're suggesting that I really should go through and pick off all the easy fixes first, even if they are very minor.
Nope. Don't do it if it won't help.

Like I said before, "nuts." How are you supposed to use profiling then? The message I'm getting is that profiling is a finicky and unreliable source of performance information and that there are just too many factors at play to even try and measure it. Surely I'm interpreting things in a much too pessimistic fashion.
I may have overdramatized a bit. Profling has its place, and is useful. I prefer time-based over event-based to narrow the field -- event-based can tell you what is wrong with a particular (small) section of code.

Given what you said, I'd be inclined to say that the branch is predicted correctly most of the time. ... SWITCH_OFFSET changes very little.
OK, good.

This is why I'm thinking that a lot of the time reported (in the time-based profiling) for PXShape is due to the instrumentation. several good reasons
Instrumentation overhead is something to watch for. As usual, its hard to quantize. You already know what to do if you suspect you're measuring overhead: Remove the profiling of the function and see how much the result profile changes.

Well unfortunately the function that eats the most time is still only like 10-15% of the total time. So if I could take like the top 3 functions and remove them completely, that wouldn't even be a 2x speedup 🙁
Yes, I was afraid that might be the case.
Have you considered implementing a parallel version of the code? Might that be possible? Many numerical applications take well to paralellization.

I guess what I'm still a bit confused on is 1) what profiling measures should I use? Event-based (which ones) or time-based? How to reconcile the differences between the two?; and 2) how do I interpret the various profiling measures? I've been taught that profiling should guide the optimization process, but I'm now feeling more lost on how to do that since I'm not even certain how to figure out what parts of the program are slowest. Total time? Instructions retired? unhalted clock cycles? something else?

Look at total time. Total time T of an execution is:

T = Time processor does work (W) + Time processor is stalled (S)

First, try to get an idea of W and S. Thats where RESOURCE_STALL comes in:
S = SumOfAll(ResourceStalls).
W = T-S.

If W >> S, you have to look for algorithmic changes, or parallelize for performance. If S << W, you might be able to reduce S with software changes... maybe. You have to break S into as many categories as you can. Hence why I suggested looking at resource stalls.
 
just some thoughts from the random crazy guy. (as I've never used VTUNE, so I couldn't really be of much help there).

If you want to optimize code, your really pretty much forced to go down to the assembly level. Some general optimization tips.

1. Don't forget ICache misses, they can really kill you. If you have a loop that is looping a lot, try to insert some NOPs in front of it to get it aligned.
2. Keep your arrays aligned. Aligned data loads faster then unaligned data.
3. I'm not sure on its usage, but x86 supports an L2 flushing/signaling instruction (its like clflush, or pflush or something along those lines) judicious use of this instruction can yield some good results.
4. If you are working with large amounts of data, you might considered looking at the SSE instructions. Every compiler out there that I know of has a tough time effectively using SSE instructions and registers. Handwriting SSE code could yield some good results. You might be able to get away with using intrinsic.
5. You probably already know this, but it bears mentioning, make sure when you access data, you do so from "left to right" one row at a time. As well, reading data in "blocks" can drastically improve speeds.
IE
[0, 1, 2, 3, 4, 5]
[2, 3, 4, 5, 6, 7]
[3, 4, 5, 6, 7, 3]
[3, 4, 5, 3, 2, 2]
reading the first 4 digits, then starting the next line and grabbing the "missed" data to be processed after doing x amount of lines can increase performance. How many digits and how many lines completely depends on the size of your L1/2 cache. It should be benchmarked to see when performance is maximized.

If I can think of some other optimization tips I'll let you know.
 
Finally responding... sorry I've been out sick :/

As you're discovering, performance counters have their own set of issues. I would suggest performance counters when its hard to tell why a particular function or subroutine is slow, not why the whole thing is slow.

If you have a point of reference, they're fine. But suppose you ran these and saw 80M, 45, and 10M? Are those numbers high or low for your workload? How much does it hurt performance? Just the raw numbers don't answer these questions.
Got it. I think after all this discussion I'm finally starting to understand how to apply all the profiling tools I have available.

Interesting. Thanks for the heads-up. It satisfies my curiosity, but isn't really related to your question.
Sure... it's the least I could do, heh.


One thing to watch out for, that I didn't mention before, is the DTLB miss rate. I would say, DTLB_MISSES.ANY / InstructionsRetired. If this is greater than, say, 5%, I would consider that to be high. This is caused by walking through memory in big steps, or in a funny pattern, or by a slight breeze.

Yeah I had been reading about the DTLB and added that to my list of watched events.

Now, here is my best hack at interpreting Intel-speak for the other resource stall conditions. Most of these can become 'f' in an Amdahl's-Law-style estimation.
snip
which means you cannot start any new memory operations until space clears up (by having other memory operations retire).
Thanks! Your descriptions were very helpful.

I'm working on this for my thesis, in a way. Not so much measuring as just quantizing. Its pretty tricky.
Good luck with that... maybe a CS student will tell me about it one day.


This is a hail-mary pass, but is your algorithm related to anything in the SPEC FP 2006 suite? There's a ton of fluid dynamic stuff going on there, and I'm fairly familiar with the behaviors in that suite.
What I'm doing does bear some similarity to the tests in that suite... or at least what I could glean of how the codes work by reading the summary pages. We're kind of a mix btwn 454.calculix (continuous galerkin finite element, CG) and 410.bwaves (i THINK finite volumes). We're solving nonlinear PDEs with a discontinuous galerkin finite element method. Right now we're mostly interested in steady-state solutions. But due to the difficulty of finding steady-state solutions in nonlinear problems, we solve a pseudo-unsteady problem... where we march the solution forward in time, but make no effort to be time accurate: we only want the steady state after all. So decrease the time used, we try to take as big time steps as possible (i.e. implicit time stepping)... this puts a lot of attention on the efficiency of the linear solver. Like 410.bwaves, we use a Krylov method--we use GMRES instead of BiCG-stab. Where things start to differ widely is the following: all the SPECFP06 codes seem to be mostly interested in simple static problems or time-accurate simulation of something. In time accurate simulation, you might still use an implicit method (b/c the physical time scales are larger than the grid-implied time scales), but the time steps are small. If the time steps are small, then the linear system of equations arising in computing the next state in time is easy to solve. Like 410.bwaves doesn't mention any preconditioning... which pretty much precludes them from solving steady state problems efficiently. Anyway I'm starting to wander a lot now... might be easier if you have more specific questions about what we do in relation to the SPEC suite. In short there are similarities, but there are also substantial differences.

Oh, there's also substantial additional complication b/c the SPEC codes all sound like they work on structured meshes. We solve on unstructured meshes, which opens up a whole other can of worms... but it is substantially more versatile.

Yes, memory-boundedness is hard to avoid. Here is a technique that might help:
1. Numerical codes like the ones you describe are often easily prefetchable. Try inserting software prefetches.
2. Big working sets typically overflow caches, especially when operations stream though memory (e.g., matrix multiply). If you know your entire working set doesn't fit in-cache, use prefetchnta (see http://groups.google.com/group/comp...hread/2ae6c66f8e69ae82/1950f09a79cd4056?pli=1) to avoid polluting the cache.
Hmm could you provide me with a source on how to insert software prefetches? Also can prefetchnta be inserted in the C code and not in the assembly? (SImilarly can I do prefetching w/o modifying assembly...) First I'm not very familiar with assembly so if I can avoid going that path I would like to. Second, our code is constantly being modified (it's like a 5 student + guys at boeing group), so asm modifications might be hard if the code is being changed...?? I've heard of like assembly injection in C, but I don't know much about it.

Lastly if it matters, we need to support icc and gcc users, all on linux though.

Yes, I was afraid that might be the case.
Have you considered implementing a parallel version of the code? Might that be possible? Many numerical applications take well to paralellization.
Yes we do have a parallel version. There are some problems though. Standard procedure is to break the grid apart into chunks, and give each chunk to a processor. Unfortunately, when you do this you're also forced to break the preconditioner up into parts. As a result, the preconditioning gets worse with more processors. What this means is that while each iteration (of the linear solver) may be going faster, we ultimately need many more iterations. So our parallel efficiency is awful. Like in 3D, 2 processors is actually slower than 1! This is at least an algorithmic if not a mathematical shortcoming of the DG method. The continuous galerkin community solved this issue using the BDDC preconditioner (we use ILU). Unfortunately no one (not us or other groups) has yet been successful in writing a BDDC-type preconditioner for DG codes.

This is of course only when the linear system is hard to solve. If we were using an explicit time integration or at least impicit with small time steps (i.e., from solving an unsteady problem), our parallel efficiency would be nearly perfect b/c the DG method parallelizes really well.

If W >> S, you have to look for algorithmic changes, or parallelize for performance. If S << W, you might be able to reduce S with software changes... maybe. You have to break S into as many categories as you can. Hence why I suggested looking at resource stalls.

Ok.
 
just some thoughts from the random crazy guy. (as I've never used VTUNE, so I couldn't really be of much help there).

If you want to optimize code, your really pretty much forced to go down to the assembly level. Some general optimization tips.
Hi Cogman, thanks for joining the discussion 🙂 Well at the moment, I think there's plenty of work to be done above the assembly level, but I'll keep that in mind. I addressed this point (and my fear of assembly) a bit in my last reply to degibson. In particular I know little assembly so I was hoping to avoid it. Also our code is not "final" or "production" in any sense. Some of the core components are rarely modified but other pieces change constantly. I'm not sure how hard it will be to make assembly edits when people are actively changing the source?

1. Don't forget ICache misses, they can really kill you. If you have a loop that is looping a lot, try to insert some NOPs in front of it to get it aligned.
Could you provide a source or an example? I've never heard of doing what you're suggesting. Also how does one go about determining if a loop is not aligned?

2. Keep your arrays aligned. Aligned data loads faster then unaligned data.
I use the -align flag to the compiler... is there more I can do?

3. I'm not sure on its usage, but x86 supports an L2 flushing/signaling instruction (its like clflush, or pflush or something along those lines) judicious use of this instruction can yield some good results.
Do you have any rules of thumb? I guess this is along the same ideas as using prefetch.

4. If you are working with large amounts of data, you might considered looking at the SSE instructions. Every compiler out there that I know of has a tough time effectively using SSE instructions and registers. Handwriting SSE code could yield some good results. You might be able to get away with using intrinsic.
I'm letting icc handle that for me for now. I think I'm a very long way from being able to write SSE instructions on my own.

5. You probably already know this, but it bears mentioning, make sure when you access data, you do so from "left to right" one row at a time. As well, reading data in "blocks" can drastically improve speeds.
Yeah. That is something that we know about/do.
 
What I'm doing does bear some similarity to the tests in that suite... or at least what I could glean of how the codes work by reading the summary pages. We're kind of a mix btwn 454.calculix (continuous galerkin finite element, CG) and 410.bwaves (i THINK finite volumes). ... [shortened] ... might be easier if you have more specific questions about what we do in relation to the SPEC suite. In short there are similarities, but there are also substantial differences.

Oh, there's also substantial additional complication b/c the SPEC codes all sound like they work on structured meshes. We solve on unstructured meshes, which opens up a whole other can of worms... but it is substantially more versatile.
My hail mary was in the last post because I have a lot of data sitting around about how those codes respond to various events. I'll get back to you on it -- some of it might be useful.

Hmm could you provide me with a source on how to insert software prefetches? Also can prefetchnta be inserted in the C code and not in the assembly? (SImilarly can I do prefetching w/o modifying assembly...) First I'm not very familiar with assembly so if I can avoid going that path I would like to. Second, our code is constantly being modified (it's like a 5 student + guys at boeing group), so asm modifications might be hard if the code is being changed...?? I've heard of like assembly injection in C, but I don't know much about it.

Lastly if it matters, we need to support icc and gcc users, all on linux though.
Well, this is gonna get ugly. There's no C-language way to expose prefetching. Some compilers have builtins for it, some will require inlined assembly. Either way, its probably going to mean conditional compilation to support two compilers.

GCC: Has some optimizations you can turn on for prefetching. Specificially, -fprefetch-loop-arrays. Icc probably does too.

GCC: Google for '__builtin_prefetch'
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins
The nature of this prefetch isn't specified. E.g., it is probably NOT prefetchnta. For that, you'll need ASM support.

GCC-Inlined assembly: http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm

e.g.,
Code:
#define prefetch_loc(addr) \
__asm__ __volatile__ ("prefetchnta %0" \
                      : \
                      : \
                      "m" (*(((char*)(((unsigned int)(addr))&~0x7f)))))
From: http://www.beowulf.org/archive/2002-April/006729.html

You can formulate other inlined prefetch commands in the same way, by changing the prefetchnta opcode to something else.

See also: http://gcc.gnu.org/projects/prefetch.html in the x86 section.

If you are using prefetchnta, do not attempt to place the prefetchnta instruction far ahead of the actual access -- place the prefetch immediately before the access. This if for prefetchnta ONLY. Put all other prefetches 'a few hundred cycles' before the actual access.

Now, for ICC:
Word on the street (aka, five minutes of googling) seems to suggest that icc supports GNU-style ASM dialect. You'll have to verify that. You may have to change __asm__ __volatile__ to asm volatile, etc.

Yes we do have a parallel version. There are some problems though. Standard procedure is to break the grid apart into chunks, and give each chunk to a processor. Unfortunately, when you do this you're also forced to break the preconditioner up into parts. As a result, the preconditioning gets worse with more processors. What this means is that while each iteration (of the linear solver) may be going faster, we ultimately need many more iterations. So our parallel efficiency is awful. Like in 3D, 2 processors is actually slower than 1! This is at least an algorithmic if not a mathematical shortcoming of the DG method. The continuous galerkin community solved this issue using the BDDC preconditioner (we use ILU). Unfortunately no one (not us or other groups) has yet been successful in writing a BDDC-type preconditioner for DG codes.
Yes, despite years of research, parallel programming is still in its infancy. As hard as it is to tune a single-thread, its exponentially harder to tune multiple threads.
 
1. Prefetching is only going to work well in places where you can predict what data you will want to access "a few hundred cycles" in the future.
2. Its important to understand how the non-temporal prefetching works. When a new miss is generated, a processor allocates something called a Miss Status Holding Register, or MSHR. The MSHR contains the allocation strategy for the miss. Most loads allocate in L1D. So does prefetcht0. prefetcht1 allocates in L1, but not L0, if it exists, etc. If another memory access it going to hit that same line in close proximity to the original miss, the new access uses the same MSHR and inherits the allocation strategy. This allows you to use a prefetch operation "just before" a load operation to change whether that load operation will allocate in-cache or not. Hence, when you're trying to keep a particular valuable working set in your cache, you need to prefix other loads with non-temporal prefetches. Since the result will NOT be cached when a non-temporal prefetch is in play, loads to that line that are "far away" from the prefetch will miss again, and may even cause allocations.
3. Prefetches happen at cache line granularity. It doesnt' do any good to prefetch the next byte.

Guidelines on prefetching:

-The whole point of prefetching is explicit cache management. You're going to treat your cache as something you can explicitly allocate. You have to know how big your CPU's caches ARE to do this effectively, so look those up. Most intel products have a 32KB L1D... L2 size varies a lot. Discard any that end up being 1KB or smaller.

-Figure out the size of all input and output data structures in a given calculation. Bear in mind that if you're accessing structs of size 64 bytes, but only hitting eight bytes in each struct, you're 'accessing' the whole struct for the purpose of this argument. Write these numbers down, call them things like RSinput1 (Read Set: Input 1) or WSoutput1 (Write Set: Output 1).

-Figure out how much total data will be accessed by the calculation. This should be the sum of the sizes of the inputs and outputs. Call this total amount WS.

-One of your data sets will often be innermost, esp. for nested codes. That is, if you have nested FOR loops, the thing over which the innermost iterates is innermost. Rank your inputs and outputs, in order, starting with innermost (the last in the order should be outermost).

-Now begins the allocation. You will get the best returns on performance by allocating inner working sets before outer working sets. Among ties, you will probably get more performance by allocating read sets over write sets. For all innermost working sets that will fit in the L1D, allocate in the L1D with a prefetcht0. Place all prefetch instructions, except prefetchnta, "a few hundred cycles" before you actually need the data (ballpark 200 instructions, but err on the side of more instructions rather than less) .

-For innermost working sets that fit in the L2 but not the L1D, use prefetcht1.

-For all other working sets -- the sets that will not fit in the L1D or L2, there are two approaches.
1. If you've already allocated some working sets into caches, NEVER issue a prefetch OR A DEMAND REQUEST that might displace those working sets. E.g., if the L1D is "full", immediately precede all other streaming accesses with either prefetcht1 or something even stronger, e.g, prefetchnta.
2. If none of your working sets fit in any caches at all, issue prefetcht0 "a few hundred cycles" before the expected access.
 
Here http://x264dev.multimedia.cx/?p=201 is a good document on ICACHE misses.
As for aligning code. Short of hand writing assembly, and inserting NOP's infront of the loop you are trying to align, I'm not sure there is a way to ensure that a loop will be aligned.

the gcc supports an __attribute__(aligned(x)) delclaration, you might be able to put that in front of your loop to make it align (I don't know about the ICC)

For the prefetching, I'm really not too experienced in it, other then to know that it is something that can give some good benefits. Debison is probably the one you should rely on for specific usage.

Earlier I was confusing ICACHE misses with loop alignment. My mistake (though ICACHE misses can be a big deal).
 
Back
Top