Hi all,
I'm attempting to do some profiling on the my group's code project (CFD stuff... maybe some ppl remember from my previous posts). I have access to Intel's VTune software (and I'm already using their compiler & the MKL), but I'm kind of confused on how to use their event counters. From what I understand, the performance counters are for intel processors (I'm on a 45nm Core2Duo at 2.8ghz) and are not unique to VTune. Currently I'm trying to get a grasp of how our code performs in terms of L1 misses, L2 misses, and branch mispredictions. If there are other factors I should be measuring, please enlighten me!
In particular, with VTune there seems to always be at least 2 or 3 ways to measure a performance factor (like impact of branch mispredicts, L2 misses, L1 misses). I don't want to just run every possible combination b/c a single run in VTune seems to only allow 4 event counters, and I'd just as soon not have to run every test in my suite like 5 times!
The various options seem to generate similar-ish results, but I'm still confused as to why there are so many choices and which one is "best". For example, to see how L1(data) misses are affecting me, there are 2 suggested ratios: L1 Data Cache Miss Rate (L1D_REPL / INST_RETIRED.ANY) and L1 Data Cache Miss Performance Impact (8 * L1D_REPL / CPU_CLK_UNHALTED.CORE). These are based on L1D_REPL, the #lines brought into L1 cache. But there are other ways to figure that. Like L2_RQST (# of requests to the L2 cache) would be a different measure of what's entering L1. There's also MEM_LOAD_RETIRED.L1D_LINE_MISS which claims to precisely count the "number of load operations that miss the L1 data cache".
Similarly options for L2 seem to be L2_LINES_IN.ANY or MEM_LOAD_RETIRED.L2_LINE_MISS. So many of these sound like they do pretty much the same thing, at least to my untrained ear. Which are the most informative? Are there advantages or are they really pretty much the same?
For branch misprediction, a document on intel's website suggests the ratio: BR_INST_RETIRED.MISPRED / BR_INST_RETIRED.ANY. But within VTune itself, the ratio: RESOURCE_STALLS.BR_MISS_CLEAR / CPU_CLK_UNHALTED.CORE is suggested. Clearly the first one measures what percentage of branches are mispredicted & the second estimates what percentage of execution time is spent waiting due to branchfail. Which is more meaningful? Do I really need both?
Also, as I mentioned VTune measures at most 4 events per run of the code. Suppose I have events A, B, C, D, E, F. That'd take 2 runs; the first handles ABCD and the second EF. Does anyone know if I can make it so that the first run does ABCD and the second does ABEF? (I'd like to see AB alongside EF without having to manually click around.) I'd be happy queueing up a bunch of runs too, but I can't figure out how to do that either.
Thanks,
-Eric
I'm attempting to do some profiling on the my group's code project (CFD stuff... maybe some ppl remember from my previous posts). I have access to Intel's VTune software (and I'm already using their compiler & the MKL), but I'm kind of confused on how to use their event counters. From what I understand, the performance counters are for intel processors (I'm on a 45nm Core2Duo at 2.8ghz) and are not unique to VTune. Currently I'm trying to get a grasp of how our code performs in terms of L1 misses, L2 misses, and branch mispredictions. If there are other factors I should be measuring, please enlighten me!
In particular, with VTune there seems to always be at least 2 or 3 ways to measure a performance factor (like impact of branch mispredicts, L2 misses, L1 misses). I don't want to just run every possible combination b/c a single run in VTune seems to only allow 4 event counters, and I'd just as soon not have to run every test in my suite like 5 times!
The various options seem to generate similar-ish results, but I'm still confused as to why there are so many choices and which one is "best". For example, to see how L1(data) misses are affecting me, there are 2 suggested ratios: L1 Data Cache Miss Rate (L1D_REPL / INST_RETIRED.ANY) and L1 Data Cache Miss Performance Impact (8 * L1D_REPL / CPU_CLK_UNHALTED.CORE). These are based on L1D_REPL, the #lines brought into L1 cache. But there are other ways to figure that. Like L2_RQST (# of requests to the L2 cache) would be a different measure of what's entering L1. There's also MEM_LOAD_RETIRED.L1D_LINE_MISS which claims to precisely count the "number of load operations that miss the L1 data cache".
Similarly options for L2 seem to be L2_LINES_IN.ANY or MEM_LOAD_RETIRED.L2_LINE_MISS. So many of these sound like they do pretty much the same thing, at least to my untrained ear. Which are the most informative? Are there advantages or are they really pretty much the same?
For branch misprediction, a document on intel's website suggests the ratio: BR_INST_RETIRED.MISPRED / BR_INST_RETIRED.ANY. But within VTune itself, the ratio: RESOURCE_STALLS.BR_MISS_CLEAR / CPU_CLK_UNHALTED.CORE is suggested. Clearly the first one measures what percentage of branches are mispredicted & the second estimates what percentage of execution time is spent waiting due to branchfail. Which is more meaningful? Do I really need both?
Also, as I mentioned VTune measures at most 4 events per run of the code. Suppose I have events A, B, C, D, E, F. That'd take 2 runs; the first handles ABCD and the second EF. Does anyone know if I can make it so that the first run does ABCD and the second does ABEF? (I'd like to see AB alongside EF without having to manually click around.) I'd be happy queueing up a bunch of runs too, but I can't figure out how to do that either.
Thanks,
-Eric
Last edited: