Originally posted by: Sassy Rabbit
Originally posted by: CTho9305
Originally posted by: Sassy Rabbit
Originally posted by: CTho9305
I asked the question I asked because many CPUs lack the ability to predict indirect branches. If the functions are very short, the overhead is going to be significant, since the function call will act like a branch that's mispredicted nearly 100% of the time.
AHH ok - the functions are fairly short
between 10 and 100 lines long.
But function pointers seem a more efficient way of calling them rather than going through a potentially large evaluation.....especially I need to iterate an NXN matrix each time step, and each iteration would involve several of these function calls.
That many lines of C++ is likely long enough that the overhead won't be bad.
Oh the architecture I run on is almost strictly AMD 64bit processors if that makes a difference.
Barcelona/Phenom are the first AMD CPUs that can predict indirect branches.
RealWorldTech has some stuff on it that's worth reading.
Originally posted by: Markbnj
Originally posted by: Sassy Rabbit
Originally posted by: Markbnj
My CPU optimization chops are woefully ancient at this point, but since you're calling the same relatively small chunks of code over and over, won't most of them end up in the L1 cache anyway?
OK - I admit this is losing me a little. I've never had formal computer science training....I'm in chemical engineering and I do simulation work. So all I know about C++ I learned myself from writing simulation software. Anyway, I'm not sure how the L1 cache would affect my program...
It's a matter of whether the CPU has to fetch the next few instructions from memory (slooow) the L2 cache (faster), or the L1 cache (fastest). Branch prediction is basically all about having the code that will be needed next already prefetched from memory into the cache. But if you are hitting the same small set of functions over and over much of their instructions will likely get into the cache and stay there.
Branch prediction and caches are somewhat orthogonal. The cache issue is, "are the instructions accessible quickly", and the branch prediction issue is, "do I know which instruction is coming next?". A reasonably small program will probably fit completely in the L1 I-cache (64KB on AMD CPUs), but even if the program is entirely in cache, a K8 is still going to eat a ~12 cycle penalty (I'm not sure of the exact number) when it hits an indirect branch that goes to a different place each time.
Originally posted by: Sassy Rabbit
OK - that is really cool to know. Now, this is probably the tougher question. How could I determine if using a set of if..else if.. etc statements, some with complex logic or an array of function pointers would be the fastest. On one hand - once the if statements are evaluated they would go to the right function (not sure where it would be stored though), on the other, I could go through many evaluations before I find the right subroutine.
The easiest way to find out would be to try them all. It depends on a lot of factors, so working it out would be non-trivial and you'd need to understand a lot of subtleties about your CPU and your program.
...but before you do that, you should use a profiler to check if you're really spending much of your execution time on the branching part. If 90% of the time is spent elsewhere, you'll get at most a 10% speedup.