C & the inline keyword

eLiu · Apr 27, 2009

Hi all,
Can someone help me understand when it is/isn't a good idea to use the "inline" keyword explicitly? Are there times when I can really make a difference using "inline"? Or is the compiler basically always smarter than me & so I should just trust it to do its thing? Currently I just use it to make code more readable. Like suppose I have confusing code to find the orientation of a face; instead of leaving it in the main body, I'll create a 'static inline' version & call out. So I'm hoping for like the readability of function calls without any of the overhead/slowdown (b/c this piece of code will run hundreds of millions of times).

If it matters, I'm coding numerical routines for C (solving the navier stokes equations); so lots of computation & lots of memory. The pieces of code I think of inlining are generally short & have small data, so it'd be cool if these functions & their variables fit into the instruction cache & registers as much as possible. Similarly it'd be good to avoid lots of useless copying variables from one place to another.

-Eric

tatteredpotato · Apr 27, 2009

Originally posted by: eLiu
Hi all,
Can someone help me understand when it is/isn't a good idea to use the "inline" keyword explicitly? Are there times when I can really make a difference using "inline"? Or is the compiler basically always smarter than me & so I should just trust it to do its thing? Currently I just use it to make code more readable. Like suppose I have confusing code to find the orientation of a face; instead of leaving it in the main body, I'll create a 'static inline' version & call out. So I'm hoping for like the readability of function calls without any of the overhead/slowdown (b/c this piece of code will run hundreds of millions of times).

If it matters, I'm coding numerical routines for C (solving the navier stokes equations); so lots of computation & lots of memory. The pieces of code I think of inlining are generally short & have small data, so it'd be cool if these functions & their variables fit into the instruction cache & registers as much as possible. Similarly it'd be good to avoid lots of useless copying variables from one place to another.

-Eric

Small functions, (like less than 10 lines max). Essentially the compiler pastes the inline function into the program instead of making an all out function call.

There isn't any explicit definition of when to use inline though.

engineereeyore · Apr 27, 2009

Sounds like you have the idea already. ObscureCaucasian is 100% correct in his advice. You want the function to be small. It's basically just a trade-off between code size and run-time overhead. As you mentioned, you eliminate unnecessary parameter copying at the expense of larger code.

So what you really have to consider is how much memory do you have available for the code segment. This is often a much bigger consideration on smaller embedded systems where memory is a premium and the use of in-line functions may not be plausible. If memory isn't really an issue, or the function is only rarely called, absolutely use in-line functions. But if memory is restricted, or the function is called a large number of times, you might want to consider not using an in-line. Just remember, the smaller the function, the more times you can use it without adversely affecting your memory.

Hope that helps.

(It should be noted that when I say "call" the function, I'm referring to the number of times it will be copied into the code. Having an in-line function called somefunc() inside a loop that repeats 20 times doesn't cause the code to be inserted 20 times, but only once with loop logic placed around it. The number of times you literally write somefunc() in your code is how many times it will be replicated.)

slugg · Apr 27, 2009

Your algorithm will use more memory if you use inline functions, but that should be okay since the whole algorithm will be in one concurrent section of memory. Due to the principal of locality, this type of coding will most likely maximize CPU cache usage, pipelining, and instruction fetching. On newer CPUs, this type of code will help make branch prediction more accurate, which greatly increases pipeline performance.

What compiler are you using? Last I checked, GCC doesn't automagically make frequently called functions inline. If you really want to optimize your algorithms, try to get rid of as much functional recursion as you can and just manage your own stack. If you use recursive function calls, a LOT of data is pushed to your program stack - way way more than what is important to you. Not only that, but then the CPU fetching latency gets out of whack if you push a huge stack. This will improve run time and memory usage. With these memory savings, you could totally afford to use many inline functions.

IHateMyJob2004 · Apr 27, 2009

1) modern compilers optimize better than humans
2) inlining can slow your program down. (waiting for people to call me an idiot so I can show off my mad skillz)
3) modern compilers optimize better than humans

Cogman · Apr 27, 2009

Originally posted by: IHateMyJob2004
1) modern compilers optimize better than humans
2) inlining can slow your program down. (waiting for people to call me an idiot so I can show off my mad skillz)
3) modern compilers optimize better than humans

1) Not always true.
2) Sometimes true, sometimes not true. It is something that should be determined after profiling on the targeted machine.
3) Again not always true.

Modern compilers do an AMAZING job at optimization. However, a human that knows what he is doing, knows the machine he is targeting, ect, can do a better job then the compiler. SSE instructions, which sqrt instruction to use, ect, are things that humans can do a better job then compilers.

I'm not saying optimization should be done by the average joe, or even the unaverage joe. I'm just saying that there are cases where humans are able to write code that runs faster on the machine then compilers are able to achieve.

eLiu · Apr 28, 2009

Originally posted by: IHateMyJob2004
1) modern compilers optimize better than humans
2) inlining can slow your program down. (waiting for people to call me an idiot so I can show off my mad skillz)
3) modern compilers optimize better than humans

Would you care to clarify? Clearly you think some people will disagree strongly with 2), and I'd like to hear about it at least for my own knowledge.

engineereeyore: I think memory for the program isn't a big deal. This will never be embedded or anything like that; the matrices generated to solve the flow-field can span many gigabytes (i.e. requires parallel).

Why is it bad to inline a function that is called often? Is that b/c it'll grow the code potentially a lot?

slugg: In my group, most people are on gcc 3.4, but a few (like me) are on gcc 4.3.2. gcc's -O1 or -O2 optimizations (i forget which or if it's both) at least claim to do inlining. There's also a couple of "-inline-blah-blah" type flags you can give.

What do you mean by "manage your own stack"? Links to stuff I can read on how to do this (and what it is) in C? Also, the only recursion (that I'm aware of) is through a tree data structure used to perform nearest neighbor searches. I don't see how to get rid of that recursion unless I support the tree w/an array or something.

-Erc

slugg · Apr 28, 2009

Well when you have a recursive function, let's say f(a,b), you have 2 parameters. When function f calls itself, it pushes ALL state information to the call stack, which includes ALL of its local variables, the current state of its parameters, the program counter position, all relevant CPU registers, and so on. Really, the only truly important information is the data that you're working on, which is most likely the 2 parameters a and b. So essentially, C creates an implicit stack for you. Once the recursion ends, it pops the state of the previous calling instance of the function and continues to run til the stack is empty. As you can see, this process takes a lot of CPU time and wastes a lot of memory.

You could make your own stack ADT (abstract data type) for handling whatever data you're working on. Then, instead of recursively calling your algorithm, you just push data to the stack and let your algorithm iteratively solve the problem using popped units from the stack. Essentially, this is doing what C automatically does for you, except for now the only info being moved around is essential info, as opposed to countless instances of CPU state and function state for EVERY recursive call. There's a lot less I/O being done this way. One thing that really slows down software is I/O operations... Not to mention, you'll be using up a lot less memory this way.

Simple Tree Example: depth-first traversal (pseudo code)

myDepthFirst(Tree myTree) //prints the value of each node in a depth-first traversal order
{

Stack myStack;

push(myStack, myTree) //push the root node onto the stack

while(!isEmpty(myStack)) //while our stack is not empty
{

currentNode = pop(myStack); //get the last added node from the stack
printf("%i\n", currentNode->value); //print its value

if(currentNode->right) != NULL)
{
push(currentNode->right);
}
if(currentNode->left != NULL)
{
push(currentNode->left);
}
//if both the left and right pointers are null, the node is terminating, so do nothing.

}

}

edit: Nearest neighbor in what context? Shortest direct path, or shortest absolute path? Also, are we actually dealing with a tree, or are we dealing with a graph, or even a matrix? I ask because then I can get a good idea of what your algorithm is doing...

Or you could use Matlab and call it a day!

edit 2: This style of coding will maximize CPU usage, however your function block will take more memory than a recursive function block. I'd say that most likely, you'd actually get better performance out of this style. While a single function block would be smaller in the recursive version, keep in mind that you'll have multiple instances of it in the call stack. This causes many more memory read/writes, which will lower the amount of instructions the CPU will get, which will lower the amount of pipelining it can perform. Basically, the CPU will get I/O interrupts left and right, so it'll just sit there idling while memory is being accessed.

This is all in theory, of course. The best way to tell is to benchmark your algorithms...

eLiu · Apr 28, 2009

Originally posted by: slugg
Well when you have a recursive function, let's say f(a,b), you have 2 parameters. When function f calls itself, it pushes ALL state information to the call stack, which includes ALL of its local variables, the current state of its parameters, the program counter position, all relevant CPU registers, and so on. Really, the only truly important information is the data that you're working on, which is most likely the 2 parameters a and b. So essentially, C creates an implicit stack for you. Once the recursion ends, it pops the state of the previous calling instance of the function and continues to run til the stack is empty. As you can see, this process takes a lot of CPU time and wastes a lot of memory.

You could make your own stack ADT (abstract data type) for handling whatever data you're working on. Then, instead of recursively calling your algorithm, you just push data to the stack and let your algorithm iteratively solve the problem using popped units from the stack. Essentially, this is doing what C automatically does for you, except for now the only info being moved around is essential info, as opposed to countless instances of CPU state and function state for EVERY recursive call. There's a lot less I/O being done this way. One thing that really slows down software is I/O operations... Not to mention, you'll be using up a lot less memory this way.

Simple Tree Example: depth-first traversal (pseudo code)

myDepthFirst(Tree myTree) //prints the value of each node in a depth-first traversal order
{

Stack myStack;

push(myStack, myTree) //push the root node onto the stack

while(!isEmpty(myStack)) //while our stack is not empty
{

currentNode = pop(myStack); //get the last added node from the stack
printf("%i\n", currentNode->value); //print its value

if(currentNode->right) != NULL)
{
push(currentNode->right);
}
if(currentNode->left != NULL)
{
push(currentNode->left);
}
//if both the left and right pointers are null, the node is terminating, so do nothing.

}

}

edit: Nearest neighbor in what context? Shortest direct path, or shortest absolute path? Also, are we actually dealing with a tree, or are we dealing with a graph, or even a matrix? I ask because then I can get a good idea of what your algorithm is doing...

Or you could use Matlab and call it a day!

edit 2: This style of coding will maximize CPU usage, however your function block will take more memory than a recursive function block. I'd say that most likely, you'd actually get better performance out of this style. While a single function block would be smaller in the recursive version, keep in mind that you'll have multiple instances of it in the call stack. This causes many more memory read/writes, which will lower the amount of instructions the CPU will get, which will lower the amount of pipelining it can perform. Basically, the CPU will get I/O interrupts left and right, so it'll just sit there idling while memory is being accessed.

This is all in theory, of course. The best way to tell is to benchmark your algorithms...

Wow, that's pretty cool. Actually it kind of blows my mind; I never thought to get around recursion in this way before. (To be honest I learned to program in Scheme/Lisp, so my first teaching was to recurse over everything!) Maybe I'm tired or just dumb, but when would this method fail? ...all I can think of is languages supporting something like 'continuations' in Scheme (sorta like persistent vars in matlab).

Another question: how important is the implementation of the stack? I have a linked list data structure with a fast node allocator (i.e. throws free-ed nodes onto a stack for future use), but it's built with an object-oriented theme. So example: the user has no access to the nodes, but has to ask the master node for an iterator. Can I use this? Or should I build a more "barren" datatype?

MATLAB has its uses... this is not one of them

I'm solving systems of PDEs using the discontinuous galerkin method. Having implemented this kind of algorithm in MATLAB before, I can tell you that even with sparse allocation & heavy vectorization, matlab is still about an order of magnitude slower than the equivalent MEX-ed code. Moreover MATLAB dies when you start working with a few hundred thousand unknowns, which is quite a small problem still. (And for this stuff, if you're MEX-ing you & you have a C-based visualization tool, which I do, you might as well work in C.)

Oh and it's NearestNeighbor in the (x,y,z) closest euclidean distance sense. I'm given a list of points (no particular order) and then for any given query point, I should be able to answer the question: 1) how many points are within a distance R of the query and/or 2) what is the nearest point to the query? Note: the 2nd case is much more common/important for me. (And for every data point that is found, I need to return some additional information about that point, but that's not really relevant)

I'm using a KDTree to do this operation. Currently the data sets are not too huge, so I'm getting away with random insertion & no attention paid to balancing. In the future I'll sort the points & insert to guarantee balancing. Query points are set up so that each subsequent query will lie very close to the one preceding it. Thus I can initialize the KD-search (to better than blind) and hopefully reduce the number of nodes that I have to visit. (To promote this, I actually store the last couple of query results & initialize w/one of those.)

Edit: this isn't the end of the search either. What I'm really doing is trying to find the distance from any point in space to the nearest solid body (like an airplane). I have a higher order polynomial representation (e.g. spline) of the airplane geometry divided into many triangular surface elements. So the KD-search gets me the closest data point on the surface; then i use newton optimization to compute the actual closest point in the locality of the closest data point.

LOL waaaaay off the original topic. Kind of hijacked my own thread.

I still want to hear about inlining if people have more to add!

slugg · Apr 28, 2009

Well I have an exam in less than an hour. I'll come back and re-read your last post later...

As for your stack implementation, the abstraction you've done should be fine. You're basically just using a bunch of pointers, which the compiler will optimize pretty well to the point that it's negligible. The only improvement I could think of is to NOT used a dynamically linked structure. Instead, use an array based implementation. Why? Well, while you'll most likely be wasting some memory since you'll most likely over-estimate the required size of the stack, you'll only need to allocate memory ONCE to the stack, so your I/O operation count goes down dramatically! Requesting memory from the OS is a pretty big slow-down. The downside to this implementation is that you need a semi-reliable way to predict the size of the stack. If your stack is full, reallocate memory for the stack for twice the amount of memory. Usually with this method, you only allocate memory once, MAYBE twice...

On my big projects, I use my own memory management. I allocate a giant block when the program starts, then use my own memory management techniques to use that huge block for my program. This way, a very very very few amount of "malloc" statements are used. The less you rely on the OS, the better off you are in terms of pure performance. Just like anything else though, there are pro's and cons...

Anyways, back to inlining... make 4 versions of your algorithm: As-is (recursive without inline), recursive with in-line functions, iterative (using stack) without inline, iterative WITH inline. Then just benchmark them. That's the only true answer, really. In theory, any of these could be argued to be "faster" in some way. I was just throwing out ideas...

degibson · Apr 28, 2009

Executive Summary 1
eLiu (OP): When to inline?
ObscureCaucasian: Small functions.
engineereeyore: Small functions. Watch your memory usage.
slugg: Inlining often speeds things up.
IHateMyJob2004: 2) inlining can slow your program down. (claims mad skillz)
Cogman: 2) Sometimes true, sometimes not true. Use profiling.

Ahh, inlining. That oft-misunderstood pseudo-feature of so many languages.
Common Misconceptions
Misconception: Putting inline on a function inlines it
Truth: inline is a hint. In some cases, compilers cannot inline code for correctness. For example: function class1::func1() in file file1.C cannot be inlined by most compilers if it is called from funciton class2::func2() in file file2.C.
Guidelines: Functions can be inlined if:
1) They are declared and implemented in source file myFile.C, and are only called from myFile.C. In this case, they should be 'static' in C++ or 'inline' in C.
2) If they are implemented at their declaration site in a class (C++ only)
3) If they are implemented as extern inline in a header file.

Misconception: Only inline small functions.
Truth: Its OK to inline large functions if they're only called once or twice in static code. Doing so might just improve readability. Its not the size of the function that matters, its the size of the total code added to the live static code footprint. Its just a tradeoff between instruction cache locality and branch target buffer affinity.

Misconception: Inline frequently-called functions.
Truth: The most frequently called functions are most likely to have a matching entry in the BTB, and are therefore most likely to be predicted correctly on modern CPUs. Inlining short yet less-frequently called functions will get you more bang for your buck.

Executive Summary 2
IHateMyJob2004: 1) modern compilers optimize better than humans
Cogman: 1) Not always true.
slugg: Great recursion-killing example.

I agree with IHateMyJob2004 about 80%, Cogman 99%, and I think slugg is just swell

.

First of all, modern compilers optimize better than 99% of humans 99% of the time for 99.99% of all high-performance code. I claim that ~1% of humans could sink effort into out-optimizing a compiler, that <1% actually do, and that the vast majority of all code is still machine-optimized *1*.

Second, compilers and humans optimize differently. Compilers transform code, while maintaining precisely the same operation at the language level. Compilers have a sadly limited view of the world -- they don't know the big picture of what the code is trying to do, just the small picture of how the code is trying to do it. Humans are good at things like slugg's manual stack management *2*, and actually so are some optimizing compilers -- provided they are allowed to exceed the strict definition of the language (e.g. parallelizing compilers).

*Footnote 1: Amdahl's law implies that not all code is equal: A human optimizing 1% of code can improve performance enormously, because 1% of code can account for an arbitrary portion of runtime. This is one reason why some people still do hand-optimizations.

*Footnote 2: Some humans, anyway.

OP: Another question: how important is the implementation of the stack? I have a linked list data structure with a fast node allocator (i.e. throws free-ed nodes onto a stack for future use), but it's built with an object-oriented theme. So example: the user has no access to the nodes, but has to ask the master node for an iterator. Can I use this? Or should I build a more "barren" datatype?

There are few data structures worse for traversal performance than a linked list. The reasons are complicated, but for modern CPUs, they can find essentially zero parallelism in a linked list traversal because you have to find the next node before you can find the next next node, etc. Basically, their only utility should be as (de)queues.

slugg · Apr 28, 2009

Originally posted by: degibson

OP: Another question: how important is the implementation of the stack? I have a linked list data structure with a fast node allocator (i.e. throws free-ed nodes onto a stack for future use), but it's built with an object-oriented theme. So example: the user has no access to the nodes, but has to ask the master node for an iterator. Can I use this? Or should I build a more "barren" datatype?

Click to expand...

There are few data structures worse for traversal performance than a linked list. The reasons are complicated, but for modern CPUs, they can find essentially zero parallelism in a linked list traversal because you have to find the next node before you can find the next next node, etc. Basically, their only utility should be as (de)queues.

^^ For sure. But in a stack, you only traverse one node at a time, so that's not a big deal. The biggest hit comes from constantly allocating and freeing memory for every node

eLiu · Apr 28, 2009

Originally posted by: degibson

Misconception: Only inline small functions.
Truth: Its OK to inline large functions if they're only called once or twice in static code. Doing so might just improve readability. Its not the size of the function that matters, its the size of the total code added to the live static code footprint. Its just a tradeoff between instruction cache locality and branch target buffer affinity.

Misconception: Inline frequently-called functions.
Truth: The most frequently called functions are most likely to have a matching entry in the BTB, and are therefore most likely to be predicted correctly on modern CPUs. Inlining short yet less-frequently called functions will get you more bang for your buck.

There are few data structures worse for traversal performance than a linked list. The reasons are complicated, but for modern CPUs, they can find essentially zero parallelism in a linked list traversal because you have to find the next node before you can find the next next node, etc. Basically, their only utility should be as (de)queues.

Haha, the executive summaries are awesome.

Point 1: Yeah I htink I mentioned in the OP that I have been using inline to break apart large functions & free up variable names. (e.g. I have a newton optimization loop & then I may have to perform additional line searches. So I put the line search code which is very similar to the newton code into a static inline function. It's long, but it's only called once so I figured it's no different than pasting it in myself.)

Point 2: Not sure what a BTB is, but I'm guessing it has to do with branching. My question would be: I've always heard that inlining reduces function call overhead b/c function calls involve a lot of 'behind the scenes' pushing/popping stuff onto/off the stack. How does this BTB thing save that overhead?

Point 3: I enqueue a list of elements attached to a particular node. Then I dequeue and operate extensively on each element. So I don't think any other data structures would be any faster/slower at this. I chose linked-list b/c I don't know beforehand how many elements will be attached.

slugg: So my linked-list object is implemented as follows: there's only one listnode struct. It has a void *data pointer for any data & a void (*destr)(void*) function pointer to the free() method for said data. I also maintain a list of unused listnodes. When a new listnode is alloc-ed, I first check to see if there are unused nodes available before alloc-ing a new one. Similarly when a node is destroyed, it will be added to the unused list for future usage (all in a thread-safe way w/posix library). So I'd have to allocate a new listnode all the way down to the max depth of the tree, but after that it's just pointer manipulations.

Is this still bad?

Also: I don't have time to benchmark just now b/c my advisor wants other things done, but I'll add it to the list of todos.

Thanks for the tips/info so far everyone,
-Eric

slugg · Apr 28, 2009

Well in theory, it's pretty dynamic, so it's not bad.

But also, in theory, you're making as many system calls as you are the number of nodes in your stack. The whole idea of making an array based implementation is that you call malloc() just ONCE. Relying on the operating system to allocate memory is pretty slow. It's more dependent on the number of allocations, not so much the size of the allocation.

JACKDRUID · Apr 28, 2009

if i recall correctly, inline will actually create another copy of the function rather than a pointer to that function, therefore compiled program will have larger size.

yet it will run faster since it doesn't have to point/look up the function.

the size/speed different is neglegible now a days. general rule is not to use it, let the compiler optimizer do its job.

Cogman · Apr 28, 2009

Originally posted by: JACKDRUID
if i recall correctly, inline will actually create another copy of the function rather than a pointer to that function, therefore compiled program will have larger size.

yet it will run faster since it doesn't have to point/look up the function.

the size/speed different is neglegible now a days. general rule is not to use it, let the compiler optimizer do its job.

Well, in theory the real speed up comes from not having to push/pop things onto the stack. However, there are times when inlining can make your code run slower. For example, if the function is too big, then the CPU instruction cache can be filled resulting in more memory paging (slow).

As far as inlining go, yeah, it is generally a good idea to let the compiler decide when to do it.

eLiu · Apr 28, 2009

Originally posted by: Cogman

Originally posted by: JACKDRUID
if i recall correctly, inline will actually create another copy of the function rather than a pointer to that function, therefore compiled program will have larger size.

yet it will run faster since it doesn't have to point/look up the function.

the size/speed different is neglegible now a days. general rule is not to use it, let the compiler optimizer do its job.

Click to expand...

Well, in theory the real speed up comes from not having to push/pop things onto the stack. However, there are times when inlining can make your code run slower. For example, if the function is too big, then the CPU instruction cache can be filled resulting in more memory paging (slow).

As far as inlining go, yeah, it is generally a good idea to let the compiler decide when to do it.

How big is "too big"? Is that a matter of lines? How many continuous things there are uninterrupted by function calls? Something else entirely? I've heard this bit of advice before, but I've never understood what "too big" is.

-Eric

Cogman · Apr 28, 2009

Originally posted by: eLiu
How big is "too big"? Is that a matter of lines? How many continuous things there are uninterrupted by function calls? Something else entirely? I've heard this bit of advice before, but I've never understood what "too big" is.

-Eric

It is system dependent. there is no way to say what to big is from one machine to the next. Number of lines is a worthless measurement especially in upper level languages (Ok, C++ can still be considered fairly low level, but it is upper level enough) as the compiler can do all sorts of things to make the code longer/shorter.

That's why inlining should generally be left up to the compiler as the compiler writers have done a pretty good job of determining how big too big is.

As I said before, the only way to know if inlining a function is good is to compare a program with the inlined function against one without it on the target machine. Anything else is just a crap shot. (and really, this is generally the rule for all optimizations. Write a good program, identify the major slow downs via profiling, optimize, profile, and see if there was any benefit from the optimization.)

slugg · Apr 28, 2009

Originally posted by: Cogman

Originally posted by: JACKDRUID
if i recall correctly, inline will actually create another copy of the function rather than a pointer to that function, therefore compiled program will have larger size.

yet it will run faster since it doesn't have to point/look up the function.

the size/speed different is neglegible now a days. general rule is not to use it, let the compiler optimizer do its job.

Click to expand...

Well, in theory the real speed up comes from not having to push/pop things onto the stack. However, there are times when inlining can make your code run slower. For example, if the function is too big, then the CPU instruction cache can be filled resulting in more memory paging (slow).

As far as inlining go, yeah, it is generally a good idea to let the compiler decide when to do it.

A full instruction cache can be a good thing in some cases, like when I/O is kept to a minimum and CPU usage is extremely high, with a high level of piplining. Unfortunately, there's no way to directly control what instructions hit the CPU unless you're programming in ASM...

That's why I think trial/error in the form of benchmarking is the way to go.

eLiu · Apr 28, 2009

Ok, cool. I'll start the benchmarking process just as soon as everything works... haha

Somewhat unrelated: slugg, did you have any tips/hints for the Nearest Neighbor searching?

slugg · Apr 29, 2009

Oh right. Let me re-read that post and see if I can make sense out of it this time around. LOL!

edit: okay so if I remember correctly, a nearest-neighbor search on a KD-Tree is pretty much just a fancy version of a depth-first traversal. While my previous example is only in a simple binary tree, the concept remains. Your algorithm is pretty trivial - you can't do much with it to optimize it in terms of algorithmic complexity. However, if you handle recursion with your own stack, you _could_ potentially see an improvement. But whether or not to inline appropriate functions? Meh, trial and error. Your code SHOULD be modular enough to have all sorts of different implementations at the same time anyways

Just wondering... did you derive your nearest-neighbor run time in big-oh? If so, what is it? I could check some textbooks to see if I can find something better. I haven't worked with KD structures at any point in my life, LOL.

What is this software for, anyways?

edit 2: Lets say that for some reason, you actually have faster performance while using the recursive version of your algorithm. If you keep BOTH versions, your iterative version will have a much smaller memory overhead, so in the event of a FAILURE (exceeded program stack, for example), your program could catch the overflow and try the alternate, less memory intensive version of the algorithm. So really, you won't be wasting your time anyways. At the very least, you'd be making your software more tolerant and robust!

eLiu · Apr 29, 2009

Yeah it's O(log(N)^d) to perform a search, where d is the # of space dimensions, and N is the number of data points. This is assuming that the tree is reasonably balanced. In the worst case, I'm looking at k*N^(1-1/k). But with my adhoc reinitialization, I hit the the optimal scaling consistently even without proper tree balancing. But yes, it's nothing more than a glorified binary tree. (Note: I've only had 1 undergrad class in algorithms, and my background is applied math/aerospace engineering so this is hardly my field. Be nice if I messed up

)

But yes, the code is modular enough for me to switch up the way all the nontrivial tasks are performed.

The only data structure I know of that's asymptotically better is the Range Tree. It's O(log(N)) straight up. But it requires O(NlogN) storage and seemed a lot harder to implement. Range is slightly more than a glorified binary tree; it's more like a bunch of trees living at the nodes of the 'master' tree. If the bigger 3D cases get unwieldly, I was going to try Range Trees. There are also a number of methods that return good approximations, but that's unacceptable here; I need the -exact- right answer.

The KD search is part of a solver for systems of PDEs. The PDE code is a discontinuous galerkin finite element method (if that means anything to you), which we can use to solve anything from convection-diffusion to euler, navier stokes, and beyond. KD search is specifically meant for RANS: reynolds averaged navier stokes. RANS is meant to model turbulence in a more economic way. The RANS equations require you to know the distance to the nearest solid wall from any point in the flow field. In 3D this is a fairly nontrivial problem b/c the walls are in general curved. The two dominant methods for finding the wall-distance are 1) intelligent searching (e.g. KD trees), or 2) solve a level-set problem (or a heavily modified poisson problem) for the "distance field" all at once. The modified poisson problem is painfully inaccurate. The level-set method is somewhat difficult to code optimally; its optimal case should be on par with KD search. So I chose method 1.

Edit: Huuuhhh? How would I catch & act on an error where I've exceeded the program stack?

Also, since eventually the full 3D cases will involve billions of points and the RANS solver will run in parallel. I'm sure I will have to move to managing my own stack (thanks so much for showing me that) & I honestly don't know how I'm even going to deal with the KD search in parallel... 0 experience there.

slugg · Apr 29, 2009

Yea, O(log(N)^d) is optimal. I would have used K instead of D... lol

I mean the only other thing left I could think of is to break your data set up into discrete sets and be able to ignore irrelevant sets. I don't know if this is a possibility for your application, since most of it is over my head, but I've used this method in computer vision with good results.

What are you studying/researching, anyways? Seems really interesting.

Cogman · Apr 29, 2009

Originally posted by: slugg

Originally posted by: Cogman

Originally posted by: JACKDRUID
if i recall correctly, inline will actually create another copy of the function rather than a pointer to that function, therefore compiled program will have larger size.

yet it will run faster since it doesn't have to point/look up the function.

the size/speed different is neglegible now a days. general rule is not to use it, let the compiler optimizer do its job.

Click to expand...

Well, in theory the real speed up comes from not having to push/pop things onto the stack. However, there are times when inlining can make your code run slower. For example, if the function is too big, then the CPU instruction cache can be filled resulting in more memory paging (slow).

As far as inlining go, yeah, it is generally a good idea to let the compiler decide when to do it.

Click to expand...

A full instruction cache can be a good thing in some cases, like when I/O is kept to a minimum and CPU usage is extremely high, with a high level of piplining. Unfortunately, there's no way to directly control what instructions hit the CPU unless you're programming in ASM...

That's why I think trial/error in the form of benchmarking is the way to go.

mmm, IDK. The CPU is going to try and keep the Instruction Cache full no matter what, the slow-down comes in when it has to dump part of the cache back to memory to get the new incoming instructions (think looping with a huge inlined function). Cache misses will kill any program, by inlining, you are essentially telling the cpu to choose which code to keep in the cache. If it makes the wrong choice on which code to grab then you suffer from the cache misses that follow each loop.

This is somewhat the situation the leads to inlining slowing down performance rather then speeding it up.

As for pipeline optimizations, well, I would leave those completely up to the compiler. Compilers really do a pretty good job of making code pipeline friendly (with the optimization flags of course).

The only places I would really recommend breaking out the good 'ole ASM are, using specialized instructions, optimizing register usage (Compilers can sometimes be stupid in the way they use registers), or gaining minor speed increases, IE if data is being spit out to EAX, don't shove it into ESI only to later push it into EAX.

Even then, I would use the compiler generated ASM as a basis for my modifications.

eLiu · Apr 29, 2009

Cogman: so most of that last post went *whoosh* on over my head. If I wanted to take some classes or some reading or whatnot to improve my understanding of how CPUs & compilers work... have some suggestions?

In the meanwhile, sounds like I should just chill out as part of that 99% of people who don't know wtf they're doing & should let -O1 O2 or O3 handle it, lol.

slugg: haha it's true, when i worked it out, I used k. But I'm so used to using "d" (or rather, "dim") as my dimension variable that I kinda switched it up in my head.

I'm not sure how I'd break the data into discrete sets. I can't make any assumptions about the input geometry (could be an airplane, just a wing, maybe an engine, rocket, or helicopter or something else entirely). If I could make assumptions about the geometry, then yeah I see what you mean... like if I can determine I'm above the wing, then I should ignore all the points on the bottom surface. Tragically this isn't possible (yet).

I'm a grad student in aerospace engineering. In particular I work on problems of computational fluid mechanics. Getting more precise, I'm interested in higher order methods for solving [systems of] partial differential equations with some emphasis on shock problems. And by higher order, roughly I mean: so you know that (u2-u1)/dx is approximately the deriviatve at u1 as u2->u1 & dx->0. But the error in this approximation is O(dx). So what if instead of a line, I used a parabola? Well there are many choices, but a symmetric one is: (u2 - u0/2/dx is the deriv at u1 if u2 and u0 are equidistant from u1. The error here is O(dx^2). This is a rather simple example, that due to the problems with equidistant polynomial interpolation, does not extend very far. But there are shitloads of other methods/ideas out there for increasing accuracy by applying higher order approximations (polynomial and otherwise; I work mostly with polynomial). As an extension of this, I'm also interested in correct estimation of error & methods to reduce it.

Most commercial codes are only 2nd order accurate. Even with a crapload of points, you're still looking at something that isn't very good. And since we don't ever have an exact solution, it's actually quite hard to quantify who is "more right". A recent survey of a bunch of CFD codes simulating drag on a wing+tube airliner body indicated that there's actually a pretty large amount of variation between the various groups submitting results... and with such big scatter & no exact answer, who's to say who's more accurate? So the lab I work with concentrates on 1) getting proveably higher order accurate schemes and 2) quantifying what the error is.

I mean I think this stuff is pretty exciting. It's not going to save lives or make the world a better place any time soon. In 20 years, the type of method I concentrate on might not even be used anymore (luckily for me you can't really just learn 1 method... gotta get something of everything)... but we think that discontinuous galerkin finite elements is the future of high accuracy computer simulation

C & the inline keyword

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member