assembly question

Cogman · Nov 9, 2009

Just wondering, is it considered bad practice not to setup the stack frame pointer when you enter a function? When writing straight ASM (IE Im using MASM to assemble, not some C compiler). I will usually only setup the stack frame pointer if I actually use the stack, (and really, that is only when I am loading variables off the stack after using the stack).

With GCC, MSVC, ect, the stack frame pointer is ALWAYS setup. For me however, a function like

Code:

func:
   push ebp
   mov ebp, esp
   xor eax, [ebp + 8]
   pop ebp
   ret

doesn't really need the whole stack setup. It would do fine with

Code:

func:
   xor eax, [esp + 4]
   ret

I can see the use if a function is overly complex with lots of stack references and multiple references to the passed in variables, I just don't see the point when the stack is rarely/not used.

Gamingphreek · Nov 9, 2009

Well you need to, at the very least push the old base pointer on the stack so that when you return it is popped off the stack thus telling you where to return. Outside of that, given that you are manually managing registers since this is assembly code, there wouldn't be anything that could be overwritten.

If you do not need the stack pointers or anything, you could consider not making this code a new function and, in essence, inlining it. Is there any reason why you have it as a separate function in the first place?

You can, however, get away with not setting up the the stack pointer ($esp) - in fact, I believe gcc has a switch you can use to do that very thing.

-Kevin

degibson · Nov 9, 2009

Cogman said:
Just wondering, is it considered bad practice not to setup the stack frame pointer when you enter a function?

Its great practice. That is, its purely a performance optimization, so make sure you're working in a domain in which that matters, of course, but this kind of optimization is one of the reasons to know and love assembly.

Speaking of which, why am I saying this and not Ken?

Gamingphreek · Nov 9, 2009

degibson said:
Its great practice. That is, its purely a performance optimization, so make sure you're working in a domain in which that matters, of course, but this kind of optimization is one of the reasons to know and love assembly.

Speaking of which, why am I saying this and not Ken?

Is this not, essentially, an example of the benefits of in-lining the function?

Sure there is a PC-Relative jmp to a new memory location, but if you never allocate stack space and you use all values existing registers , this code is, in effect inlined (save for the jump).

Additionally, I know that the gcc switch '-fomit-frame-pointer' when writing in C/C++ instead of assembly tells the compiler to omit the base pointer and use the stack pointer for the addressing of all within that frame. I would imagine, that omitting both is impossible for a compiler and only possible if you write in straight assembly code.

-Kevin

Cogman · Nov 9, 2009

degibson said:
Its great practice. That is, its purely a performance optimization, so make sure you're working in a domain in which that matters, of course, but this kind of optimization is one of the reasons to know and love assembly.

Speaking of which, why am I saying this and not Ken?

Ok, just making sure I'm not breaking the universe. I couldn't technically think of anything that I was doing to harm things, however, I always feel nervous when dealing with assembly. Even when my code runs perfectly I have this feeling of impending doom.

I saw it as a fairly good performance optimization, especially for smaller functions (though I haven't ran a single benchmark, I'm just assuming it would be), and wondered if there was a good reason for one of my instructors to say that you should always setup the stack frame pointer.

Gamingphreek,
The return address is automatically pushed onto the stack when you hit the call instruction. No need to push ebp as you aren't changing it or using it.

I think I miss spoke, I ment base frame pointer when I said stack frame pointer. I knew what I ment (ebp).

This was obviously just a simple example. However, there are places where functions are slightly more complex yet not complex enough to setup the stack. Inlining is not always a good thing as it can lead to slower code.

Gamingphreek · Nov 9, 2009

Cogman said:
Ok, just making sure I'm not breaking the universe. I couldn't technically think of anything that I was doing to harm things, however, I always feel nervous when dealing with assembly. Even when my code runs perfectly I have this feeling of impending doom.

I saw it as a fairly good performance optimization, especially for smaller functions (though I haven't ran a single benchmark, I'm just assuming it would be), and wondered if there was a good reason for one of my instructors to say that you should always setup the stack frame pointer.

Gamingphreek,
The return address is automatically pushed onto the stack when you hit the call instruction. No need to push ebp as you aren't changing it or using it.

I think I miss spoke, I ment base frame pointer when I said stack frame pointer. I knew what I ment (ebp).

This was obviously just a simple example. However, there are places where functions are slightly more complex yet not complex enough to setup the stack. Inlining is not always a good thing as it can lead to slower code.

By inlining, I was referring to the principle rather than the act.

While this is still a subroutine, given that you don't setup a new frame, it has essentially all the benefits of inlining with none of the disadvantages, correct? This, of course, assuming that you can manage to fit every single thing in registers and don't have to allocate stack space (since you have none).

Ah - I knew x86 automatically pushed the return address at some point 🙂 - I may have been thinking of MIPS instead. I'm curious as to what my Professor would have to say about the potential performance advantages on this as well - learning more asm techniques is fun 🙂

-Kevin

Venix · Nov 9, 2009

Gamingphreek said:
While this is still a subroutine, given that you don't setup a new frame, it has essentially all the benefits of inlining with none of the disadvantages, correct? This, of course, assuming that you can manage to fit every single thing in registers and don't have to allocate stack space (since you have none).

Well no, all he's doing is avoiding setting up ebp, which just saves a single push/mov/pop. He still has an argument on the stack (esp + 4) that must be pushed and popped by the caller. call/ret also push/pop the return address.

Not using ebp is pretty common practice in x86 assembly. The frame pointer exists mainly so you can avoid error-prone esp-based math, but if esp never changes there is little point in using ebp. Unless you're calling some code that requires a frame pointer (e.g. calls longjmp or alloca, or uses SEH on Windows), I would definitely just omit it.

Cogman · Nov 9, 2009

Next question, would something like this

Code:

   push retAddr
   jmp func
   retAddr:

be faster then this

Code:

   call func

due to pipelining? A call takes 2 clocks to complete (according to some intel book I have, though it may not be accurate) while a push and a jmp take one clock each. However, the push and jmp can be pushed into the pipeline in less then one clock each (I'm still not sure how this works), Effectively giving you a less then 2 clock cycles call instruction.

Gamingphreek · Nov 9, 2009

I asked my professor and he agreed with you guys.

He actually also stated that compilers often do this as well (He gave an example using the command 'gcc -S -O3 -fomit-frame-pointer').

Furthermore, he said that the x86_64 ABI doesn't even require saving $rbp anymore given that the base pointer is used for debuggers. It does; however, keep an on-the-side table to keep track of whats on the stack though.

As for the push, jmp, ret; I would be interested in this as well.

As for execution in less than one cycle, wouldn't it be because neither utilize every step in the pipeline. Thus, even though they can execute in less than one clock, that cycle will still take the same amount of time, since the cycle needs to stabilize. I could be very wrong though...

-Kevin

Modelworks · Nov 9, 2009

Speaking of timing of cycles, I just finished up coding a IR remote decoder for ARM .You only get 10ms to receive and record some signals and then to break them down and the timing has to be near perfect or you get all kinds of errors. Add one instruction and the whole program goes out of sync.

Here is what I usually use for my guide for timing on x86, this was current as of the Pentium 4, which was about the last x86 cpu that I used ASM on .

call 1-5 cycles
jmp 1 or 2 cycles
push 1 or 2 cycles
ret 1- 4 cycles

Ken g6 · Nov 9, 2009

degibson said:
Speaking of which, why am I saying this and not Ken?

Because you spoiled me when you introduced me to those compiler intrinsics. 😛

Also because I've been busy working on Greasemonkey scripts in the Technical Forum Issues forum.

Modelworks, I wouldn't ever use a Pentium 4 as a basis for timing anything that was to be run on any modern processor. (Notice I didn't say "other"? 😉) I generally use the tables at http://www.agner.org/optimize to find instruction timings.

degibson · Nov 10, 2009

With GCC, MSVC, ect, the stack frame pointer is ALWAYS setup.

Compilers fear code that isn't based on stacks. A HUGE amount of compiler theory is based on stack theory.

RE: Timing in cycles
Take every timing guide with a grain of salt on x86 parts... the out-of-order parts are VERY hard to debug the timing because they're... well... out of order! Its very hard to reason about some static instruction taking X cycles when the underlying hardware is reordering it, subject to unseen memory ordering dependencies.

Venix · Nov 10, 2009

Cogman said:
Next question, would something like this

Code:

push retAddr jmp func retAddr:

be faster then this

Code:

call func

due to pipelining? A call takes 2 clocks to complete (according to some intel book I have, though it may not be accurate) while a push and a jmp take one clock each. However, the push and jmp can be pushed into the pipeline in less then one clock each (I'm still not sure how this works), Effectively giving you a less then 2 clock cycles call instruction.

Bad idea. Modern CPUs use a return address stack to predict the target of ret instructions. Mismatching calls and returns will break the predictor, leading to mispredictions and the associated performance degradation.

I also imagine that call and push/jmp would decode to similar or identical micro-ops, and if anything call would be faster. call is functionally identical to push/jmp, so I doubt that it would be slower on any microarchitecture that isn't fundamentally broken.

Cogman · Nov 10, 2009

Venix said:
Bad idea. Modern CPUs use a return address stack to predict the target of ret instructions. Mismatching calls and returns will break the predictor, leading to mispredictions and the associated performance degradation.

I also imagine that call and push/jmp would decode to similar or identical micro-ops, and if anything call would be faster. call is functionally identical to push/jmp, so I doubt that it would be slower on any microarchitecture that isn't fundamentally broken.

I didn't know about the return address stack.

As for the similar or identical comment. There are a couple of functions in the x86 architecture that do exactly the same thing (enter, leave) however, most will avoid using them because it is faster to just do the push and mov instead of identical enter/leave function due to pipelining.

Venix · Nov 10, 2009

Cogman said:
I didn't know about the return address stack.

As for the similar or identical comment. There are a couple of functions in the x86 architecture that do exactly the same thing (enter, leave) however, most will avoid using them because it is faster to just do the push and mov instead of identical enter/leave function due to pipelining.

True, although not really applicable to call. Simple instructions like call are directly decoded into micro-ops, but complex "CISC-y" beasts like enter or loop have to be sent to microcode, which incurs a significant performance penalty. Leave is actually faster on some architectures (AMD).

My point was mainly that call could be made as fast as or faster than an explicit push/jmp at the microcode level, and given its ubiquity it would be odd to not do so. I should have stated it more clearly before.

degibson · Nov 10, 2009

Cogman said:
I didn't know about the return address stack.

Rant warning

Some modern CPUs also have a branch target buffer (BTB), for predicting the target of register-indirect branches. Stuff of the following flavor (please forgive the SPARC -- I don't know any x86 black magic to conjure a register-indirect jump):

jmp %o7

However, both BTBs and RASs (return address stacks) work less well than a regular pc-relative branch predictor, because it turns out that call stacks are pretty hard to predict. Mainly because register-direct branches happen for things besides function calls (e.g., switch statements), exceptions happen, the RAS can overflow/underflow, the BTB can overflow, etc.

In short, these predictors exist exist, but function returns and register indirect branches are more likely to be mispredicted than pc-relative branches. In my experience, branch prediction accuracy is pretty much the #1 limiting factor of processor performance from the architectural perspective--even memory latency could be tolerated if we had better branch predictors.
[/rant]

assembly question

Cogman

Lifer

Gamingphreek

Lifer

degibson

Golden Member

Gamingphreek

Lifer

Cogman

Lifer

Gamingphreek

Lifer

Venix

Golden Member

Cogman

Lifer

Gamingphreek

Lifer

Modelworks

Lifer

Ken g6

Programming Moderator, Elite Member

degibson

Golden Member

Venix

Golden Member

Cogman

Lifer

Venix

Golden Member

degibson

Golden Member

TRENDING THREADS