C/C++ prefix vs postfix increment

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

engineereeyore

Platinum Member
Jul 23, 2005
2,070
0
0
Originally posted by: mlm
But it wouldn't be just a pointer. The compiler would invoke that object's copy constructor to create the temp. If you look at the code that I posted from the compiler, the postincrement takes the int by value, not by reference.

I'm not quite sure how you could do a return before finishing the increment. I've hardly worked with concurrency (or whatever it's called), but from what I've seen that doesn't seem possible.

Why would it created an entire copy? All that is being returned is a pointer to the location, whether you do a postfix or prefix. If an entire copy is being made, then what you've got is one screwed up compiler. I've never written a compiler that would do something that stupid. Doesn't mean they don't exist, cause I'm sure they do. But a compiler should be smart enough to not make a copy anyway, but if it does, to only copy the pointer.

As for the concurrency, I could explain Tomasulo's algorithm if you want, but it's rather invoked and somewhat off topic. I'd be happy to PM you if you'd like though.
 

engineereeyore

Platinum Member
Jul 23, 2005
2,070
0
0
Originally posted by: smack Down
mlm is correct with respect to object because the standard uses a hack to make it faster.
With an int++ returns a copy of the initial value, ++int returns a copy of the final value, object++ returns a copy of the initial value, but ++object returns a reference to the final value.


The int in "_Myt operator++(int) " is another hack. The int isn't used for data, but instead used to make the function signatory different between the pre and post.

Well, if that is the case, some software engineer foobar'd that problem big time. Fact still remains the even with the temp, the postfix is going to run faster on any modern microprocessor.
 

smack Down

Diamond Member
Sep 10, 2005
4,507
0
0
Originally posted by: engineereeyore
<div class="FTQUOTE"><begin quote>Originally posted by: smack Down
mlm is correct with respect to object because the standard uses a hack to make it faster.
With an int++ returns a copy of the initial value, ++int returns a copy of the final value, object++ returns a copy of the initial value, but ++object returns a reference to the final value.


The int in "_Myt operator++(int) " is another hack. The int isn't used for data, but instead used to make the function signatory different between the pre and post.</end quote></div>

Well, if that is the case, some software engineer foobar'd that problem big time. Fact still remains the even with the temp, the postfix is going to run faster on any modern microprocessor.

No it isn't because the act of copying the object is going to take so many instruction that it will saturate the instruction window.

It really is academic because no one implements pre and post increment differently.
http://www.sgi.com/tech/stl/InputIterator.html For example of how ++ is implemented.
 

engineereeyore

Platinum Member
Jul 23, 2005
2,070
0
0
Originally posted by: smack Down
No it isn't because the act of copying the object is going to take so many instruction that it will saturate the instruction window.

It really is academic because no one implements pre and post increment differently.
http://www.sgi.com/tech/stl/InputIterator.html For example of how ++ is implemented.

Yes it is because the copy instruction is independent of the other instructions, therefore they can execute in parallel. The prefix instruction is not. I can instantiate a copy instruction and continue right on trucking. Without it, I have to sit there and wait for the increment operation to complete before I can do another thing.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Originally posted by: engineereeyore[/i]
Originally posted by: dighn[/i]
The RAW/RAR issue is significantly overshadowed by the temporary creation. You are looking at this from a very low-level which doesn't really apply to the object case. The --/++ operations are proper, actual function calls and the way the language is structured necessarily requires the construction of a temporary object for the post-operator. Object creation is always relatively expensive, certainly when compared to the minute delay caused by data hazards.

PrinterElement(i++); --> the compiler calls the ++ function, which must increment i but also return the original value of i for passing into PrintElement() after the increment. copy must be made.

PrintElement(--i); --> -- function is called, which decrements i but can return the current state of the object.

But that is not at all true. When the processor creates a copy, it's not copying the entire data structure. i is nothing more than a pointer. Copying a pointer is a very simple process. You can't increment an object, only a pointer to an object. Therefore, what you are copying is a pointer and that operation, on current processors, is almost not even worth considering. Therefore, the RAR and RAW become very much a big deal as all that is copied is the pointer, not the structure.

I've asked this question of several other computer architects, many with Ph. D's, and they all agree. Prefix does not function faster. To a software person, it may appear so because of the use of a temporary, but to a hardware person who knows how these things are handled by the processor, this claims is absolutely false.

The thing you're missing, because you are indeed looking at it too far down the stack :), is that in C++ there are semantics around the lifecycle of objects that will be enforced by the compiler before the assembler gets a crack at the code. Various situations force the compiler to create temporaries. Sometimes that is the desired behavior, and at other times it is an unwitting side-effect.

I taught C++ for several years, and although things have gotten munged up in commentary since the OP's post, I can guarantee you that what was being referred to was the very common mistake of unwittingly creating a temporary copy of a user-defined class that has a copy constructor or default constructor, due to incautious use of operators or pass-by-value semantics.

In such cases the compiler will make calls to the constructors to create the temporary copy, and those calls can entail quite a bit of overhead. Needless to say none of this matters much if you do it once, but if you do it in a tight loop there can be significant performance impacts.
 

dighn

Lifer
Aug 12, 2001
22,820
4
81
copy, it's not copying the entire data structure. i is nothing more than a pointer. Copying a pointer is a very simple process. You can't increment an object, only a pointer to an object. Therefore, what you are copying is a pointer and that operation, on current processors, is almost not even worth considering. Therefore, the RAR and RAW become very much a big deal as all that is copied is the pointer, not the structure.

I've asked this question of several other computer architects, many with Ph. D's, and they all agree. Prefix does not function faster. To a software person, it may appear so because of the use of a temporary, but to a hardware person who knows how these things are handled by the processor, this claims is absolutely false.

I am familiar with the hardware details. But you are missing the software perspective. Temporary creation is expensive. There is stack allocation, there is constructor call, there is the copy on return, and for any non-trivial objects there are more than just a pointer copy going on. It is NOT just a couple of processor instructions.

 

mlm

Senior member
Feb 19, 2006
933
0
0
Originally posted by: engineereeyore
<div class="FTQUOTE"><begin quote>Originally posted by: mlm
But it wouldn't be just a pointer. The compiler would invoke that object's copy constructor to create the temp. If you look at the code that I posted from the compiler, the postincrement takes the int by value, not by reference.

I'm not quite sure how you could do a return before finishing the increment. I've hardly worked with concurrency (or whatever it's called), but from what I've seen that doesn't seem possible.</end quote></div>

Why would it created an entire copy? All that is being returned is a pointer to the location, whether you do a postfix or prefix. If an entire copy is being made, then what you've got is one screwed up compiler. I've never written a compiler that would do something that stupid. Doesn't mean they don't exist, cause I'm sure they do. But a compiler should be smart enough to not make a copy anyway, but if it does, to only copy the pointer.

You don't always want a pointer though. Take the following example:

int main() {
int i = 1;
int j = i++;
cout << i << " " << j << endl;} // output: 2 1

You could argue that the compiler would take the initiative to separate the increment and assignment into two lines, but that's the compiler's perogative. As far as coding goes, it's the standard to return a copy.

As for the concurrency, I could explain Tomasulo's algorithm if you want, but it's rather invoked and somewhat off topic. I'd be happy to PM you if you'd like though.

As far as I know from a coding perspective, it "just doesn't work that way." Unless you use threads, everything is expected to be executed in sequence, line by line, in the order that they're called. I don't doubt that it could get more complicated than that, but that really has nothing to do with ++/--.
 

engineereeyore

Platinum Member
Jul 23, 2005
2,070
0
0
Okay, it is obvious that a hardware lesson is needed because you guys just aren't getting it. First and foremost, looking at this at the hardware level is NOT too low level. How do you measure a software program? How do you say one software implementation is faster that the other? By the size of the code? By the number of instructions? NO. You measure it according to how quickly it runs on the hardware. Therefore, if you want to know if ++i is really faster that i++, you have to look at it at the hardware level. Assembly code is not going to tell you anything.

As for the hardware, for simplicity sake, let's use a dual-core processor, and then we'll move on to single core. For your ++i implementation, this is how it will work.

CPU1 | CPU2
increment value 1 |
increment value 2 |
increment value .... |
wait until all increment have completed |
.... |
.... |
return value |

Now, consider the use of a temp

CPU1 | CPU2
copy value 1 | increment value 1
copy value 2 | increment value 2
copy value .... | increment value ......
return value | wait for all increments to complete

(EDIT: Sorry, spaced didn't appear. | mark operations used on different CPUs)

Now, which one completes first? The second implementation, use with the i++. And contrary to popular belief, you don't need to thread the process for this to work. The CPU is capable of determining independent, non-specified operations and executing them concurrently. Don't have a dual core? A hyper-threaded computer will do the same thing. Using Tomasulo's algorithm, and the fact that all microprocessor architectures since, I believe, the Pentium II Pro implement this algorithm in order to perform out-of-order execution and out-of-order completion, a hyper-threaded computer can take in more than one instruction at a time, and can therefore still execute the above code as shown.

Now if you want to tell me again how exactly it is that ++i executes faster, I'm all ears. As for the original statement, prefix increment can be faster than postfix, but never slower, this is completely wrong. Consider:

j = array[i++] vs. j = array[++i] (ignore that these are two different values)

add temp,i,0 ; store i in temp
inc i ; increment i
st j, array(temp) ; store result in j

vs.

inc i
st j, array(i)

Which one happens faster? Neither if your compiler sucks. If the compiler is doing its job, the first one will execute faster due to the RAW stall of the second. So the stalls are exceptionally important.

Point is, the only metric for measuring how quickly code executes is in clock cycles, and you can't determine clock cycles if you don't understand the hardware. Feel free to disagree if you like, but you'll be wrong if you do so. I have a Master's in Computer Engineering with a specialty in processor architecture. I have several other people with Ph.D's in Computer Engineering who concur with what I have said. If you want to ask ANY other computer engineer with such specialties, feel free. They'll tell you the same thing. The claims that ++i may occasionally run faster than i++ may have been valid in older architectures, but like I said, this is no longer the case.
 

dighn

Lifer
Aug 12, 2001
22,820
4
81
You just ignored everything we said. I am not ignorant to the details of processor design as I did graduate in CE. Looking at that fine level does not apply when the variable is an object. Objection instantiation is a complex process that weighs vastly more than instruction level delays. Before you criticize compiler designs, there are reasons for this level of complexity; most software aren't coded in assembler level after all.

Just to test this, I made a short program:

int _tmain(int argc, _TCHAR* argv[])
{
DWORD first, last;
vector<int> temp(1000000);

first = GetTickCount();
for(vector<int>::iterator it = temp.begin(); it != temp.end(); it++)
{
}
last = GetTickCount();
cout << (last-first) << endl;

return 0;
}

++i takes 735ms on my machine to complete, while i++ takes over 2000ms. This is with VC++7
 

engineereeyore

Platinum Member
Jul 23, 2005
2,070
0
0
Originally posted by: dighn
You just ignored everything we said. I am not ignorant to the details of processor design as I did graduate in CE. Looking at that fine level does not apply when the variable is an object. Objection instantiation is a complex process that weighs vastly more than instruction level delays. Before you criticize compiler designs, there are reasons for this level of complexity; most software aren't coded in assembler level after all.

Just to test this, I made a short program:

int _tmain(int argc, _TCHAR* argv[])
{
DWORD first, last;
vector<int> temp(1000000);

first = GetTickCount();
for(vector<int>::iterator it = temp.begin(); it != temp.end(); it++)
{
}
last = GetTickCount();
cout << (last-first) << endl;

return 0;
}

++i takes 735ms on my machine to complete, while i++ takes over 2000ms. This is with VC++7

:laugh: ARE YOU SERIOUS!?!?! GET TICK COUNT????? Do you even know what that does? It returns the number of milliseconds that have passed. What in the hell does that have to do with how many clock cycles a portion of code has taken? Any number of things could affect that count. Context-switches, page faults, cache misses. Why don't you tell me exactly how much of that time was actually used to run the program, cause GetTickCount sure isn't going to tell you.

Thanks for playing though. Oh, and thanks for ignoring everything I stated as well.
 

Templeton

Senior member
Oct 9, 1999
467
0
0
Originally posted by: engineereeyore
<div class="FTQUOTE"><begin quote>Originally posted by: dighn
You just ignored everything we said. I am not ignorant to the details of processor design as I did graduate in CE. Looking at that fine level does not apply when the variable is an object. Objection instantiation is a complex process that weighs vastly more than instruction level delays. Before you criticize compiler designs, there are reasons for this level of complexity; most software aren't coded in assembler level after all.

Just to test this, I made a short program:

int _tmain(int argc, _TCHAR* argv[])
{
DWORD first, last;
vector<int> temp(1000000);

first = GetTickCount();
for(vector<int>::iterator it = temp.begin(); it != temp.end(); it++)
{
}
last = GetTickCount();
cout << (last-first) << endl;

return 0;
}

++i takes 735ms on my machine to complete, while i++ takes over 2000ms. This is with VC++7</end quote></div>

:laugh: ARE YOU SERIOUS!?!?! GET TICK COUNT????? Do you even know what that does? It returns the number of milliseconds that have passed. What in the hell does that have to do with how many clock cycles a portion of code has taken? Any number of things could affect that count. Context-switches, page faults, cache misses. Why don't you tell me exactly how much of that time was actually used to run the program, cause GetTickCount sure isn't going to tell you.

Thanks for playing though. Oh, and thanks for ignoring everything I stated as well.

Don't be so quick to brush him off, if the difference were < 100ms, I would agree with you; we're talking over a second here though. (assuming he didn't decide to do anything crazy at the same time as running the test the second time, this gives at least a rough picture of things) Do you have any numbers you'd like to share?

With that said, care to cite where in the c++ standard i mandates that vector<T>::iterator shall be implemented in terms of a pointer? It may be possible to implement it as a pointer for certain types (hint: not bool) but this is not the case across the board. (checked iterators?) For other containers, it's not at all a possibility for an iterator to simply be a fancy typedef for a pointer. (linked lists, maps...) Iterators are objects. When a postfix increment is done, a copy constructor needs to be called. If things are inlined, this may only be a few instructions. If you're not so lucky, it's a full function call with all that that entails.
 

dighn

Lifer
Aug 12, 2001
22,820
4
81
Originally posted by: engineereeyore
:laugh: ARE YOU SERIOUS!?!?! GET TICK COUNT????? Do you even know what that does? It returns the number of milliseconds that have passed. What in the hell does that have to do with how many clock cycles a portion of code has taken? Any number of things could affect that count. Context-switches, page faults, cache misses. Why don't you tell me exactly how much of that time was actually used to run the program, cause GetTickCount sure isn't going to tell you.

Thanks for playing though. Oh, and thanks for ignoring everything I stated as well.

Ok so it's not exaclty the most accurate way of measuring cycles, but it is a fairly quick wy to get a rough estimate. I ran the test many times with no heavy load so statistically it's very unlikely I'd get consistently heavily skewed results because of multitasking.

But just to satisfy you, I increased the size by 10, and measured the CPU time per Task Manager in Windows.

The results: 7 seconds for prefix, 21 seconds for postfix
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Point is, the only metric for measuring how quickly code executes is in clock cycles, and you can't determine clock cycles if you don't understand the hardware.

I tried to make it as clear as a could, but I guess I missed the mark. Let me try one more time: in certain situations with objects of user-defined class type _any_ creation of a temporary will cause the compiler to generate a function call to a constructor or copy constructor for the class. I'm not going to debate methods of profiling code with you. You seem to know that area quite well. I presume that you won't debate whether or not a function call is slower than no function call.
 

KCfromNC

Senior member
Mar 17, 2007
208
0
76
Originally posted by: engineereeyore

j = array[i++] vs. j = array[++i] (ignore that these are two different values)

add temp,i,0 ; store i in temp
inc i ; increment i
st j, array(temp) ; store result in j

vs.

inc i
st j, array(i)

Which one happens faster? Neither if your compiler sucks. If the compiler is doing its job, the first one will execute faster due to the RAW stall of the second. So the stalls are exceptionally important.

Maybe, maybe not. When you work with these sorts of things in the real world it's hard to predict, especially when you haven't specified which microarchitecture you're talking about. It might depend on what other instructions you have around the code. Could be that the add and inc are issued on one cycle and the store on the next, meaning you take an extra cycle hit in the first case. Could be that you have an architecture that includes an address generation unit in the store pipe that gives you increment for free so, for example, your second example would be one instruction. It could be that there are limited forwarding paths (it takes power to drive all those muxes, and too many forwarding paths can kill timing) so maybe it takes an extra cycle to forward temp from the ALU to the ld/st unit.

But since we're both making up hypothetical CPUs, it's tough to logic your way through this. If it matters in real production code (which is incredibly unlikely) you'd have to time it and see. In this particular discussion, the measurements show that pre is quicker.
 

dighn

Lifer
Aug 12, 2001
22,820
4
81
Originally posted by: KCfromNC
In this particular discussion, the measurements show that pre is quicker.

Just to clarify: the measurements show that the prefix ++ operator of std::vector is quicker than the postfix version. The integer case is a entirely different scenario and frankly for all I know/care engineereeyore is probably right. The argument shifted away from that quite a way back however.
 

alpha88

Senior member
Dec 29, 2000
877
0
76
For primative types (integers, pointers, chars): i++ could befaster, but most likely will be the same.


For objects: ++i is can be significantly faster, depending on the copy constructor.


When you're doing stuff like

for(int i = 0; i<MAX;i++){...}, it doesn't matter at all if you use i++ or ++i.
 

KCfromNC

Senior member
Mar 17, 2007
208
0
76
Originally posted by: dighn
<div class="FTQUOTE"><begin quote>Originally posted by: KCfromNC
In this particular discussion, the measurements show that pre is quicker.</end quote></div>

Just to clarify: the measurements show that the prefix ++ operator of std::vector is quicker than the postfix version. The integer case is a entirely different scenario and frankly for all I know/care engineereeyore is probably right. The argument shifted away from that quite a way back however.

Fair enough. I probably should have said "the only measurements anyone's bothered to do". I was just trying to make the point that even in the integer case it's very unlikely that a stall penalty will be the important factor in this case, regardless of how many big words get thrown around.