Originally posted by: goku
If more programs were optimized and possibly written in assembly language, wouldn't games like facry run on a PIII 600MHZ system? What makes C++ more inefficient than assembly if it all gets converted into machine code?
Ah, my favorite topic
There are many things that prevent a complier from taking C/C++ code and turning it into the most efficient set of asm instructions. But for the most part, it's not just the translation that isn't (and can't be) perfect, it's also the fact that even asm written for one machine, could perform very differently on another, even if they are both the same platform. Just one thing like the cache size could have a dramatic effect on the speed with which a program executes.
Last semester I actually took a course at my university that was dedicated to teaching assembly especially for the purpose of optimizing your C/C++ programs. In that course, any time we had to optimize code, we were told the exact specs of the hardware that it will be tested on. Also, after compiling the program, we would often disassemble it, and then make modifications to the actual assembly code by hand to take care of optimizations that a compiler just can't make.
One of the key things to remember about compilers, is that while optimizations are allowed, they can't in any way interfere with the result (output of the program). As a human, you can take a look at a piece of code and see that if you move some things around, the program would run faster. For a compiler however, if moving things around in one program makes it run correctly and faster, while in another it alters the outcome, that optimization isn't allowed.
Here are a few things to keep in mind:
Greater efficiency of your code usually makes it more difficult to understand and maintain. Languages, in general, were created so that it is easier to program machines, so there will always be a price to pay for convenience. There are also various other reasons that code in C/C++ is difficult for a compiler to optimize.
Take this for example:
int x = 0;
int y = 0;
for (int i = 0; i < 10000; i++) {
x++;
y++;
}
This code will run with a certain speed, but if you do the following, it will run quite a bit faster:
int x = 0;
int y = 0;
for (int i = 0; i < 10000; i+=4) {
x++;
y++;
x++;
y++;
x++;
y++;
x++;
y++;
}
And the code above will also run faster then this code:
int x = 0;
int y = 0;
for (int i = 0; i < 10000; i+=4) {
x++;
x++;
x++;
x++;
y++;
y++;
y++;
y++;
}
Each one does the same thing. In the second example, I used what is called "loop unrolling" to minimize the number of times i < 10000 comparison is made. It also eliminates a jump which can stall the CPU pipeline (this is called a control hazard, where the pipeline doesn't know if the jump should be made or ignored). In the third example, it looks just like the second, except I first increment all the x's and then all the y's. So why would it run slower? Once again because of the pipeline. In the second case, the CPU would be able to begin executing y++ even before the x++ before it was able to finish. In the third case, because all x++'s are together, each one has to wait before the other one finishes to get the new value of x (this is called a data hazard).
While it seems that things like this should be optimizable by the compiler, I could just as easily come up with a situation in which re-ordering some things could have a very drastic effect on the outcome of the program (especially when you start dealing with pointers), and thus isn't doable by the compiler. When it comes to pipelining, very often when writing in C/C++, especially with compiler optimizations on, it is very difficult to predict exactly how the compiler decides to execute instructions. Usually, compiler optimizations help, but there were many times in my own experience, when I optimized the code myself and it actually ended up running much faster when compiler did no optimizations of its own. Sometimes, what looks like a good idea in C/C++, can actually turn out pretty bad once it's converted into asm (data hazards, control hazards, etc.).
There's also the involvement of cache, which is pretty much transparent in both C/C++ and asm, but consider the following:
You have a matrix in which you have to process each element. The most obvious solution would be a for loop inside of another for loop, but do you process elements row by row, or column by column? Same question, but now imagine you're coding in Pascal?
In C/C++, the row by row approach would be MUCH faster then column by column. Also, if you process two rows at the same time and then increment the outer loop by 2, it will run even faster still. This is due to the fact that C/C++ store matrices in row-major order. Any time you access a memory location, a block of memory (usually 4096 bytes) is transferred to the cache. If you process elements next to the first element, then no more access to memory is needed - the data is already cached and can be accessed in 1 clock cycle. But if you go through a single column first, then on each request, data has to be grabbed from memory, put into cache, overwriting what you had in there before.
In Pascal, however, it's the complete opposite. Because matrices are stored in column-major order, row by row approach would be a lot slower. Now tell me where in the language itself would be you have been able to figure out this information (without doing a bit of tinkering)?
For all of the reasons above (and many more), assembly will always be a faster approach because it brings you closer to hardware. As with the example of cache, in assembly there wouldn't be any question of the most optimal way of doing things, because you're the one who decides how a matrix is stored. A compiler has a pre-determined way of doing things, and if the programmer isn't aware of such little details and goes the other route (which is perfectly legal, and logically shouldn?t have any effect), compiled programs often end up with many problems that can only be resolved by hand.
Hope that helps
