The precision of a float is fixed. You will not gain more precision by multiplying it by 10,000; all that will do is move some of the lost precision to the left of the decimal point.
So if I had 1.23456789, obviously in a float only 1.234567 would be accurate. So even if I multiplied by 100 resulting in a double with 123.456789 and then casted it to a float, it would still not be accurate?
I have great difficulty believing that this function is the performance bottleneck of your application. Have you done any profiling yet, or are you just assuming that this needs to be aggressively optimized?
Well in the program/project I am working on, it more than likely isn't the problem. Cross product; however, is one of the bottlenecks for the code that my boss and others use elsewhere. I haven't actually seen the code as it is in the TS:SCI area (thus someone would have to bring it out to me).
As for the profiler, I actually haven't run the code through a profiler yet for the simple fact that, in VS2008, the profiler is only available in 2 version. My company owns VS2008 Professional which does not include it.
It takes about 60-70% of the time of the original code, but it requires the vector to be 16-byte aligned (although that's the default for MSVC++'s "new") and its size to be a multiple of two. Meanwhile, the majority of the time in the function is eaten by the resize of the vectors, since vector default initializes every element. You would see much greater performance gains by simply using a container that allows for uninitialized values; one possibility may be to write an allocator that does nothing in its "construct" function.
Well the code, running on multiple computers normally takes a weekend to run through the datasets that they task it with. So saving 60-70% on one part of the program that is called extensively should be at least somewhat beneficial.
As for writing our own allocater - honestly that is above me. If you point me to some code guides that explain the concert I would happily read over it, learn it, and try my hand at it.
But again, I imagine that there are other areas in the program that can be more easily optimized for much greater gains. For example, copying one of these humongous vector<vector<>>'s would be horribly slow, so ensure that you never return one by value or assign to a new variable.
Well, in my code, I am in control of the array in question; however, in others code, it is not in my control and used a little too extensively to just up and switch unless I can prove huge gains.
With 2,000,000 elements, this cross-product loop (excluding the vector resize) takes only ~65 ms (~41 with SSE2); unless you're calling it hundreds or thousands of times, that's pretty insignificant.
Array sizes, from what I am told, are in the 3x(x)00,000's. There are, again from what I understand, many of these per data set, so there should be a least a decent gain.
--------------------
Having said that, the only other information I know at this point is that trig functions are called for each element in the array currently. I assume if I were to read in the values into the _m128 data type, call the crossproduct and loop that way, I would save time?
--------------------
Well, I figured out why none of my ideas were working. I'm using gcc 4.3.3, and gcc is too smart for me! Once I compiled with gcc -S to look at the assembly source, I found that it (1) pre-calculates things like out[1], so out[1][ i] is only two operations, and (2) it found the possibility for SIMD all by itself and inserted it automatically!
gcc compiler is so great
. I think they switched to the Intel compiler briefly and, while it was faster, it also broke some things. So, until someone rewrites that program, we are stuck with the MS (<-bleh) compiler.
Edit: OpenMP is just slightly faster, but uses much more CPU. To enable it, #include <omp.h>, link with its library, and make the central loop look like the following:
Yea, I was looking at that as well. I'll have to investigate the results when paired with the SSE code.
Thanks for all the help guys. As I said, I have learned a TON from this thread and am still eagerly burning through all documentation on both concepts that I can
-Kevin