Originally posted by: Gamingphreek
So if I had 1.23456789, obviously in a float only 1.234567 would be accurate. So even if I multiplied by 100 resulting in a double with 123.456789 and then casted it to a float, it would still not be accurate?
Correct. Floating point numbers are stored with a mantissa (the digits of precision) and exponent (what to multiply the mantissa by to get the final number); it's basically very similar to scientific notation.
A theoretical base 10 floating point format would be something like [7 digit mantissa] * 10^[exponent]. Multiplying it by 100 just changes the exponent, but under no circumstances can you have more than 7 digits in the mantissa.
Well the code, running on multiple computers normally takes a weekend to run through the datasets that they task it with. So saving 60-70% on one part of the program that is called extensively should be at least somewhat beneficial.
As for writing our own allocater - honestly that is above me. If you point me to some code guides that explain the concert I would happily read over it, learn it, and try my hand at it.
Forget the allocator suggestion. VC++'s STL optimizes construction of scalars, so it never even calls allocator::construct. It's a hacky suggestion, anyway.
A better option is to call "reserve" instead of "resize" and replace the end of that SSE code with:
const double* d = reinterpret_cast<const double*>(&_mm_sub_pd(xmml0, xmmr0));
res[0].push_back(d[0]);
res[0].push_back(d[1]);
d = reinterpret_cast<const double*>(&_mm_sub_pd(xmmr1, xmml1));
res[1].push_back(d[0]);
res[1].push_back(d[1]);
d = reinterpret_cast<const double*>(&_mm_sub_pd(xmml2, xmmr2));
res[2].push_back(d[0]);
res[2].push_back(d[1]);
It knocks off another 30% or so since the vector elements are no longer being default initialized. It still appears to be slower than just copying into a raw array due to some overhead in push_back.
Having said that, the only other information I know at this point is that trig functions are called for each element in the array currently. I assume if I were to read in the values into the _m128 data type, call the crossproduct and loop that way, I would save time?
You can still read into a vector of doubles and just cast to _m128d* at the point where you call the SSE intrinsic. reinterpret_cast has no performance impact, and the code would probably be clearer that way.
You should push very hard to get someone to give you access to a profiler. Even if it were possible to optimize this function down to nearly nothing, it may turn out to not affect overall application performance much.