Multithreading/SIMD C++

Venix · Jun 22, 2009

Originally posted by: Gamingphreek
So if I had 1.23456789, obviously in a float only 1.234567 would be accurate. So even if I multiplied by 100 resulting in a double with 123.456789 and then casted it to a float, it would still not be accurate?

Correct. Floating point numbers are stored with a mantissa (the digits of precision) and exponent (what to multiply the mantissa by to get the final number); it's basically very similar to scientific notation.

A theoretical base 10 floating point format would be something like [7 digit mantissa] * 10^[exponent]. Multiplying it by 100 just changes the exponent, but under no circumstances can you have more than 7 digits in the mantissa.

Well the code, running on multiple computers normally takes a weekend to run through the datasets that they task it with. So saving 60-70% on one part of the program that is called extensively should be at least somewhat beneficial.

As for writing our own allocater - honestly that is above me. If you point me to some code guides that explain the concert I would happily read over it, learn it, and try my hand at it.

Forget the allocator suggestion. VC++'s STL optimizes construction of scalars, so it never even calls allocator::construct. It's a hacky suggestion, anyway.

A better option is to call "reserve" instead of "resize" and replace the end of that SSE code with:

const double* d = reinterpret_cast<const double*>(&_mm_sub_pd(xmml0, xmmr0));
res[0].push_back(d[0]);
res[0].push_back(d[1]);

d = reinterpret_cast<const double*>(&_mm_sub_pd(xmmr1, xmml1));
res[1].push_back(d[0]);
res[1].push_back(d[1]);

d = reinterpret_cast<const double*>(&_mm_sub_pd(xmml2, xmmr2));
res[2].push_back(d[0]);
res[2].push_back(d[1]);

It knocks off another 30% or so since the vector elements are no longer being default initialized. It still appears to be slower than just copying into a raw array due to some overhead in push_back.

Having said that, the only other information I know at this point is that trig functions are called for each element in the array currently. I assume if I were to read in the values into the _m128 data type, call the crossproduct and loop that way, I would save time?

You can still read into a vector of doubles and just cast to _m128d* at the point where you call the SSE intrinsic. reinterpret_cast has no performance impact, and the code would probably be clearer that way.

You should push very hard to get someone to give you access to a profiler. Even if it were possible to optimize this function down to nearly nothing, it may turn out to not affect overall application performance much.

Gamingphreek · Jun 22, 2009

Ok in the SSE Intrinsics that were given, it crashes. I am pretty sure it is because you are not doing a 16byte alignment statement.

I have a large suite of tests that I made that I am going to run and see what the times are like. Without aligning 16 bytes, some numbers work and, from what I have seen, using the OpenMP threading doesn't provide any enhancement (Though my dataset may not be quite large enough).

Right now I am trying to figure out how to align the values and which values to align.

As for the profiler, its not that we aren't given access to them, it is quite simply that MS employees stopped using their brain again and took it out of VS2008 Professional. I'll see if I can get a free version and have it installed

Edit: Venix - as for using the calls to push_back, there is much more overhead in them. I saw a performance drop when I moved the intrinsics back to using push_back calls. I already call reserve before I call resize() so I think it is as efficient in that sense as possible.

Ken g6 · Jun 22, 2009

I've looked into alignment issues. It looks like aligning the contents of a vector is rather hard. It requires creating a subclass with a different memory allocation system.

For what I did, I just replaced the inner vector with a pointer to a double array, declared with _mm_malloc(). To initialize out as a vector of arrays, for instance, I passed in vecSize separately, and then did this:

// Instantiate vector rows.
out.resize(vector1.size());

for (size_t i = 0; i < vector1.size(); ++i)
{
if(out[ i] != NULL) _mm_free(out[ i]);
out[ i] = (double*)_mm_malloc(vecSize*sizeof(double), 16);
if(out[ i] == NULL) DIE("Couldn't create array 2\n");
}

Gamingphreek · Jun 22, 2009

// Computes the cross product for 2 3xn vectors.
void crossProductRev3( vector<vector<double>> &vector1, vector<vector<double>> &vector2, vector<vector<double>> &tempVector )
{
// Instantiate vector rows.
tempVector.clear();
tempVector.reserve( 3 );
for ( int i = 0; i < 3; i++ )
{
tempVector.push_back( vector<double> () );
tempVector[ I].reserve( size );
tempVector[ I].resize( size );
}

const __m128d v10 = _mm_loadu_pd(&vector1[0][0]);
const __m128d v11 = _mm_loadu_pd(&vector1[1][0]);
const __m128d v12 = _mm_loadu_pd(&vector1[2][0]);
const __m128d v20 = _mm_loadu_pd(&vector2[0][0]);
const __m128d v21 = _mm_loadu_pd(&vector2[1][0]);
const __m128d v22 = _mm_loadu_pd(&vector2[2][0]);

__m128d crossArr0 = _mm_loadu_pd(&tempVector[0][0]);
__m128d crossArr1 = _mm_loadu_pd(&tempVector[1][0]);
__m128d crossArr2 = _mm_loadu_pd(&tempVector[2][0]);

for (unsigned int i = 0; i < (size / 2); i++)
{
__m128d xmml0 = _mm_mul_pd(v11, v22);
__m128d xmmr0 = _mm_mul_pd(v12, v21);

__m128d xmml1 = _mm_mul_pd(v10, v22);
__m128d xmmr1 = _mm_mul_pd(v12, v20);

__m128d xmml2 = _mm_mul_pd(v10, v21);
__m128d xmmr2 = _mm_mul_pd(v11, v20);

crossArr0 = _mm_sub_pd(xmml0, xmmr0);
crossArr1 = _mm_sub_pd(xmmr1, xmml1);
crossArr2 = _mm_sub_pd(xmml2, xmmr2);
}

return;
}

I decided to give it a shot at breaking down the Intrinsic Commands and doing the loads manually myself. This is working kinda halfway.

First off the values are never incremented.
Second off the values are never stored back into the array.

I can't figure out how to do either of them? Any ideas guys? Rewriting the vector allocator, converting to a vector<array>, and rewriting the vector class are things I would rather avoid

-Kevin

Venix · Jun 22, 2009

Don't subclass vector, just write an allocator:

template <typename T>
class aligned_allocator
: public std::allocator<T>
{
public:
template <typename T>
struct rebind
{
typedef aligned_allocator<T> other;
};

aligned_allocator() {}
aligned_allocator(const aligned_allocator& a) : std::allocator<T>(a) {}
template <typename T>
aligned_allocator(const aligned_allocator<T>& a) : std::allocator<T>(a) {}

pointer allocate(size_type count)
{
return (T*)_aligned_malloc(count * sizeof(T), 16);
}

void deallocate(pointer ptr, size_type count)
{
_aligned_free(ptr);
}
};

Then use std::vector<double, aligned_allocator<double>> (preferably typedef-ed). Although I can't seem to get MSVC to ever return anything that isn't 16-byte aligned even with the default allocator.

Calling both reserve() and resize() isn't necessary. Your choices are to either call reserve() and use push_back(), or call resize() and use operator[].

Gamingphreek · Jun 22, 2009

Originally posted by: Venix
Don't subclass vector, just write an allocator:

template <typename T>
class aligned_allocator
: public std::allocator<T>
{
public:
template <typename T>
struct rebind
{
typedef aligned_allocator<T> other;
};

aligned_allocator() {}
aligned_allocator(const aligned_allocator& a) : std::allocator<T>(a) {}
template <typename T>
aligned_allocator(const aligned_allocator<T>& a) : std::allocator<T>(a) {}

pointer allocate(size_type count)
{
return (T*)_aligned_malloc(count * sizeof(T), 16);
}

void deallocate(pointer ptr, size_type count)
{
_aligned_free(ptr);
}
};

Then use std::vector<double, aligned_allocator<double>> (preferably typedef-ed). Although I can't seem to get MSVC to ever return anything that isn't 16-byte aligned even with the default allocator.

Calling both reserve() and resize() isn't necessary. Your choices are to either call reserve() and use push_back(), or call resize() and use operator[].

I honestly still can't get it working still given that I am using a vector of vectors of doubles. Simply won't compile.

To get it to crash I tried with an array size of 10 and the program crashes with an Access Violation right away.

As for reserve and resize - using them both is necessary in this case. The reserve() call allocates memory for the array but doesn't instantiate the memory to anything. When the resize is used after the reserve it doesn't have to keep calling the new operation, it merely goes and instantiates the values to 0.

Using push_back, the call to the vector class is more costly I have found in my testing.

I can avoid overwriting the vector class all together if I just use lower level intrinsics (ie: _mm_loadu_pd()) or just write inline assembly flat out.

Also, wouldn't a swap be more efficient than multiple loads?

-Kevin

Venix · Jun 23, 2009

Originally posted by: Gamingphreek
I honestly still can't get it working still given that I am using a vector of vectors of doubles. Simply won't compile.

std::vector<std::vector<double, aligned_allocater<double>>> should work fine. What is the compilation error?

As for reserve and resize - using them both is necessary in this case. The reserve() call allocates memory for the array but doesn't instantiate the memory to anything. When the resize is used after the reserve it doesn't have to keep calling the new operation, it merely goes and instantiates the values to 0.

I think you're confusing resize() with push_back(). resize() doesn't "keep calling the new operation"; it calls its allocator's allocate() once to get a block of memory and then default initializes every element. There is no benefit to calling reserve() first. The code is available in the <vector> header file if you want to look.

I can avoid overwriting the vector class all together if I just use lower level intrinsics (ie: _mm_loadu_pd()) or just write inline assembly flat out.

Unaligned loads and stores are slower, but if you want to do it:

for (unsigned int i = 0; i < size; i += 2)
{
__m128d v10 = _mm_loadu_pd(&vector1[0][ i]);
__m128d v11 = _mm_loadu_pd(&vector1[1][ i]);
__m128d v12 = _mm_loadu_pd(&vector1[2][ i]);
__m128d v20 = _mm_loadu_pd(&vector2[0][ i]);
__m128d v21 = _mm_loadu_pd(&vector2[1][ i]);
__m128d v22 = _mm_loadu_pd(&vector2[2][ i]);

__m128d xmml0 = _mm_mul_pd(v11, v22);
__m128d xmmr0 = _mm_mul_pd(v12, v21);

__m128d xmml1 = _mm_mul_pd(v10, v22);
__m128d xmmr1 = _mm_mul_pd(v12, v20);

__m128d xmml2 = _mm_mul_pd(v10, v21);
__m128d xmmr2 = _mm_mul_pd(v11, v20);

_mm_storeu_pd(&res[0][ i], _mm_sub_pd(xmml0, xmmr0));
_mm_storeu_pd(&res[1][ i], _mm_sub_pd(xmmr1, xmml1));
_mm_storeu_pd(&res[2][ i], _mm_sub_pd(xmml2, xmmr2));
}

The Visual Studio 2008 performance profiler is free. I strongly suggest using it and only attempting to optimize areas that are obvious bottlenecks.

Muddling the code with SIMD garbage should be an absolute last resort. If this application takes an entire weekend to run, there are probably a ton of other I/O or algorithm related optimizations that will be clearer, easier to implement and lead to much larger performance gains. Programs that deal with large datasets are very rarely CPU-bound, although it's always a possibility; that's what the profiler is for.

Gamingphreek · Jun 23, 2009

Venix - that you so much for all the help. My code for loading unaligned values was the exact same, EXCEPT, I didn't know about the storeu command so I was stuck there. Thank you.

I did get confused with resize. For whatever inane reason, I thought it called the push_back function continuously to populate the vector. I got rid of all the reserve calls and just used .resize(). I did notice however that the reserve() push_back() method actually is more efficient for small (ie: 3) vectors. I use the push_back still when I instantiate each row.

The compilation error when rewriting the allocator dealt with it not knowing what to do with the values that I was pushing back. Unless I am mistaken, I won't save anything remotely significant by moving to load aligned and using the allocator though. Obviously, correct me if I am wrong in that respect.

I got IT to install that profiler and it seems that it is different on my system... None of those command line commands work. Instead in the Build menu, I have Profile Guided Optimization. I haven't turned anything up on Google about how to use it yet; however, I'll have to look into it.

NOW - I finished my test suite and finished optimizing the code. Here are my results (Each revision contains the changes from the previous revision - The data is 3 Million random values in a 3x3M vector<vector<double>> - Runtime is calculated by QueryPerformanceCounter functions - Each Rev is inside a loop of 10 times with each iteration timed and an elapsed time indicator incremented - the elapsed variable is divided by 10 to return the average) :

Rev0 - 400.84ms
Rev1 - 183.57ms (Reserve space with resize() for the vector. Save values to doubles before assigning them to index)
Rev2 - 129.94ms (Return by reference instead of value)
Rev3 - 92.30ms (Using SSE2 Compiler Intrinsics)
Rev4 - 92.11ms (Multi-Threading with OpenMP)

My boss said that they use Dot-Product even more than cross product. I don't see that taking me more than a few minutes now that I am familiar with compiler intrinsics, so I'm going to code that up next. Optimization is fun though

- I like learning this stuff and then seeing my code morph and get progressively faster.

Quick question, there would be no benefit to passing a __m128d into a trig function to computer 2 values in the array in parallel would there? I imagine I would have to overload the Math.trig functions with __m128d parameters at any rate.

-Kevin

Ken g6 · Jun 23, 2009

Uh, oh, looks like you're hooked. :Q Remember, Donald Knuth says, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (my highlighting)

Now, that said, it might interest you to know that SSE4 includes a single-instruction dot product! However, SSE4 is only available on 45nm (not 65nm) Core 2 and later processors. So you might want to learn about processor identification before trying to use it. While searching for that, I came across this page too.

Cogman · Jun 23, 2009

Glad to see you got such a nice performance increase. SSE4 has a dot product instruction you might be interested in. DPPD.

Gamingphreek · Jun 23, 2009

Originally posted by: Cogman
Glad to see you got such a nice performance increase. SSE4 has a dot product instruction you might be interested in. DPPD.

Yea I had seen a while back that it had a dot product instruction (Back when I was programming in opcode and couldn't believe we didn't have a basic subtraction code on the LC3).

Unfortunately, the computers in the back use Pentium IV and later. So the highest level SIMD instructions I can guarantee that all of them have is SSE2 I think.

I did read up on using the CPUID call - and I should probably define that in my code as a precaution.

Yes, I am definitely hooked

- its fun stuff watching your code take progressively less time as you go.

Real quick - the Multi-threaded code is not accurate. The multi-threaded #pragmas are not working (As I watched Task Manager during the run time). For now I just commented it out, but I'll have to read up more on OpenMP as to what is wrong.

Thanks for all the help guys,
-Kevin

Ken g6 · Jun 23, 2009

Here's the OpenMP guide I followed. I got all four of my cores working according to GKrellM System Monitor (a little like Task Manager, but in Linux); but didn't see much speed improvement overall.

Edit: P.S. In general, you should probably compile a separate version where you set the compiler to optimize for the P4s, as their architecture is very different.

P.P.S. The last link in my last post shows how to do a dot product without SSE4.

Gamingphreek · Jun 23, 2009

Originally posted by: Ken g6
Here's the OpenMP guide I followed. I got all four of my cores working according to GKrellM System Monitor (a little like Task Manager, but in Linux); but didn't see much speed improvement overall.

Edit: P.S. In general, you should probably compile a separate version where you set the compiler to optimize for the P4s, as their architecture is very different.

P.P.S. The last link in my last post shows how to do a dot product without SSE4.

Well - I just sent them the cpp so they can drag and drop the function into their code implementations.

I am compiling on a Core 2 Duo machine right now, so giving my compiled code to them probably wouldn't be very productive

. Also, I believe most of them are running VS2005 - from what I remember reading, it should still contain the necessary header file for SSE2. I'm not sure why the other systems are older though...

I'll most definitely look into that guide - thanks for the Dot Product link as well

Thanks so much for all the help. I'll post back if I have any further questions about the topic!

-Kevin

Gamingphreek · Jun 24, 2009

In further testing of the cross product final, I found that I mistype 2 statements and unnecessarily initialized the result vector a couple times. After correcting, the revision where all I do is change from return by value to returning to a variable that was passed by reference - the pass/return by reference is coming out slower the more values I put into the array.

I cannot for the life of me figure out why a reference is slower than a value. (Note: This is a revision before the Compiler Intrinsics)

Search

Multithreading/SIMD C++

Venix

Golden Member

Gamingphreek

Lifer

Ken g6

Programming Moderator, Elite Member

Gamingphreek

Lifer

Venix

Golden Member

Gamingphreek

Lifer

Venix

Golden Member

Gamingphreek

Lifer

Ken g6

Programming Moderator, Elite Member

Cogman

Lifer

Gamingphreek

Lifer

Ken g6

Programming Moderator, Elite Member

Gamingphreek

Lifer

Gamingphreek

Lifer

TRENDING THREADS