Multithreading/SIMD C++

Venix · Jun 18, 2009

Originally posted by: Common Courtesy

Originally posted by: Venix

Originally posted by: Cogman
Well, as far as optimizations go, there are a couple of things that I noticed that you could improve on to see some speed ups.

...

Click to expand...

This is incorrect. There is no performance benefit to taking a pointer to the beginning of a vector, nor is there any performance impact in using operator[]. Unless you're using debug builds or checked iterators, every compiler available will generate identical code for rawbob[ i] and bob[ i].

Similarly, every compiler I've ever used--MSVC, GCC, Comeau--will call size() once before the loop and cache the result, even if you put the call in the loop condition. Most "optimizations" like this just muddle the code; it's best to identify problem areas and only then attempt to manually optimize.

As for real optimizations, using a vector of vectors is rarely a good idea. Any time the outer vector re-allocates, all the inner vectors will be copied. Since you seem to always know the size of the result vector, one mitigating technique is to call reserve(size) before the loop that fills the vector. If you do this, push_back will be guaranteed to never cause a reallocation.

A better solution would be to not use vectors of vectors. A list of vectors might be a possibility, although a real multi-dimensional array library would probably be best.

Click to expand...

The bolded area can cause a serious problem if the size of an array is modified within the loop. If the array is compressed; you are utilizing "forigen" memory. If the array is expanded; you may not get to the full end of the array.

The compiler won't cache the result of size() if it's possible for it to change in the loop. If an optimization isn't applicable to a piece of code, I can't imagine any sane optimizing compiler attempting to apply it anyway.

Cogman · Jun 18, 2009

Originally posted by: JasonCoder

Originally posted by: Cogman

Originally posted by: JasonCoder

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Click to expand...

Wait, what? I'm pretty sure the for loop will wait for createthread to return before proceeding with the loop.

Races are definitely important to understand, though. For instance if you have lots of routines in different threads incrementing a counter, like the old interface reference counter in COM or something, you'll want to make sure even increment operators are mutex'd or somehow locked for one at a time.

It turns out even the increment operator (the ++ in c++) isn't actually one statement. It compiles down to two or three operations. If using .Net, use the Interlocked class.

Click to expand...

Ummm. What are you smoking? the increment operator, especially how I have it written, is exactly one command. Infact, most processors have a special increment operator. Specifically the inc operator. Even if that wasn't the case, then it would just be something to the effect of add $1 %eax;

And don't be so sure that won't cause a race condition. Give it a try if you don't believe me. It turns out that sometimes i will update before the thread has time to use it.

Click to expand...

I quit smoking many years ago, but thanks for your concern. And I don't care if most processors do x... there's nothing in the standard about ++ being thread safe. I brought it up because it's a classic threading gotcha. Assume at your own peril.

Ownage link 1

Ownage link 2

Ownage link 3

I see what you are saying now about the createThread call. I misunderstood the first time. Thought you were saying you had the call stack pop off a function return before the function returns.

Who said anything about thread safety? You said the increment operator compiled down to more then one operation in c++, I said it didn't. incl is an x86 command, all x86 processors have it.

The only way I could see the inc operator being more then one operation is if it is replacing a register before it is incrementing. Then I could see it being five ops, IE

Pushl %eax
movl (%edx + 4), %eax
incl %eax
movl %eax, (%edx + 4)
popl %eax

However, if thats the case, then just about any operation will potentially take up to 5 operations.

Cogman · Jun 18, 2009

Originally posted by: Venix

Originally posted by: Cogman
Well, as far as optimizations go, there are a couple of things that I noticed that you could improve on to see some speed ups.

...

Click to expand...

This is incorrect. There is no performance benefit to taking a pointer to the beginning of a vector, nor is there any performance impact in using operator[]. Unless you're using debug builds or checked iterators, every compiler available will generate identical code for rawbob[ i] and bob[ i].

Similarly, every compiler I've ever used--MSVC, GCC, Comeau--will call size() once before the loop and cache the result, even if you put the call in the loop condition. Most "optimizations" like this just muddle the code; it's best to identify problem areas and only then attempt to manually optimize.

As for real optimizations, using a vector of vectors is rarely a good idea. Any time the outer vector re-allocates, all the inner vectors will be copied. Since you seem to always know the size of the result vector, one mitigating technique is to call reserve(size) before the loop that fills the vector. If you do this, push_back will be guaranteed to never cause a reallocation.

A better solution would be to not use vectors of vectors. A list of vectors might be a possibility, although a real multi-dimensional array library would probably be best.

Not true, I just did a compiled test. While there isn't a function call (it is inlined on an -O2 optimization with gcc) the size is being checked every time, even though it doesn't change.

However, I do stand corrected about the rawbob thing, I could have sworn I saw a speed boost by doing it, but it had no effect. Live and learn

nhocstrasser · Jun 19, 2009

Originally posted by: Cogman

Originally posted by: JasonCoder

Originally posted by: Cogman

Originally posted by: JasonCoder

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Click to expand...

Wait, what? I'm pretty sure the for loop will wait for createthread to return before proceeding with the loop.

Races are definitely important to understand, though. For instance if you have lots of routines in different threads incrementing a counter, like the old interface reference counter in COM or something, you'll want to make sure even increment operators are mutex'd or somehow locked for one at a time.

It turns out even the increment operator (the ++ in c++) isn't actually one statement. It compiles down to two or three operations. If using .Net, use the Interlocked class.

Click to expand...

Ummm. What are you smoking? the increment operator, especially how I have it written, is exactly one command. Infact, most processors have a special increment operator. Specifically the inc operator. Even if that wasn't the case, then it would just be something to the effect of add $1 %eax;

And don't be so sure that won't cause a race condition. Give it a try if you don't believe me. It turns out that sometimes i will update before the thread has time to use it.

Click to expand...

I quit smoking many years ago, but thanks for your concern. And I don't care if most processors do x... there's nothing in the standard about ++ being thread safe. I brought it up because it's a classic threading gotcha. Assume at your own peril.

Ownage link 1

Ownage link 2

Ownage link 3

I see what you are saying now about the createThread call. I misunderstood the first time. Thought you were saying you had the call stack pop off a function return before the function returns.

Click to expand...

Who said anything about thread safety? You said the increment operator compiled down to more then one operation in c++, I said it didn't. incl is an x86 command, all x86 processors have it.

The only way I could see the inc operator being more then one operation is if it is replacing a register before it is incrementing. Then I could see it being five ops, IE

Pushl %eax
movl (%edx + 4), %eax
incl %eax
movl %eax, (%edx + 4)
popl %eax

However, if thats the case, then just about any operation will potentially take up to 5 operations.

You are both correct. Processors always follow the fetch, operate, store model of computation. The true number of operations would depend on how well the compiler optimizes register access. In a worst case the ++ operator could take multiple operations to execute. Doing an increment in a single operation is not guaranteed.

That said, I think there is a general misunderstanding about the difference between atomic and non-atomic operations and how that relates to thread safety. In a multi-threaded environment you can't guarantee that your fetch-operate-store model will not be interrupted by another process. Say this happens:

Thread 1: Copy A to B
Thread 1: Decr B
Thread 2: Copy A to B
Thread 2: Incr B
Thread 2: Save C to A
Thread 1: Save B to A

Thread 2 will have no idea what value is stored and weird things will happen. This wouldn't happen if the operation was atomic. Atomic operations are guaranteed to be uninterrupted by the processor. So when he brought up the issue of thread safety it was implicit that the problem was with the operation taking multiple instructions.

This is an interesting thread. Its been ages since I studied computer architecture.

dinkumthinkum · Jun 19, 2009

1. Increment is not an atomic operation unless specified (ie. with the lock prefix on Intel archs). The Bakery Algorithm is a classic example of how to work-around that without using atomic operations.

2. C is call-by-value therefore it doesn't matter if the for-loop increments the value of 'i' because the integer value was already copied in the call to createthread().

Gamingphreek · Jun 19, 2009

Real quick - I don't want to get too far away from SIMD/SMT discussion; however, Cogman, your second operation confuses me a little bit.

for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
tempVector[0].push_back( vector11i * vector22i - ( vector12i * vector21i ) );
tempVector[1].push_back( -( vector10i * vector22i ) + vector12i * vector20i );
tempVector[2].push_back( vector10i * vector21i - ( vector11i * vector20i ) );
}

Is calling the cos() method and the other vector methods more costly than allocating space (and then deallocating space) on the run-time stack for all of those double values?

Also - others suggested something other than a vector<vector<double>> type. Does someone have any other suggestions? There are a ton of these arrays (Both local and global via public class) which all need to be saved to a memory address that is passed into this subroutine (multiple returned values).

I originally tried using a vector theVecArray[3] - an array of vectors, but it was too hard to pass or manage between functions.

Furthermore, since I am dealing with 2D vectors, I need to instantiate the rows of the vectors in advance. It seemed to me as I was doing this that I was doing it in, quite possibly, the least efficient way possible (This is 1 example):

for ( int i = 0; i < 3; i++ )
{
upVecs.push_back( vector<double> () );
eastVecs.push_back( vector<double> () );
northVecs.push_back( vector<double> () );
posVecs.push_back( vector<double> () );
}

I'm learning a TON in this thread - thank you for any advice or pointers you can give.

Thanks,
-Kevin

degibson · Jun 19, 2009

This thread has grown enormously in size, so I can't do my usual due diligence this morning and read everything. Sorry. However:
1) x++ is not safe to use on a shared variable without synchronization.
2) The bakery algorithm will not succeed on modern hardware with shared memory. It will succeed among timeshared processes, but it fails under weaker consistency models without fences.
3) OP: If you're going to restructure your code for threading, get rid of calls to push_back as those create loop-carry dependencies. Instead, resize() the vector before you start operating on it instead of adding things to it on the fly.
4) OP: vector<vector>> is fine once they are loaded. As you already know, they can be a pain to load in the first place.
5) OP: Listen to cogman's post about GetTickCount(). Do not ignore the need to profile where the time goes in your application.

Gamingphreek · Jun 19, 2009

Originally posted by: degibson
This thread has grown enormously in size, so I can't do my usual due diligence this morning and read everything. Sorry. However:
1) x++ is not safe to use on a shared variable without synchronization.
2) The bakery algorithm will not succeed on modern hardware with shared memory. It will succeed among timeshared processes, but it fails under weaker consistency models without fences.
3) OP: If you're going to restructure your code for threading, get rid of calls to push_back as those create loop-carry dependencies. Instead, resize() the vector before you start operating on it instead of adding things to it on the fly.
4) OP: vector<vector>> is fine once they are loaded. As you already know, they can be a pain to load in the first place.
5) OP: Listen to cogman's post about GetTickCount(). Do not ignore the need to profile where the time goes in your application.

Right now, on a 3x500,000 vector<vector<double>> (I filled it from [trigfunction](0:499,999) ) I used the QueryPerformanceCounter() to determine the run time of my simple CrossProduct function.

vector<vector<double>> crossProduct(vector<vector<double> > &vector1, vector<vector<double> > &vector2)
{
LARGE_INTEGER time1, time2;

QueryPerformanceCounter(&time1);

// Value for holding the cross product before returning the value.
vector<vector<double>> tempVector;

// Instantiate vector rows.
tempVector.reserve( 3 );
for ( int i = 0; i < 3; i++ )
{
tempVector.push_back( vector<double> () );
tempVector[ i].reserve( size );
}

for ( unsigned int i = 0; i < size; i++ )
{
tempVector[0].push_back( vector1[1][ i] * vector2[2][ i] - ( vector1[2][ i] * vector2[1][ i] ) );
tempVector[1].push_back( -( vector1[0][ i] * vector2[2][ i] ) + vector1[2][ i] * vector2[0][ i] );
tempVector[2].push_back( vector1[0][ i] * vector2[1][ i] - ( vector1[1][ i] * vector2[0][ i] ) );
}

QueryPerformanceCounter(&time2);

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);

double elapsed = ((double)(time2.QuadPart - time1.QuadPart) * 1000) / (double)(freq.QuadPart);

cout << elapsed << "ms have elapsed";
system("PAUSE");

return tempVector;
}

I call this function 5 times in my test and I got back:
37.35ms
37.61ms
37.13ms
37.35ms
36.98ms

I am trying to figure out how I can manage to write this function using compiler intrinsics or inline assembly.

-Kevin

Ken g6 · Jun 19, 2009

Looking through what I used last time I used compiler intrinsics, it looks like the PDFs on this page were my main guide, followed by Intel's manuals.

Stack allocation is a simple matter of adding to a stack pointer register, SPX, if I recall. The data might even get stored in registers, if there are enough. (8 registers, 6 values, not including temp calculation values - the compiler probably can't do it but you might be able to by hand.) Alternatively, you could try something like this, though I don't use C++ much, so I have no idea if my syntax is correct:

Vector<double> *vector10 = &vector1[0];
Vector<double> *vector20 = &vector2[0];
Vector<double> *vector11 = &vector1[1];
Vector<double> *vector22 = &vector2[2];
Vector<double> *vector12 = &vector1[2];
Vector<double> *vector21 = &vector2[1];
for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
tempVector[0].push_back( vector11->[ i] * vector22->[ i] - ( vector12->[ i] * vector21->[ i] ) );
tempVector[1].push_back( -( vector10->[ i] * vector22->[ i] ) + vector12->[ i] * vector20->[ i] );
tempVector[2].push_back( vector10->[ i] * vector21->[ i] - ( vector11->[ i] * vector20->[ i] ) );
}

Edit: Oh, so that's what's with the space before the i.

Cogman · Jun 19, 2009

Ok, so with the code you provided, on my Q6600 and default settings I was able to run it in 300ms on average, after making the following changes, I dropped it down to an average of about 120 ms. You'll probably have to change some things around, however, this is my final optimized code. (for the cross product).

void crossProduct(vector<vector<double> > &vector1, vector<vector<double> > &vector2, vector< vector<double> >& out)
{
// Instantiate vector rows.
out.resize(vector1.size());
size_t vecSize = vector1[0].size();

for (size_t i = 0; i < vector1.size(); ++i)
{
out[ i].resize(vecSize);
}

for ( size_t i = 0; i < vecSize; ++i )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
out[0][ i] = ( vector11i * vector22i - ( vector12i * vector21i ) );
out[1][ i] = ( -( vector10i * vector22i ) + vector12i * vector20i );
out[2][ i] = ( vector10i * vector21i - ( vector11i * vector20i ) );
}

The biggest speed impact came from using resize instead of push_back every time. that cut about 100ms off of the time. the double vector trick cut off about 40ms, and replacing the return with a pass by reference output variable cut off the remaining 40ms. You could probably shave a few more instructions off if you make the assumption that the first dimension of the vector will always be 3.

I'll see if I can't get a working threaded example for you next

The data set was just rand() cast into a double 200000 times for the all 3 of the dimensions. Timing was done outside of the function.

Gamingphreek · Jun 19, 2009

Well I found that doing a resize() was more expensive than using a for loop. I don't quite understand why though - in my mind it should be shorter. I just eliminated the push_back though which saved 10ms off the aforementioned times.

I also found that if you do a .reserve() it shaves off a small amount of time as well.

Ken - that website is amazing. I am reading through every bit of that code that I can get my hands on. If I can get my cross product function running with Compiler Intrinsics, the Project Manager has given me approval to present my findings and my division may, perhaps, switch to my code (I'm only an intern too

)

I can't thank all of you enough for this thread - I can't begin to describe how much I have learned thus far. Not to mention the number of resources that I am not reading through.

-Kevin

Cogman · Jun 19, 2009

Ok, be forwarned that a ton of code is following.. Also note that the gains I noticed where minimal at best. This is your warning.

I managed to thread the cross product, however, there where no noticed speed increases at 200,000 units. So I tested it with 200,000,000 units. The difference was still minimal (1200 ms vs 1300 ms). Either way, this is what I came up with (I haven't check the results, but I believe they should be correct)

struct Input
{
size_t start;
size_t end;
vector< vector<double> >* vec1;
vector< vector<double> >* vec2;
vector< vector<double> >* out;
};

DWORD WINAPI crossProduct_thread(Input* data)
{
size_t start = data->start;
size_t end = data->end;
vector<double>* vec1 = &(*data->vec1)[0];
vector<double>* vec2 = &(*data->vec2)[0];
vector<double>* out = &(*data->out)[0];
for (size_t i = start; i < end; ++i)
{
double vector10i = vec1[0][ i];
double vector20i = vec2[0][ i];
double vector11i = vec1[1][ i];
double vector22i = vec2[2][ i];
double vector12i = vec1[2][ i];
double vector21i = vec2[1][ i];
out[0][ i] = ( vector11i * vector22i - ( vector12i * vector21i ) );
out[1][ i] = ( -( vector10i * vector22i ) + vector12i * vector20i );
out[2][ i] = ( vector10i * vector21i - ( vector11i * vector20i ) );
}
return 0;
}

void crossProduct(vector<vector<double> > &vector1, vector<vector<double> > &vector2, vector< vector<double> >& out)
{
const int NUM_THREADS = 4;

// Instantiate vector rows.
out.resize(3);
size_t vecSize = vector1[0].size();

for (size_t i = 0; i < vector1.size(); ++i)
{
out[ i].resize(vecSize);
}

int step = vecSize / NUM_THREADS;
vector<HANDLE> threads;

for (int i = 0; i < NUM_THREADS - 1; ++i)
{
Input* data = new Input;
data->vec1 = &vector1;
data->vec2 = &vector2;
data->out = &out;
data->start = step * i;
data->end = step * (i + 1);
threads.push_back(CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)crossProduct_thread, (void*)data, 0, NULL));
}

for ( size_t i = step * (NUM_THREADS - 1); i < vecSize; ++i )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
out[0][ i] = ( vector11i * vector22i - ( vector12i * vector21i ) );
out[1][ i] = ( -( vector10i * vector22i ) + vector12i * vector20i );
out[2][ i] = ( vector10i * vector21i - ( vector11i * vector20i ) );
}

for (int i = 0; i < threads.size(); ++i)
WaitForSingleObject(threads[ i], INFINITE);
}

Again, gains where minimal. I suspect that thread creation is eating up a fair amount of the actual gains. A thread pool might help to mitigate that loss.

I didn't employ any sort of mutex/semaphore as none of the threads should be modifying the same data at the same time. (if they do, that data is invalid

)

BTW, are you using compiler optimizations? If not, turn them on. That might cause you to see some nifty speed boosts.

Gamingphreek · Jun 19, 2009

Optimizations are indeed turned on. They make a rather large difference actually - and this is the MS VC++ Compiler. I would imagine the Intel or GNU compiler would work even better.

I am really struggling at writing code that would create the cross product in SIMD though. I could have sworn that there was a SIMD instruction explicity for the cross product; however, I can't find it.

-Kevin

degibson · Jun 19, 2009

Originally posted by: Gamingphreek
I am really struggling at writing code that would create the cross product in SIMD though. I could have sworn that there was a SIMD instruction explicity for the cross product; however, I can't find it.

Well, there is the problem that your data type is double. SIMD works a lot better when your native data type is small (e.g., char, short, or even float). Many datapaths are only 64 or 128-bit in the first place, and on a 128-bit datapath you will not get more than 2x the FP computational throughput, ignoring all the other overheads. Depending on your platform, you might even be limited more by memory bandwidth.

If this is really critical code, you might try a modern GPU. They recently introduced support for double data types (though not all are IEEE compliant double types). However, if the whole computation is just a cross-product, its probably not worth the effort of streaming the data out over PCIe and then back again.

Gamingphreek · Jun 20, 2009

Originally posted by: degibson

Originally posted by: Gamingphreek
I am really struggling at writing code that would create the cross product in SIMD though. I could have sworn that there was a SIMD instruction explicity for the cross product; however, I can't find it.

Click to expand...

Well, there is the problem that your data type is double. SIMD works a lot better when your native data type is small (e.g., char, short, or even float). Many datapaths are only 64 or 128-bit in the first place, and on a 128-bit datapath you will not get more than 2x the FP computational throughput, ignoring all the other overheads. Depending on your platform, you might even be limited more by memory bandwidth.

If this is really critical code, you might try a modern GPU. They recently introduced support for double data types (though not all are IEEE compliant double types). However, if the whole computation is just a cross-product, its probably not worth the effort of streaming the data out over PCIe and then back again.

Well this is one of many critical operations. From what I can understand, the code this concerns uses a lot of trig operations over vectors, lots of vector computation.

We use doubles simply because 6 decimal places of precision is not enough. The numbers are certainly small enough to be floats, but we need somewhere in the neighborhood of 10 decimals of precision.

If there is someway we can get this precision with SP Floating Points, I am more than willing to research it.

-Kevin

Cogman · Jun 20, 2009

Originally posted by: Gamingphreek

Originally posted by: degibson

Originally posted by: Gamingphreek
I am really struggling at writing code that would create the cross product in SIMD though. I could have sworn that there was a SIMD instruction explicity for the cross product; however, I can't find it.

Click to expand...

Well, there is the problem that your data type is double. SIMD works a lot better when your native data type is small (e.g., char, short, or even float). Many datapaths are only 64 or 128-bit in the first place, and on a 128-bit datapath you will not get more than 2x the FP computational throughput, ignoring all the other overheads. Depending on your platform, you might even be limited more by memory bandwidth.

If this is really critical code, you might try a modern GPU. They recently introduced support for double data types (though not all are IEEE compliant double types). However, if the whole computation is just a cross-product, its probably not worth the effort of streaming the data out over PCIe and then back again.

Click to expand...

Well this is one of many critical operations. From what I can understand, the code this concerns uses a lot of trig operations over vectors, lots of vector computation.

We use doubles simply because 6 decimal places of precision is not enough. The numbers are certainly small enough to be floats, but we need somewhere in the neighborhood of 10 decimals of precision.

If there is someway we can get this precision with SP Floating Points, I am more than willing to research it.

-Kevin

Well, if you interested, I've made an intrinsic SSE3 version of the cross product. However, the results are horrible (they did verify that thread creation was the major slowdown though

)

For a single threaded app, the code doubled the time it takes to do a cross multiply. the multithreaded version absorbed the difference with no problem.

Just to show how I did it (may not be the best way.) here is the code I used.

void crossProduct_old(vector<vector<double> > &vector1, vector<vector<double> > &vector2, vector< vector<double> >& out)
{
// Instantiate vector rows.
out.resize(vector1.size());
size_t vecSize = vector1[0].size();

for (size_t i = 0; i < vector1.size(); ++i)
{
out[ i].resize(vecSize);
}

__m128d a, b, c, d;
double first[2] __attribute__((aligned(16)));
double last[2] __attribute__((aligned(16)));
double output[2] __attribute__((aligned(16)));
for ( size_t i = 0; i < vecSize; ++i )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
first[0] = vector11i;
first[1] = vector12i;
last[0] = vector22i;
last[1] = vector21i;

a = _mm_load_pd(first);
b = _mm_load_pd(last);
c = _mm_mul_pd(a, b);

first[0] = vector12i;
first[1] = vector10i;
last[0] = vector20i;
last[1] = vector22i;

a = _mm_load_pd(first);
b = _mm_load_pd(last);
d = _mm_mul_pd(a, b);

c = _mm_hsub_pd(c, d);

_mm_store_pd(output, c);

out[0][ i] = (output[0]);
out[1][ i] = (output[1]);
out[2][ i] = ( vector10i * vector21i - ( vector11i * vector20i ) );
}
}

xtknight · Jun 20, 2009

Maybe you could use arrays and allocate with new/delete instead of using vectors. It might lower overhead.

Ken g6 · Jun 21, 2009

I've been working on my own SSE2 (don't think it needs SSE3!) version of the cross multiply code. One thing I've found is that a Core 2 is limited by the number of memory reads it can do: one per cycle, with latency 2. Assuming vectors work like arrays, reading vector1[0][ i] requires 3 reads, or 6 cycles; but it can be interleaved with another read so two of those "only" take 7 cycles. That means the double assignments inside these loops take at least 21 cycles.

Again, I'm not sure about vectors; but if the outer vectors contained arrays of doubles, and you did:

double *vector10 = vector1[0];
double *vector20 = vector2[0];
double *vector11 = vector1[1];
double *vector21 = vector2[1];
double *vector12 = vector1[2];
double *vector22 = vector2[2];

before the loop, that would save two cycles on each load, making it take just 15 cycles. Doing the same for out[][] might help, too.

Cogman, could you please post your testing code, so I can compare what I'm working on with your results?

Cogman · Jun 21, 2009

int main()
{
vector< vector<double> > vec1;
vector< vector<double> > vec2;
vec1.resize(3);
vec2.resize(3);

for (int i = 0; i < 3; ++i)
for (int j = 0; j < 2000000; ++j)
{
vec1[ i].push_back((double)rand());
vec2[ i].push_back((double)rand());
}

LARGE_INTEGER time1, time2;

QueryPerformanceCounter(&time1);

vector< vector<double> > cross;
vector< vector<double> > cross2;

for (int i = 0; i < 100; ++i)
{
crossProduct(vec1, vec2, cross);
crossProduct_old(vec1, vec2, cross2);
}

for (int j = 0; j < cross.size(); ++j)
for (int i = 0; i < cross[ i].size(); ++i)
{
if (cross[ j][ i] != cross2[ j][ i])
{
cout << "Not the same at: " << i << endl;
}
}

QueryPerformanceCounter(&time2);

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);

double elapsed = ((double)(time2.QuadPart - time1.QuadPart) * 1000) / (double)(freq.QuadPart);

cout << elapsed << "ms have elapsed\n";

return 0;
}

The code isn't exactly the same as what I started with, but it gets the job done.

I threw in a check to make sure that I was actually doing a cross multiply with SSE.

Gamingphreek · Jun 21, 2009

Well, I am trying to implement the cross product by swapping vectors, but with a double it messes things up a bit.

Is there any way to extract more decimals of precision out of a single precision floating point? If I could use an SP float, I could extract, theoretically, 4x the throughput. I could also store all 3 variables into the register instead of using 2 registers to store the first 3.

This is what I am working with right now:

V1 <3xn>:
x1 x4
x2 x5
x3 x6

V2 <3xn>:
y1 y4
y2 y5
y3 y6

If I do a read into a standard _m128 data type from vector<vector<double>> array, I should grab x1, x2 and y1, y2. In order to do the cross product using swaps I have to waste 1/2 of an _m128 read in.

_m128 = *vector1
_m128 = *vector2
_m128 = ++*vector1
_m128 = ++*vector2

That would mean x3 and y3 are read in 2x (or if I increment them 2x - more would be read in).

From what I have been reading - using the swap calls and then multiplying _m128's values in their entirety at once is the most efficient way.

If someone knows anyway to grab more decimals of precision using a float instead of "wasting" a double, I am all ears.

-Kevin

Gamingphreek · Jun 21, 2009

Well I just had an idea which may end up taking more time overall:

When I read the values into the _m128 data type if I do:

_m128 = (float)(*vector1 * 10000)

I would achieve 10 decimals of precision and still be well within the bounds of a 32bit SP float value.

Then before I pass the value back to the caller, I simply cast it as a double and then divide by 10000.

That would allow me to fit 4 values in each SSE2 register and thereby move through the vector much faster in my mind. I'm not sure how much (or if my cast will work) overhead in involved in my multiply and cast though.

-Kevin

Venix · Jun 21, 2009

The precision of a float is fixed. You will not gain more precision by multiplying it by 10,000; all that will do is move some of the lost precision to the left of the decimal point.

I have great difficulty believing that this function is the performance bottleneck of your application. Have you done any profiling yet, or are you just assuming that this needs to be aggressively optimized?

Even an SSE2 implementation isn't that much faster:

const __m128d* v10 = reinterpret_cast<__m128d*>(&vector1[0][0]);
const __m128d* v11 = reinterpret_cast<__m128d*>(&vector1[1][0]);
const __m128d* v12 = reinterpret_cast<__m128d*>(&vector1[2][0]);
const __m128d* v20 = reinterpret_cast<__m128d*>(&vector2[0][0]);
const __m128d* v21 = reinterpret_cast<__m128d*>(&vector2[1][0]);
const __m128d* v22 = reinterpret_cast<__m128d*>(&vector2[2][0]);

__m128d* res0 = reinterpret_cast<__m128d*>(&res[0][0]);
__m128d* res1 = reinterpret_cast<__m128d*>(&res[1][0]);
__m128d* res2 = reinterpret_cast<__m128d*>(&res[2][0]);

for (unsigned int i = 0; i < (size / 2); ++i)
{
__m128d xmml0 = _mm_mul_pd(v11[ i], v22[ i]);
__m128d xmmr0 = _mm_mul_pd(v12[ i], v21[ i]);

__m128d xmml1 = _mm_mul_pd(v10[ i], v22[ i]);
__m128d xmmr1 = _mm_mul_pd(v12[ i], v20[ i]);

__m128d xmml2 = _mm_mul_pd(v10[ i], v21[ i]);
__m128d xmmr2 = _mm_mul_pd(v11[ i], v20[ i]);

res0[ i] = _mm_sub_pd(xmml0, xmmr0);
res1[ i] = _mm_sub_pd(xmmr1, xmml1);
res2[ i] = _mm_sub_pd(xmml2, xmmr2);
}

It takes about 60-70% of the time of the original code, but it requires the vector to be 16-byte aligned (although that's the default for MSVC++'s "new") and its size to be a multiple of two. Meanwhile, the majority of the time in the function is eaten by the resize of the vectors, since vector default initializes every element. You would see much greater performance gains by simply using a container that allows for uninitialized values; one possibility may be to write an allocator that does nothing in its "construct" function.

But again, I imagine that there are other areas in the program that can be more easily optimized for much greater gains. For example, copying one of these humongous vector<vector<>>'s would be horribly slow, so ensure that you never return one by value or assign to a new variable.

With 2,000,000 elements, this cross-product loop (excluding the vector resize) takes only ~65 ms (~41 with SSE2); unless you're calling it hundreds or thousands of times, that's pretty insignificant.

Ken g6 · Jun 21, 2009

Well, I figured out why none of my ideas were working. I'm using gcc 4.3.3, and gcc is too smart for me! Once I compiled with gcc -S to look at the assembly source, I found that it (1) pre-calculates things like out[1], so out[1][ i] is only two operations, and (2) it found the possibility for SIMD all by itself and inserted it automatically!

I'm going to try OpenMP next...

Edit: OpenMP is just slightly faster, but uses much more CPU. To enable it, #include <omp.h>, link with its library, and make the central loop look like the following:

int chunk = (vecSize/4)+1;
#pragma omp parallel shared(vecSize, chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for ( size_t i = 0; i < vecSize; ++i )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
out[0][ i] = ( vector11i * vector22i - ( vector12i * vector21i ) );
out[1][ i] = ( -( vector10i * vector22i ) + vector12i * vector20i );
out[2][ i] = ( vector10i * vector21i - ( vector11i * vector20i ) );
}
}

Gamingphreek · Jun 21, 2009

The precision of a float is fixed. You will not gain more precision by multiplying it by 10,000; all that will do is move some of the lost precision to the left of the decimal point.

So if I had 1.23456789, obviously in a float only 1.234567 would be accurate. So even if I multiplied by 100 resulting in a double with 123.456789 and then casted it to a float, it would still not be accurate?

I have great difficulty believing that this function is the performance bottleneck of your application. Have you done any profiling yet, or are you just assuming that this needs to be aggressively optimized?

Well in the program/project I am working on, it more than likely isn't the problem. Cross product; however, is one of the bottlenecks for the code that my boss and others use elsewhere. I haven't actually seen the code as it is in the TS:SCI area (thus someone would have to bring it out to me).

As for the profiler, I actually haven't run the code through a profiler yet for the simple fact that, in VS2008, the profiler is only available in 2 version. My company owns VS2008 Professional which does not include it.

It takes about 60-70% of the time of the original code, but it requires the vector to be 16-byte aligned (although that's the default for MSVC++'s "new") and its size to be a multiple of two. Meanwhile, the majority of the time in the function is eaten by the resize of the vectors, since vector default initializes every element. You would see much greater performance gains by simply using a container that allows for uninitialized values; one possibility may be to write an allocator that does nothing in its "construct" function.

Well the code, running on multiple computers normally takes a weekend to run through the datasets that they task it with. So saving 60-70% on one part of the program that is called extensively should be at least somewhat beneficial.

As for writing our own allocater - honestly that is above me. If you point me to some code guides that explain the concert I would happily read over it, learn it, and try my hand at it.

But again, I imagine that there are other areas in the program that can be more easily optimized for much greater gains. For example, copying one of these humongous vector<vector<>>'s would be horribly slow, so ensure that you never return one by value or assign to a new variable.

Well, in my code, I am in control of the array in question; however, in others code, it is not in my control and used a little too extensively to just up and switch unless I can prove huge gains.

With 2,000,000 elements, this cross-product loop (excluding the vector resize) takes only ~65 ms (~41 with SSE2); unless you're calling it hundreds or thousands of times, that's pretty insignificant.

Array sizes, from what I am told, are in the 3x(x)00,000's. There are, again from what I understand, many of these per data set, so there should be a least a decent gain.

--------------------

Having said that, the only other information I know at this point is that trig functions are called for each element in the array currently. I assume if I were to read in the values into the _m128 data type, call the crossproduct and loop that way, I would save time?

--------------------

Well, I figured out why none of my ideas were working. I'm using gcc 4.3.3, and gcc is too smart for me! Once I compiled with gcc -S to look at the assembly source, I found that it (1) pre-calculates things like out[1], so out[1][ i] is only two operations, and (2) it found the possibility for SIMD all by itself and inserted it automatically!

gcc compiler is so great

. I think they switched to the Intel compiler briefly and, while it was faster, it also broke some things. So, until someone rewrites that program, we are stuck with the MS (<-bleh) compiler.

Edit: OpenMP is just slightly faster, but uses much more CPU. To enable it, #include <omp.h>, link with its library, and make the central loop look like the following:

Yea, I was looking at that as well. I'll have to investigate the results when paired with the SSE code.

Thanks for all the help guys. As I said, I have learned a TON from this thread and am still eagerly burning through all documentation on both concepts that I can

-Kevin

iCyborg · Jun 22, 2009

Originally posted by: Gamingphreek
So if I had 1.23456789, obviously in a float only 1.234567 would be accurate. So even if I multiplied by 100 resulting in a double with 123.456789 and then casted it to a float, it would still not be accurate?

Floats aren't like that, they are like scientific notation. So hypothetically if 1.23456789 is rounded to 1.234567, then in float it would be 1.234567 x 10^0. If you multiply by 10,000, you get 1.234567 x 10^5, so no, you don't get anything with that, you don't get a fixed number of digits to the right of the decimal point, that's why it's called a floating point. I'm just using base 10 for the idea here, the base is actually 2 so the representation is different.

Multithreading/SIMD C++

Golden Member

Lifer

Lifer

Junior Member

Senior member

Lifer

Golden Member

Lifer

Programming Moderator, Elite Member

Lifer

Lifer

Lifer

Lifer

Golden Member

Lifer

Lifer

Elite Member

Programming Moderator, Elite Member

Lifer

Lifer

Lifer

Golden Member

Programming Moderator, Elite Member

Lifer

Golden Member