Multithreading/SIMD C++

Gamingphreek · Jun 17, 2009

I have code that I am working on that I am able to extract a great deal of parallelism out of.

For instance, I have for loops running that have nothing to do with each other that read values in from an external binary file and store in a vector<vector<double>>.

These could be done simultaneously, I simply don't know how.

Can someone point me in the right direction to grabbing info on learning multithreading in C++? I searched Google, but the information is kind of scattered.

Thanks
-Kevin

nickbits · Jun 17, 2009

What platform?

If your loops are pulling files from the same disk, it might be slower than if you read the data sequentially.

sao123 · Jun 17, 2009

Originally posted by: nickbits
What platform?

DaveSimmons · Jun 17, 2009

Have you done some profiling to see if it will help? If those sections aren't the bottleneck for your application then it adds a lot of effort and chances for new bugs for little or no gain.

IHateMyJob2004 · Jun 17, 2009

Originally posted by: sao123

Originally posted by: nickbits
What platform?

Click to expand...

Yaaaaaa

In the real world:
Don't multithread unless you already know that there is a performance issue

In academia:
do it!

Advice:
You will want to load in the files in their entirety sequentially. So:
1) load the first file into memory.
2) Start processing file 1 in thread one
3) start loading file two
4) after file 2 is loaded, start processing it.

Point is, don't try to load the files in parallel. The hard drive will be a bottleneck, especially as the head jumps all over the place.

And of course, I'd assume you have 2+ processors/cores.

Regardless of the OS read up on semaphores. What they are and why they are needed.

I don't think Java VMs take advantage of two cores yet. Anyone?

Gamingphreek · Jun 17, 2009

Hey everyone - sorry, commute back home sucks on a lot of levels.

The platform is 32bit Windows. I'm compiling in VS2008 Professional.

As for the application... I can't get into too much but this is what I can say.

I am reading a flat binary file into a buffer which is a dynamically allocated 1 dimensional array. Since the read() function only accepts char *, the the array is of type char *. Then, since the data is stored as doubles, I create a pointer to a double and then iterate through that array with the double pointer and store the values in a 2D vector (vector<vector<double>>).

At any rate, the entire portion of the file that I need is loaded into the buffer and, subsequently, the vector (Also a buffer I suppose). The operations are vector ops so they are all operations that are in memory.

Things such as populating new vectors with the result of the initial vector and mathematical computations are things I would like to perform in parallel. Many times, they are in the same loop (but it can be broken up), but they are not dependent on values that would be using the same memory address (Thus, there should be no deadlock from what I can see)

Thanks,
-Kevin

Cogman · Jun 17, 2009

Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

For the most part. Unix and Windows multithreading is pretty similar. Read up on Pthreads if you want to know how to code unix style threads.

As for your problem, unless your operation is taking longer then it takes to get the stuff off the hard drive, you aren't going to see any benefit from multithreading in this case. But you can test that for yourself. Just make sure you lock the vector before you read/write to it, and you keep it locked until all operations are done on it, otherwise things get really messy really quick.

Gamingphreek · Jun 17, 2009

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

For the most part. Unix and Windows multithreading is pretty similar. Read up on Pthreads if you want to know how to code unix style threads.

As for your problem, unless your operation is taking longer then it takes to get the stuff off the hard drive, you aren't going to see any benefit from multithreading in this case. But you can test that for yourself. Just make sure you lock the vector before you read/write to it, and you keep it locked until all operations are done on it, otherwise things get really messy really quick.

The sizes of the array's vary, but, in this particular data set I am working from, if I were to load every single value 2 things happen:
1. I run out of memory in the Physical Address Space on a 32bit machine. I hit 2GB and then, obviously, the program crashes.
2. If I remember right that would be loading in excess of 14 Million values - 99.9% of the time this wont happen; however, this is on instance (If I managed to get it into a buffer) that I believe Multi-threading would help process a data set that large.

Thanks for the advice - I will definitely look into those functions. I understand the principles behind threading and locking values; however, I have never done it in practice.

-Kevin

Kirby · Jun 17, 2009

http://www.boost.org/

JasonCoder · Jun 18, 2009

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Wait, what? I'm pretty sure the for loop will wait for createthread to return before proceeding with the loop.

Races are definitely important to understand, though. For instance if you have lots of routines in different threads incrementing a counter, like the old interface reference counter in COM or something, you'll want to make sure even increment operators are mutex'd or somehow locked for one at a time.

It turns out even the increment operator (the ++ in c++) isn't actually one statement. It compiles down to two or three operations. If using .Net, use the Interlocked class.

degibson · Jun 18, 2009

I agree with IHateMyJob2004:
1) Only thread for performance when you're sure you can get performance from threading. Single disks strongly prefer sequential accesses. What you probably would benefit from more would be a second file descriptor on your big file, opened non-blocking, that always read()s N chunks ahead of where you're actually operating (aka an asynchronous prefetch). No thread required.

In agreement with Cogman&JasonCoder
2) Beware data races. Also beware assumptions of orderedness. Its often difficult to be sure that A happens before B when A and B are in different threads.
3) Fair warning: Never try to roll your own synchronization unless you know exactly what you are doing. Use someone else's mutex's, locks, semaphores. condition variables. Only make your own if you know the difference between a store fence and a load fence, and can explain the differences between total store order and processor consistency to your grandmother.

General observations
4) It sounds like you're doing some data movement after you read the file -- transforming a double* to a vector<vector<double> >. That part might benefit from threading. However, put a couple calls to time() around it and make sure that step is taking seconds (not microseconds or milliseconds) to run before you start spawning threads.

5) If your parallelism takes the form of loops, I'd encourage you NOT to look directly at createThread() -- thats like staring at the sun. Check out one of the various parallelism infrastructures designed for loop parallelization, like Intel's TBB (http://www.threadingbuildingblocks.org/) or OpenMP (http://openmp.org/wp/).

Gamingphreek · Jun 18, 2009

Originally posted by: degibson
I agree with IHateMyJob2004:
1) Only thread for performance when you're sure you can get performance from threading. Single disks strongly prefer sequential accesses. What you probably would benefit from more would be a second file descriptor on your big file, opened non-blocking, that always read()s N chunks ahead of where you're actually operating (aka an asynchronous prefetch). No thread required.

In agreement with Cogman&JasonCoder
2) Beware data races. Also beware assumptions of orderedness. Its often difficult to be sure that A happens before B when A and B are in different threads.
3) Fair warning: Never try to roll your own synchronization unless you know exactly what you are doing. Use someone else's mutex's, locks, semaphores. condition variables. Only make your own if you know the difference between a store fence and a load fence, and can explain the differences between total store order and processor consistency to your grandmother.

General observations
4) It sounds like you're doing some data movement after you read the file -- transforming a double* to a vector<vector<double> >. That part might benefit from threading. However, put a couple calls to time() around it and make sure that step is taking seconds (not microseconds or milliseconds) to run before you start spawning threads.

5) If your parallelism takes the form of loops, I'd encourage you NOT to look directly at createThread() -- thats like staring at the sun. Check out one of the various parallelism infrastructures designed for loop parallelization, like Intel's TBB (http://www.threadingbuildingblocks.org/) or OpenMP (http://openmp.org/wp/).

Yes the disk read will stay strictly sequential. Performing reads from multiple points at the same time in the same file, I would imagine, would give a decrease in performance.

I'll have to research data races and all of that. I imagine that my idea of what it is is correct; however, I have no knowledge of the intricacies of it.

Yes - I am doing data movement after the initial read in. Because of the way the data is structured in the file as well as the way the read() function operates - I need to read it into a buffer before I can move it to my 2D Vector.

I immediately deallocate the first buffer (the array) after everything is organized into the vector.

What I envision, and correct me if I am wrong, I can extract a great deal of parallelism from these areas:
1. Moving the data from the array (Memory) into the vector (Memory)
2. Performing operations by rows of data in the vector and storing them in another vector

For instance:
*********CODE*************
for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
posVecs[0].push_back( eRad[ I] * ( cos( readingsArr[2][ I] ) * cos( gcLat[ I] ) ) + readingsArr[3][ I] * upVecs[0][ I] );
posVecs[1].push_back( eRad[ I] * ( sin( readingsArr[2][ I] ) * cos( gcLat[ I] ) ) + readingsArr[3][ I] * upVecs[1][ I] );
posVecs[2].push_back( eRad[ I] * ( sin( gcLat[ I] ) ) + readingsArr[3][ I] * upVecs[2][ I] );
}
*********CODE************

Since the rows are, in no way, dependent on each other each could have their own thread (It may not seem like much in this example, but each value is a double, and there can be thousands of numbers in each row for each array that I pull from).

With the data set parameters that I use right now, the code is near instantaneous; however, if I increase one aspect or another of the program and take into account that this is merely a subroutine of a larger program - I imagine that it will take a little longer.

Additionally, this is an internship - so now is as good a time as any for me to push my boundaries and start working with threading (Benefits both myself and my employer).

-Kevin

Cogman · Jun 18, 2009

Originally posted by: JasonCoder

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Click to expand...

Wait, what? I'm pretty sure the for loop will wait for createthread to return before proceeding with the loop.

Races are definitely important to understand, though. For instance if you have lots of routines in different threads incrementing a counter, like the old interface reference counter in COM or something, you'll want to make sure even increment operators are mutex'd or somehow locked for one at a time.

It turns out even the increment operator (the ++ in c++) isn't actually one statement. It compiles down to two or three operations. If using .Net, use the Interlocked class.

Ummm. What are you smoking? the increment operator, especially how I have it written, is exactly one command. Infact, most processors have a special increment operator. Specifically the inc operator. Even if that wasn't the case, then it would just be something to the effect of add $1 %eax;

And don't be so sure that won't cause a race condition. Give it a try if you don't believe me. It turns out that sometimes i will update before the thread has time to use it.

Ken g6 · Jun 18, 2009

Originally posted by: Gamingphreek
What I envision, and correct me if I am wrong, I can extract a great deal of parallelism from these areas:
1. Moving the data from the array (Memory) into the vector (Memory)
2. Performing operations by rows of data in the vector and storing them in another vector

...

Since the rows are, in no way, dependent on each other each could have their own thread (It may not seem like much in this example, but each value is a double, and there can be thousands of numbers in each row for each array that I pull from).

With the data set parameters that I use right now, the code is near instantaneous; however, if I increase one aspect or another of the program and take into account that this is merely a subroutine of a larger program - I imagine that it will take a little longer.

OK, I'm going to approach this from the outside in. First, are you sure your code is the bottleneck in the application? Have you profiled the application and found that your code takes the most time? For the journey of 1000 miles, making your starter motor start your car a few seconds faster won't significantly improve your travel time.

Next, for area 1, is the 2D Vector constant once loaded? Or does its size at least remain the same? If so, you can fairly easily treat the initial buffer as a constant-sized 2D array. The basic idea is to create the buffer and create a different-type pointer to the same memory. You should probably either create the buffer as doubles, or explicitly give __align directives (which you might want to anyway), so that the doubles aren't across DWord boundaries. Then if the rows are a constant width, you can access them with a function like val(x,y) { return a[WIDTH*y+x]; }. If they're not, or you don't like that syntax, you can create a smaller array of double pointers, each pointed at the beginning of a row. Then you access an entry with parr[y][x]. This is known as a "jagged array" in C# and Java.

Oh, and I recall degibson mentioned mmap() recently; it might be useful to you too.

Finally, for point 2, don't overlook SIMD. We had a thread about MMX/SSEx a few months back. I've found compiler intrinsics work well if you have either very few individual SIMD variables/constants (4 or so) or very many (much more than 8). In between there you might do better with some inline assembly. But for a loop, also use OpenMP like degibson said.

Gamingphreek · Jun 18, 2009

Originally posted by: Ken g6

Originally posted by: Gamingphreek
What I envision, and correct me if I am wrong, I can extract a great deal of parallelism from these areas:
1. Moving the data from the array (Memory) into the vector (Memory)
2. Performing operations by rows of data in the vector and storing them in another vector

...

Since the rows are, in no way, dependent on each other each could have their own thread (It may not seem like much in this example, but each value is a double, and there can be thousands of numbers in each row for each array that I pull from).

With the data set parameters that I use right now, the code is near instantaneous; however, if I increase one aspect or another of the program and take into account that this is merely a subroutine of a larger program - I imagine that it will take a little longer.

Click to expand...

OK, I'm going to approach this from the outside in. First, are you sure your code is the bottleneck in the application? Have you profiled the application and found that your code takes the most time? For the journey of 1000 miles, making your starter motor start your car a few seconds faster won't significantly improve your travel time.

Next, for area 1, is the 2D Vector constant once loaded? Or does its size at least remain the same? If so, you can fairly easily treat the initial buffer as a constant-sized 2D array. The basic idea is to create the buffer and create a different-type pointer to the same memory. You should probably either create the buffer as doubles, or explicitly give __align directives (which you might want to anyway), so that the doubles aren't across DWord boundaries. Then if the rows are a constant width, you can access them with a function like val(x,y) { return a[WIDTH*y+x]; }. If they're not, or you don't like that syntax, you can create a smaller array of double pointers, each pointed at the beginning of a row. Then you access an entry with parr[y][x]. This is known as a "jagged array" in C# and Java.

Oh, and I recall degibson mentioned mmap() recently; it might be useful to you too.

Finally, for point 2, don't overlook SIMD. We had a thread about MMX/SSEx a few months back. I've found compiler intrinsics work well if you have either very few individual SIMD variables/constants (4 or so) or very many (much more than 8). In between there you might do better with some inline assembly. But for a loop, also use OpenMP like degibson said.

I am not able to profile the code because MS, in all its wisdom, decided to not include the Code Profiler in CS2008 Professional. I downloaded a trial version of dotTrace, but I am waiting for IT to come by and install it on my machine.

Ok - I will get as detailed as I am able because the Compiler Intrinsics have me excited about the learning benefit. My boss and other employees (who are Professionals Software Developers) have never heard of them and are now researching them.

void read_sbet_ECEF::bufferToArray()
{
// Read entire group of data into a buffer.
char* memBlock = new char[chunkSize * 11 * sizeof(double)];
outFile->read( memBlock, chunkSize * 11 );

// Indirectly cast each char* value as a double.
double* doubleptr = (double*)memBlock;

// Store in a [11][ n] vector array.
for ( int i = 0; i < 11; i++ )
{
readingsArr.push_back( vector<double> () );

for ( int j = 0; j < chunkSize; j++ )
{
readingsArr[ i].push_back( doubleptr[ i + ( j * 11 )] );
}
}

delete[] memBlock;

return;
}

That is how I read in the values.

The file that I read in is formatted as a .out file which is a flat binary structure. The file is, in my current test set, ~500MB.

The value chunkSize is determined later and is the number of records to read from this file. The data in the file is read in to be formatted as 11 rows by 660 columns. Since I am reading them in as chars, do accommodate for values of type double, I multiply it by the the sizeof(double).

Once the values are in a formatted vector buffer, I can then do operations and store them in other vectors for use elsewhere in the program. For instance, I wrote a generic cross product function. Another example; I have to iterate through row x and take the cos() of every value and then multiply it by row y, then save that result in new vector VEC.

There are a TON of instances where I have to do operations like that throughout the program. For instance, here is 1 of many loops that are similar.

for ( int i = 0; i < 3; i++ )
{
upVecs.push_back( vector<double> () );
eastVecs.push_back( vector<double> () );
northVecs.push_back( vector<double> () );
posVecs.push_back( vector<double> () );
}

for ( unsigned int i = 0; i < readingsArr[1].size(); i++ )
{
// Calculates the geocentric latitude.
gcLat.push_back( atan( pow( ( 1 - Constants::EARTH_FLAT ), 2 ) * tan( readingsArr[1][ i] ) ) );

// Calculates the earth radius value vector.
eRad.push_back( 1 /
( sqrt(
pow ( ( cos( gcLat[ i] ) / Constants::EARTH_EQU_RADIUS ), 2 ) +
pow ( ( sin( gcLat[ i] ) / Constants::EARTH_POLAR_RADIUS ), 2 )
)
) );

// Calculates the upVectors.
upVecs[0].push_back( cos( readingsArr[2][ i] ) * cos( readingsArr[1][ i] ) );
upVecs[1].push_back( sin( readingsArr[2][ i] ) * cos( readingsArr[1][ i] ) );
upVecs[2].push_back( sin( readingsArr[1][ i] ) );

// Calculates the eastVectors.
eastVecs[0].push_back( -sin( readingsArr[2][ i] ) );
eastVecs[1].push_back( cos( readingsArr[2][ i] ) );
eastVecs[2].push_back( 0 );
}

For reference, here is my cross product function as well:

vector<vector<double>> read_sbet_ECEF::crossProduct(vector<vector<double> > &vector1, vector<vector<double> > &vector2)
{
// Value for holding the cross product before returning the value.
vector<vector<double>> tempVector;

// Instantiate vector rows.
for ( int i = 0; i < 3; i++ )
{
tempVector.push_back( vector<double> () );
}

for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
tempVector[0].push_back( vector1[1][ i] * vector2[2][ i] - ( vector1[2][ i] * vector2[1][ i] ) );
tempVector[1].push_back( -( vector1[0][ i] * vector2[2][ i] ) + vector1[2][ i] * vector2[0][ i] );
tempVector[2].push_back( vector1[0][ i] * vector2[1][ i] - ( vector1[1][ i] * vector2[0][ i] ) );
}

return tempVector;
}

I apologize for the length of this post and I certainly don't expect any code on anyone's part, but some advice and recommendations (Possible Tutorials, Books, or websites) would be amazing!

-Kevin

Edit: On a major side not, we use A LOT of fft's - there isn't, by chance, any way I can use SIMD to speed that up significantly as well. If so - I may have just found the holy grail for my division.

Cogman · Jun 18, 2009

Well, as far as optimizations go, there are a couple of things that I noticed that you could improve on to see some speed ups.

everytime you do something like vector[w] in your code, you are inserting a new function call into the cpp. While it won't make your code look cleaner (optimizations rarely do) a faster method is to save off the values of vector[w] into variables, and then do the math on them. For example.

for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
tempVector[0].push_back( vector1[1][ i] * vector2[2][ i] - ( vector1[2][ i] * vector2[1][ i] ) );
tempVector[1].push_back( -( vector1[0][ i] * vector2[2][ i] ) + vector1[2][ i] * vector2[0][ i] );
tempVector[2].push_back( vector1[0][ i] * vector2[1][ i] - ( vector1[1][ i] * vector2[0][ i] ) );
}

could be changed to

for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )
{
double vector10i = vector1[0][ i];
double vector20i = vector2[0][ i];
double vector11i = vector1[1][ i];
double vector22i = vector2[2][ i];
double vector12i = vector1[2][ i];
double vector21i = vector2[1][ i];
tempVector[0].push_back( vector11i * vector22i - ( vector12i * vector21i ) );
tempVector[1].push_back( -( vector10i * vector22i ) + vector12i * vector20i );
tempVector[2].push_back( vector10i * vector21i - ( vector11i * vector20i ) );
}

This will cut the number of instructions used in 1/2. You might remove some memory dereferancing as well (not guaranteed). The same could be done to the sin and cos stuff as well. IE double sinreadingsArr2i = sin(readingsArr[2][ i]); That will cut down on your usage of the sin/cos functions as well (again, not pretty, but faster).

You'll also want to decrease your loop overhead

for ( unsigned int i = 0; i < readingsArr[0].size(); i++ )

This will have to call the vector[] function and then the vector.size() function for EVERY loop! Yikes. Rather

size_t size = readingsArr[0].size();
for ( unsigned int i = 0; i < size; ++i )

Will dramatically decrease your loop overhead.

I would do stuff like this to the code before considering multi-threading. You'll be amazed at the speed increase it can offer to you. One further optimization that might interest you is this. If you have a vector that you won't be change the size of, you can get a good speed increase by doing something like

someFunc(vector<double> bob)
{
double* rawbob = &bob[0];
}

and then playing only with rawbob. That will bypass the accessing of the vector[] function all together which will decrease the overhead of using a vector by a ton.

I hope this helps somewhat. Even if you do go the route of multithreading, these optimizations will still offer a huge speed increase (for the functions, ahmdal's law and all.).

[edit] Fuse talk, I hate you![/edit]

Ken g6 · Jun 18, 2009

FFTs are probably not something you want to build yourself. There are quite a few good libraries for FFTs, some of which use SSE/SSE3, and a few of which use graphics cards for even faster SIMD processing. SETI@Home uses FFTs, is open source, has a GPU client, and has had other people build optimized applications with better FFT libraries as well.

Google turns up interesting stuff, too:
http://www.google.com/search?hl=en&q=SSE3+FFT
http://www.google.com/search?hl=en&q=GPU+FFT

iCyborg · Jun 18, 2009

Originally posted by: Gamingphreek
Edit: On a major side not, we use A LOT of fft's - there isn't, by chance, any way I can use SIMD to speed that up significantly as well. If so - I may have just found the holy grail for my division.

Both Intel's IPP and AMD's Framewave libraries have FFT stuff and both are optimized for SIMD and multi-core. The downside is that each one is optimized for its own CPUs, and even worse, Intel's IPP isn't free. If you're writing something commercial, it might be worthwhile to check.

Cogman · Jun 18, 2009

For the FFT, check out Doom9's board and ask them for recommendations, they us them all the time, I wouldn't be surprised if Dark Shakiri gave you some good suggestions.

Gamingphreek · Jun 18, 2009

Oh wow - this is amazing. I felt like there was a lot of optimization that could be done, but I didn't know where to look. Is there, or are there, books or websites that I can read about optimizing code. I know generally you pick it up through working with low level code; however, this is really neat stuff.

As for FFT's, right now I have no idea what they use. The FFT usage I am referring to is in a SCIF, so I have no knowledge of what they are using. I would imagine it is absolutely horrid though given all the rewriting of code we are doing (Code sets run for days right now - If there is some way I can reduce that to less days or hours....watch out - I'm learning everything I can). I will make these suggestions with regards to GPU based code as well as SSE optimized code. Nobody here has heard of SSE even.... I plan on making myself the first

My program director said that he will purchase any books that I feel might be beneficial to learn about. Are there books where I can learn about compiler intrinsics such as SSE? I would love to learn and, hopefully, if I can show promising results, go through and optimize code.

-Kevin

Cogman · Jun 18, 2009

For SSE, I don't really know of any books (I've been wanting to learn more about SSE myself, however, as you've probably found, there isn't a whole lot of resources and examples of it).

For Optimizations, A good Computer architecture book will do the trick. "Computer Systems: A programmer's Perspective" by Randal E. Bryant and David R. O'Hallaron is an AMAZING book when it comes to understanding what is going on in the low levels of a computer. However, It should be noted that most optimizations are already being done by a modern compiler. The biggest are that aren't are removing unnecessary function calls. It is hard for the compiler to know that vector[] isn't changing anything and so it can just be loaded into a register.

Other then that, Umm, a quick search for C optimizations might yield some fruit.

Just remember ahmdal's law. You want to spend your time optimizing the code that is taking the longest to execute. You really need a profiler to do this. (unless you just want to optimize everything

) Barring that, A timer might be your only other option, IE

timer = GetTickCount();
someFunc();
timer = GetTickCount() - timer;
cout << timer;

For code you optimize, you'll want to do this anyways to ensure that you are actually making things run faster and no slower.

JasonCoder · Jun 18, 2009

Originally posted by: Cogman

Originally posted by: JasonCoder

Originally posted by: Cogman
Read up on the windows CreateThread function and Enter/leave CriticalSection functions. Those should be what you need. Also make sure you understand race conditions.

IE

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Click to expand...

Wait, what? I'm pretty sure the for loop will wait for createthread to return before proceeding with the loop.

Races are definitely important to understand, though. For instance if you have lots of routines in different threads incrementing a counter, like the old interface reference counter in COM or something, you'll want to make sure even increment operators are mutex'd or somehow locked for one at a time.

It turns out even the increment operator (the ++ in c++) isn't actually one statement. It compiles down to two or three operations. If using .Net, use the Interlocked class.

Click to expand...

Ummm. What are you smoking? the increment operator, especially how I have it written, is exactly one command. Infact, most processors have a special increment operator. Specifically the inc operator. Even if that wasn't the case, then it would just be something to the effect of add $1 %eax;

And don't be so sure that won't cause a race condition. Give it a try if you don't believe me. It turns out that sometimes i will update before the thread has time to use it.

I quit smoking many years ago, but thanks for your concern. And I don't care if most processors do x... there's nothing in the standard about ++ being thread safe. I brought it up because it's a classic threading gotcha. Assume at your own peril.

Ownage link 1

Ownage link 2

Ownage link 3

I see what you are saying now about the createThread call. I misunderstood the first time. Thought you were saying you had the call stack pop off a function return before the function returns.

DaveSimmons · Jun 18, 2009

for(int i = 0; i < 3; ++i)
createthread(func, i);

will cause a race condition because the main thread could update i before the new thread is created. There are ways around this, but just be aware that it is an issue.

Unless I'm missing something this example is thread-safe because the new thread doesn't exist until after i has been -copied- to make the function call to create it, which we know will always happen before i is incremented regardless of any context switches to other threads.

Your example would make sense if you passed &i not i, then the new thread could potentially see i before or after one or more loop increments.

Venix · Jun 18, 2009

Originally posted by: Cogman
Well, as far as optimizations go, there are a couple of things that I noticed that you could improve on to see some speed ups.

...

This is incorrect. There is no performance benefit to taking a pointer to the beginning of a vector, nor is there any performance impact in using operator[]. Unless you're using debug builds or checked iterators, every compiler available will generate identical code for rawbob[ i] and bob[ i].

Similarly, every compiler I've ever used--MSVC, GCC, Comeau--will call size() once before the loop and cache the result, even if you put the call in the loop condition. Most "optimizations" like this just muddle the code; it's best to identify problem areas and only then attempt to manually optimize.

As for real optimizations, using a vector of vectors is rarely a good idea. Any time the outer vector re-allocates, all the inner vectors will be copied. Since you seem to always know the size of the result vector, one mitigating technique is to call reserve(size) before the loop that fills the vector. If you do this, push_back will be guaranteed to never cause a reallocation.

A better solution would be to not use vectors of vectors. A list of vectors might be a possibility, although a real multi-dimensional array library would probably be best.

EagleKeeper · Jun 18, 2009

Originally posted by: Venix

Originally posted by: Cogman
Well, as far as optimizations go, there are a couple of things that I noticed that you could improve on to see some speed ups.

...

Click to expand...

This is incorrect. There is no performance benefit to taking a pointer to the beginning of a vector, nor is there any performance impact in using operator[]. Unless you're using debug builds or checked iterators, every compiler available will generate identical code for rawbob[ i] and bob[ i].

Similarly, every compiler I've ever used--MSVC, GCC, Comeau--will call size() once before the loop and cache the result, even if you put the call in the loop condition. Most "optimizations" like this just muddle the code; it's best to identify problem areas and only then attempt to manually optimize.

As for real optimizations, using a vector of vectors is rarely a good idea. Any time the outer vector re-allocates, all the inner vectors will be copied. Since you seem to always know the size of the result vector, one mitigating technique is to call reserve(size) before the loop that fills the vector. If you do this, push_back will be guaranteed to never cause a reallocation.

A better solution would be to not use vectors of vectors. A list of vectors might be a possibility, although a real multi-dimensional array library would probably be best.

The bolded area can cause a serious problem if the size of an array is modified within the loop. If the array is compressed; you are utilizing "forigen" memory. If the array is expanded; you may not get to the full end of the array.

Multithreading/SIMD C++

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Lifer

Lifer

Lifer

Lifer

Golden Member

Golden Member

Lifer

Lifer

Programming Moderator, Elite Member

Lifer

Lifer

Programming Moderator, Elite Member

Golden Member

Lifer

Lifer

Lifer

Golden Member

Elite Member

Golden Member

Discussion Club Moderator<br>Elite Member