Share memory parallelisation with independent processes but shared output data in c++

Carlis

Senior member
May 19, 2006
237
0
76
Hi all,


I am developing a simulation tool for a physics problem that I am working on (using c++). Essentially, the computation consists of manipulation of pointer structures that correspond to different physical processes. The output is “events” in time and space, and needs to be stored on a tensor (array) with four dimensions.


Normally I would parallelise this using MPI and have one tensor for every process, thus storing N tensors. However, I now need to be able to store much larger tensors (to obtain adequate accuracy), so ideally, I would like to use a shared memory approach and only store a single tensor.


But the only techniques I am familiar with is openMP, which works fine for manipulating arrays. The question is, what is a good approach to shared memory parallelisation when one has individual processes that produce data, yet a common data structure to store the output…


I am not very well versed in parallel programming techniques so I would really appreciate your input.


Thanks

//

Johan
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365574(v=vs.85).aspx

Do the workers need read access to the result or just to send (x,y,z,t, value) new cells to add to it?

For write-only updates, WM_COPYDATA looks like it will work. One "result" or "control" process receives updates from worker processes.

For read-write access, File Mapping is probably the easiest. But as the text mentions you need to come up with your own locking / synchronization scheme if multiple workers might read and write to the same cells. PostMessage can be used to send messages from the workers to the control.

For file mapping, remember that the address of the shared buffer is NOT the same across processes! Also only the mapped buffer itself is shared so if you store a pointer in it from process A, that pointer is garbage to process B.

For memory within the shared buffer, you can work around this by storing offsets instead of pointers:

Process A, has buffer mapped to $080000 and wants to store a pointer to $080100, instead stores (address - base) = $000100.

Process B, has buffer mapped to $065000 and wants to use the not-pointer so adds (address+base) = $065100.
 
Last edited:

exdeath

Lifer
Jan 29, 2004
13,679
10
81
Dunno if relevant but double buffering your data state with public and private data makes for very fast parallel processing at the expense of memory (which is cheap and abundant). All objects dependent on other objects read the current state of other objects that is public facing and static for the processing interval while updating their private data, then swapping at the end. With the public data guaranteed consistent for the frame / interval, you can can charge ahead with thousands of worker threads with no worries about synchronization or partial updates (eg: reading data from another object as parts of it are modified and containing both old and new data simultaneously).

A double buffered SoA pattern works very well for high scale-ability arbitration-less multi threading.

For your shared output stream I would precommit private blocks to each process to again avoid synchronization pitfalls as much as possible. Synchronization spin-locks are counterproductive to multi threading. You want to privatize the output segments to avoid that.

Also for optimal performance, have 1 real thread per logical core on init, and have them in an idle loop self feeding from a microthread/job queue in user land with no sys calls or context changes (which will require a lock on the queue, but one that will always be brief and non blocking for the most part). You want to avoid actual OS thread creation/destruction at run time as its very expensive.

Though the last time I looked at parallel processing seriously was when Cell first came out. I don't know if things are better in modern Windows concerning microthreads/jobs, etc. I just rolled my own microthread engine to do everything in user mode without relying on the OS at all once the initial actual independent system threads were created (1 per logical core at init). Then you just take a scatter-gather approach; a linear and dependent sequential primary code path (I focused on game engines) where each step has to run in sequence, but at each step there are thousands of parallel objects that can dump to the job queue then sleep until the job queue is zero before proceeding. I was almost able to scale linearly with core count this way.
 
Last edited:

Carlis

Senior member
May 19, 2006
237
0
76
Hi, thanks for the replies...
So I only need to write to the array, not read from it.
It appears to me that WM_copy is a microsoft/windows thing? I will be using a linux cluster for this...

And no I don't want to store multiple instances of the data. Memory may be cheap but for this application I will gain a great deal of accuracy by having a fine mesh for my problem...

Perhaps I should use MPI and send all the data to a master process that stores it? On a shared memory machine bandwidth should be pretty good, right?

Best
//
Johan
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
Yes, if you only want one copy of the array and the workers do not need to read from it then having a master / controller process that builds it based on result messages from worker processes or threads is a reasonable solution.

If you have bandwidth issues you could optimize it by having the workers cache results to send in blocks of N, so a message might look like: (count = N) (cell1) (cell2) ... (cell N) and each cell is (x,y,z,t value)