MPI-- many calls to same process and stability

Carlis

Senior member
May 19, 2006
237
0
76
Hi
I am writing an application that will use MPI and a large number of processes-- up to 4096 or so.
At a certain stage all processes except rank=0 should send some result to process zero. Is there an issue with stability if thousands of processes try to send messages more or less at the same time to a single process?

I am not concerned with efficiency-- The communications is a negligible part of the computational effort.

Best
//
Johan
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
If your messages are large there may be problems with overflowing memory. Otherwise not much of a problem here.

Are these processes distributed or are they all on one box? If they are all on one box, you may be spending a lot of time thrashing the CPU scheduler and doing excessive context switches. You may be better off using a greenthread solution (such as in language like go with goroutines) instead of having a bunch of processes. You'll save on performance loss from context switching.

Also, what is the consequence of message loss? Does it matter? Does it not matter? The drawback of MPI is it is non-durable so if these things are really super important you might look into using something with a little more heft and durability. If failure is only moderately inconvienent, you might be able to get by with Redis. On the other hand, if it is mission critical that these messages are processed then Kafka or a AMQP system like Rabbit might be in the cards.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
Yes, there will be issues.

This is my favorite RFC. By far.
https://tools.ietf.org/html/rfc1925

It's meant to be funny (hence it was released on April 1st).
But there is a lot of truth in it.
However, what I wanted to quote is not in there. :D
Sorry.


Cisco (a company that understands networking) once bought a mathematical algorithm for routing. Called the DUAL algorithm.
They used it as the basis for a new routing protocol, called EIGRP.
That new protocol, EIGRP was implemented by two good programmers. (Well, at least one very good programmer. Later he laid the foundations of many multicast-routing and LISP-related things). However, those programmers wrote a pretty naive implementation of the algorithm. In particular, they didn't take the real world into account.

In the DUAL paper, it is assumed that all messages between routers are delivered instantaneously. And those messages will never be lost. Basically for DUAL to work, there had to be:
1) Round-Trip-Time is 0 microseconds
2) Bandwidth is infinite
3) There is never any packet-loss

That's nice in a mathetical paper. But those conditions are never (NEVER!!) met in the real world.
As soon as EIGRP was ran in a small to medium-size network, and there was some "churn", the protocol would implode. Routers had 100% CPU load, the network was flooded with messages, input and output-queues were filled everywhere, routes were flapping, interfaces were flapping, etc. It just didn't work. (BTW, this is a real story. It happened in 1995). Another guy had to come in, who had a lot more experience with real-world networking (and IGPs). And he fixed EIGRP. He made it behave the way it should have. Take into account that packets can get lost/dropped. Don't burst your data, but rate-limit your packets. Don't assume all packets are delivered immediately. Half a year later EIGRP turned out to be a great protocol (especially in certain types of networks). But it never lost that smell of its first half year of existence.

So I got some suggestions for you.
See the next post.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
First suggestion: use TCP.

TCP gives you reliable delivery. It gives you rate-limiting, it tries to use the maximum bandwidth it can use. It works in high-speed environments and in low-speed environments. It might seem like it's not always 100% efficient (because TCP headers overhead). But forget about that. If you're gonna write your own protocol, e.g. based on UDP, you'll be wasting a lot of time. You'll be trying to solve all problems that TCP has solved. It'll take you weeks or months to come up with something that works. In the end it'll look a lot like TCP, but it will work less good that TCP.


Second suggestion: make the receiver initiate the transfer

What you wrote kinda suggests that you have a lot (thousands) of programs working towards a partial solution. And once they are done, they will decide by themselves to send the result to the gatherer. Don't do that. Make the gatherer gather the information. That way the gatherer dictates the speed at which data comes in. And the gatherer can make sure it slows down if it gets overwhelmed.

E.g. when every process starts, let them set up a TCP connection to the gatherer. Then they send a small message to the gatherer that they exist. Then they start working on the problem. When they are done, and have the results ready, they send a small message to the gatherer again. The gatherer waits for these notifications to come in. When the gatherer sees one notification come in, it sends a command to the node to request it to start sending data. You can do this in parallel (I assume I don't need to teach you about listening sockets, select systemcall, non-blocking I/O, etc). But even when you gather data in parallel, the gatherer has control over how many nodes are allowed to send at the same time. You might think this is a bit complicated to program. But I think it's not that bad. And if you really have a lot of nodes, and a lot of data, this will pay off in the end.


Third suggestion: don't do a simple copy of all data over the connection

I'm not sure if you're gonna use scp or ssh to copy data. Or if you are just gonna write your data over the TCP connection. That might be more efficient (no encryption overhead, which will have a big impact on big data-sets). How much data are we talking ? A MegaByte per node ? A GigaByte per node ? More ? What you don't want is that nodes all start sending data, and then the shit hits the fan, and connections are being reset. And then all nodes start their transfer all over again, from scratch.

In stead, make your own simple little protocol. (Over TCP of course). You are already implementing a few messages in either direction in your custom protocol. One message so that a node can tell the gatherer it exists. One message for the gatherer to tell the node to start sending. While you're at it, add another little message. A header for your data, that describes the offset and the length of the data in the file you are transfering. Say for every 1 MByte or every 10 MByte, you start that data with your little message telling the gatherer "here is data starting at 180 MB, to 190 MB". That way, if the TCP connection gets reset, the gatherer knows how many of those blocks it has received fully, and which block was in transmission, but didn't finish. The gatherer can then request the node to start resending half-way your data set. First of all this can prevent a lot of problem, imho. But the nice thing is that it almost guarantees that there is contineous progress. And hopefully at some time all data will have made it accross.


Fourth suggestion: prevent synchonization between your nodes

What you don't want is 4000 nodes being ready at exactly the same time, and start sending packets at exactly the same time. Also, you don't want the gatherer to send on broadcast or multicast message, and all 4000 nodes responding at the same time to the gatherer. That is a guarantee that messages will be lost.

A good protocol has random delays in it. Whenever a host or router decides that it needs to send a message, don't send it immediately. In stead, start a timer, e.g. for 10 milliseconds or 100 milliseconds. And jitter the exact time. Higher granularity of the timer helps. Now suppose your protocol lets the gatherer broadcast a message to the nodes saying: "tell me who is ready and wants to send data to me". If all nodes reply immediately, that causes packet drops. But if they all wait between 100 milliseconds and a full second, then the gatherer will receive 4000 messages in a 900 millisecond timeframe. That's 5 messages per millisecond. Maybe still a bit much, but better than getting 4000 messages within a few milliseconds.
Of course, if you use TCP, some of these issues will be less relevant. But keep in mind, if you design or implement a protocol, try to smooth out all actions over time, and try to prevent synchronization.



This should get you started.
I hope these ideas help. They are all old ideas that have been implemented already decades ago. But you can still learn from them.
If you have more questions, or ideas to discuss, let me know. I enjoy this kind of stuff. :)
Good luck.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
I kinda ignored the fact you said "an application that will use MPI".
I now see that MPI is a frame-work for message-passing is large clusters of computers.

I have no experience with MPI. I had never heard of it.
I have no idea what underlying protocols or algorithms it is using.
But I hope my remarks are still useful.

If you have to combine fast message-passing of many small messages, with transfers of large data-sets, then maybe you should try to use best of both worlds. Use MPI to send small messages around. But use TCP to transfer large data-sets. Anyway, good luck.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
Well Carlis (OP), even if you didn't want to answer to the replies in this thread, or tell us what you're gonna do (just curious), the least you could do is acknowledge that you've read our posts. :/
 

Carlis

Senior member
May 19, 2006
237
0
76
Well Carlis (OP), even if you didn't want to answer to the replies in this thread, or tell us what you're gonna do (just curious), the least you could do is acknowledge that you've read our posts. :/

Hi all

Sorry for being a prick. Somehow I managed to forget about this post that I made, and I did not see the replies until now. I have had a bit to much work to do lately. Shame on me.

So, long story short: I managed to solve this problem by changing the structure of the communication. Somehow it seems like a bad idea to have thousands of processes crowding one process with data. So I put everything on arrays and used mpi-reduce instead. It uses more memory, but is at least stable and predictable.

The software intended for some statistical mechanics applications.

Best
//
Carlis
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
Thanks for the update, Carlis. Much appreciated.

So if understand correctly, having 4k processes sending messages to 1 process, without any form of flow-control, was indeed a problem ? Cool. I predicted that correctly then. :) Sometimes I think I am too much of a pessimist. But maybe not.

I don't know enough about massive parallel computing, or the MPI library to say anything about your solution. But I'm happy you solved the problem. Have a nice day.
 

Carlis

Senior member
May 19, 2006
237
0
76
Actually, I did not test that approach. MPI is very well designed and incredibly stable (though I know nothing about how it actually works), and I would not be surprised if it did work. However, it strikes me as a bad idea to implement it this way. Whats not to say that it would work for 4k processes but not for 8k? So it seems like bad practice.

If you are interested in parallel programming, then MPI is great. Open source, easy to use, stable, predictable and good performance (much more so than open mp) and possible to use on clusters where you do not have shared memory.

Anyway, thanks for the input!