TCP packet encapsulating

jaydee · Jun 20, 2014

I have a program running (written in C#) on Win7Pro, that sends a up to a couple hundred of messages that are 56 bytes long each over the network (through a switch, via TCP) to a different node as fast as it can. When I trace the messages in wireshark (from the sending PC), it always sends the first 52 messages in individual packets of 56 bytes. But after that, it groups messages oddly, and I'm trying to figure out why. I'm having issues getting the correct responses and it may have to do with the way the node is recieving the messages.

For instance, the last run I tried, I attempted to send 200 messages of 56 bytes of data each. Using wireshark, I can see that the first 52 messages were sent in individual packets. The next 5 packets had data of message lengths: 1460, 1012, 1460, 1228, 2792 (which individually are not divisible by 56, but collectively, they are). The last 6 packets all had data lengths of 56 bytes. If I run this several times, it's repeatable to the extent that the first 52 messages are all sent in their own individual packets and it's unpredictable after that.

It is from the data in the long packets where I am not getting the correct responses from the end node.

Can someone explain to me this behavoir, and perhaps point me in the direction how to send all the messages individually in their own packets? I did not write the C# program myself, but I do have access to the code and the guy who wrote it.

Thanks!

brandonb · Jun 20, 2014

Sounds like:

http://en.wikipedia.org/wiki/Nagle's_algorithm

And here is the fix:

http://msdn.microsoft.com/en-us/library/system.net.sockets.tcpclient.nodelay(v=vs.110).aspx

alkemyst · Jun 20, 2014

It's TCP Windowing. Your packets go in different orders with sequence numbers. These sequence numbers allow the device to put them back in the right order.

jaydee · Jun 20, 2014

brandonb said:
Sounds like:

http://en.wikipedia.org/wiki/Nagle's_algorithm

And here is the fix:

http://msdn.microsoft.com/en-us/library/system.net.sockets.tcpclient.nodelay(v=vs.110).aspx

Nagle Algo is already turned off. There is a toggle in the program interface. Double checked the code with internet and it appears to be correct.

alkemyst · Jun 20, 2014

It's normal for TCP traffic to do this. Otherwise one transmission would hog up everything else.

jaydee · Jun 20, 2014

alkemyst said:
It's normal for TCP traffic to do this. Otherwise one transmission would hog up everything else.

can you be a little more specific? Its normal for 50 some messages to transmit individually, then cram the next 100 some messages into 5 packets, then send the last 6 messages individually? I'm not doubting you, but I'm not following the logic.

alkemyst · Jun 20, 2014

jaydee said:
can you be a little more specific? Its normal for 50 some messages to transmit individually, then cram the next 100 some messages into 5 packets, then send the last 6 messages individually? I'm not doubting you, but I'm not following the logic.

TCP/IP can be forgiving to new customers. Once you exceed your welcome it will break it up.

imagoon · Jun 20, 2014

Have you tried to open the socket with TCP_NODELAY? It sounds a bit like Nagle is kicking in. How are you acking this data is there delayed acks involved? TCP/IP engines on the major OS systems are pretty complex now and do a lot of manipulation in the name of performance. I don't know the exact triggers in the stack you are working with but after certain number packets, protocols like Nagle will start to put the messages together and send them as one large packet to reduce overhead. Add on to that, most stacks will use sliding windows and packet sizes. So what may be happening is the stream starts, you trigger nagle, large packets are built and sent. The socket stops sending, a delayed ack gets generated and nagle releases that last left over messages. Is there about a 200ms delay or so between the last large packet and the last 6?

Use TCP_NODELAY or UDP to get around this. Your utilization efficiency will suffer with TCP_NODELAY but that may be acceptable. UDP you need to have your own packet and lost packet management in the application.

drebo · Jun 21, 2014

Use UDP and have your application handle buffering/ordering manually. What you're observing is TCP windowing and it's by design. It sounds like your application probably would be better served by UDP anyway.

Gryz · Jun 22, 2014

The real question here is: what is your problem ?

TCP is working as designed. It's a transport service, that gives you a reliable, ordered byte-steam from one application on some machine to another application on another machine. The bytes you put in on one side will come out on the other side, in the same order, reliably, without loss, without duplication. How TCP transmits those bytes is none of our concern.

TCP has been improved over the decades. Windowing, slow-start, maximum segment size, elephant optimizations, nagle, selective acks, etc. There are lots of little tricks that have been invented over the years to make TCP work better. Not all of those are intuitive.

Bottom-line is that TCP transports bytes. If you want to transport packets, and maintain the original boundaries of your blobs of data, then TCP is not the right tool. You'll have to use UDP. And UDP does not have reliability. So you'll have to implement that yourself. Which is probably a lot more work than you realize. And once you're done with implementing all the reliability and speed features that TCP has, your own homegrown will probably look a lot like TCP.

Again, the question is: what is the problem ?

jaydee · Jun 23, 2014

Thank you all for the responses, I'm an EE by schooling and test engineerby employment, not a network engineer, so all this info is very helpful to me.

Gryz said:
The real question here is: what is your problem ?

TCP is working as designed. It's a transport service, that gives you a reliable, ordered byte-steam from one application on some machine to another application on another machine. The bytes you put in on one side will come out on the other side, in the same order, reliably, without loss, without duplication. How TCP transmits those bytes is none of our concern.
.
.

The 56 byte messages I am simulating in the lab from one single PC. This is the lab tool I have available to me at the moment. In the field the 56-byte messages will be coming from hundreds of different devices, thus they won't be bunched together within a single packet. So I am trying to send these messages in their own individual packets to better simulate a field scenario. I am testing the performance of the middleware of the end-node, so I am trying to slam these 100's of messages as fast as the network will allow me.

jaydee · Jun 23, 2014

imagoon said:
Have you tried to open the socket with TCP_NODELAY? It sounds a bit like Nagle is kicking in. How are you acking this data is there delayed acks involved? TCP/IP engines on the major OS systems are pretty complex now and do a lot of manipulation in the name of performance. I don't know the exact triggers in the stack you are working with but after certain number packets, protocols like Nagle will start to put the messages together and send them as one large packet to reduce overhead. Add on to that, most stacks will use sliding windows and packet sizes. So what may be happening is the stream starts, you trigger nagle, large packets are built and sent. The socket stops sending, a delayed ack gets generated and nagle releases that last left over messages. Is there about a 200ms delay or so between the last large packet and the last 6?

Use TCP_NODELAY or UDP to get around this. Your utilization efficiency will suffer with TCP_NODELAY but that may be acceptable. UDP you need to have your own packet and lost packet management in the application.

Nagle Algo is already turned off. There is a toggle in the program interface.

The first 52 messages that came individually in their own 56 byte messages took an average of 0.064ms per message.

The next 142 messages that came condensed in 5 packets, took an average of 0.116ms per message.

The last 6 messages that came individually in their own 56 byte messages took an average of 0.056ms per message.

I found a 0.09ms delay between the last large packet and the first of the last 6 packets.

UDP is not an option. The end-node is the test article, I have no control of what protocol it talks.

alkemyst · Jun 23, 2014

In your wrapper you will need to reassemble the transmission by the sequence numbers...it technically should be already done for you by the transport service.

Gryz · Jun 25, 2014

I am trying to slam these 100's of messages as fast as the network will allow me.

Now I understand.

Let's make a distinction between "messages" and "packets".
Your application sends messages.
Those go into TCP. TCP doesn't do packets. It does a bytestream. Under the blanket, TCP uses the concept of "segments". But you have not influence on that.
The TCP segments are sent inside IP packets.
The IP packets are sent inside Ethernet frames.

Your application sends messages. It has no control over segments, packets or frames.
So it's fine if your simulation also only deals with messages.
So what you want is to maximize the amount of incoming messages at the receiving application.

If you want to maximize throughput in a network, one of the tricks is to make sure you minimize overhead. This can be done by bundling data as big as possible, so you'll have relatively less headers. Sending fewer large packets can also have a good impact on RTTs.

If you can improve this, depends on your simulation software.
How do you send your messages ?

Code:

char buffer[56];
while (not_done_yet()) {
   create_message(&buffer);
   write(socket, buffer, 56);
}

Is that how you do it ?
That means one context switch per message.
You could try.

Code:

#define MESSAGE_SIZE 56
#define BEST_TCP_MSS 1460 /* Ethernet frame data size - TCP and IP header. */
#define MESSAGES_IN_ONE_TCP_MSS (BEST_TCP_MSS / MESSAGE_SIZE)
#define BUFFER_SIZE (MESSAGES_IN_ONE_TCP_MSS * MESSAGE_SIZE)

char buffer[BUFFERSIZE];
char *next_message;
int messages_in_buffer;

while (true) {
   next_message = &buffer;
   messages_in_buffer = 0;

   while (not_done_yet() &&
            (messages_in_buffer < MESSAGES_IN_ONE_TCP_MSS)) {
      create_message(next_message);
      next_message += MESSAGE_SIZE;
      messages_in_buffer++;
   }

   write (socket, buffer, (messages_in_buffer * MESSAGE_SIZE);
   if (we_are_done()) {
      return();
   }
}

I'm sure you would have figured this out. I wrote this just to make clear what I meant. And I enjoyed writing the pseudo-code.

Edit: Actually, there is no reason to create and write() only 1 ethernet frame at a time. To speed up things you can actually write a lot more. Like 10 frames at a time. Or a full TCP send window (64 Kbytes). Or 128K at a time. That should push TCP to its max.

Note, TCP has slow start. No matter what you do, the first (dozens of) packets will not go full speed. During slow start, TCP tries to detect the bandwidth it has. After a while, TCP will send close to wirespeed. So for your simulation, during the first seconds you will never reach full speed. But after 10-20 seconds or so, you should see a good throughput. Which will stress your simulation. Hopefully. If you want full speed at the start then: 1) you'll need to use UDP, and 2) you'll need to inform your application exactly how much bandwidth it can/may use. I think that's not worth it. So you'll have to live with the limitations of TCP.

RadiclDreamer · Jun 25, 2014

Gryz said:
Now I understand.

Let's make a distinction between "messages" and "packets".
Your application sends messages.
Those go into TCP. TCP doesn't do packets. It does a bytestream. Under the blanket, TCP uses the concept of "segments". But you have not influence on that.
The TCP segments are sent inside IP packets.
The IP packets are sent inside Ethernet frames.

Your application sends messages. It has no control over segments, packets or frames.
So it's fine if your simulation also only deals with messages.
So what you want is to maximize the amount of incoming messages at the receiving application.

If you want to maximize throughput in a network, one of the tricks is to make sure you minimize overhead. This can be done by bundling data as big as possible, so you'll have relatively less headers. Sending fewer large packets can also have a good impact on RTTs.

If you can improve this, depends on your simulation software.
How do you send your messages ?

Code:

char buffer[56]; while (not_done_yet()) { create_message(&buffer); write(socket, buffer, 56); }

Is that how you do it ?
That means one context switch per message.
You could try.

Code:

#define MESSAGE_SIZE 56 #define BEST_TCP_MSS 1460 /* Ethernet frame data size - TCP and IP header. */ #define MESSAGES_IN_ONE_TCP_MSS (BEST_TCP_MSS / MESSAGE_SIZE) #define BUFFER_SIZE (MESSAGES_IN_ONE_TCP_MSS * MESSAGE_SIZE) char buffer[BUFFERSIZE]; char *next_message; int messages_in_buffer; while (true) { next_message = &buffer; messages_in_buffer = 0; while (not_done_yet() && (messages_in_buffer < MESSAGES_IN_ONE_TCP_MSS)) { create_message(next_message); next_message += MESSAGE_SIZE; messages_in_buffer++; } write (socket, buffer, (messages_in_buffer * MESSAGE_SIZE); if (we_are_done()) { return(); } }

I'm sure you would have figured this out. I wrote this just to make clear what I meant. And I enjoyed writing the pseudo-code.

Edit: Actually, there is no reason to create and write() only 1 ethernet frame at a time. To speed up things you can actually write a lot more. Like 10 frames at a time. Or a full TCP send window (64 Kbytes). Or 128K at a time. That should push TCP to its max.

Note, TCP has slow start. No matter what you do, the first (dozens of) packets will not go full speed. During slow start, TCP tries to detect the bandwidth it has. After a while, TCP will send close to wirespeed. So for your simulation, during the first seconds you will never reach full speed. But after 10-20 seconds or so, you should see a good throughput. Which will stress your simulation. Hopefully. If you want full speed at the start then: 1) you'll need to use UDP, and 2) you'll need to inform your application exactly how much bandwidth it can/may use. I think that's not worth it. So you'll have to live with the limitations of TCP.

Very nice, one of the best written posts ive seen on a forum in YEARS! Kudos to those like you that still go out of their way to write long but descriptive answers to complex questions

jaydee · Jun 26, 2014

Gryz, thank you very much for your detailed and thought out response. So I don't make this post unnecessarily wrong, I will only quote a portion of it.

Gryz said:
Edit: Actually, there is no reason to create and write() only 1 ethernet frame at a time. To speed up things you can actually write a lot more. Like 10 frames at a time. Or a full TCP send window (64 Kbytes). Or 128K at a time. That should push TCP to its max.

Note, TCP has slow start. No matter what you do, the first (dozens of) packets will not go full speed. During slow start, TCP tries to detect the bandwidth it has. After a while, TCP will send close to wirespeed. So for your simulation, during the first seconds you will never reach full speed. But after 10-20 seconds or so, you should see a good throughput. Which will stress your simulation. Hopefully. If you want full speed at the start then: 1) you'll need to use UDP, and 2) you'll need to inform your application exactly how much bandwidth it can/may use. I think that's not worth it. So you'll have to live with the limitations of TCP.

I sent the code on to the software engineer who wrote the application. I have a few questions in the meantime.

Are you saying, that I have no chance of sending my 100's of 56-byte messages in their own individual packets at the application level (unless I introduce a time buffer, which would defeat the point of sending them "as fast as I can"), because I can not influence how TCP/IP handles it?

Also I don't have the luxury of 10-20 seconds. The performance I have now, I'm capable of sending 200 messages in about 20ms (confirmed, via wireshark). I have no need to ramp up more than 1200 messages at one time, so I'm hoping that can take place in about 120ms.

Thank you again for such a thorough and well-thought response.

Gryz · Jun 26, 2014

jaydee said:
Are you saying, that I have no chance of sending my 100's of 56-byte messages in their own individual packets at the application level

I don't understand this question.
Your messages are at the application level.
TCP does a byte-stream.
IP does packets.

Maybe your question is:
Can we make it so that when the application hands over a bunch of bytes (which is a message in your application) to TCP, that TCP will immediately send them to the other end, with no unnecessary delays at all ?

No. You can't do that.
TCP tries to do a lot of stuff. Send large files. Do interactive stuff (like ssh/telnet). And it tries to be efficient. When you build a protocol like that, you have to make choices. And the result is that TCP might not be perfect for real-time applications. (RT application are applications where input->output have a maximum time that needs to be made 100% of the time).

Note, TCP does a pretty good job. Many games use TCP. Most MMOs do. And it's good enough for quick interactive gameplay. Not perfect, but good enough.

(unless I introduce a time buffer, which would defeat the point of sending them "as fast as I can"), because I can not influence how TCP/IP handles it?

Nope, you can not influence how TCP handles them. TCP is a protocol. That means it (should) describe what happens on the wire. It does not tell you how to implement it in detail. Of course many TCP implementations share the same source. But even if you know exactly how your TCP stack behaves in your OS, you can not be sure how it behaves on any other machine or implementation.

If TCP is not good enough for your real-time needs, maybe you should use another protocol. I have no experience with those, unfortunately.
Maybe something like RTP is better ?
https://en.wikipedia.org/wiki/Real-time_Transport_Protocol

I was first thinking of SCTP.
https://en.wikipedia.org/wiki/Stream_Control_Transmission_Protocol
But that has congestion control too, so it might not be good enough for your real-time needs.

Also I don't have the luxury of 10-20 seconds.

I was just thinking that maybe you wanted to test the performance of the receiver under full load. During such a test, the test might not be 100% load during the first seconds (when slow start happens). But at some time after that, messages will come in at wire-speed. But it seems you are testing something else.

The performance I have now, I'm capable of sending 200 messages in about 20ms (confirmed, via wireshark). I have no need to ramp up more than 1200 messages at one time, so I'm hoping that can take place in about 120ms.

Are we talking about a test environment ? Or production environment ? Do you have 1 sender and 1 receiver ? Or 1 receiver and multiple sender ? Is the hardware fairly modern, are they standard PCs, or embedded chips with their own OS and TCP stack ? I don't need (or want) to know the answer to those questions. But when you are talking about performance, all those details are factors that can limit performance.

jaydee · Jun 26, 2014

Gryz said:
If TCP is not good enough for your real-time needs, maybe you should use another protocol. I have no experience with those, unfortunately.
Maybe something like RTP is better ?
https://en.wikipedia.org/wiki/Real-time_Transport_Protocol

I was first thinking of SCTP.
https://en.wikipedia.org/wiki/Stream_Control_Transmission_Protocol
But that has congestion control too, so it might not be good enough for your real-time needs.

I was just thinking that maybe you wanted to test the performance of the receiver under full load. During such a test, the test might not be 100% load during the first seconds (when slow start happens). But at some time after that, messages will come in at wire-speed. But it seems you are testing something else.

Are we talking about a test environment ? Or production environment ? Do you have 1 sender and 1 receiver ? Or 1 receiver and multiple sender ? Is the hardware fairly modern, are they standard PCs, or embedded chips with their own OS and TCP stack ? I don't need (or want) to know the answer to those questions. But when you are talking about performance, all those details are factors that can limit performance.

Maybe this will clarify, this is a test environment, I'm testing production hardware/software. I have no control over the network protocol of the test article itself (I refer to here as the "end-node"). I am trying to determine the performance limitations of the "end-node" (which is a safety critical device, proprietary hardware and software).

I am testing the response time of the "end-node". When it goes to service, the "end-node" will be receiving many 56 byte messages from many different other "field nodes". and responding to them individually. I have a simulation tool, written in C# on a Windows 7 PC that simulates the "field nodes". The messages from the field nodes are obviously going to come in their own TCP packets, because they will be coming from different devices and IP's.

So in my simulator, (just a Windows PC), when I send 100's of messages from the same IP, and trace it with wireshark, it looks like this (I abbreviated a little bit, you get the picture though):

1.62067 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes]
1.620738 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes]
1.620803 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes] (52 consecutive packets look like the three above, where the data length is 56)
1.622674 192.168.180.35 192.168.180.40 TCP [data Length=1568 bytes] (after the first 52 packets, the rest of the messages get bunched together like this)

Ideally, I'd like to send the messages in a way that all 200+ of them look like the first 52, in their own 56 byte packets (like in the field), but it sounds like I can't do that with the tools I have.

alkemyst · Jun 26, 2014

jaydee said:
Are you saying, that I have no chance of sending my 100's of 56-byte messages in their own individual packets at the application level (unless I introduce a time buffer, which would defeat the point of sending them "as fast as I can"), because I can not influence how TCP/IP handles it?

That is correct.

One has to understand that the Network Model/Stack/et al is designed to handle everyone's traffic at the same time and fairly. Things like QoS can have influence on this and prioritize traffic, but it will not change how that traffic flows through Layers 1 through 8 of the OSI model.

TCP/IP does a windowing technique that sends a quick small burst of data quickly, but for longer streams starts a small to increasingly larger 'window' to the destination. This is a lot of the reason why on a very large download off the internet you will see something like 100Kbps to start and eventually increase up to a typical 1.2Mbps or beyond.

What you need to do with your application is look at the assembled data and not the TCP/IP stream (or do you own sequencing and reading which is possible).

alkemyst · Jun 26, 2014

jaydee said:
Maybe this will clarify, this is a test environment, I'm testing production hardware/software. I have no control over the network protocol of the test article itself (I refer to here as the "end-node"). I am trying to determine the performance limitations of the "end-node" (which is a safety critical device, proprietary hardware and software).

I am testing the response time of the "end-node". When it goes to service, the "end-node" will be receiving many 56 byte messages from many different other "field nodes". and responding to them individually. I have a simulation tool, written in C# on a Windows 7 PC that simulates the "field nodes". The messages from the field nodes are obviously going to come in their own TCP packets, because they will be coming from different devices and IP's.

So in my simulator, (just a Windows PC), when I send 100's of messages from the same IP, and trace it with wireshark, it looks like this (I abbreviated a little bit, you get the picture though):

1.62067 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes]
1.620738 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes]
1.620803 192.168.180.35 192.168.180.40 TCP [data Length=56 bytes] (52 consecutive packets look like the three above, where the data length is 56)
1.622674 192.168.180.35 192.168.180.40 TCP [data Length=1568 bytes] (after the first 52 packets, the rest of the messages get bunched together like this)

Ideally, I'd like to send the messages in a way that all 200+ of them look like the first 52, in their own 56 byte packets (like in the field), but it sounds like I can't do that with the tools I have.

This may help, http://www.wireshark.org/docs/wsug_html_chunked/ChapterWork.html

also if you can send your data to a specific port (like 192.168.180.35:8081 or something) that may make it easier to sniff.

You can write custom filters in Wire Shark as well.

BrightCandle · Jun 26, 2014

The 56 byte messages I am simulating in the lab from one single PC. This is the lab tool I have available to me at the moment. In the field the 56-byte messages will be coming from hundreds of different devices, thus they won't be bunched together within a single packet. So I am trying to send these messages in their own individual packets to better simulate a field scenario. I am testing the performance of the middleware of the end-node, so I am trying to slam these 100's of messages as fast as the network will allow me.

Are you trying to send these down a single connection from one machine? One of the problems you have is that you want to simulate 100's of machines, so to do that you are going to need 100's of connections. Then the packets wont get merged they will have to be separate as they come from a different port on the source machine and are a separate connection to the target one.

I think its just how the test tool is written that is your issue here. I could be wrong I am far from a network expert, but if I connect 100 times and send a 52 byte message on each connection and then close it I would expect to 100's of packets, I don't see how the network could merge them.

alkemyst · Jun 26, 2014

BrightCandle said:
Are you trying to send these down a single connection from one machine? One of the problems you have is that you want to simulate 100's of machines, so to do that you are going to need 100's of connections. Then the packets wont get merged they will have to be separate as they come from a different port on the source machine and are a separate connection to the target one.

I think its just how the test tool is written that is your issue here. I could be wrong I am far from a network expert, but if I connect 100 times and send a 52 byte message on each connection and then close it I would expect to 100's of packets, I don't see how the network could merge them.

It's all about encapsulation and decapsulation.

For TCP it would assign something like SEQ 1092938485 for one and SEQ1029384845 for the other and perhaps put those side by side via encapsulation up the stack and then decapsulate and reassemble in proper order as it comes back down the stack.

TCP packet encapsulating

Diamond Member

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

No Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

No Lifer

No Lifer

Diamond Member

No Lifer