What could cause TCP checksum failures?

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
I don't think it's tcp offloading because they're just PC network cards. That and the checksum errors really do tear down the TCP sessions. The host sends a FIN.

client <-> host

The host starts a TCP session to the client. 3way handshake looks good. then data is sent to the client. the ACKS from the client then have TCP checksum errors, host retransmits only to be met with ACKS from the client with checksum problems. Host finally sends a FIN to the client.

Second problem is the session will run just fine for hours and then all of a sudden stop responding to the host. The trace was run on the client machine.

Any ideas? I'm thnking maybe driver/stack problem and told level 3 to focus on that.

Oh, and the trace shows that it will work good/clean/no checksum errors for hours.
 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
spidey07, traces on the host and any checksum or TCP segment offload don't mix. Put a tap on the wire and trace from there.

What NIC chipset?

Have you tried disabling checksum / TSO and repeating? It should be easy to do and will let you test that variable right away.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Originally posted by: cmetz
spidey07, traces on the host and any checksum or TCP segment offload don't mix. Put a tap on the wire and trace from there.

What NIC chipset?

Have you tried disabling checksum / TSO and repeating? It should be easy to do and will let you test that variable right away.

Lines of responsibility blurred with this one, we don't have direct access to the client systems. Really no idea on the PC hardware being used. Think two client machines that the host (mainframe and it's support systems/servers) sends messages/updates to over a WAN. The trace was ran on the client PC, I have traces of the host, core and client...just haven't looked at them too deeply.

One question about the TCP offloading - why would TCP work just fine, then the client stops responding? How is this related to the TCP checksum errors?

There are two syptoms here:

1) TCP breaks down because of checksum errors, yet layer2 CRC is good
2) Host retransmits datagrams because it doesn't receive ACK from client, client does NOT send a response...remember this particular trace is on the client. Host finally says "FIN, I'm done with you". That's the key point - trace was performed on the client - client receives retransmissions and yet NEVER sends a reply.

Both of these symptoms seem to occur at random at because it is a polling/push type application. Symptom 1 and symptom 2 occur at different times (hours, different TCP connections, different sessions.)

Oh - and dagnabbit, call me spidey. ;)
Thanks in advance for any help.
 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
Without seeing a trace *from the wire* it's hard to tell what's really going on. (remember: either end's view can't be totally trusted)

The TCP checksum offload in the network stack or the offload engine in a modern NIC is totally separate from the CRC computation+insert logic in the NIC chip. What (1) tells you is that the data isn't getting corrupted on the wire, it's somewhere in the host stack or NIC.

(2) could be caused by many things. For example, if the windows got out of whack somehow (e.g. internal corruption) and the server keeps retransmitting and the client keeps dropping those segments for being out of window, you could get the behavior you describe. Check your trace for where the pointers are and what the current window size is.

Is this a real operating system? Can you check the TCP stack's counters? Events like bad checksum, retransmits seen, and out of window should be counted somewhere.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
thanks cmetz, OS is windows XP.

I keep pointing fingers at the client OS stack/drivers. Trace shows only 3 layer2 devices - client1, client 2, cisco router. My guess is they have a hub from the cisco router to provide connectivity.

Seeing the "wire" would yield better results but I'm sure you understand that we dont' have access to that. Might have to roll a tech to get a "true" trace.

-edit- I've already instructed a replacement of the NIC/driver/cabling/stack repari....we'll see how that goes.
 

nweaver

Diamond Member
Jan 21, 2001
6,813
1
0
I have had issues in the past where running TCP Checksum offloading would cause problems (only on the server with a Gig card, Intel Server adapter). I just turned it off on the server side and it went away. It was a known issue within the ALtiris community...not sure if it was just them or other problems that they saw the most?
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
update - after a conference call with 16 people involving cisco product management and microsoft these jackholes had two network cards on the host.

grrrrrrrr. don't do that.