Switch or NIC problem?

cpals · May 16, 2011

Have a problem at work I'm troubleshooting and will most likely open a ticket tomorrow for it, but wanted to see what you guys thought first...

We have a switch with two servers on it and 10-15 desktops connected in the same vlan. It's a Cisco 3750G - 48 port.

I'm still testing some theories, but it appears that anytime we have a large amount of data go through server #1 (either a SQL backup or system backup) it causes some sort of flood of traffic on all the other ports. Looking at our monitoring software I see 16 million discards on a bunch of the ports and on our server #2 ports I see over 20 million.

All of the desktops mainly talk to server #2, which has two active NICs set at 10/half... requirements by vendor. So anytime all of these discards occur it seems to knock off all of the clients.

What could be causing surges of traffic/discards on all other ports? I did a couple packet captures, but am not the most packet sniffing expert. Could this be bad ports on the switch, bad switch, bad NIC...? Just trying to figure out the most logical place to begin looking.

Thanks!

spidey07 · May 16, 2011

Need to determine if the switch is truly flooding traffic. It's only supposed to do that on bcast/mcast frames or unknown MACs. Also look at spanning tree to see if you've got a lot of topology change notifications (TCNs) because that causes brief flooding. Make sure you're using portfast on all access layer ports, that won't trigger a TCN for end stations leaving/joining the bridge.

If it's discards, and not errors that is normally an indication of simply too much traffic trying to egress that port. The 10/half thing is pretty redonkulouss given the year is 2011.

However, I've seen ports go bad and you'll see all kinds of errors on a bank or group of ports. IIRC the 3750 has 12 ports per ASIC (you can check with "show int g1/0/1 capabilities" to see which ports are on the same bank.

The only way to see if the ports are bad would be a POST...boot the switch and have a console connection to it to see diagnostic results.

-edit-
Two active NICs in server? That can cause ALL sorts of problems if not carefully setup properly.

Pheran · May 16, 2011

You've got a SERVER with its NICs at 10/half??? I'm sorry, but my first response to any vendor who told me to set things up that way would be "F*ck off!" I would imagine that any server at 10/half being accessed by gigabit clients would barely function in the first place.

You also mention it has 2 active NICs - if you don't have more than 1 VLAN, I hope that you are doing something to manage this dual connection (NIC failover/trunking software and/or etherchannel) and are not creating a bridging loop/broadcast storm.

cpals · May 16, 2011

Pheran said:
You've got a SERVER with its NICs at 10/half??? I'm sorry, but my first response to any vendor who told me to set things up that way would be "F*ck off!" I would imagine that any server at 10/half being accessed by gigabit clients would barely function in the first place.

You also mention it has 2 active NICs - if you don't have more than 1 VLAN, I hope that you are doing something to manage this dual connection (NIC failover/trunking software and/or etherchannel) and are not creating a bridging loop/broadcast storm.

Yep, it's an old Motorola Tandem server... don't know too much about it, but I believe they keep the NICs separate inside their system. They refuse to change their NIC speed... we're supposed to be going to their next-gen windows based system sometime, but that's far out from now.

As far as I know, it was a very expensive piece of equipment back in the day.

spidey07 · May 16, 2011

Back in what day? 1995?:biggrin:

cpals · May 16, 2011

Spidey: Yes, our Solarwinds Orion server shows 0s across the board for misses, errors, etc... only thing with huge numbers are the transmit discards on the switch interfaces.

I don't see any discards on the server generating the traffic or on our uplink to our main switch... only on other interfaces.

spidey07 · May 16, 2011

My guess then if it's ONLY happening during big transfers is the switch is indeed flooding the traffic (sending to all ports instead of a single where for the DST MAC) and it's overloading the egress queues for other ports. Discards are basically the output queue being full. You can also try to catch it with "show int" and see depth of output queue.

Make sure you're using spanning-tree portfast on all ports used by hosts and that you're MAC table isn't full (highly unlikely).

And check to make sure it isn't bcast or mcast traffic because it's supposed to flood those.

Pheran · May 16, 2011

Also, beware that if this Tandem server is using a load-balancing algorithm that is obnoxious enough (something like Microsoft's unicast NLB algorithm), it might be causing the clients to flood the entire switch with unknown unicasts or multicasts. We need more info about what traffic you are seeing on the ports that are getting flooded.

cpals · May 16, 2011

spidey07 said:
Back in what day? 1995?:biggrin:

I was in middle school back then... :awe:

cpals · May 16, 2011

From a sh capabilities... don't see where it would see how the ASIC is setup:

3750#sh int gi 1/0/1 capabilities
GigabitEthernet1/0/1
Model: WS-C3750G-48PS
Type: 10/100/1000BaseTX
Speed: 10,100,1000,auto
Duplex: half,full,auto
Trunk encap. type: 802.1Q,ISL
Trunk mode: on,off,desirable,nonegotiate
Channel: yes
Broadcast suppression: percentage(0-100)
Flowcontrol: rx-(off,on,desired),tx-(none)
Fast Start: yes
QoS scheduling: rx-(not configurable on per port basis),
tx-(4q3t) (3t: Two configurable values and one fixed.)
CoS rewrite: yes
ToS rewrite: yes
UDLD: yes
Inline power: yes
SPAN: source/destination
PortSecure: yes
Dot1x: yes

spidey07 · May 16, 2011

I'd say the best course of action would be to see if it is indeed flooding. You can get a packet capture on a regular host port (no need to setup span session) and then look at the destination MAC addresses. If it's not your MAC or bcast/mcast then it's flooding and you'll need to look at portfast and TCNs to figure out why it's flooding.

cpals · May 17, 2011

spidey07 said:
I'd say the best course of action would be to see if it is indeed flooding. You can get a packet capture on a regular host port (no need to setup span session) and then look at the destination MAC addresses. If it's not your MAC or bcast/mcast then it's flooding and you'll need to look at portfast and TCNs to figure out why it's flooding.

Okay, so I figured out that it's not just SQL dump traffic causing problems, but just traffic in general on the switch... large amounts of traffic. I turned on port mirroring on gi1/0/36 to gi1/0/25. I initiated traffic from my server gi1/0/30 to a desktop on gi1/0/4. It was a small 300MB file and it flooded my sniffer on port 25 and filled up the buffer. In the packets I see tons of packets addressed from the server IP to the desktop IP, but I shouldn't be seeing that traffic on my port I wouldn't think?

At layer 2 level I see a source mac of the server and a destination mac of the desktop I was file transferring to.

I've attached two views... one is from Clearsight Analyzer showing a packet capture and then one with a Wireshark view.

http://www.baacktech.net/sniffer.jpg
http://www.baacktech.net/sniffer2.jpg

spidey07 · May 17, 2011

Sounds like it's flooding. You shouldn't see that traffic from the client or server unless it was part of your span session. I'd open a case with cisco.

cpals · May 17, 2011

After troubleshooting even more... I found that the flooding only occured when the server sent data to other IPs. If I initiated a transfer from a computer on the same switch to another computer on the switch the traffic was normal. So it was either that specific port on the 3750 was somehow bad or the NIC on the server.

I moved the port on the switch to a different one and the traffic is acting normal now... so bad port? I guess I should follow through and plug a laptop or something into the bad port to see if it stays bad, but don't know if I'm going to do that.

spidey07 · May 17, 2011

Could be. But I've never seen a bad port cause flooding. It could still be a portfast/TCN thing. Check overall spanning-tree and config of that port.

Good read:
http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00801d0808.shtml

Search

Switch or NIC problem?

cpals

Diamond Member

spidey07

No Lifer

Pheran

Diamond Member

cpals

Diamond Member

spidey07

No Lifer

cpals

Diamond Member

spidey07

No Lifer

Pheran

Diamond Member

cpals

Diamond Member

cpals

Diamond Member

spidey07

No Lifer

cpals

Diamond Member

spidey07

No Lifer

cpals

Diamond Member

spidey07

No Lifer

TRENDING THREADS