Question Re: Spanning tree and broadcast storms

Saulbadguy

Diamond Member
Jan 27, 2003
5,573
12
81
I work in a IT department for a school district with very poor management. We run a giant layer II switched network with over 7000 nodes, which only one person has access and control over. Right now, there are several sites down. For what reason, I do not know, but the "switch/router guy" says a hub or a personal switch caused spanning tree to not function and now we are having broadcast storms.

Is this possible, or is he BSing us?
 

ScottMac

Moderator<br>Networking<br>Elite member
Mar 19, 2001
5,471
2
0
Yes, it's absolutely possible.

If someone brought in a SOHO switch / hub and created parallel links between two (or more) points, then you got a broadcast storm (this could also happen with managed switches, with STP turned off).

SOHO stuff doesn't have Spanning-Tree, so there's no way to control the ports (vis STP protocol(s)).

Find the twit and kill 'em (or at least fire / dismiss / expell / bend / fold / spindle / mutilate 'em).

This can also happen with rogue access points, wireless bridges ... and few other scenarios ... anything that creates an uncontrolled parallel span / segment will get you broadcast storm.

Good Luck

Scott
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
scottmac has it down.

It is VERY possible and HIGHLY likely. Happens all the time.

oh - and you guys need to totally redesign your network. a large layer2 network like that is just asking for trouble, exactly like the trouble you are having.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
A paralell link is really a layer2 loop. A hub/switch/router/access point/bridge that links two parts of the same layer2 broadcast domain. Say some yahoo plugs in a SOHO switch or router into the network, then he plugs it in again - effectively two plugs into the network.

The only real way to troubleshoot it is to look at the spannig-tree parameters and seeing where all the TCN (topology change notification) flags are coming from and trace it down. Very difficult to troubleshoot.

Normally it's just easier to disassembe the network with a sniffer running and narrow down the source of the storm.

so word to the wise - don't be plugging in stuff on a network that isn't yours.
 

Saulbadguy

Diamond Member
Jan 27, 2003
5,573
12
81
I would think a broadcast storm would be limited to one network or VLAN - this broadcast storm has taken down all 7,000 some nodes. Unbelievable.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Originally posted by: Saulbadguy
I would think a broadcast storm would be limited to one network or VLAN - this broadcast storm has taken down all 7,000 some nodes. Unbelievable.

You guys are probably spanning vlans all over the place. That's why I suggested you need a redesign. With good design these kinds of storms are contained to a single wiring closet.
 

ScottMac

Moderator<br>Networking<br>Elite member
Mar 19, 2001
5,471
2
0
Constrained to one VLAN (per broadcast storm), yes, that's true.

You've described your network as flat ... you apparently have one **HUGE** VLAN.

VLAN = Broadcast Domain

Very poor design (though, once upon a time, Microsoft was also on one huge flat network)

Maybe this will convince the administration that a few bucks spent on a router is cheaper than 7000 nodes screaming their guts out at each other without any work getting done.

Good Luck

Scott
 

KLin

Lifer
Feb 29, 2000
30,216
564
126
I had to trace a broadcast storm like this before. I started at my fiber MDF. I disconnected fiber connections until the network went to normal and I had found the segment that was causing the problem. I then went to that IDF and disconnected copper until I found the port that was causing the problems. Turned out to be a netgear switch with a patch cord looped into it.
 

randal

Golden Member
Jun 3, 2001
1,890
0
71
Hmmm, he says that a random switch/hub caused spanning-tree to not function? That doesn't make much sense, as that's the whole reason spanning tree exists: to prevent such network loops. If a loop is detected, it shuts off one of the parallel paths (based on cost). This works, regardless of whether or not the offending bridge runs spanning tree -- some switch up the chain would detect it and shut it down.

Sounds to me like somebody doesn't have spanning-tree turned on everywhere or configured properly.

$.02
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Originally posted by: randal
Hmmm, he says that a random switch/hub caused spanning-tree to not function? That doesn't make much sense, as that's the whole reason spanning tree exists: to prevent such network loops. If a loop is detected, it shuts off one of the parallel paths (based on cost). This works, regardless of whether or not the offending bridge runs spanning tree -- some switch up the chain would detect it and shut it down.

Sounds to me like somebody doesn't have spanning-tree turned on everywhere or configured properly.

$.02

Randal, read up on spanning-tree. You can bring down even the most massive network with a hub. There are modern features out there to prevent this, but basic 802.1d spanning-tree cannot stop it. Basically both switchports are in a forwarding state - bridge these too ports wih a bridge that isn't running spanning-tree and you have a layer2 loop, spanning-tree cannot stop this.

Basically spannint-tree is the devil and you architect/design your network to avoid it as best as possible. The problem comes from the fact that SOHO gear doesn't run spanning-tree - imagine just how much trouble that can cause...it's worse than a uni-directional link.. A L2 bridge that doesn't participate in spanning-tree.

And people wonder why us "network nazis" are so adamently against this crap.
 

randal

Golden Member
Jun 3, 2001
1,890
0
71
Ah-hah. I misread the OP, my apologies.

We turn off ports we aren't using, has been a pretty good solution. Also, storm-control? Sure, it doesn't fix the problem, but it sure does help.

VLANs? Routers? ... singing to the choir, I'm sure.

 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
A proper 802.1d STP implementation will correctly detect and sever a loop caused by a hub or switch that spans two ports in the same STP domain (e.g., the office user who knows just enough to be dangerous, and accidentally plugs two wall-jack ports into the same hub). The STP switch(es)' ports will transition to LISTENING on link-up in which they will send out periodic STP BPDUs and will NOT forward traffic. Once each port sees the BPDU from the other the switch will flip one of those ports to BLOCKING, while the other can time transition into FORWARDING. In a proper STP implementation, at most one of those ports is ever FORWARDING (the other in LISTENING or BLOCKING) and therefore there's no way the broadcast packets can loop around.

A SOHO switch or hub, or even properly designed managed switches with STP disabled, will just pass the STP BPDUs on from the received ports to all the rest. This is the behavior that allows it all to work; there's not really a difference between a dumb switch/hub connecting two STP ports and a crossover cable connecting them, from STP's perspective.

If you configured your STP to defeat this safety mechanism (e.g., Cisco's port fast), you lose.

Modern switches also contain capabilities like the abilty to disable ports automatically if they receive too many broadcasts, or to rate-limit broadcast traffic, or at least to filter out broadcast traffic. Unfortunately, many switches' CPUs deal with broadcast storms by falling over. One of the things that separate the men from the boys in the switch business.

spidey07, "spanning-tree is the devil" is a bit harsh. Spanning tree is a good protocol for solving the problem that it was intended to solve -- detecting loops caused by bridges in Ethernet networks. It's simply being stretched WAY beyond its design intentions. It was in particular never intended to be used for switched station ports. And STP & VLANs remains a major kluge fest. Too many people get frustrated by the trademark 802.1d 30s of silence when you plug a port in (esp. if you have stupid NICs that toggle the link state when you reboot the PC), and configure STP in a way that effectively disables its upside while maintaining much of the downside.

My only real true beef with STP is that it fails the obviousness test -- that is, I can build you a network topology rather easily where if I take five experienced network guys and ask them all how the traffic flows, at least one of them will get it wrong, probably a majority. It requires a strong understanding of the protocol just to know how the tree is really going to get built, and what it is going to do if some arbitrary link gets cut. And that is very bad. Many many networks I have seen where people designed a beautiful redundant structure that STP will not use the way that the designers thought.

Saulbadguy, the Ethernet spec caps the number of stations on a network at 1,000. I don't know if the IEEE specs relax this, but it's a very good limit. You might do some research and see if you can find something in writing to back that up. 7,000 stations on the same broadcast domain is a design that asks for trouble. Also, it's 2006. In generaly, you don't bridge between sites. Maybe back when the world was IPX or some similar, but we have IP and fast routers/L3 switches now, there's no excuse for not separating sites into routed networks.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
cmetz,

I guess I did go overboard. These days networks I design have no spanning-tree blocking ports, no loops. The redundancy is all at layer3 so that's what I meant about it being avoided.

You're definately correct. Not too many people know the hardcore ins and outs of spanning-tree however it really is necessary to fully know and comprehend it. Not to mention desiging your trees and documenting what ports should be blocking and which should be forwarding. That way you already know how traffic is flowing. This can done in the design phase and verified after/during installation.

I second that I've seen rather large networks that on the surface look well desigined, but the layer2 paths are very, well...."not right", not deterministic.
 

TC10284

Senior member
Nov 1, 2005
308
0
0
I guess I get to look forward to all this lovely stuff in my near-future net-admin/tech jobs.

Good stuff guys. :D
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Originally posted by: TC10284
I guess I get to look forward to all this lovely stuff in my near-future net-admin/tech jobs.

Good stuff guys. :D

Troubleshooting spanning-tree is one of the most difficult things there is.

Imagine a spanning-tree that is fluctuating wildly with ports tranistioning between the states in a large network of say 30 or more switches. Not fun. That's where I normally just start yanking blades to isolate the problem. Now this shouldn't happen with a properly designed net, but with consulting you get to look at other peoples messes.
 

TC10284

Senior member
Nov 1, 2005
308
0
0
I've always thought working with a consulting team would be pretty interesting, Spidey. I don't know if I'd have enough experience to get hired though in the eyes of employers. IMO I do, but who knows. Maybe you can tell from some of my other posts.
Definitely not enough to go-it alone and tackle something like the above problem. I mean, I'm CCNA certified, but I need/want more experience working with Cisco routers/switches.
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Originally posted by: TC10284
I've always thought working with a consulting team would be pretty interesting, Spidey. I don't know if I'd have enough experience to get hired though in the eyes of employers. IMO I do, but who knows. Maybe you can tell from some of my other posts.
Definitely not enough to go-it alone and tackle something like the above problem. I mean, I'm CCNA certified, but I need/want more experience working with Cisco routers/switches.

Tip for you.....

routing and switching are the bread and butter...
all the money is in new technology - wireless, IP communications, storage, etc. All of which need a good foundation in routing/switching.
 

TC10284

Senior member
Nov 1, 2005
308
0
0
Originally posted by: spidey07

Tip for you.....

routing and switching are the bread and butter...
all the money is in new technology - wireless, IP communications, storage, etc. All of which need a good foundation in routing/switching.

Thank you sir. I greatly appreciate it and I will keep that in mind. Should I look toward one of the Cisco wireless certs?
 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
spidey07, basically, my rule is to not use STP for redundancy (don't create intentional loops/parallel paths), use routing protocols for that. That keeps people mostly out of trouble, because you're avoiding the place where STP gets complex.

STP is IMO best used simply as a protocol for preventing a stupid cabling error from taking down your network. We've all been in cable plants that were messy and tangled, and we've all been working on networks "hot" (live and fully in production, not a maintenance window in which you are allowed to kill it all...). It's so easy to make a loop by accident, especially temporarily when you're reorganizing. STP helps you keep a small mistake small. A true broadcast storm is a real <expletive deleted> especially when you have stupid routers and stupid hosts who deal with storms by falling over hard. It's 2006, gear should be QAd to handle such an event, but not all is.

Ironically enough then that the original poster in this thread is having the problem that STP isn't doing its job. My bet is that the switch admin misconfigured it. It's very easy to confuse behaviors you don't like (the 30s delay on link-up) with behaviors you should "fix."
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
cmetz,

I think it is a network diameter problem (bigger than 7). Coupled with some misconfiguration, coupled with bad design, along with other nastiness.

Basically any one of the things that could cause STP to not do what it is supposed to. Or all of them combined.
;)
 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
spidey07, true. Anyone who would design a school-district wide network as one big broadcast domain.... there's no telling what kind of bad design and bad implementation could be lurking in there.

Sorry, Saulbadguy - there's no telling how deep this particular rabbit hole goes. I know you're just the poor guy who needs the network to work. Perhaps you could convince the school system's management to bring in an outside networking person to fix the design to not allow this sort of thing to happen. You might find more success with management if you make the task narrow, just to look at what the current design is and to recommend ways to solve this specific problems - for the switch/router guy to implement. Otherwise you can quickly enter into a turf war. Of course, once the wedge is driven and you have an outsider's opinion that your network is fubar, you can try to expand the scope of the outside party and take it away from the school system's switch/router guy, at least to implement fixes.