• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Your most memorable network screw-up ...

ScottMac

Moderator<br>Networking<br>Elite member
I had one of those moments this week. It didn't kill the network per se, but still reminded me that nobody is bulletproof ...

In this case, it was making "just one little code change" to a perl script that does a format conversion for data going to a remote monitoring system (that, in production, generates trouble tickets that cause network engineers to fix things).

Rather then firing up the dev environment (that error checks the code), I used a text editor, because I was just making a couple small changes. I tested the changes in the dev environment, the actual changes were good & valid ... the issue was the *TYPO* I made making the simple little changes. The script didn;t run, the data didn't convert, the monitor didn't get its data, and tickets were generated (but, thankfully, in the test environment, disaster averted).


SO ... what I thought might be interesting, and possibly educational, for the newer IT/Data/Networking folks, is for the ol' folk to fess up and tell us all about that one thing you did ... you know the moment ... when your brain is screaming "WAAAAAAAITTTTTTTTT!!!!!!!!" but your finger ignores the command and presses the return key (button, switch ...) that brings the house down. Or when that one simple thing that was overlooked came back to haunt you and ruined your weekend (vacation, career, life): that one thing that taught you how evil the word "assumed" can be ...or that moment of laziness that left permanent teeth marks on your posterior.

We have all been there, I know it for certain; time to confess for the good of all ...

 
I got a call that the network monitoring alarms were going off. I went in and checked my edits, saw the typo and fixed it (I fat fingered, causing an execution error (the script didn't run).

If I'd done the edit in my usual IDE editor, it sould have flagged the typo.

 
7200 router with multiple DS3/OC3s, a critical router. Needed a down and dirty way to check out a few packets of a small conversation, so debug ip packet detail and make sure you put an access list on it. Easy peasy.

debug ip packet detail <CR>

Forgot to put the ACL on it, router dead.
 
Not network killing, but definitely time consuming: rm -rf /usr /local/bin/file

A boss of mine managed to pull the wrong drive out of a raid array one night when a drive went bad. Lot /usr/local on a critical solaris box. Managed to repopulate it from another box through an ssh session I had open with the bad one using uuencode/uudecode. :1337;

A bad IDS signature cemented my rule of never pushing new sigs after noon (now with management backing!). That was a fun Friday night. That was almost a network down event.
 
I'm still pretty green (about 3 yrs full-time in IT), and can't think of any big goof-ups on my part, but I'm sure it will happen sometime.

But I'll share what a former boss did that has always stuck out in my mind:

He wanted to change the Switch Port that one of our Production Servers was plugged into (the reasons for changing the port escape me...but it was something relatively unimportant, like he was trying to clean up the cabling in the rack). So he tells me that he's not going to wait for down-time. He's going to unplug it and move it really quickly - "it will take less than a full second and in a congested network, latency could take up more time than that, so the application will be fine and users won't know the difference". I said "I don't think it really works like that...". He insisted and said "trust me, i'll show you"...but I was a new employee then, so I wasn't going to take it any further than that. It didn't end up doing any damage to the application/db, but it resulted in a ton of phone calls about application errors.

Thankfully, I don't work for him anymore. He was really just a programmer that managed to pad his resume enough to get a job as the IT Manager for a small/medium business, and tried to fake his way through everything.
 
I was at our colocation facility once. I had a customer that needed me to add a virtual machine to a box running ESXi...but the virtual machine had to be on a different physical network than the host system was connected to. No problem, I thought, there's a free NIC, so I'll just plug it in and configure it. Easy cheesy. Well, I ran the cable between the cabinets and plugged it in...and got called away for a few minutes. Well, dumb me, I didn't reconfigure the NIC in software before I plugged it in...this box had been configured at one point to aggregate these two NICs, so they were both part of the same vSwitch. Every VM on the box immediately lost network connectivity. One stupid mistake and I managed to take down the phone systems for 3 doctors' offices (this particular host runs virtual hosted PBXes)! Woo!
 
Just a few weeks ago I was working on a route map for our 3750 switches that lead to our internet firewalls. We are in the process of moving to a Check Point VSX firewall cluster and off of our old firewalls and we are using policy based routing to selectively move traffic over before we change the default route as well as to make certain types of traffic hit specific VS's (i.e. all SMTP traffic goes out one VS). I was cleaning up our access lists to be named rather than numbered so that they would be more easily understandable for anyone else working on this and I created the new access list and removed the match statement (effectively matching all traffic), instantly killing my access to the switch before I put the new match statement back in and routing all of our internet traffic to the wrong firewall.

Rookie mistake from someone with more than 10 years experience in networking. Luckily, there was another interface that I was able to get to to fix it, and it was after a majority of people were gone for the day and I didn't catch too much flack for it other than from myself, I hate when I make a stupid mistake.
 
This one wasn't something directly as a result of me, but I had a part in a few months back.

We are in the process of data center migration, and are also doubling our hosting size at this site. Well, needless to say lots of new stuff coming in and old stuff going out. Well, that day we needed to install 4 VPN boxes into the old side and old cabinets. These boxes were heavy and we had to install rails on to attach to both sides of the cabinet. The old cabinets PDUs look basically like a longer one of these. The PDUs were on the backside of the cabinet, and run from the bottom of the rack to the top.

First 3 boxes go in without issue (had to adjust the PDU spacing so we didn't run into plugs with the rails), with me in the front and a coworker at the back of the cabinet guiding it in. While installing the 4th box, we almost installed it when it slipped from the coworkers hand. The rails on the back side drop, and manage to bridge the 2 prongs on a plugged in power cord. Sparks shot right in front of his face, and shot across the cabinet right next to my face. Breaker tripped killing the PDU and took out a cabinet. None of those servers that went down were critical to the company (hence why no 2nd power source), but we did get a few calls about it.

Pics One Two. The were with my phone, so excuse the crap quality.
 
Woohoo! I did that type of direct short inadvertently when I was about 6 years old. We had open box springs and the bed got pushed over too far. I sit down and the thin angle iron slips right between the plug and the wall. BZZZT, light go out, I jump up and run from the room! Dad showed me the two little slots carved in the angle iron and the nearly sawed off copper plug🙂
I recently took down an accountant's server a bit prematurely. This place was a networking disaster waiting to happen, and the box was just sitting on the dusty carpet headless. I could not even get a password for it or any cooperation until AFTER I killed it for the move🙂
Luckily no corruption, but it was a wake up call.
Now the box is cleaned up, backed up, off the floor and on a UPS.
 
I had a medallion around my neck on my lanyard. One very late night working on a cabinet install i was poking around the phone system nearby. When I leaned over to look at the back of the cabinet the medallion managed to slip into a slot barely wider than the medallion and shorted out the power supply to 2 PBX's. The Site lost half of their phones. Upon futher review there were plastic covers not put onto the PBX's that would have prevented this and a one in a million chance that the medallion could have gone through that tiny slot. I try not to wear anything metal ever anywhere on me.
 
I had two that I can remember:
1.
Was adding a redundant ACE load balancer into a production environment.
I forgot to apply the correct license on the new unit, and it somehow wiped out the config on both LB's.
The rollback command wasn't working, and I had to piece back every single serverfarm & VIP.

2.
I was trying to move a cart full of a stack of HP DL360G5's by myself.
There's an up ramp into the datacenter due to the raised floors, so I had a "head start" run, hoping to be able to make it all the way up the ramp.
I made it alright, but I ran too hard, and when I made a turn, two servers flew out of the cart and fell on the floor.
Two colleagues who saw the whole thing were laughing like bastards.
The servers powered up, but one SAS drive keeps failing and would not form a RAID w/ the rest of the drives, and I had to request an RMA w/ vendor...
 
Originally posted by: Cooky
I had two that I can remember:
1.
Was adding a redundant ACE load balancer into a production environment.
I forgot to apply the correct license on the new unit, and it somehow wiped out the config on both LB's.
The rollback command wasn't working, and I had to piece back every single serverfarm & VIP.

Holy fuck that sounds painful. I hope you learned your lesson.

ps - been there, done that, fucking sucks.

Sorry for the language but what cooky had to do, it's justified.

 
I've done a few bad things in my time, but this is the one that comes to mind at the moment. I had a router running the IOS firewall feature set ("ip inspect" commands) that was one of the primary routers in the network. I was trying to remove stateful inspection from a single interface, the command looks like "ip inspect <name> in". I was in interface config mode and I thought I could just abbreviate the removal as "no ip inspect". WRONG. That's actually a global command that wipes out every single piece of IOS firewall config on the router (which unfortunately will happily execute from interface config mode). Argh! Fortunately my access to the router wasn't lost and it didn't take too long to paste back the appropriate commands from a config backup, but that blew away just about every established session on the network for a few minutes. 😛

 
Originally posted by: spidey07
7200 router with multiple DS3/OC3s, a critical router. Needed a down and dirty way to check out a few packets of a small conversation, so debug ip packet detail and make sure you put an access list on it. Easy peasy.

debug ip packet detail <CR>

Forgot to put the ACL on it, router dead.

Heh, some genius (fortunately not on my team) at work once executed "debug all" on one of their switches in a production lab, with the expected result.

EDIT: I forgot the saddest part, when you enter that command the switch asks you if you are really sure about that (basically it should just ask you if you are a dumbass), and he still entered "y".
 
yeah when I ran into the ACE problem, the daily backup hasn't kicked in yet, so the backed up config didn't contain the changes that had gone in earlier.
Luckily I was able to restore the delta from my logged CRT sessions so it wasn't too bad afterward.

When I saw a blank config in all contexts I did scream the F word loud, and thought for sure I was gonna get fired.

Lesson learned - always back up a copy of config to either laptop or mgmt station before proceeding, and don't rely on rollback or archive commands.
 
Back
Top