sourceninja
Diamond Member
I have a solaris/sun story from hell. I should first point out I did not setup these servers it was done by the admin I am referencing in this post.
Sunday night at about 11pm our web server goes down. I try to ssh in and the server is not responding, but it pings. I drive in expecting a small easy to resolve issue. The kvm is corrupt and just printing out things like "$%$%@^&^@#$@!$$^$%&@^%$@#!!!!DF" and nothing human readable. The serial console will not respond. So I decide to just reboot the server. After post it just sits there and will not boot. I break out and try to boot it manually and it will not boot. I try to boot the mirror and it will not boot. So I decide this is a hardware issue and that is not my job so I place the appropriate call and go home for the night.
I come in this morning and the server is being worked on. It is an old server (sunfire 280r). Our admin pulled the drives from the server and was trying to get the to boot in another 280r. It will not boot. He then tries to put the old disks back into the server he was testing on, and it will not boot back up now either. So he calls our hardware support vendor. This is starting to look bad, I've decided we need a backup plan and I start looking for a replacement server to get going. Our important files are all in a folder called /var/web which is supposed to be mounted on our SAN. Should be a simple matter of setting up apache/php on another server, mounting that SAN volume, changing the DNS and we are back in business. Or so I thought....
Fast froward to 11am. Support vendor arrives, looks at disks. Initially he thinks the motherboard has died. Then he hears about the second server. Now he asks our admin if he fully unplugged the servers before removing the drives. Admin says no. Vendor tells us that that can hose the boot sectors on the drives or worse cause full data loss. Not what I want to hear. He then determines that the first disk had a motor failure and that is why the first sever went down. I ask why the mirror would not boot and he tells me that disk is reading blank!!! So we have no mirror. Worse then that he is unable to recover the second servers drives and it must be reinstalled. Luckily for us that server is not production critical (it runs some minor backups). I add the files needed to a quick rsync to a box that is still getting backed up. At this point our Admin breaks the news to me. He never got around to setting up the san volume on that production critical web server. This now means we have total loss of data. At this point I decide to go to backups and I am informed he never setup the veritas backup software either! This server has been in production for 5 months now. I have 6 months of work that the only copy was on this server. On top of that our entire public website is on this server. I am pissed! I force myself to go to lunch.
I come back from lunch with a mission. We do not have any free sun boxes, so I find a Dell 2850, stick 6 gigs of ram and 2 73gig scsi drives in it. Pop in a fiber card and put linux on it. I setup the raid, install linux, setup the san volume, configure and install php, apache, ssh, etc. Run all the security updates, add all our users, the normal setup stuff. Now I start hunting for our files. My work is 100% lost, the last checkin I made in the repository was 2 months old. So I'm screwed. I didn't checking frequently cause it was a work in progress and I figured it was on the san. I was the only one working on it, so why bother checking it in. Big mistake. Our website was another problem. The checking was about a week old. We managed to get the other files form developers desktops and we had to recreate a few things, but we got the website 99% running as of 10pm tonight.
I have called a meeting to discuss the fate of the admin tomorrow. This is not the first time he has cost me days if not weeks or work. 3 weeks ago he tried to break a mirror on a new sun T2000 server (development server where the developer wanted a copy to roll back to if he screwed up a very tricky install) and hosed the entire box. No data was recoverable and it had to be formatted and solaris 10 reinstalled. The server was 1 week old and had just been setup by me. It was not running bare metal backups so that was no help. The reason he had to build the last webserver we used is because he managed to 'mess' up our old one while trying to update php. Php was compiled on that server in /usr/local. Somehow he compiled it into /usr, but some files also got updated in /usr/local and it was faster to build a new server then to figure out what the hell he did.
Anyways, I thought you would enjoy the drama.
Sunday night at about 11pm our web server goes down. I try to ssh in and the server is not responding, but it pings. I drive in expecting a small easy to resolve issue. The kvm is corrupt and just printing out things like "$%$%@^&^@#$@!$$^$%&@^%$@#!!!!DF" and nothing human readable. The serial console will not respond. So I decide to just reboot the server. After post it just sits there and will not boot. I break out and try to boot it manually and it will not boot. I try to boot the mirror and it will not boot. So I decide this is a hardware issue and that is not my job so I place the appropriate call and go home for the night.
I come in this morning and the server is being worked on. It is an old server (sunfire 280r). Our admin pulled the drives from the server and was trying to get the to boot in another 280r. It will not boot. He then tries to put the old disks back into the server he was testing on, and it will not boot back up now either. So he calls our hardware support vendor. This is starting to look bad, I've decided we need a backup plan and I start looking for a replacement server to get going. Our important files are all in a folder called /var/web which is supposed to be mounted on our SAN. Should be a simple matter of setting up apache/php on another server, mounting that SAN volume, changing the DNS and we are back in business. Or so I thought....
Fast froward to 11am. Support vendor arrives, looks at disks. Initially he thinks the motherboard has died. Then he hears about the second server. Now he asks our admin if he fully unplugged the servers before removing the drives. Admin says no. Vendor tells us that that can hose the boot sectors on the drives or worse cause full data loss. Not what I want to hear. He then determines that the first disk had a motor failure and that is why the first sever went down. I ask why the mirror would not boot and he tells me that disk is reading blank!!! So we have no mirror. Worse then that he is unable to recover the second servers drives and it must be reinstalled. Luckily for us that server is not production critical (it runs some minor backups). I add the files needed to a quick rsync to a box that is still getting backed up. At this point our Admin breaks the news to me. He never got around to setting up the san volume on that production critical web server. This now means we have total loss of data. At this point I decide to go to backups and I am informed he never setup the veritas backup software either! This server has been in production for 5 months now. I have 6 months of work that the only copy was on this server. On top of that our entire public website is on this server. I am pissed! I force myself to go to lunch.
I come back from lunch with a mission. We do not have any free sun boxes, so I find a Dell 2850, stick 6 gigs of ram and 2 73gig scsi drives in it. Pop in a fiber card and put linux on it. I setup the raid, install linux, setup the san volume, configure and install php, apache, ssh, etc. Run all the security updates, add all our users, the normal setup stuff. Now I start hunting for our files. My work is 100% lost, the last checkin I made in the repository was 2 months old. So I'm screwed. I didn't checking frequently cause it was a work in progress and I figured it was on the san. I was the only one working on it, so why bother checking it in. Big mistake. Our website was another problem. The checking was about a week old. We managed to get the other files form developers desktops and we had to recreate a few things, but we got the website 99% running as of 10pm tonight.
I have called a meeting to discuss the fate of the admin tomorrow. This is not the first time he has cost me days if not weeks or work. 3 weeks ago he tried to break a mirror on a new sun T2000 server (development server where the developer wanted a copy to roll back to if he screwed up a very tricky install) and hosed the entire box. No data was recoverable and it had to be formatted and solaris 10 reinstalled. The server was 1 week old and had just been setup by me. It was not running bare metal backups so that was no help. The reason he had to build the last webserver we used is because he managed to 'mess' up our old one while trying to update php. Php was compiled on that server in /usr/local. Somehow he compiled it into /usr, but some files also got updated in /usr/local and it was faster to build a new server then to figure out what the hell he did.
Anyways, I thought you would enjoy the drama.