• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Do you like the drama?

sourceninja

Diamond Member
I have a solaris/sun story from hell. I should first point out I did not setup these servers it was done by the admin I am referencing in this post.

Sunday night at about 11pm our web server goes down. I try to ssh in and the server is not responding, but it pings. I drive in expecting a small easy to resolve issue. The kvm is corrupt and just printing out things like "$%$%@^&^@#$@!$$^$%&@^%$@#!!!!DF" and nothing human readable. The serial console will not respond. So I decide to just reboot the server. After post it just sits there and will not boot. I break out and try to boot it manually and it will not boot. I try to boot the mirror and it will not boot. So I decide this is a hardware issue and that is not my job so I place the appropriate call and go home for the night.

I come in this morning and the server is being worked on. It is an old server (sunfire 280r). Our admin pulled the drives from the server and was trying to get the to boot in another 280r. It will not boot. He then tries to put the old disks back into the server he was testing on, and it will not boot back up now either. So he calls our hardware support vendor. This is starting to look bad, I've decided we need a backup plan and I start looking for a replacement server to get going. Our important files are all in a folder called /var/web which is supposed to be mounted on our SAN. Should be a simple matter of setting up apache/php on another server, mounting that SAN volume, changing the DNS and we are back in business. Or so I thought....

Fast froward to 11am. Support vendor arrives, looks at disks. Initially he thinks the motherboard has died. Then he hears about the second server. Now he asks our admin if he fully unplugged the servers before removing the drives. Admin says no. Vendor tells us that that can hose the boot sectors on the drives or worse cause full data loss. Not what I want to hear. He then determines that the first disk had a motor failure and that is why the first sever went down. I ask why the mirror would not boot and he tells me that disk is reading blank!!! So we have no mirror. Worse then that he is unable to recover the second servers drives and it must be reinstalled. Luckily for us that server is not production critical (it runs some minor backups). I add the files needed to a quick rsync to a box that is still getting backed up. At this point our Admin breaks the news to me. He never got around to setting up the san volume on that production critical web server. This now means we have total loss of data. At this point I decide to go to backups and I am informed he never setup the veritas backup software either! This server has been in production for 5 months now. I have 6 months of work that the only copy was on this server. On top of that our entire public website is on this server. I am pissed! I force myself to go to lunch.

I come back from lunch with a mission. We do not have any free sun boxes, so I find a Dell 2850, stick 6 gigs of ram and 2 73gig scsi drives in it. Pop in a fiber card and put linux on it. I setup the raid, install linux, setup the san volume, configure and install php, apache, ssh, etc. Run all the security updates, add all our users, the normal setup stuff. Now I start hunting for our files. My work is 100% lost, the last checkin I made in the repository was 2 months old. So I'm screwed. I didn't checking frequently cause it was a work in progress and I figured it was on the san. I was the only one working on it, so why bother checking it in. Big mistake. Our website was another problem. The checking was about a week old. We managed to get the other files form developers desktops and we had to recreate a few things, but we got the website 99% running as of 10pm tonight.

I have called a meeting to discuss the fate of the admin tomorrow. This is not the first time he has cost me days if not weeks or work. 3 weeks ago he tried to break a mirror on a new sun T2000 server (development server where the developer wanted a copy to roll back to if he screwed up a very tricky install) and hosed the entire box. No data was recoverable and it had to be formatted and solaris 10 reinstalled. The server was 1 week old and had just been setup by me. It was not running bare metal backups so that was no help. The reason he had to build the last webserver we used is because he managed to 'mess' up our old one while trying to update php. Php was compiled on that server in /usr/local. Somehow he compiled it into /usr, but some files also got updated in /usr/local and it was faster to build a new server then to figure out what the hell he did.

Anyways, I thought you would enjoy the drama.
 
I'm officially a software developer. I am also responsible for any issues keeping our website running software/process wise. We are a small shop with 2 system/network admins, 3 techs, and 4 developers, so we do a lot of crossover work.
 
That was the short version, only way to make it shorter is to say this:

System admin is an idiot, destroyed our server. I yet again am forced to save the day.
 
A lot of scraping last edit files from developers desktops. Thank god most of us use editors that create backup files when you edit a file. So we were close and it was just some minor changes. Plus a lot of our source code was in our repository. The repository is not 100% current, but we were able to check in a few things and get about 99% back. We are still finding some missing things (our site is very large) and are fixing them on the fly. I have an intern going link by link though our website now to look for broken links or errors (I still have php reporting errors on the page until we have checked everything).

Our meeting about the admin in question was interesting. He was not in the office today (he was sent to cisco training). When he gets back next week there is going to be a joint meeting between me, my boss (Director and CIO) and our president (My boss's boss) about the situation. On a side note, I have been tasked with verifying all backups are taking place on every server and doing a general audit of the network/severs/security. I'm also getting sent to get solaris 10 certified.
 
So hopefully you have learned to check stuff in any time you make a change now, and the admin has learned to start looking for a new job.
 
Most of our sun servers are legacy and left over from a decision made years ago. The admin at the time did not want to risk using linux and opted for sun for everything. DNS, apache, you name it. These servers, even as old as they are, are still very good servers and thus need to be used (although management now believes me that the old servers should not be for production critical systems). After my arrival that admin left and I have been pushing our the direction of our technology to open source. As of the disaster we are now using sun for 3 things.
1) Our Sungard applicatons
2) Our Oracle DB
3) Our Oracle applications server

The only reason we are doing this is because the SunGard does not support their software on anything but solaris, aix, windows via cygwin, and redhat 3. They are also our primary support for oracle. We have been told they are going to support a more recent version of redhat, or suse in the near future, at that point when we look to upgrade those servers we will probably look to move off sun entirely. It is simply too costly for support and hardware. It is also more difficult to use when 99% of the office is already familiar with linux and gnu tools. It seems like I'm always running into a "Oh sun does that differently" situation.

I've grown to have a deep love of Dell servers and linux. In fact, tomorrow I get to setup a maxed out Dell 1950 for use as our moodle server.

And yes, I learned my lesson about coding cowboy. I will be checking in everything and on top of that doing my checkouts on my notebook and not directly on the server. My early looks into our backup situation are NOT promising at all. I can't even find a person who remembers who might of rotated the tapes, or taken them to our second campus for off-site storage. It looks like I will be probably moving away from programing for a good long while.

I guess it is good that my job history has delt with most of these job roles. I've been a network admin, a system admin, a web developer, a software developer, etc. I'm also glad that our other admin is not an idiot. The only reason he does not know about these issues is because he strictly deals with novell. And that part of our world is basically flawless.
 
Originally posted by: sourceninja
That was the short version, only way to make it shorter is to say this:

System admin is an idiot, destroyed our server. I yet again am forced to save the day.

Ah, now that's drama I can appreciate. 😀
 
Update, it looks like this admin is going to get away free of any punishment! Current talks place me as a new server admin and moving him to deal only with the network.....What the hell is going on?
 
Sadly, I've seen admins survive more retarded fuckups with their jobs intact.

Also, business is noticing that you are a pretty competent software developer and admin. Bonus! Sorry 🙂.
 
Originally posted by: skace
Sadly, I've seen admins survive more retarded fuckups with their jobs intact.

Also, business is noticing that you are a pretty competent software developer and admin. Bonus! Sorry 🙂.

No doubt.
 
Ok one last bitch. On the todo list for this admin was to prepare two servers for use by one of our developers for a project that has a go live of end of september. It has been well over 3 weeks and he has 'been working on it'. Today I setup and deployed both servers in the afternoon. It was such a simple task, just install linux, perform all our default security tasks, and give it a static ip/dns entry, and deliver.

I don't understand why they can't just give me his salary in addition to my own and send him packing.
 
Back
Top