• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

I just saved an admin's job today

AStar617

Diamond Member
Sep 29, 2002
4,983
0
0
Being in tech support does have its occasional benefits.

Long story short, I just finished showing one of our customers how to restore production data that the faint of heart might have considered hopelessly lost forever. The only backup would have been a month old and he would have had MANY questions to answer from his management. No partition table dumps to work with either, I would later learn. Took me about 5hrs from very beginning to end, and we aren't even contractually obligated to say anything other than "the LUN is seen by the server, restore from backup". But I could tell he was not being a demanding, contract-renewal-threatening dick about his situation, so I stuck with it. He's checking the data now, and it all appears to be there.

All is well in the world right now. :) I feel like I've done my good deed for the day. My job is a constant pressure cooker so it's not like I'm rarely in this situation... but instead I'm illustrating that it's not very hard to offer help to those who make it known that they genuinely appreciate it. :thumbsup:
 

Looney

Lifer
Jun 13, 2000
21,938
5
0
Yeah, the worst thing you can do to a CS is threatening you're going to take your business elsewhere, or you're going to call your lawyer (this ALWAYS get a LOL from me... i do it just to piss them off in fact).
 

voodoochylde

Senior member
Feb 19, 2004
305
0
71
Good man. Kudos to your customer service skills and dedication. Wish I could find some new-hires at Kroger with your ethics. All we've got now are lazy bums who cry about not being able to push 3 freaking bascarts...
 

AStar617

Diamond Member
Sep 29, 2002
4,983
0
0
Super-condensed, quick and dirty details:

Sun StorEdge 3510 FC (Fibre Channel, hardware RAID) array is directly attached to a Sun server. It has 4x73gb disks configured as a single LUN in RAID5 with 1x73gb global hot-spare, and a single RAID controller rather than dual redundant controllers. The RAID controller got parity errors and at some point the array bounced. The NVRAM settings were backed up from the controller directly to the LUN (this is supported) and the controller replaced live by one of our field guys (this too is supported) but upon replacement the host could not see the LUN at all.

Many things were wrong at this point. The controller unique identifier (CUI) did not match that of the chassis. This is bad because it's like a hostid, and the WWNs, MAC addresses, etc. are all based off of it, so when this messes up, most connectivity is fvcked too (luckily out-of-band communication right on the FC-AL loop was established thru sccli). But there was evidence in old outputs that BEFORE the problem, they didn't match either... this was confirmed by the customer's info that the box had been bought as a JBOD but then had the RAID controller added afterwards. Went thru lots of hoopjumping to get that straightened out and allow us direct access to the array to fully restore the settings and confirm that the LUN setup info was correct (because the disks never had a problem, the underlying data simply needs the right pointers to pull it back into the config).

That was only half the fun tho... when the controller settings were fully restored on the 3510, the host could finally see the LUN at the Solaris level... but as the dreaded "drive type unknown" in format's available disk list. The customer did not have a disk label backup, a partition map printout, or even a similarly configured system to refer to. I told him I'd try to help but was not sure the HW RAID manipulation had fully worked. Knowing full well that one wrong cylinder boundary would mean disaster, we went thru the guesswork of assuming how he PROBABLY had his partition table laid out (again, this is just a series of pointers to untouched underlying data), crossed our fingers, and wrote the label. fsck -m showed the FS as passing the sanity check, and he mounted... then breathed out, "my god, its all there". :D:D

Technically his config is still somewhat incorrect because he has two different sets of WWNs in the same array enclosure (the onboard SES devices, midplane, etc. don't match the controller because nobody ever made it match when they installed it as an upgrade) but it works, and that's all he (or I) cares about.
 

BigJ

Lifer
Nov 18, 2001
21,330
1
81
I've worked in retail and various CS positions, and I absolutely love helping people that genuinely need help and are nice about. If you're nice to me, I will do everything within my power to try to help you.
 

Kenazo

Lifer
Sep 15, 2000
10,429
1
81
i hope that you talked him into a pretty decent tape backup, and daily backups. We run a 10 tape schedule and have never lost more than a few hour's work.
 

AStar617

Diamond Member
Sep 29, 2002
4,983
0
0
Originally posted by: Kenazo
i hope that you talked him into a pretty decent tape backup, and daily backups. We run a 10 tape schedule and have never lost more than a few hour's work.
My exact words? "[Admin], I will fly to [State] and CHOKE you if you don't back up the FS, save the custom label, and dump the partition table immediately." We both shared a laugh at that one. :laugh:
 

Kelvrick

Lifer
Feb 14, 2001
18,422
5
81
Originally posted by: AStar617
Super-condensed, quick and dirty details:

Sun StorEdge 3510 FC (Fibre Channel, hardware RAID) array is directly attached to a Sun server. It has 4x73gb disks configured as a single LUN in RAID5 with 1x73gb global hot-spare, and a single RAID controller rather than dual redundant controllers. The RAID controller got parity errors and at some point the array bounced. The NVRAM settings were backed up from the controller directly to the LUN (this is supported) and the controller replaced live by one of our field guys (this too is supported) but upon replacement the host could not see the LUN at all.

Many things were wrong at this point. The controller unique identifier (CUI) did not match that of the chassis. This is bad because it's like a hostid, and the WWNs, MAC addresses, etc. are all based off of it, so when this messes up, most connectivity is fvcked too (luckily out-of-band communication right on the FC-AL loop was established thru sccli). But there was evidence in old outputs that BEFORE the problem, they didn't match either... this was confirmed by the customer's info that the box had been bought as a JBOD but then had the RAID controller added afterwards. Went thru lots of hoopjumping to get that straightened out and allow us direct access to the array to fully restore the settings and confirm that the LUN setup info was correct (because the disks never had a problem, the underlying data simply needs the right pointers to pull it back into the config).

That was only half the fun tho... when the controller settings were fully restored on the 3510, the host could finally see the LUN... but as the dreaded "drive type unknown". The customer did not have a disk label backup, a partition map printout, or even a similarly configured system to refer to. I told him I'd try to help but was not sure the HW RAID manipulation had fully worked. Knowing full well that one wrong cylinder boundary would mean disaster, we went thru the guesswork of assuming how he PROBABLY had his partition table laid out (again, this is just a series of pointers to untouched underlying data), crossed our fingers, and wrote the label. Fsck -m showed the FS as passing the sanity check, and he mounted... then breathed out, "my god, its all there". :D:D

Technically his config is still somewhat incorrect because he has two different sets of WWNs in the same enclosure (the onboard SES devices, midplane, etc. don't match the controller because nobody ever made it match when they installed it as an upgrade) but it works, and that's all he (or I) cares about.

Don't get how you saved an admin's job. Is it because the latest backup was from a month ago? Cuzz it sounds like a hardware fault. At first, I thought you were talking about an admin assistant and was like, how would they be held responsible?
 

AStar617

Diamond Member
Sep 29, 2002
4,983
0
0
Originally posted by: Kelvrick

Don't get how you saved an admin's job. Is it because the latest backup was from a month ago? Cuzz it sounds like a hardware fault. At first, I thought you were talking about an admin assistant and was like, how would they be held responsible?
As their HW support provider, we are responsible for making sure that the array can present a recognizeable LUN to the host, and nothing more. As soon as we achieve that, the initial hardware fault (the controller issue) is considere resolved, and the contents of that LUN are NOT our responsibility.

Consider this: I could have told him to restore from backup onto the LUN that was now seen, then "have a nice day"... and not gotten in trouble, per se. However, the data from a month ago would have been unacceptable from his management's perspective, with nobody to blame but themselves (him?). Ask anyone in a datacenter environment what a month's worth of production data lost means.

 

miniMUNCH

Diamond Member
Nov 16, 2000
4,159
0
0
Originally posted by: AStar617
Originally posted by: Kelvrick

Don't get how you saved an admin's job. Is it because the latest backup was from a month ago? Cuzz it sounds like a hardware fault. At first, I thought you were talking about an admin assistant and was like, how would they be held responsible?
As their HW support provider, we are responsible for making sure that the array can present a recognizeable LUN to the host, and nothing more. As soon as we achieve that, the initial hardware fault (the controller issue) is considere resolved, and the contents of that LUN are NOT our responsibility.

Consider this: I could have told him to restore from backup onto the LUN that was now seen, then "have a nice day"... and not gotten in trouble, per se. However, the data from a month ago would have been unacceptable from his management's perspective, with nobody to blame but themselves (him?). Ask anyone in a datacenter environment what a month's worth of production data lost means.

"FUUUUUUUUUUUUUCK!" Is what is means.

That guy would have been out on his ass by the end of day. Nightly backups are the norm at any business that is large enough to require an IT professional. Rolling/rotating backups are commonplace.

Hell, our department servers at school (chemical engineering) are backed up every night.
 

AStar617

Diamond Member
Sep 29, 2002
4,983
0
0
Originally posted by: Goosemaster
well done.

You should've had it backed up any way (for all your customers) and make them payout;)
Heh... all fun and games, cashing those big checks, until a Fortune 1000 holds us directly liable for lost funds due to compromised data integrity... it'd be a "payout" alright, just in the wrong direction :p