Well if the LUN hostnames can't be resolved, or the NFS server can't resolve the name of the VM server (to determine if it should have access) then the LUNs drop. Then everything else drops. My setup is rather basic, single VM server, single storage server, LUNs are setup via NFS. Couple physical servers too for other stuff.
Though I realized something interesting, I falsely thought my environmental server went down (that also does backup DNS) since when I started investigating I could not SSH into it and my alarm display web page was showing a time out, but turns out when there are DNS issues, SSH does not work well (no idea why - does it try to resolve client IPs on login and then fails?). I keep an alarm display up on a Raspberry PI and since the browser in the RPI closes on it's own after a while I use Firefox via an X11 session on the environmental server. If that server DID go down I would have lost that X11 session. I never realized this till now while recollecting the situation. I was just in pure panic so not really thinking straight. So that server probably never actually went down and I rebooted it for nothing. That is also the backup DNS. The primary DNS would have been down at this point since the LUNs were down. (I did not know this at that point since I could not login to the ESX server due to DNS being down)
I just checked the uptime of my storage server, and turns out it DID drop. That would have brought down all the VMs including primary DNS. I originally thought it didn't, because I would not expect that box to actually reboot gracefully if it did, but it actually did reboot gracefully. My file shares on my workstation were actually working. Honestly I'm surprised I did not sustain more damage, with the new knowledge that it actually did go down.
So what I figure happened:
- AC power fail, very dirty power drop and not clean
- UPS failed to trip fast enough
- File server dropped, other servers not affected (as far as I know) - I guess the storage server PSUs are more sensitive to dips in the AC cycle.
- All VMs dropped/stopped responding, including primary DNS
- Even though file server came back up, the VM server could not resolve the LUN hostnames so it could not reconnect
- Storage server could also not resolve any other host names so even if VM could resolve LUNs when it tried to connect, storage server could not allow access as it could not resolve hostnames in the allow list
The problem still remains though that my DNS is not failing over properly. It should not have taken this much effort to get the LUNs to show up again.
Also connecting to anything was very hit and miss, as I was getting lot of DNS resolvation failures. SSH also seems to get really flacky if DNS is down, takes like 5-10 minutes to actually connect to anything. Like even if I get a successful name resolution for the server I'm connecting to, it still takes a long time for the password prompt to come.
I think my easiest bet might be to just go to a single physical DNS and then have a failover that is offline but has the same IP. I don't get it though, I should be able to just have two DNS servers and then put both IPs in each client, but for some reason it just does not fail over properly. The clients keep trying the primary even if it's down and then that hangs up everything.