- Apr 20, 2013
- 4,307
- 450
- 126
Ran into an issue with my home lab over the weekend that has me a bit perplexed. My storage admin experience is extremely limited so feel free to point out if I missed something obvious.
Setup: 2x ESXI 5.5 Hosts connected to Solaris/napp-it based SAN running ZFS via Fiber. One host was powered off at the time of the event.
Situation: ESXI Hosts show massive datastore latency (as high as 15 SECONDS).
Details:
Setup has been running fine for some time now. UPS randomly powered off for about 20 seconds on Saturday then powered back on. Devices did not kick over to battery power. UPS had passed a self test the night before.
Once UPS came back up I powered up the SAN. SAN appeared to boot normally. Disk pool looked fine, LUN's look fine.
Powered the host back up. Host boots off local disks and booted with out issue. Started powering VM's back up. Some VM's booted back up but were basically non-responsive. Other VM's powered themselves off during boot process. vmKernel log shows connection to VMDK timed out. Checked performance tab which showed the above mentioned massive latency spikes. So I began troubleshooting as best as I could with my limited storage experience.
1) DD benchmark on the SAN itself shows normal throughput. So hardware appears to be working properly.
2) Manually kicked off a ZFS Scrub which completed and stated no errors were found. So ZFS seems to be intact.
3) Recreated one of the VM's from backup on the local storage on the host. VM is running perfectly with no latency, so host appears to be fine.
4) Powered on second host which had been shut down for the summer and therefore should have been unaffected by all this. Second host sees the same latency issues as the first one. So, again, host doesn't seem to be the problem.
5) Created new LUN on the SAN and presented it to the hosts. Downloaded VM off the old datastore onto the new datastore on the new LUN. No latency, VM is running normal. So again, hardware seems to be fine and this would seem to indicate the data is intact, at least partially.
6) Attempted to run VMDK check via SSH, timed out.
7) Dismounted and remounted the datastores, no change.
8) Used a VM recovery tool browse the contents of the VM's, proceeded to download files off the file server VM. Slow going, but the data again all appears to be intact.
So, I'm about out of ideas. In addition to the above steps, did the standard reboots, enable/disable on the fiber switch. All of which had no effect. Any ideas? I've got backups so I can nuke the LUN's and recreate them if that's where I'm at, I just don't like running into a problem I can't explain.
Setup: 2x ESXI 5.5 Hosts connected to Solaris/napp-it based SAN running ZFS via Fiber. One host was powered off at the time of the event.
Situation: ESXI Hosts show massive datastore latency (as high as 15 SECONDS).
Details:
Setup has been running fine for some time now. UPS randomly powered off for about 20 seconds on Saturday then powered back on. Devices did not kick over to battery power. UPS had passed a self test the night before.
Once UPS came back up I powered up the SAN. SAN appeared to boot normally. Disk pool looked fine, LUN's look fine.
Powered the host back up. Host boots off local disks and booted with out issue. Started powering VM's back up. Some VM's booted back up but were basically non-responsive. Other VM's powered themselves off during boot process. vmKernel log shows connection to VMDK timed out. Checked performance tab which showed the above mentioned massive latency spikes. So I began troubleshooting as best as I could with my limited storage experience.
1) DD benchmark on the SAN itself shows normal throughput. So hardware appears to be working properly.
2) Manually kicked off a ZFS Scrub which completed and stated no errors were found. So ZFS seems to be intact.
3) Recreated one of the VM's from backup on the local storage on the host. VM is running perfectly with no latency, so host appears to be fine.
4) Powered on second host which had been shut down for the summer and therefore should have been unaffected by all this. Second host sees the same latency issues as the first one. So, again, host doesn't seem to be the problem.
5) Created new LUN on the SAN and presented it to the hosts. Downloaded VM off the old datastore onto the new datastore on the new LUN. No latency, VM is running normal. So again, hardware seems to be fine and this would seem to indicate the data is intact, at least partially.
6) Attempted to run VMDK check via SSH, timed out.
7) Dismounted and remounted the datastores, no change.
8) Used a VM recovery tool browse the contents of the VM's, proceeded to download files off the file server VM. Slow going, but the data again all appears to be intact.
So, I'm about out of ideas. In addition to the above steps, did the standard reboots, enable/disable on the fiber switch. All of which had no effect. Any ideas? I've got backups so I can nuke the LUN's and recreate them if that's where I'm at, I just don't like running into a problem I can't explain.