I have a server that randomly just dies for no reason, maybe once or twice a year. If I get console access, I just get spam saying "pciehp failed to check link status" (There's a bunch of numbers and stuff, could not copy and paste since I was on a raw console)
I also get tons of "card not present" errors in the dmesg log. Rebooting solves it.
Any way I can somehow troubleshoot this? When it happens, the only way to access the server is the console but because of the spam it's not really usable. Basically it's like if the NIC just drops out for no reason. What makes this even harder is that it only does it a few times a year. Any clues as to where I can start to check? Any specific logs?
OS is CentOS 6.9.
I also get tons of "card not present" errors in the dmesg log. Rebooting solves it.
Any way I can somehow troubleshoot this? When it happens, the only way to access the server is the console but because of the spam it's not really usable. Basically it's like if the NIC just drops out for no reason. What makes this even harder is that it only does it a few times a year. Any clues as to where I can start to check? Any specific logs?
OS is CentOS 6.9.