Server randomly goes offline with pcie link status errors

Red Squirrel

No Lifer
May 24, 2003
70,157
13,566
126
www.anyf.ca
I have a server that randomly just dies for no reason, maybe once or twice a year. If I get console access, I just get spam saying "pciehp failed to check link status" (There's a bunch of numbers and stuff, could not copy and paste since I was on a raw console)

I also get tons of "card not present" errors in the dmesg log. Rebooting solves it.

Any way I can somehow troubleshoot this? When it happens, the only way to access the server is the console but because of the spam it's not really usable. Basically it's like if the NIC just drops out for no reason. What makes this even harder is that it only does it a few times a year. Any clues as to where I can start to check? Any specific logs?

OS is CentOS 6.9.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
Onboard NIC or addon? I had a motherboard where one of onboard ports had this issue. I just disabled the port and called it good.
 

Red Squirrel

No Lifer
May 24, 2003
70,157
13,566
126
www.anyf.ca
Onboard. It's a Supermicro board. I don't recall exact model I'd have to check. It's an Atom barebones where you just add your own ram and HDD. What would cause it to go bad so randomly though?