• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

NVME drives do go bad or....

Markfw

Moderator Emeritus, Elite Member
My 7452 EPYC box went to la-la land... I could boot the windows install I did, but that is it. Then I powered down and then rebooted, and ubuntu now showed up, but I got nothing but IO errors. NOT a big heatsink, just a tiny one, adata NVME. So I am in the process of re-installing Linux, and BOINC, etc... I will never get a small HSF on one again, as I am sure that was the problem.
 
Time was I could tell if a solid state component was running out of spec by laying a finger on it, try that now and you might get a blister.
 
I am using an old B350 motherboard. I can tell you once in awhile my Pc will hang and not boot with an adata SX8100. It's rare but a reboot fixes the problem. I can't guarantee that it's the NVMe drive.
 
I am using an old B350 motherboard. I can tell you once in awhile my Pc will hang and not boot with an adata SX8100. It's rare but a reboot fixes the problem. I can't guarantee that it's the NVMe drive.
But this is a server motherboard...... a $500 motherboard.
 
I don't even know if my main rig's NVMe drive has a heatsink, can't remember what it is and I'm hoping my sig will show it, lol. [edit] Yep 🙂
 
I'm not sure the exact failure rate, but out of probably 800-1000 computers at work that have NVMe drives, we've had to replace at least 4 or 5 dozen of the drives due to partial or complete failure...

So yes, they can and do fail sometimes.
 
The numbers are just rough estimates off the top of my head, but yes, the failure rate is annoyingly high. Most of the failed drives have been from Lite-On and Adata, but a surprising number have also been from Intel and Samsung. Not "pro" drives, but still more than expected from those companies.
 
Wow. That seems to be a rather high failure rate. 🙁 Only Seagate Spinners are worse.

Here is the most current backblaze reliability report. SeaGate doesn’t look so bad at the moment.

 
I don't remember the exact model, but I bought around 10, 1TB seagate drives back around 2005/2006 (give or take a couple years, it's been a while) and every-single-one of those drives failed within' 6 months. It was my first time buying Seagate drives and my last time.

Sadly, they sent me replacements, exact models, which also failed in a very, short time. I didn't even bother trying to RMA the drives after the first batch.
 
I had the same experience with 500GB EDGE branded SATA SSDs. We bought about 20 of them for our site and used them to replace spinners in ordinary, well cooled desktop PCs for generic office work. Out of the 20, 5 failed within one year, 14 had failed by the second year, and an additional two failed by the third year. Never again Edge, never again...
 
Heh, and yea that's outrageous.
You should send them a really snotty email and post them all your dead drives with a note saying 'you can bin all this crap!', or similar 😉
 
Oh, there were words exchanged. Both the vendor and tech support at Edge got strongly worded letters of dissatisfaction from us. It wasn't worth involving council in, given the dollar amount at play, but both were removed from our approved sources lists for our MUCH larger national level buys.
 
Update !!!! I think the NVME slot on the motherboard went bad. I replaced the drive after it would not post today (accidently turned breaker off while working on the panel for mew AC), and it would not even go into bios. Code 92, PCIE devices. I removed the NVME drive and put a sata drive in it, and BAM< up and running, but I still get the code 92 upon boot.
 
Sounds worrisome; all of the PCIe lanes are directly driven by the 7452 after all.

What board is it?

Half a year ago, I had a failure on a Supermicro H11DSi, after the board was in service for half a year. A small chip which is part of the power delivery to one out of four RDIMM sections on that board, I believe it was a PWM controller, had gone up in smoke (literally, not just figuratively). When I populated the replacement board with the components of the failed computer, I learned that the two 7452 CPUs luckily came out of the disaster unscathed, but the RDIMMs within the respective section had been killed in the event.

I wonder how the power delivery to the m.2 slot is implemented. I don't see a typical VRM section in the PCIe/m.2 area, nor is there any VRM section besides CPU VRMs, SoC VRMs, and RAM VRMs listed in the IPMI sensors. But since you get the error even if the slot is not populated, it might not be an issue with power delivery, at least not with peripheral power delivery.
 
I don't remember the exact model, but I bought around 10, 1TB seagate drives back around 2005/2006 (give or take a couple years, it's been a while) and every-single-one of those drives failed within' 6 months. It was my first time buying Seagate drives and my last time.

Sadly, they sent me replacements, exact models, which also failed in a very, short time. I didn't even bother trying to RMA the drives after the first batch.

1TB in 2005/2006?


I was still using 80GB drives in customers' builds in Aug 2006 🙂

The flood in Indonesia that affected hard drive manufacturers was in 2011. I was using 500GB drives as standard then, and 1TB didn't cost a great deal more. IIRC a lot of drives had reliability issues around that time. IIRC in 2010, Seagate produced some high capacity drives (over 1TB IIRC) that were notorious for high failure rates too.
 
Sounds worrisome; all of the PCIe lanes are directly driven by the 7452 after all.

What board is it?

Half a year ago, I had a failure on a Supermicro H11DSi, after the board was in service for half a year. A small chip which is part of the power delivery to one out of four RDIMM sections on that board, I believe it was a PWM controller, had gone up in smoke (literally, not just figuratively). When I populated the replacement board with the components of the failed computer, I learned that the two 7452 CPUs luckily came out of the disaster unscathed, but the RDIMMs within the respective section had been killed in the event.

I wonder how the power delivery to the m.2 slot is implemented. I don't see a typical VRM section in the PCIe/m.2 area, nor is there any VRM section besides CPU VRMs, SoC VRMs, and RAM VRMs listed in the IPMI sensors. But since you get the error even if the slot is not populated, it might not be an issue with power delivery, at least not with peripheral power delivery.
Its the EPYCD8-2T, almost all my EPYC setups use this board.

Edit: What I don't get is the 2080TI is PCIE, and it works fine. (F@H) So why would the NVME slot error out ? and Why will it not even go into bios if the NVME is populated, but when removed, it boots fine to SATA ?
 
Last edited:
1TB in 2005/2006?


I was still using 80GB drives in customers' builds in Aug 2006 🙂

The flood in Indonesia that affected hard drive manufacturers was in 2011. I was using 500GB drives as standard then, and 1TB didn't cost a great deal more. IIRC a lot of drives had reliability issues around that time. IIRC in 2010, Seagate produced some high capacity drives (over 1TB IIRC) that were notorious for high failure rates too.

Well, I did say give or take a couple years. :relaxed: Perhaps it was 2007/2008. I bought them because they were real cheap (retail) and wanted to fool around with RAID arrays. Then again, they might not have been 1TB drives. Possibly 160GB drives come to think of it. I know the drives were super cheap, like ~$50 per drive or something. Brand new, retail. I got them all and played around with RAID on them, luckily I only experimented with the drives and they wasn't being used in anything where data loss would have mattered.

It was the first time I bought a Seagate drive as I only bought WD and Hitachi drives; which I still practice today. With the exception of SSDs, I will only buy WD or Hitachi hard drives.
 
Well, I did say give or take a couple years. :relaxed: Perhaps it was 2007/2008. I bought them because they were real cheap (retail) and wanted to fool around with RAID arrays. Then again, they might not have been 1TB drives. Possibly 160GB drives come to think of it. I know the drives were super cheap, like ~$50 per drive or something. Brand new, retail. I got them all and played around with RAID on them, luckily I only experimented with the drives and they wasn't being used in anything where data loss would have mattered.

It was the first time I bought a Seagate drive as I only bought WD and Hitachi drives; which I still practice today. With the exception of SSDs, I will only buy WD or Hitachi hard drives.
Hitachi bought the IBM desk star line… perhaps the highest failure rate line of HDDs ever.
 
Hitachi bought the IBM desk star line… perhaps the highest failure rate line of HDDs ever.
This problem was unique to Deskstar 75GXP which was made by IBM but not by HGST. Deskstar 120GXP and 180GXP, the models which crossed the acquisition, were not affected. 75GXP and 120GXP had glass platters, 180GXP was back at aluminum platters. 75GXP's problem of the coating coming lose was evidently fixed in 120GXP. AFAIK.
 
Back
Top