NVME drives do go bad or....

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,481
14,434
136
My 7452 EPYC box went to la-la land... I could boot the windows install I did, but that is it. Then I powered down and then rebooted, and ubuntu now showed up, but I got nothing but IO errors. NOT a big heatsink, just a tiny one, adata NVME. So I am in the process of re-installing Linux, and BOINC, etc... I will never get a small HSF on one again, as I am sure that was the problem.
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
Time was I could tell if a solid state component was running out of spec by laying a finger on it, try that now and you might get a blister.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,092
1,065
136
I am using an old B350 motherboard. I can tell you once in awhile my Pc will hang and not boot with an adata SX8100. It's rare but a reboot fixes the problem. I can't guarantee that it's the NVMe drive.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,481
14,434
136
I am using an old B350 motherboard. I can tell you once in awhile my Pc will hang and not boot with an adata SX8100. It's rare but a reboot fixes the problem. I can't guarantee that it's the NVMe drive.
But this is a server motherboard...... a $500 motherboard.
 

Assimilator1

Elite Member
Nov 4, 1999
24,120
507
126
I don't even know if my main rig's NVMe drive has a heatsink, can't remember what it is and I'm hoping my sig will show it, lol. [edit] Yep :)
 

Fardringle

Diamond Member
Oct 23, 2000
9,184
753
126
I'm not sure the exact failure rate, but out of probably 800-1000 computers at work that have NVMe drives, we've had to replace at least 4 or 5 dozen of the drives due to partial or complete failure...

So yes, they can and do fail sometimes.
 

Fardringle

Diamond Member
Oct 23, 2000
9,184
753
126
The numbers are just rough estimates off the top of my head, but yes, the failure rate is annoyingly high. Most of the failed drives have been from Lite-On and Adata, but a surprising number have also been from Intel and Samsung. Not "pro" drives, but still more than expected from those companies.
 

Endgame124

Senior member
Feb 11, 2008
954
669
136
Wow. That seems to be a rather high failure rate. :( Only Seagate Spinners are worse.

Here is the most current backblaze reliability report. SeaGate doesn’t look so bad at the moment.

 
  • Like
Reactions: Assimilator1

Skillz

Senior member
Feb 14, 2014
911
929
136
I don't remember the exact model, but I bought around 10, 1TB seagate drives back around 2005/2006 (give or take a couple years, it's been a while) and every-single-one of those drives failed within' 6 months. It was my first time buying Seagate drives and my last time.

Sadly, they sent me replacements, exact models, which also failed in a very, short time. I didn't even bother trying to RMA the drives after the first batch.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
I had the same experience with 500GB EDGE branded SATA SSDs. We bought about 20 of them for our site and used them to replace spinners in ordinary, well cooled desktop PCs for generic office work. Out of the 20, 5 failed within one year, 14 had failed by the second year, and an additional two failed by the third year. Never again Edge, never again...
 

Assimilator1

Elite Member
Nov 4, 1999
24,120
507
126
Heh, and yea that's outrageous.
You should send them a really snotty email and post them all your dead drives with a note saying 'you can bin all this crap!', or similar ;)
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Oh, there were words exchanged. Both the vendor and tech support at Edge got strongly worded letters of dissatisfaction from us. It wasn't worth involving council in, given the dollar amount at play, but both were removed from our approved sources lists for our MUCH larger national level buys.
 
  • Like
Reactions: Assimilator1

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,481
14,434
136
Update !!!! I think the NVME slot on the motherboard went bad. I replaced the drive after it would not post today (accidently turned breaker off while working on the panel for mew AC), and it would not even go into bios. Code 92, PCIE devices. I removed the NVME drive and put a sata drive in it, and BAM< up and running, but I still get the code 92 upon boot.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Sounds worrisome; all of the PCIe lanes are directly driven by the 7452 after all.

What board is it?

Half a year ago, I had a failure on a Supermicro H11DSi, after the board was in service for half a year. A small chip which is part of the power delivery to one out of four RDIMM sections on that board, I believe it was a PWM controller, had gone up in smoke (literally, not just figuratively). When I populated the replacement board with the components of the failed computer, I learned that the two 7452 CPUs luckily came out of the disaster unscathed, but the RDIMMs within the respective section had been killed in the event.

I wonder how the power delivery to the m.2 slot is implemented. I don't see a typical VRM section in the PCIe/m.2 area, nor is there any VRM section besides CPU VRMs, SoC VRMs, and RAM VRMs listed in the IPMI sensors. But since you get the error even if the slot is not populated, it might not be an issue with power delivery, at least not with peripheral power delivery.
 

mikeymikec

Lifer
May 19, 2011
17,577
9,268
136
I don't remember the exact model, but I bought around 10, 1TB seagate drives back around 2005/2006 (give or take a couple years, it's been a while) and every-single-one of those drives failed within' 6 months. It was my first time buying Seagate drives and my last time.

Sadly, they sent me replacements, exact models, which also failed in a very, short time. I didn't even bother trying to RMA the drives after the first batch.

1TB in 2005/2006?


I was still using 80GB drives in customers' builds in Aug 2006 :)

The flood in Indonesia that affected hard drive manufacturers was in 2011. I was using 500GB drives as standard then, and 1TB didn't cost a great deal more. IIRC a lot of drives had reliability issues around that time. IIRC in 2010, Seagate produced some high capacity drives (over 1TB IIRC) that were notorious for high failure rates too.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,481
14,434
136
Sounds worrisome; all of the PCIe lanes are directly driven by the 7452 after all.

What board is it?

Half a year ago, I had a failure on a Supermicro H11DSi, after the board was in service for half a year. A small chip which is part of the power delivery to one out of four RDIMM sections on that board, I believe it was a PWM controller, had gone up in smoke (literally, not just figuratively). When I populated the replacement board with the components of the failed computer, I learned that the two 7452 CPUs luckily came out of the disaster unscathed, but the RDIMMs within the respective section had been killed in the event.

I wonder how the power delivery to the m.2 slot is implemented. I don't see a typical VRM section in the PCIe/m.2 area, nor is there any VRM section besides CPU VRMs, SoC VRMs, and RAM VRMs listed in the IPMI sensors. But since you get the error even if the slot is not populated, it might not be an issue with power delivery, at least not with peripheral power delivery.
Its the EPYCD8-2T, almost all my EPYC setups use this board.

Edit: What I don't get is the 2080TI is PCIE, and it works fine. (F@H) So why would the NVME slot error out ? and Why will it not even go into bios if the NVME is populated, but when removed, it boots fine to SATA ?
 
Last edited:

Skillz

Senior member
Feb 14, 2014
911
929
136
1TB in 2005/2006?


I was still using 80GB drives in customers' builds in Aug 2006 :)

The flood in Indonesia that affected hard drive manufacturers was in 2011. I was using 500GB drives as standard then, and 1TB didn't cost a great deal more. IIRC a lot of drives had reliability issues around that time. IIRC in 2010, Seagate produced some high capacity drives (over 1TB IIRC) that were notorious for high failure rates too.

Well, I did say give or take a couple years. :relaxed: Perhaps it was 2007/2008. I bought them because they were real cheap (retail) and wanted to fool around with RAID arrays. Then again, they might not have been 1TB drives. Possibly 160GB drives come to think of it. I know the drives were super cheap, like ~$50 per drive or something. Brand new, retail. I got them all and played around with RAID on them, luckily I only experimented with the drives and they wasn't being used in anything where data loss would have mattered.

It was the first time I bought a Seagate drive as I only bought WD and Hitachi drives; which I still practice today. With the exception of SSDs, I will only buy WD or Hitachi hard drives.
 

Endgame124

Senior member
Feb 11, 2008
954
669
136
Well, I did say give or take a couple years. :relaxed: Perhaps it was 2007/2008. I bought them because they were real cheap (retail) and wanted to fool around with RAID arrays. Then again, they might not have been 1TB drives. Possibly 160GB drives come to think of it. I know the drives were super cheap, like ~$50 per drive or something. Brand new, retail. I got them all and played around with RAID on them, luckily I only experimented with the drives and they wasn't being used in anything where data loss would have mattered.

It was the first time I bought a Seagate drive as I only bought WD and Hitachi drives; which I still practice today. With the exception of SSDs, I will only buy WD or Hitachi hard drives.
Hitachi bought the IBM desk star line… perhaps the highest failure rate line of HDDs ever.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Hitachi bought the IBM desk star line… perhaps the highest failure rate line of HDDs ever.
This problem was unique to Deskstar 75GXP which was made by IBM but not by HGST. Deskstar 120GXP and 180GXP, the models which crossed the acquisition, were not affected. 75GXP and 120GXP had glass platters, 180GXP was back at aluminum platters. 75GXP's problem of the coating coming lose was evidently fixed in 120GXP. AFAIK.
 
  • Like
Reactions: Assimilator1