Bad blocks: Keep or RMA? And some useful info on SeaTools for DOS

zir_blazer

Golden Member
Jun 6, 2013
1,219
508
136
The HD in question is a Seagate Desktop HDD.15 ST4000DM000, which I got early December 2013, here. The HD itself was on nearly 24/7 from the install date, and passed with no issues the long test of SeaTools for DOS.

Some days ago I tried to do a copy of the IMG file cointaining my everyday WXP SP3 VM in Arch Linux, and it consistently reported a I/O error around the 1 GB mark. I tried copypasting other big files and it succeeded, yet failed several times with that one, even after a restart. This prompted me to use fsck.ext4, and it detected at least one bad block affecting the IMG. I was hoping it was File System corruption due to some improper shutdowns, but no, it had to be the HD...
I also installed the smartmontools package, and there were 22 SMART errors, of which only the data from the last 5 is available, and they reported issues with the same LBA address at pretty much the same date, with some seconds difference - I suppose that all 22 could be caused by the numerous tries to copy the IMG file.

This finding made me worry about some unusual behaviator that I was experiencing from months ago was related to data corruption inside the VM, which Windows XP itself may be unable to detect. I used to download files with Chrome in the VM, and if they were installable EXEs or ZIP files, they used to be corrupt with the installer reporting errors, while the Windows shell and WinRAR refused to open ZIPs.
After re-downloading them they usually worked, but then, I noticed that corrupted versions were 1 KB less than the expected size and the working version. At first I through than this behavior was caused due to Chrome not finishing downloads properly - anyone oldschool enough should remember than sometimes Internet Explorer used to report a file size when downloading files, yet abruptly finishing early and reporting success, leaving you with an incomplete file. Now I'm inclined to think than that was caused by the bad block, as if I deleted the corrupted file and inmediately re-downloaded it, it was also corrupt, while if I downloaded it with another name without touching the existing copy (And the space it occupied), it simply worked. So it seems than that bad block was a sort of undetected black hole.
However, due to the lack of data regarding how bad blocks may affect the inside of a VM (And/or get detected), I don't know if my theory is true or not. Also, assuming than data corruption did indeed happen, I don't even know what else could have been on that bad block and get corrupted in the process, so most data on the VM should need a check - or better, starting a new installation just to be sure.


Anyways. After fsck.ext4 in Linux, I decided to test with SeaTools for DOS, which I already used to make sure the HD was intact when I purchased it. After preparing a USB Pendrive with XBoot, I couldn't get it working, nor the latest version, nor the older ones. The error was "invalid opcode" usually followed by some hex numbers. After googling a bit I found this error is very common, but no one knows what causes it, and most seem to just accept that their systems are somehow incompatible with it - which is impossible, cause I already ran it once.
Besides "invalid opcode", sometimes it also said in a green font "can't open". I googled that instead and found a single Post of a guy that says that he had that issue running a Dual Monitor setup - which I have. After lots of testing, it seems than SeaTools for DOS does NOT like my DVI Monitor, because with the VGA one, it simply worked. This info may be useful for anyone that has issues with the tools.
Anyways, after spending 7 hours on the long test, no bad blocks were found and the HD itself passed properly - it didn't even reported SMART errors. Supposedly at the end of the test SeaTools should ask you what to do with the bad blocks, but this did NOT happen. After using the SMART tools again, I noticed that now I have 2 bad blocks, which should have been already remapped before running SeaTools. This annoys me because it seems than these bad blocks aren't enough to RMA the HD as Seagate wants you to have recorded errors with SeaTools as proof that it needs a replacement.

I already was intending to do a fresh start with this HD to repartition it for my storage purposes explained here, so I was intending to zero-fill anyways, but now I have doubts of the physical status of the HD. As the long test took the same 7 hours or so it took the first time, I suppose than mechanically the HD may be good. But the spountaneous apparition of bad blocks that actually manage to cause issues by corrupting used blocks with data (At least one bad block was involved in the VM IMG file corruption, not sure were the other was located. But I don't think its random than with less than 200 GB used on a 4 TB HD, it fell in the most used part), makes me think about how probable it is that more appear over time.
I'm aware that a bad block every now and then may be normal as the HD gets older and that it has spare blocks to replace the bad ones, which shouldn't be that bad if the Firmware catches them in time while the blocks are still readable. But you have also bad block swarms which may mean a platter that is deteriorating very fast. This means that I need to know what sort of risk my data will be taking.
Also, the actual HD warranty is just 6 months (And I'm sitting on the 4 month mark), because there is no representative in Argentina and if I want the full 2 years warranty, I would have to do international shipping to Seagate with all the cost involved, so the warranty that matters is the one offered by the local vendor. So I'm between zero-fill and start from scratch with this HD, or RMA it for a replacement (Though I don't think that they take it for 2 bad blocks on an otherwise fully functional HD) and hoping than it is a brand new, non-refurbished unit.


This is the reelevant SMART data, adquired via smartctl --all /dev/sda


SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 113 090 006 Pre-fail Always - 57921512
3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 98
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 063 052 030 Pre-fail Always - 81642038890
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2778
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 98
183 Runtime_Bad_Block 0x0032 098 098 000 Old_age Always - 2
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 078 078 000 Old_age Always - 22
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 052 045 Old_age Always - 38 (Min/Max 30/41)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 69
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1690
194 Temperature_Celsius 0x0022 038 048 000 Old_age Always - 38 (0 24 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2748h+38m+17.608s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2013183627
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3069367568

SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)


There is a lot of more info (The error themselves), but as I stated they're from the same LBA address, so I see them as redundant. So far, it seems quite good except for the bad blocks.


Also, the computer case sometimes vibrates (If I put a finger on top of it and make pressure it usually stops for a while), which as far that I know may also be a cause of bad blocks because HDs are sensitive to vibrations. So either if I keep this one or go for RMA, I will have to check the desktop stability, the case (Which is a generic case from 2006), or an alternative like suspending the HD like this.

For anyone that readed from start to end, any insights?
 

Elixer

Lifer
May 7, 2002
10,371
762
126
Corruption isn't good, and it sounds like you have a bad HD. I assume you have already done a memtest86+ scan to check the memory, and you aren't o/cing.
IMO (if it is the HD), just RMA it, and hopefully what you get back will actually work.
Do an advanced RMA if they offer it in Argentina, that way, they will send you the HD first, and you use the same box to send the old one back.
 

Buk

Senior member
Oct 9, 1999
558
0
76
Probably not directly relevant but I'll share a recent experience with a couple of hard drives.

First: I had a Hitachi 2Tb hd begin to act strange taking a long time to read and write some files (this was a storage hd). I removed it from the computer, placed it in a sealed antistatic bag with cables attached, and put it in my deep freeze. Then I set up a computer with the new hd in it beside the freezer, attached the cables from the one in the freezer and proceeded to copy files to the new hd. Most files copied fine, a lot were very slow, and a couple wouldn't copy at all. After all was done I ran Hitachi's DFT on it and it passed with no errors! Then I ran HDTune and saw that, as the drive was benchmarked, instead of a graph starting high and slowing as the test finished, there was a marked dip maybe 10% into the test where the transfer rate dropped dramatically then rose again to rejoin the normal and expected curve. Interesting! Another test with DFT resulted in an error so the hd was sent back for replacement.

Second: My wife's computer was very slow to boot and normal operations (email & facebook) were slow as well. Memory tested fine as did the hd. Remembering my experience with the 2Tb Hitachi, I ran HD Tune and the benchmark showed a similar dip early in the test. I cloned it to another hd that tested as expected with hd tune and that restored the pep that her desktop hadn't had in a while.

I have had another computer regain its youthful vigor after doing the same thing again. Maybe I'm reading something into the HD Tune benchmarking test that isn't there but so far it has worked as above three out of three. Perhaps the hd controller gets tired and loses its memory for some areas of the disk.

Run HD Tune on your errant disk and see what you get........
 

zir_blazer

Golden Member
Jun 6, 2013
1,219
508
136
Corruption isn't good, and it sounds like you have a bad HD. I assume you have already done a memtest86+ scan to check the memory, and you aren't o/cing.
No overclock possible as I'm on a Haswell Xeon and a Supermicro Motherboard - can't make anything run out of spec. I didn't did any recent RAM test but symptoms are unrelated to it anyways, stability has been rock solid during continuous week periods. If RAM was bad, I would have random crashes or issues, and possibly no consistent SMART errors at all.


Run HD Tune on your errant disk and see what you get........
Can't currently do that as this machine runs Linux with Windows in a VM. It won't directly be able to directly see the HD. And as far that I googled, there isn't a Linux application equivalent to HD Tune (The smartmontools that I used the next closest thing, and also you have hdparm which is usually used to disable APM, which I didn't). I'm missing the cute graph where you can see the performanece dips.
 
Feb 25, 2011
16,983
1,616
126
Bad Blocks = Replacement time.

Every drive has a few, but the controller hides them and replaces them with spares.

That means it's either gone stupid or it's out of spares.

And if it's out of spares, that probably means the drives on its way out.