ISRT and reduction of disk corruption for BSODs or power loss

BonzaiDuck · Mar 12, 2014

I just came up with a hypothesis about the SSD-caching ISRT feature.

For several months, I've been having the most occasional 10-day to 1-month unanticipated resets and some smaller number of BSODs. I now believe that I've solved the problem, but only for several reasons at a time. The risk of disk corruption on my mind, I regularly clone the hard disk after running it through CHKDSK with repair enabled. And troubleshooting became more of an urgency because of these fears. There's no time to simply make one change targeting a single suspected cause, and then wait 7, 10, 12 days to find out if the fix has occurred.

But through all this, when rebooting from a crash, freeze, or cold shutdown, the Intel disk controller goes through its BIOS "thing" as it counts through checksums within the cache. Always (so far) it gives itself a clean bill of health, and the system boots.

I've run CHKDSK more and more with these fears, and on these high-capacity drives (even 500GB), the most extensive features take a lot of time. But nary a problem. Perhaps a "USn Journal" error that was automatically corrected on an auxiliary drive running on the non-Intel Marvel controller.

It would seem the ISRT cache may have a data-integrity feature going for it. I don't know. Someone else could come forward with more than intuition, or with better knowledge.

razel · Mar 12, 2014

I assume this is on your current rig in your sig which is overclocked... If so perhaps throttle everything back to normal and within specifications for at least a month and see if it still happens.

BrightCandle · Mar 12, 2014

Its seems kind of crazy to me that you are worrying so much about the security of your data and backups and such and yet you are running clearly unstable overclock. Solving the crashing by reducing the clock speed or increasing the voltage should get the machine back to a solid bill of health and avoid the crashing that has a small chance of data corruption. I feel your expending your effort in the wrong direction considering that the cause of your issue is something you have done and is easily fixed.

Soulkeeper · Mar 12, 2014

I agree with the posters above, 4.7GHz is like an extreme benchmark clock. Something you would set for bragging rights on a forum

I believe the recent poll in the cpu section showed most people not going above 4.2GHz 24/7 ... and for good reason.

Just because a system don't crash doesn't mean things aren't consistently being corrupted in cpu cache, memory and anything you read/write open/close etc. (including your filesystem).

john3850 · Mar 13, 2014

After reading your posts I remember that you always ran the lowest possible vcore to keep your temps on the low side and I believe thats half of your problem.

Essence_of_War · Mar 13, 2014

I've been having the most occasional 10-day to 1-month unanticipated resets and some smaller number of BSODs.

*notes i7 OC'd to 4.7*

I agree with others, this sounds like an unstable OC. The first, and easiest, thing you could do to verify that is to drop back to stock clocks for a month or so, and see if the problem disappears.

BonzaiDuck · Mar 15, 2014

BrightCandle said:
Its seems kind of crazy to me that you are worrying so much about the security of your data and backups and such and yet you are running clearly unstable overclock. Solving the crashing by reducing the clock speed or increasing the voltage should get the machine back to a solid bill of health and avoid the crashing that has a small chance of data corruption. I feel your expending your effort in the wrong direction considering that the cause of your issue is something you have done and is easily fixed.

John3850 said:
After reading your posts I remember that you always ran the lowest possible vcore to keep your temps on the low side and I believe thats half of your problem.

Soulkeeper said:
I agree with the posters above, 4.7GHz is like an extreme benchmark clock. Something you would set for bragging rights on a forum

I thought so too, for a long time. But it didn't matter whether at 4.6 or 4.7. It would go ten days or so rock-stable, and then at EIST idle (except for running Media Center -> AVR/HDTV), a reset would occur. Much less frequent than an unannounced reset -- a BSOD.

As the concern about data grew as I tried to troubleshoot the problem, I made more frequent clones of my HDD. I even set the OC back to stock. And, again, with that concern, I began seeing to possible multiple fixes all at once, though following the inventory of things I'd read had caused similar trouble.

John3850 may be right -- about parsimony over voltage, but I've been careful to avoid this -- knowing past experience over the years. The voltage for the clocks has a margin added of some several millivolts.

The inventory: Check and retest the clocks and voltages. [no cigar]. Replace the RAM [I had been coveting a 2x8GB set] -- no cigar. Update all the drivers [except for the Hauppauge HVR-2250 tuner card -- which I'd forgotten until a little later.] I corresponded with hardware tech-support; scoured forums for similar problems.

I had been planning to replace my graphics card anyway. So I did that, together with updating the HVR-2250 drivers.

I learned a few things. Chief among them: the PLL Voltage default on Z68 and probably later comparable IB boards -- maybe even Haswell -- is excessive. Reducing that voltage was a panacea for better stability, but it is definitely helpful for temperatures. Maybe not for others but very likely for me -- VCCIO doesn't need to be bumped up much unless you're having trouble with OC'ing RAM [but I don't now need to do that, either.] Most of all, there are commonly cited rules-of-thumb for OC'ing -- for instance, the enabling of PLL Overvoltage, increase in "current capability," and other features which actually might increase temperatures slightly. I found that I'm fine without them.

I also spoke to some tech-support reps. There seems to be common problem with the HVR-2250 causing freezes, resets and BSODs. I'm standing pat with the driver update before I choose to remove the card.

Finally -- the old graphics card was running hot. It had been clogged with kruft, and I have no idea how it was able to get reasonable ventilation. So I also suspect some weakness in the VRAM -- something with that graphics card (GTX 570). I found other people who may have been having trouble with their GTX 570 -- experiencing the same symptoms.

Whatever it was -- it seems to have disappeared (knock on wood.) But it seems to have disappeared.

BAck to the question. Is it possible that ISRT might reduce the chance of disk corruption? What's in the cache is still there upon reboot. I don't know what sort of checksums are done by the IRST software, but it would seem the SSD could function like ECC. This was just a thought -- a theory-- that I had.

On the 4.7-as-bragging-rights. I'd since revised my own judgment that a "safe" limit for SB-K chips was 1.35V VCORE. It's more like 1.38V, like Nehalem. At this point, it's an easy 24/7 overclock. Sometimes I run the system at stock, sometimes at 4.6 and sometimes at 4.7. But even at 4.7, IBT doesn't heat up the processor more than about 71C. There's actually some room left there.

I think this problem went away. I might still suspect the HVR-2250 driver, but I have stronger suspicions about ineffective cooling for the graphics card - -which -- I'm no longer using.

One more thing. It would seem that if one were going to OC the processor cores, and the HD 3000 iGPU ramps up in proportion to the overclock, you should either OC the iGPU or turn the damn thing off. Which -- I did.

Essence_of_War · Mar 15, 2014

Whatever it was -- it seems to have disappeared (knock on wood.) But it seems to have disappeared.

It's only been three days since you posted this, and you originally said that:

For several months, I've been having the most occasional 10-day to 1-month unanticipated resets

I don't see how you're justified in claiming that the problem has disappeared if you're still inside of the typical inter-arrival time of the problem.

Is there something specific stopping you from dropping back to stock for 10 days or so to confirm that it's not your overclock?

BonzaiDuck · Mar 15, 2014

Essence_of_War said:
It's only been three days since you posted this, and you originally said that:

I don't see how you're justified in claiming that the problem has disappeared if you're still inside of the typical inter-arrival time of the problem.

Is there something specific stopping you from dropping back to stock for 10 days or so to confirm that it's not your overclock?

I already did that. To address it, I adjusted my voltages to raise the offset voltage, since a number of enthusiasts had reported idle-instability after creating seemingly stable settings under load. I also noted that the three power-saving settings besides EIST cause the EIST idle voltage to vary. With only EIST, you might have an idle voltage of 1.016V with a particular offset value, although using LLC would cause it to be lower than without LLC, but there wouldn't be any variation. With these C1E etc. features disabled, I discovered I had a problem with sleep -- most likely if the OS tweaks would allow sleep with "hibernation." So I re-enabled them. Even so, the collective wisdom and experience had been that these power-saving features, too, might be causing EIST-idle instability. But -- for me -- apparently not.

Your attention to chronology in my statements almost seems amazing. So I checked my event-logs. The last time I got an Event ID 41 was on the 9th, and that was because of the sleep-state problem -- now resolved. Before that -- the 6th -- attempting to run 3dMark Vantage with LUCID still enabled. Tech-support informed me that LUCID was a no-no for 3dMark benches -- it won't work. That's where I disabled the onboard HD 3000 entirely until the next time I want to experiment with it -- and that's "no time soon." So -- technically -- about 9 days. "Significantly" -- knowing what caused it in the interim -- more like 15 days. That breaks the pattern.

During that time, I disabled the iGPU, replaced the video card, re-installed the remaining drivers. I agree that I shouldn't count my chickens so soon, but I think these last actions may have resolved it. I'll eat my hat otherwise, so I'll order a new one made out of butter-cookie dough!! :biggrin:

"Inter-arrival time." Did you study Queuing Theory?

razel · Mar 15, 2014

You are stuck up the branches when the problem lies closer to the root. If you have dropped back to stock and are still having issues, then I would begin testing the RAM at stock. Go back to the beginning and simplify. Don't get stuck up in trees. You'll get light headed and lose your way.

Essence_of_War · Mar 15, 2014

"Inter-arrival time." Did you study Queuing Theory?

I did, just enough to apply to some stuff for my research group, and to be dangerous to myself and others

The time-table sounds better than I originally thought, but I'd really still HIGHLY recommend you drop back to stock clocks for a little while. This sounds very much like an unstable OC. If you could go to stock for a few weeks, then maybe metaphorically turn up the gas slowly and steadily and see if the problem returns.

Like razel said, the best matches for the symptoms you descirbe are an unstable CPU OC, and bad/unstable-OC RAM, so it seems like it very prudent to be 100% sure that neither of those are the issue.

Old Hippie · Mar 15, 2014

Your OC is a problem but cloning or imaging with an OC is a major problem.

BonzaiDuck · Mar 17, 2014

Old Hippie said:
Your OC is a problem but cloning or imaging with an OC is a major problem.

[Also acknowledging Essence_of_War . . ]

Don't go out of the way to trouble yourself with my thread-n-post verbosity, but here's where I am on the operating system and HDD :

http://forums.anandtech.com/showthread.php?t=2373811

The only "problem" with the scan-and-repair results was a missing file -- not a corrupt one. I had seen this as I found a forum post by a fellow who had exactly the same, single result. It has more to do with mis-installed software and C++ redistributables -- nothing to do with corruption. The general impression: the missing MFC80.DLL was likely benign. I'm currently pondering whether to dump the 32-bit C++ redistributable, or download and install it. I had explained that in the linked thread.

On the RAM issue. Until January, I had a 4x4GB set of G.SKILL GBRL's running. They'd passed the full 1000% "thorough" HCI-MEMTEST. I earlier thought that this may have been a problem; I'd coveted a 2x8GB set of DDR3-1866 modules, and bought them [pretty much -- needlessly.]

Twiddling with OC settings, dropping back to stock, replacing RAM -- the problem persisted. Advice from different corners pointed at other hardware and drivers which I replaced. [SiliconDust had some things to say about Hauppauge tuner-cards and instability -- as did several folks posting in various forums.]

But I think I have this licked. Under full CPU load, there's no way other than good clock settings that I'd be able to run Media Center LiveTV while getting through 50 iterations of LinX. Nothing under load would account for the problem. And I took measures to assure that EIST idle was solid and ample.

I mentioned an "0C2" error in the OS thread, which occurred right away after committing security and ownership changes to a component service. There was no trace of this stop error -- occurring only that one time -- once I'd reinstalled the Intel network driver. Of course -- no stop errors I'd experienced on that occasional basis included this 0C2. Or maybe I didn't notice it when it would show in a BSOD -- most of these occasional errors are just resets and not BSODs. But the OC2 error was the network driver and the IPBusEnum service -- at least, my opinion.

Over the last few days, I was thinking that hardware problems (including overclock settings) could corrupt the OS, which in turn, would proliferate more problems, which -- in agreement with Old Hippie -- would cause more confusion. But -- no OS corruption, no HDD problems.

Between a hot-running gfx card, the tuner-card and its driver (for which I need to "wait and see"), I'm now beginning to wonder if the Event-Log errors and specifically the one associated with the network device might have been at the root of this all along.

So -- tell ya what. I'll get back to this thread as soon as I'm "confirmed" -- either way.

Search

ISRT and reduction of disk corruption for BSODs or power loss

BonzaiDuck

Lifer

razel

Platinum Member

BrightCandle

Diamond Member

Soulkeeper

Diamond Member

john3850

Golden Member

Essence_of_War

Platinum Member

BonzaiDuck

Lifer

Essence_of_War

Platinum Member

BonzaiDuck

Lifer

razel

Platinum Member

Essence_of_War

Platinum Member

Old Hippie

Diamond Member

BonzaiDuck

Lifer

TRENDING THREADS