Seeking insight on changing stability

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
I forgot whether my system configuration shown below -- reflected that I changed out my 3.0C Northwood for a 3.2E PRescott (sock 478) last September.

I sinked the Mosfets on my ASUS P4P800 SE. I used a ThermalRight SI-120 cooler for the Prescott. The system memory is OCZ EL Gold DDR500 dual channel, 2x1024 kit.

This system was rock-solid until two weeks ago with ASUS "Lock-Free" enabled, so at the stock FSB speed, the processor ran at an effective 2.8 Ghz. Bumping up the external frequency to 200 and FSB to 1000, it was tested extensively with OCZ's MEMTEST86 version and many hours running S&M -- and then PRIME95 -- without flaw. The VCORE was only set to 1.3875V, and VDIMM at 2.85V. AGP VDDQ voltage was bumped up to 1.6V, and the AGP/PCI setting was fixed at 66/33. Running the OCZ DDR500's at their full spec, the processor ran at 3.5Ghz.

The temperatures were ridiculously low, and I had posted many pages on this web-site about my cooling and ducting experience. The processor never got hotter than maybe 104F at room ambient of 75F -- while running at full load under S&M.

The PSU is an OCZ Powerstream 520.

So -- the problem at hand -- two weeks ago, I had some system crashes.

Suddenly I was getting maybe 12 errors under MEMTEST86 at the those (wonderful) settings. I dropped all the settings back to stock values, and began testing it again.

I was able to get the system back up to an external frequency of 244 Mhz, the processor at around 3.4Ghz -- without changing the VCORE setting. But now, in order to get the memory to test well at 246 Mhz, the VCORE had to be increased to something like 1.42V.

With Speedfan, I could be imagining things, but I think the range of variation around the mean voltage on the 12V rail is 11.8 to 12.15V. And I thought that when I built the system, that range had been tighter by +/- 0.05V, but I cannot be sure.

What is happening here? Is my power supply losing its performance edge? Or would the high FSB speeds be causing my CPU to slowly deteriorate? Would the chipset be going bad? Or has something happened to the memory modules?



 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
Correction: In my explanation of the over-clock settings, I didn't mean that I bumped up my external frequency to 200, but rather to 250. Just to avoid confusion here . . .
 

trexpesto

Golden Member
Jun 3, 2004
1,237
0
0
Funny parallel between this thread and my Barton thread.
I made alot of changes though which muddied the waters.

One would think that if memtest is having problems, and voltage is really within specs (got voltmeter?), then it is a problem with the memory or memory controller.

EDIT: compare and contrast Prime-Blend (RAM-intensive) versus Prime-SmallFFT.

Did you update your chipset drivers?

Try switching out the memory if you have some other sticks. Or try just one of them.

EDIT: I have bumped VCore from 1.65 to 1.7 in BIOS with same old 216x11 OC, stock VDimm 2.6 and VChipset 1.6: still was failing Prime-Blend in a few hours.
So backed off my timings from 2-2-3-7 to 2-3-3-11 on good advice from Snipershide and Maluckey, and Prime-Blend is running much longer now which MAY be an indication of better stability.

It never had a problem running memtest at stock, but I only ran it 21 passes or so.


FREEEEKING sunspots ;)
 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
Well, that's one area of this recently acquired Over-Clocking obsession where I'm blind: "Is degradation of components gradual?" "If you make settings right on the edge -- beyond which the system posts memory errors and stress-test failures, does that line change here or there ever so slightly? And Why?"

Like I said -- It was posting no MEMTEST86 errors or stress-test errors up to and including FSB settings at 246 after the minor disaster occurred. I ran S&M for several hours. Ditto for Prime95.

If I knew for sure which component was the cause of the problem, it wouldn't seem like such a pickle for me. And it DOES run fine with the external frequency just a few megahertz below the original setting.

But obviously something has changed.

The latencies for the OCZ EL Gold modules are 3,4,4, 8. OCZ and others suggest that 3,4,3,8 is possible. But the first set of latencies is the module spec, and that is as far as I can adjust the motherboard.

Also -- I did flash the newest BIOS into the system. I looked for updated chipset drivers, but there are none. But obviously, with the MEMTEST86 results, this is not a software or firmware problem.


 

trexpesto

Golden Member
Jun 3, 2004
1,237
0
0
But now, in order to get the memory to test well at 246 Mhz, the VCORE had to be increased to something like 1.42V.
That is with slack timings? To clarify:
Slacking the timings on the RAM to a "known slack favorite" of the sticks doesn't allow you to pass memtest at previous VCore and CPU OC?

What is S+M?

For me, the RAM-intensive Blend version of Prime was failing way before the SmallFFT Test, which I thought pointed at the RAM or NB.
However some people have been telling me that it doesn't matter WHEN you fail out, only IF.

If you have another PSU, even a crappy one, try it in addition to your current (pun) one. Like run the case fans and drives off of one, MB off the other, and nothing chained off a single set of molex wires.

EDIT: like I'm some kind of expert :laugh:

 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
trexpesto --

I read your thread.

I've been VERY cautious about bumping up the VCORE. The setting from September '05 through the crash a couple weeks ago was only one notch above the minimum setting allowed by the ASUS motherboard. That was 1.3875V and now I've pushed it up to about 1.4215V.

I also notice some people saying they get better OC results with the VDIMM set LOWER. Does this mean they are pushing even more juice through the CPU? I have DDR Booster, and was thinking about bumping the VDIMM up by 0.05 to 0.1V. Until last week, VCORE and VDIMM were always within factory spec.

Also -- on the issue of running the CPU fan off the mobo header. I have a Delta Tri-Blade, which runs 12V @ 0.9A at its top setting, but Speedfan is set to run it only at top speed. You yourself did not notice differences by running the fan off a Molex. And my MOBO allows something in excess of 2A draw from the combined fan headers. None of the other fan headers are in use.

I thought the remark about updated Windows components very interesting -- as though such software changes would stress the hardware differently. Considering my skepticism about this possibility, and if such effects actually occur, it could be consistent with the slight change in stable settings.

I have a query sent to OCZ tech support. It will be interesting to hear what they say.
 

letdown427

Golden Member
Jan 3, 2006
1,594
1
0
So it's your RAM that is deteriorating surely? Now failing memtest at same speeds as before?

It's happened(happening even) to me, although I've got some notorius 2x1Gb Ballistix. For months it's been running at 2-2-2-5 at a mere 200Mhz, (ran memtest86 on it for over 60 passes without error) but recently, I started getting odd memory related crashes. So, I tried memtest again, and lo, it craps out, briefly (few errors) in test 5, then in test 8 up come a thousand or so errors.

I've just loosened the timings back to stock, and am running at 220Mhz 3-4-4-8 atm. I thought this was just my ballistix doing their thing and dying on me, maybe not?
 

trexpesto

Golden Member
Jun 3, 2004
1,237
0
0
yep I hate to add voltage. My ram's not overclocked at all so I am sticking to stock voltage on VDimm for now at least.

It was rated at 2-3-3-6 or 7 and ran fine at that and stock Voltage for about a year and a half. Still no trouble in memtest, but now I can't recall if I had raised the VCore already.

I wonder how they test it at the factory?

Edit my stock VCore is 1.45 (Mobile Barton) and VDimm 2.6
 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
No final word on this yet, but today's observations at about 70F room temperature -- purely qualitative but nevertheless . . .

I don't have a thermometer to catch the air blowing out of my Powerstream 520.

But yesterday, I took the system apart to clean out the dust, install a new DVD-burner, improve the front-panel wiring so I can remove the bezel easily for cleaning, clean the fan-filters and revise their installation.

The "revised" fan-filtering provides greater air-flow. The system is designed around a pressurized case, which forces air through a motherboard duct. The original assumption of this design included plenty of pressure to feed both the duct and the PSU. The PSU ventilation is not part of the ducting design.

The temperature of air coming out of the PSU is much lower than before. OCZ says their PSU is designed to operate at 0C to 50C. But one would think that it would only perform better at lower temperatures.

Just for fun, now that the system is back up and running, I decided to disable ASUS "Lock-Free" and test a higher CPU over-clock at a lower memory speed. The VCORE is still set to 1.42V -- when I dropped it to 1.4V there were memory errors at this setting immediately.

The external frequency is only 236Mhz, which is about 10Mhz less than the top setting I was able to achieve after I experienced the problem and topic of this thread, but the CPU is running at nearly 3.8Ghz, and I don't believe I've had it running this fast before.

I should probably test the configuration with "Lock-Free" and DDR500 again. What I should have done -- when I achieved that stable configuration last September -- I should've backed off the setting to something between 244 and 248. There is no "safety-margin" at DDR500 and external frequency 250. I could never make it run at 251 or higher -- but then I never had the VCORE up this high, either.

Some people say they clock the VCORE on this Prescott to 1.5V, but frankly, I'd say they either have money to burn or they haven't read the tech-news articles about why such a setting would be bad for a Prescott.

This is also an old configuration, and I should probably start saving some money to move up to an AMD socket 939 system. The MSI-K8N Diamond Plus SLI board -- touted by Maximum PC -- uses DDR ram, although it probably wouldn't work well with these OCZ's.

I don't believe everything in Maximum PC, but the Newegg ratings on the board are splendid. Ordinarily, I'd stick with ASUS.

More on the developments regarding the slightly degraded performance problem with the PRescott as progress proceeds throughout the next few days.

 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
As a matter of fact, someone else asked me about that, and I sent them pictures of the ducting job via e-mail a few days ago.

How can I post jpg files here? My old personal web-site with earthlink is defunct -- just because I changed an e-mail address.

I was going to take photos of the new fan-filter setup. The filters were not the fancy-dan stuff with the transparent blue/violet/etc. rings you get at frozencpu.com, but those aluminum jobs -- square -- with the thick wire mesh. The front intake fans are installed between the case-chassis exterior and the front (plastic) facie/bezel. The bezel has narrow slots in the plastic for intake -- running halfway up the computer front. With the filters affixed to the inside of the plastic, these narrow apertures simply clogged much faster with dust and dirt. This time, with a little trimming, I was able to install them on the fan shrouds.

The fan shrouds are aluminum -- the fans are those Evercool 120mm fans. And the screws holding the fans to the chassis weren't long enough to accommodate the thickness of the filters.

I had these left-over spring-steel clips for 92mm fans that come with the ThermalRight heatpipe coolers, and I made some nifty little clips that span each corner of the filters and slide into the decorative slots on the sides of the Evercools. They grab the aluminum snugly, and must be pried loose to get the fans off, but it sure beats removing the fans screws.

And there's more airspace between the bezel and the filters, so I suspect that airflow is improved.

One of the problems I had with this old Gateway 1995 full-tower case was the front-panel wiring. The wiring wasn't long enough to move the bezel far enough from the chassis to do a really good cleaning job. So I got some wiring scraps for soldering together a pin-and-plug assembly so that the wires to the LEDs and switches could simply be unplugged at the chassis-front.

Now that I've done that, I'm going to widen the plastic bezel slots without any degradation in the appearance. It won't be a chore anymore to remove the bezel for dremeling.

I have another issue, which I think has been explored before. I'll post that and see if anyone has any answers. It's about Windows Activation and over-clock settings.

 

palouse

Member
Sep 28, 2004
90
0
0
Originally posted by: BonzaiDuckI changed out ... last September.
...two weeks ago, I had some system crashes. Suddenly I was getting maybe 12 errors under MEMTEST86
I assembled the sig rig in July 2005. I ran many hours of Memtest86+, along with other system tests, without errors before turning it over to my boy for use. (Although it may have been an error to let him have a game-capable machine...)

About two months ago, he started complaining about system crashes and lockups. I checked memory, and found it failing Memtest86+. After several hours of switching the 4 DIMMs around, I found one DIMM was "bad". The other 3 would run individually or in pairs without any failures. The bad DIMM always gave errors, individually or in a pair with another.

Do the error occur at the same memory address each time? You likely have a bad DIMM. RMA it or the "Dual Channel" pair back to OCZ. If you bought it from a really good retailer or eRetailer, they will handle RMA on your behalf.

You have to find the bad DIMM. Fairly easy to do in your 2x1024 case, just run the system with one DIMM. Set BIOS options to stock, and let Memtest86 run for a reasonable amount of time. It may take longer for errors to show at stock settings, but if the DIMM is bad, they will show. Same process for AMD or Intel. If you don't find errors in either DIMM after 8 to 12 hours of Memtest86 and stock settings, then it is likely something else.

I can see how someone would try to "just live with" RAM errors, though. I believe you risk disk corruption doing so.
 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
Yo. I didn't realize I'd had so many good responses to this thread.

Yes, something not so nice has happened to my MOJO machine.

Someone asked "What is S&M?" -- That was Trexpesto.

S&M -- which somehow implies "sado-masochism" on your system -- was introduced to me about eight months ago by a friend (AMD aficionado) in New Mexico. (He may even be posting here under an alias that I don't recognize, even though he knows me as "the Bonzai-Duck-Meister")

I even corresponded with the program's author -- a guy named "Serj." It is apparently a Russian creation, touted in sticky posts or compilations of benchmarking programs here at Anandtech. the home page for its source is:

TestMem

The latest version is 1.7.6, which I just discovered can be downloaded here -- just search through the text until you find reference to S&M:

Benchmark HQ

There seems to be a lack of documentation on the program, but it's pretty easy to figure out, and much more elegant than CPU-Burn-in and some other OC'ing stress-tests. The graphic displays of temperature variation is great, although I would suggest that you disable (close or exit) SpeedFan or any other program that polls the motherboard and processor sensors.

Back to my problem, and what seems parallel to that of Letdown427.

First, I had an OCZ DDR Booster in my system, which is supposed to stabilize RAM voltage even if left at the default setting with the BIOS VDIMM setting at its maximum -- the default being precisely the VDIMM specified in BIOS. You twist the little black screw on the Booster slightly, and an LED readout shows the new (and higher) voltage.

I DID see what would happen if I increased VDIMM beyond the motherboard (BIOS) maximum of 2.85V to 2.9 or 2.95. The OCZ DDR500 EL Gold's are warrantied for life at VDIMMs up to 2.9 + 5% -- which puts the maximum acceptable voltage for replacement at 3.045V. It only allowed me to increase my "new" stable settings from a maximum of 244 Mhz to 246 Mhz at the "favored" or "slack" latencies of 3, 4, 4, 8 -- the latencies specified as default on the heatspreader labels. Those were the latencies I had from the very beginning with this 2x1GB kit.

So I set the Booster down just to the threshold between 2.9 and 3.0 -- which experience had told me was effectively close to 2.95V. Not much improvement there.

I also found some review articles on the EL Gold's which had been published more recently and after I had purchased an earlier set of (2x512) 1GB -- which I gave to my brother for Xmas. Those never failed, but on the mobo I gave to my bro' -- the earlier P4P800 (standard) similar to the P4P800 SE -- I had never, ever been able to get those modules to be error free at the chosen VCORE setting beyond 240 Mhz external frequency. Even so, the latency on the 1GB kit is tighter at 2.5, 4, 4, 7, and at 240 I was able to get them to work perfectly at 2.5, 4, 4, 6.

This is why I was not willing to accept the possibility that the EL Gold modules had degraded, as suggested here by Palouse and Letdown427. There are other possibilities.

But I don't think it would be the CPU. I ran S&M at different settings, and the CPU doesn't show errors unless the voltage is insufficient for some OC setting. But I get either one single-bit error or a single multi-bit error after running the test at near 100% CPU usage for about six hours.

And I'm looking at the possibilities here. Where's the real stress that's "beyond spec?" You could argue that OC'ing the processor is "beyond spec," but for the six months preceding my recent troubles, I had it under-clocked (from 3.2 to 2.8) at the lower multiplier, and then over-clocked on the FSB bringing it back to 3.5 -- up 300 Mhz from 3.2. Any decent Prescott 3.2E should be able to sustain that setting as though it were stock. So at a VCORE of 1.3875V (one notch above minimum), I don't think the processor was being stressed.

That leaves the memory and the memory controller on the mobo. The memory was being run at spec, but the mobo was over spec with FSB at 1000 instead of 800.

Palouse may have a good point, even so. I did seem to notice - when errors occurred under MEMTEST86 -- there were a few at the same location, so I should be able to tell which stick it is. Or, for that matter, you'd think I could just RMA the two-stick kit and get a replacement. Supposedly OCZ is very good about this. If the modules don't run at spec when they did earlier, they should be replaced.

The cheapest component here is the mobo. And I've invalidated the mobo warranty by gluing on the aluminum Mosfet sinks. In fact, I believe I was only able to run the OCZ's at spec because the Mosfet heatsinks stabilized the voltage and reduced heat. And that's probably why I became so comfortable with the 500 Mhz DDR setting.

Now -- looking into this problem -- I discovered another reason to feel stupid. MEMTEST86 reports a bandwidth measure in MB/sec which is probably good for measuring performance of latency and speed settings -- if only in relative terms. With the modules running at DDR500 and 3,4,4,8, I think the bandwidth reported was either 2,792 MB/sec or just over 2,800. As I said about the more recent review articles on the memory, they noted that the OCZ EL Gold's had acceptable and tighter latencies at lower speeds. So at DDR466, you could run 3, 3, 3, 6 latencies, and at DDR433 you could run the modules with 2.5, 3, 3, 6 settings.

I tried it. The bandwidth reported at those latencies and at DDR466 was 2,800-something. So I can use less power, put less stress on the mobo, and get about the same memory performance with 2/3 of the original memory over-clocking. Why wouldn't that be better?

It would especially be better if I were previously only running the processor at 3.5Ghz in order to get the FSB to 1000 Mhz. Because now, I can bump the multiplier back up to stock (16 x 200 = 3,200 Mhz or 3.2 Ghz for the 3.2E Prescott), and get the processor to 3.7 Ghz with the FSB running at only 932 Mhz -- less stress on the mobo, less power consumption (I would think) -- same bandwidth.

I'm testing that setting now.

But again -- Why would I have been able to run the system with VCORE 1.3875V and the OCZ EL-Gold's at full spec?

Something is degraded here. I should probably RMA the OCZ's, since they show memory errors under extreme S&M tests after five or six hours -- where they never showed errors before. And at speed settings of between DDR 488 to DDR500. The processor -- with decent VCORE voltages that still are within spec -- seems to "pass" its S&M tests.

And if it's the memory controller -- well -- $90 bucks for a replacement and another $10 for ramsinks on the Mosfets so I can invalidate my ASUS warranty again.

Palouse is abso-freakin-lutely right about data corruption on the hard disk, which is why I dropped the over-clock settings for a day and back up all my data, created a NotePad TXT list of installed programs, CD-keys and license codes, and put it all on my RAID5 (Pentium 3) server upstairs.

This hobby can be (a) expensive unless you're careful and (b) damaging to your files -- the extension of your mind's history over 20 years -- unless you're careful about discipliined backups.

Anyway -- it has been a learning experience. Why stress your mobo at FSB 1,000 Mhz when you can stress it less at FSB 932 Mhz and get the same memory bandwidth with a higher CPU clock setting?

Old age is making me dumber.
 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
OK -- this just in:

S&M(r) 1.7.5 (beta)

Start: 18.04.2006 (Tuesday) 21:07
Stop: 18.04.2006 (Tuesday) 22:33
Processor # 0:
Cache L1 passed
Cache L2 passed
Integer passed
FPU passed
Power Supply passed
Processor # 1:
Cache L1 passed
Cache L2 passed
Integer passed
FPU passed
Power Supply passed

System Memory passed


That's about 2 hours of stress-testing at the "average" setting and 80% CPU load. I should probably run it through six to ten hours. Serj had said that (for instance) a failed "Integer test" on "Processor #1" at 100% load doesn't necessarily mean the processor is bad -- that the program will give such results at that level in its attempt to deal with hyperthreading.

I should run MEMTEST86 all night. Last night, it ran through 21 iterations without a single error. But at the conservative settings to which I've retreated. Palouse may have something there -- if they don't run at a few Mhz lower than DDR500, and if they used to -- RMA the memory sticks.

The elusiveness of these memory errors still makes me wonder if it isn't the mobo going South . . . .
 

BonzaiDuck

Lifer
Jun 30, 2004
16,593
2,002
126
For anyone who may have shown interest in this thread, the problem has been resolved.

Palouse was more confident in his correct assessment of the problem.

Apparently, one of the OCZ EL Gold DDR500 modules "went south," or had degraded.

Anyone who has OCZ high-performance memory modules will be interested in my experience.

I sent e-mail to their tech-support. The response was very quick -- within the day.

We discussed the possibility that it was either the memory or the motherboard -- perhaps the memory controller (this was an Intel configuration). The tech-rep immediately suggested I RMA the complete Gold EL dual-channel kit. I asked about the turnaround time, and the assessment was five days to a week. Before I could even think of it as an alternative, the tech-rep -- Jimmy -- offered to send me replacements in advance, charging a temporary $10 to my credit card. I would return the Gold modules within the 15-day time-frame.

Jimmy later noted that there were no "Gold" replacement kits, but he would send a set of Platinum XTC DDR500 modules. He noted that they had tighter latencies -- 3, 3, 3, 8 instead of 3, 4, 3, 8.

The XTC modules arrived about four days after the e-mail exchange and assignment of an RMA number. My postal costs were around $12 to mail off the Gold modules.

It was, indeed, a problem with the Gold EL's. Thus, no need to replace a motherboard.

I'm happier than a pig in poop. (Just an expression; I know they are clean animals!) But even better than that: (1) the XTC modules are compatible with socket-939 AMD platforms, and (2) the bandwidth at DDR498 is about 80 MB/sec greater for the XTC's than the bandwidth for the Gold modules at DDR500.

I can drop the FSB setting a bit more and still squeeze out the same performance I had before -- assuming that a slight drop in CPU speed will have much less effect than the memory bandwidth on overall performance. That should mean less stress on the modules. I can play around later with finding their full speed and stability limit.

OCZ rocks!! LifeTime Warranty rocks!!