Error checking

Rubycon · Jan 30, 2008

How do the clients like BOINC check for errata in overclocked systems? For example suppose a user signs up with an overclocked system that was stable, etc. But summer heat comes, dog hair, other stuff plugs up the heatsink, etc. System starts producing errors. What kind of checking happens to prevent error-laden completed work units from getting submitted back to the mothership? Does the client check like P95?

I'm curious about this as it looks like a lot of overclockers are also participants in distributed computing.

Insidious · Jan 30, 2008

I don't know the detailed 'how', but when their program reads data that is not in line with constraints they have written into the program they usually throw up an error.

Other times, an error gives an impossible answer to a calculation and the program (or sometimes even the PC) locks up or crashes all together.

You'll know it if your machine goes unstable. Different projects will throw different errors, but you will know it has crapped out.

-Sid

VirtualLarry · Jan 30, 2008

Originally posted by: Rubycon
How do the clients like BOINC check for errata in overclocked systems? For example suppose a user signs up with an overclocked system that was stable, etc. But summer heat comes, dog hair, other stuff plugs up the heatsink, etc. System starts producing errors. What kind of checking happens to prevent error-laden completed work units from getting submitted back to the mothership? Does the client check like P95?

I'm curious about this as it looks like a lot of overclockers are also participants in distributed computing.

I think the real answer is... that they really don't know if the results are correct or not.
The only reliable way that I can see is to establish a quorom - that is, to send the work units out to multiple independent people, and then check that the returned results all agree.

But I doubt that these projects implement such a thing.

SoB was implementing some sort of double-checks, for precisely this reason, but I don't think that all of the work units sent out were double-checked. I kind of wish that they were, I'm suspicious of bad data. I might have even sent some bad data when I OCed my rig to 3.28Ghz. I thought that it was prime stable, but later on I was doing more testing and it failed Prime95, so I dropped it back down to 3.2Ghz. Now it seems stable, but apart from testing, how can I know for certain.

It also seems that some systems that are Prime95 stable, are not F@H stable. (Prime95 stresses the FPU, F@H stresses the SSE units).

Ideally, I guess, BOINC or other distributed-computing clients would come with an integrated stress-tester, that ran tests on both the integer, FPU, and SSE units on the CPU to ensure stability. Perhaps in the future this will happen. Perhaps the client would refuse to download work units unless the stress test passed sucessfully, even.

Alyx · Jan 31, 2008

I'm pretty sure that Seti waits for two computers to crunch a WU before it gives it points. After the little time I've been on malariacontrol I'm also thinking they do this. Some WUs give points right away, others are "pending".

biodoc · Jan 31, 2008

Yes, many of the WUs at malariacontrol have a quorum of 3. I'm assuming they use this for error checking but they also use it for assigning points to a WU. The same WU is sent out to 3 clients and the three clients return the completed WU with a "claimed credit" of xx. The middle number is the points granted to all three clients.

F@H seems to be a good project to test an OC'd system. If I push my system a bit too far, the client "kicks out" with an error so I back off a bit. I'm assuming the data is fine if the system is stable 24/7 for x number of weeks though.

Maybe I'll go over to the forum and check though😉

Error checking

Rubycon

Madame President

Insidious

Diamond Member

VirtualLarry

No Lifer

Alyx

Golden Member

biodoc

Diamond Member

TRENDING THREADS