Weekly Stats-18FEB2018

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kiska

Golden Member
Apr 4, 2012
1,016
290
136
Right, because lots of other projects with overcrowded servers would be affected by such a race condition. Though maybe Cosmology@Home has some own file shuffling going on in the background, such that they introduce a race condition of their own.

I think this may be an incorrect assumption, but I'll still say it.
Projects with high volumes of connections/tasks tend to use RAM disks to temporarily store incoming data results. And because C@H use a quorum of 1, the task gets added to the validator queue to be validated immediately. Which is what I am assuming is happening to some of my tasks that get submitted, is that the validator doesn't wait for the file handler to move it from RAM disk to storage. This wouldn't be a problem if the invalid:valid ratio is low. But 35% is way too much

That is the basis I think they are creating their own race condition, in that they have quorum of 1, where most other projects I know have 2 or more. Such as PG or S@H.

EDIT: I may have found a config tag which specifies the delay between the scheduler declaring the task as finished and the validator running. And I think C@H have set that to 0. Meaning as soon as your scheduler request goes through it tries to validate
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,562
7,927
136
I have stopped LHC from running on the machine that was getting the most errors. Here's a log file from one of the failed tasks:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176988322
There is stuff about "object not found" and "invalid object state" and "access denied" and whatnot in my log. Whereas yours says "access denied" only.
@crashtech, incidentally I discovered a faulty WU on my host (which still has only a small fraction of failures) which showed an error log very similar to yours:
https://www.cosmologyathome.org/result.php?resultid=65190118

So, unlike I thought, this E_ACCESSDENIED failure mode is not tied to whether a host experiences many or few failures.