August 22, 2005 - 18:00 UTC
We are currently in the middle of the first of the scheduled daily 3-hour outages to clear out the large number of "antique" results. Some numbers will be in an addendum at the bottom of this post when the outage is over. Until then, here's a fun FAQ about the current situation:
Q: Where did all these antique results come from?
A: The client finishes a workunit and sends the result to the file upload handler which writes it to our file system. The file upload handler has no connection to the database. Then the client contacts the scheduling server saying, "I just upload result XYZ, please give me more work." The scheduling server, which does have a connection to database, normally updates the result entry and sends more work. However, if the result being sent back is so far beyond deadline that the result entry in the database has already been purged, nothing happens and the file remains on the system. This problem is currently being fixed. One additional theory is that as more people start using BOINC (especially now that all new users have to use the BOINC version), more slow/busy computers are being employed which can't return results before the deadline.
Q: So why is this a problem?
A: Because so many antique files were being left on our server it slowed validation down. We calculated that about 40% of the result files in the upload directories are antiques. To validate (or assimilate, or delete) any result, the file system needs to do a "directory lookup" to find the files before it can read them. When you have more and more files in a single directory, this takes more and more time.
Q: Are there other reasons the directories got so huge?
A: Actually, yes. We had a master science database crash a month ago. This was recoverable, but for several days the assimilators had to be shut off because they couldn't write to the database. This forced regular non-antique files to linger on disk longer than they normally would. Then, we discovered a logic problem in the file deleter (the process that removes files from disk after they finished assimilation). Workunits and results were being deleted from disk at the same rate, when results should have been being deleted at four times the rate (since there are four times the results for each workunit). So the result deletion was being throttled by the workunit deletion rate. These have all been fixed, and the regular file deletion queue has been slowly but steadily dropping. However, this compounds the current validator backlog problem.
Q: How could you possibly get one million results behind in validation? Doesn't this mean SETI@home/BOINC is a complete failure?
A: Here's a perspective check: In classic SETI@home, we only validate results after we give users credit. At this point in time classic SETI@home is roughly 50 million results behind in validation. At other times the number was hundreds of millions. Nobody notices because in classic users get their credit first. The BOINC validation system is actually faster and more elegant, and the current situation was brought on by several problems listed above, all of which are fixed or getting fixed.
Q: Why do you need an outage to delete antiques?
A: It's much faster that way. When the system is running full bore, there are too many processes accessing the upload directories to make antique file deletion worthwhile - it would just slow everything down.
Addendum: Some fun numbers: All the uploaded results are randomly distributed into 1024 subdirectories. Last Thursday we removed 235,666 antique results from 44 subdirectories, and today from 560,755 results from 105 more subdirectories (796,370 out of 150 total so far). So 14.65% of subdirectories have been cleaned up in about 4 total hours of outage time.