Rattledagger
Elite Member
Latest News:
Due to the various problems, there haven't been any new stats-dumps generated, and this means there's no new stats to post... So, expect a jump, possibly after todays normal outage...
Technical News:July 25, 2008
We've started creating work for SETI@home's new Astropulse application. At first we will just create a small amount, but we expect to enter full production next week.
May I Have Another (Jul 28 2008)
Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail.
Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project.
We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day.
On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well.
Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work.
And oh yeah.. we were slashdotted again on Sunday.
That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out!
- Matt
Due to the various problems, there haven't been any new stats-dumps generated, and this means there's no new stats to post... So, expect a jump, possibly after todays normal outage...