SETI@Home technical news after todays unexpected power-outage.

Rattledagger · Feb 28, 2006

Technical News:

February 28, 2006 - 21:15 UTC
We had a planned outage today to remove a couple more items from the server closet (the Classic SETI@home data server and several large, heavy disk arrays which contained the old science database). In order to safely do so, we wanted to power down several important machines so they wouldn't accidentally get bumped and go down ungracefully.

The Bay Area is having a rough winter, and a storm today brought lightning which knocked out power to the entire campus, including our lab, around 8am. Most of the servers went down without a hitch. And with the power off anyway we went ahead and cleaned up the closet as planned. We can now get behind the racks again without painful contortion.

Powering up the entire network is painful, as servers need to revive in a set order, and many hidden mounting issues come to light (that only get tickled by a reboot). Plus some drives needed some fsck'ing. Everything eventually booted up just fine, except for the master science database.

One of the fibre channel loops disappeared on this particular server. Bad cable? Bad GBIC? Not sure just yet, as the terminal wasn't working well enough to give us all the boot diagnostics. We hooked up a laptop and fought with hyperterm to see these messages, but by the time we got that working the machine booted just fine for no explicable reason... but all the metadevices needed to be resynced. This resync could take up to 24 hours, during which the master science database will be down. That means no splitting and no assimilating, and we'll probably run out of work to send before too long. Oh well.

February 28, 2006 - 00:30 UTC
Just a quick update so you know we haven't disappeared. We've entered a phase of massive cleanup - moving machines around in preparation to put newer ones in the server closet. Since we were cracking the whole system open we figured we might as well bite the bullet and clean all our /usr/local's, update old versions of software, etc. So naturally, everything broke. The last couple of weeks have been spent playing a non-stop game of Whac-a-Mole, trying to fix one minor broken thing after another. You may have noticed some of these failures. For example, the user-of-the-day selection was stuck for a week due to a broken path.

There were some other minor issues. One of the assimilators kept crashing with no error messages - after some painful debugging we found it was freaked out by a single corrupt record in the database. But other than that there has been slow, steady progress. The new data recorder is nearing completion (being stress tested at this point), and we're planning to move more old servers out of the closet tomorrow.

networkman · Feb 28, 2006

It's the aliens causing all these problems - I just know it is.. but how to prove it..? 😕

SETI@Home technical news after todays unexpected power-outage.

Rattledagger

Elite Member

networkman

Lifer

TRENDING THREADS