- Feb 5, 2001
- 2,989
- 18
- 81
News from SETI@home/AstroPulse Beta:
The beta has both SETI-wu's and Astropulse-wu's, and just like on main project, you can choose to only get one of the types by editing your project-specific preferences.
It seems to be Windows-only for the moment, and unfortunately Nvidia, so have no way to test it.
Lastly, you'll need to use BOINC-client v6.4.5 or later, that is currently being alpha-tested, and there's been many more-or-less chaotic changes to work-scheduling, so it's possible you'll suddenly sit with a years worth of wu's, or, you'll idle on all cores...
Until "stable" release, don't be surprised if a new build every 1-2 days, that can be asked to upgrade to...
BTW, since this is the 1st. public beta-test of SETI@home's CUDA-application, it wouldn't be surprising if various bugs pops-up, and new applications is released. So, wouldn't recommend to run with a large cache... It's little point to sit with a 10-days cache crunching v6.05-wu's, if there's already been a v6.06, v6.07 and v6.08 released...
Being beta, it's possible it won't work at all, or screw-up something else, so don't use on computers if computer can't handle downtime.
Normal SETI@home-News:
Technical News:
Stats will follow "soon"...
Note, it's currently beta, you'll need to attach to http://setiweb.ssl.berkeley.edu/beta/ if wants to test-out.Dec. 10, 2008
We've released a new SETI@home app that runs on NVIDIA GPUs. Please run it if possible; you'll need to install a new driver. Instructions are here. If you encounter problems, please post on the SETI@home Enhanced message board.
The beta has both SETI-wu's and Astropulse-wu's, and just like on main project, you can choose to only get one of the types by editing your project-specific preferences.
It seems to be Windows-only for the moment, and unfortunately Nvidia, so have no way to test it.
Lastly, you'll need to use BOINC-client v6.4.5 or later, that is currently being alpha-tested, and there's been many more-or-less chaotic changes to work-scheduling, so it's possible you'll suddenly sit with a years worth of wu's, or, you'll idle on all cores...
Until "stable" release, don't be surprised if a new build every 1-2 days, that can be asked to upgrade to...
BTW, since this is the 1st. public beta-test of SETI@home's CUDA-application, it wouldn't be surprising if various bugs pops-up, and new applications is released. So, wouldn't recommend to run with a large cache... It's little point to sit with a 10-days cache crunching v6.05-wu's, if there's already been a v6.06, v6.07 and v6.08 released...
Being beta, it's possible it won't work at all, or screw-up something else, so don't use on computers if computer can't handle downtime.
Normal SETI@home-News:
December 8, 2008
Giuseppe Cocconi, author (with Philip Morrison) of the first paper on radio SETI, 'Searching for Interstellar Communications' and a central figure in particle physics and cosmic rays, passed away on 9 November at age 94. Read more about Cocconi.
December 8, 2008
Overland Storage donated 10 Terabytes of network attached storage to us today (in the form of one Snap Server 18000 and two SD30 expansion units). This will vastly help our current workunit/raw data storage problems. Thank you Overland!
Technical News:
Done with the Eternal (Nov 24 2008)
Welcome back from the weekend, which was actually relatively painless except for the usual set of automounter issues. We're close to giving up on all that. Today was a day filled with lots of chores - including trying to maximize how much raw data we have on line for splitting over the long weekend.
We did have a server hiccup today due to an administrative script corrupting an /etc/passwd file (thanks to aforementioned automounter problems). It's hard to maintain a server if the "root" user disappears from the passwd file. So I had to boot from DVD to file the corrupt file. Just so happens this was the server I was having BIOS issues last week, and they happened again! Without my consent it reset the boot drive sequence, causing a little bit of annoyance and grief. Eric and I are thinking there's a dead battery involved.
Reminder: this is a "short week" for us, thanks to the turkey day.
- Matt
Nefarious Designs, Inc. (Nov 25 2008)
Happy Tuesday. We had the usual outage rigamarole today and should be recovering from that in due time. Right after the backup was finished we restarted mysql with full query logging turned on. We knew this would choke the server a bit, and would just be on temporarily. After about a half hour we had over a million queries in the log, so we brought everything back down and turned logging off. We'll parse this log file, and perhaps others we generate over the next 24 hours, in order to find pesky unoptimized queries, anything that would die if we remove all multidimensional indexes, or queries running far more often than we expected.
Also during the outage I moved some big directories around - more NAS shell games. Other than that I've been reconfiguring some more web server stuff (internal use pages) and trying to maximize the raw data pipeline plumbing to get as much work online as possible. It doesn't help that a lot of our raw data drives are showing weird signs of corruption. Don't worry - we do checksums at every important transfer to ensure the data are sound, and the splitters cannot operate on garbage (there are keyword strings occurring regularly throughout the files). Nevertheless, we're having to throw away some files, which is sad. My spider sense tells me this has to do with our general SATA enclosure mounting/unmounting woes. For example, we're finding drives that are 500GB thinking they are 750GB when mounted. Was this because a drive previously on that mount point was 750GB and some bookkeeping bits haven't been cleared? I dunno, but I'm sure this isn't good.
In a couple hours I get to call a number where an automated voice will tell me if I have to attend jury duty tomorrow or not. I get dragged in for potential jury duty an astonishing amount (pretty much the legal maximum) considering I never actually get selected for trial, and never will.
- Matt
Crisis Management (Nov 26 2008)
Oops. My web configuration changes yesterday afternoon seemed to work at first (I checked the logs, tested it myself, etc.) but something bad got exercised, probably at the next web log rotation (which quickly stops/starts the web server) which then made it impossible for people to see the home page for a couple hours. Instead they got a broken link to our subversion page (an interface to our freely available source code). My bad. I fixed this as soon as I noticed it later in the evening.
Later on we had some weird behavior on the scheduling server (anakin) where it ran out of memory due to too many httpd/cgi processes running. It actually recovered on its own around midnight, then got choked up again. Nothing really changed, as far as our configuration nor our executables so we restarted it again this morning with the "ceiling" process limit values lower than before. However I noticed the fastcgi's were growing as they stuck around. A memory leak perhaps? Dave pointed out we have been doing client logging the past couple of weeks (which we usually don't do). Maybe that part of the code contains a leak - he's checking. Maybe that combined with the short period of mysql query logging slowing everything down caused the scheduler fastcgi processes to bloat. Not sure exactly, but we turned client logging off, and I added another flag to the fastcgis to force them to exit from time to time regardless of error just to make sure they don't bloat for too long and eat up RAM. I also finally bit the bullet and figured out our broken/wonky web log rotation system given all the above and fixed all that (I think).
Obviously I didn't get dinged with jury duty this time around, though last night the automated reporting instructions hotline told me to call again today at 11am for further instructions. So I did, but then the service kept saying it was "unavailable at this time." You know, I tried. Anyway.. Happy day of turkey. Actually I think we're having goose this year. Jeff and I will both be around and checking in from time to time (as usual).
- Matt
Keep in Touch (Dec 01 2008)
Welcome back from the holiday weekend, those who actually had a holiday weekend. Things were more or less calm around here. However thanks to our predictable nemesis autofs some things got a little murky yesterday. The mysql replica lost contact with the master - a regular occurrence - but we didn't get the warnings as mail was hung on a dead mount. Now that the replica has fallen behind (though it is catching up) the stats/server pages are a bit behind as well. This will clean itself up in due time. A few hours perhaps.
Otherwise work/data seems to be flowing normally, or normal enough. Dave incorporated some new scheduler logic (not sure what offhand) that is being tested in beta, probably rolled out to the public tomorrow. I'm bouncing around between data management, radar blanking code, and OS upgrade projects today.
- Matt
You Know You Know (Dec 02 2008)
Typical Tuesday outage day today (for database maintenance), and currently we're in the midst of smooth recovery from that, more or less. Things sometimes seem weirder on the server status page than they actually are, as the replica database (where we collect the stats) is too far behind the master. Sometime soon I'll add some stats to show this, hopefully thus refusing confusion (and fix the broken XML stuff while I'm at it).
Major improvements during the outage: Jeff put in some freshly compiled servers that went into beta last week, Bob rebuilt an index that has been missing on result for some time (used for occasional statistics Eric checks by hand), and I changed data selection priority to match between both Astropulse and Multibeam splitters (so they chew on the same files at the same time - and make it easier to determine who's splitting faster).
I also been busy with other sysadmin-y tasks. Moving accounts around (still), kicking one of our internal diagnostic cronjobs that has been hanging on stale lock files in /var/lib/rpm, data pipeline management (including shipping empty drives to Arecibo), and messing around with FC10.
- Matt
Surviving a Methodology (Dec 03 2008)
Ah, Wednesday. It usually today when Jeff and I swap our "focus." Early in the week I'm aimed at hardware/sysadmin and he's deep in software development, and then later in the week we switch. This is an attempt to make sure we both get some programming time as the other person is taking the helm. He's mostly working on the NTPCker, and me on radar blanking stuff. Both projects are slow going.
There are a lot of chores we both manage. Maintaining the raw data pipeline eats up an astonishing amount of time so we swap those duties as well. Simply "walking the beat," chasing down alerts, fixing hung processes and broken services, could easily end up a whole day every day if we're not careful. Today a huge chunk of time was spent by me moving home accounts off the old server onto the new one (and cleaning up a bunch of old garbage in the process). Also lost an hour with Jeff trying to figure out why his subversion repository was out of sync in such a manner he couldn't check changes in. I did get a moment to get the latest version of the software radar blanking signal generator to compile - and I just started a test run.
- Matt
Crazy (Dec 05 2008)
Happy Friday! I don't really have much to add to the proceedings.. today was a lot like Wednesday when last I was here at the lab. Time spent on more filesystem shell games, compiling/running code, and working with Josh to figure out some weird discrepancies between beta/public Astropulse results.
I should point out I added a couple more stats to the server status page, those being mysql queries/second, along with the amount of seconds behind the replica is from the master. Maybe this will help clarify when things go awry, though I know sometimes more information obscures the pertinent stuff.
I forsee a couple dams breaking in the very near future, resulting in massive server closet updates/upgrades including, but not limited to: shutting down the incredibly solid (but physical large and logically small) NetApp rack to be replaced by a 3U system with twice the storage, thus making room to (finally) put vader and sidious in the closet, along with several UPSes, and another CPU server, clarke, which has been waiting for too long to be employed. Sometimes these things have to happen serially. Ducks in a row and all that.
- Matt
Go Gone Green (Dec 08 2008)
Happy Monday, folks. Things were sort of okay over the weekend. The replica mysql database got stuck on Sunday - the usual drill - I logged in and quickly restarted it. The science database, however, also choked. This happened on Friday. Jeff's been doing some NTPCkr testing that would have gone all through the weekend except the excess I/O ate up all the informix threads, thus causing the splitters/assimilators to slow down and run out of work to send. Luckily I caught this before bedtime that night and broke that dam. Jeff's looking into why that happened.
In good news, Overland Storage (formally Snap Appliance, or Adaptec), donated 10 Terabytes of NAS storage in the form of a new "head" and two expansion units. One of the expansion units we'll try to get on our current workunit storage server ASAP (so we stop running out of room to split new work), and the other stuff we'll make a new temporary (possibly permanent) raw data reserve so we can do the big shell game and convert all the science database devices from RAID5 to RAID10. Thanks, Overland!
- Matt
Simple Life (Dec 09 2008)
Tuesday outage day (mysql database backup/maintenance). Today Bob took care of the final step of the "single vs. multi-dimensional indexes" exercise. That is, he dropped all the multi-dimensional indexes on the result table in the main project on the master database and we crossed our fingers. Looks like mysql is neatly, or smartly, parsing queries and merging single indexes as needed just fine. This whole point was to remove the number of indexes we need, and thus keep a slightly smaller footprint in memory, which in turn helps performance.
The raw data pipeline has been a major headache, if only because our hot-swap enclosures have been giving us grief. Jeff and I determined one of them is flat out broken, so that reduces our current maximum throughput by half until we get it replaced. This isn't a disaster, as we pretty much never reach half of our maximum throughput anyway, but still a slight inconvenience as we have to more rigorously schedule drive swaps.
Gearing up for the donation drive, I discovered our mass mail server lost its DNS entry for some reason. The lab DNS master replaced it, but not after I turned sendmail on an hour earlier and started my tests, thus causing all kinds of circular bounces that clogged the entire lab's mail queue with literally thousands of e-mails (maybe tens of thousands). It's still draining as I type this. Don't blame me - I didn't remove that DNS entry.
We're another step closer to removing that NetApp box. In fact, it's out of the automounter maps, everything on it is sym-linked elsewhere or chmod'ed to 0, and I scoured all the other servers to remove sym-links to it. Part of this project meant resurrecting server "clarke" (donated many months ago) to be a CPU server (or otherwise internal use) as it will soon have room in the closet. It had a stale configuration at this point which needed refreshing.
No news on the Overland boxes - though one question was: why not combine them into one big box? Well, we have two separate needs: workunit storage, and raw data storage. The former we already have, and it works great - we just need more room - so we'll plug in one of the new expansions and get that room. The latter we don't really have and would like to keep on separate volumes (as you read the raw data and write out workunits, so you don't want the I/O to compete as it would on shared drives). Also.. part of the deal is we're going to continue helping them beta test their latest OS, which they have on the second head unit they gave us. So in a sense we're obliged to have two separate entities - the raw data on the beta test head/expansion and the workunits on the known-reliable head and additional expanion. Other question: form factor - the heads are about 2U and the expansions are about 3U. We have 2 of the former and 3 of the latter now. We'll have room for them eventually. I will update closet photos when we do the next major move (next week, I hope?).
- Matt
Stats will follow "soon"...