SETI@Home/BOINC finished moving results to new array.

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
September 8, 2005
Outage Notice: We are now in the middle of a day-long outage to reconfigure our upload/download file systems. Uploads and downloads will be unavailable at this time. This should vastly improve our general system performance.

Technical News:
September 8, 2005 - 17:00 UTC
We are now moving all the results from the upload/download file server onto a separate file system (directly attached to the upload/download server). We are copying as fast as we can - our early estimates show this will take about 24 hours. After that we will turn on all the backend processes and drain all the queues throughout the night. Tomorrow morning we will turn everything back on. Since the upload directories will be on local disks, and the download directories won't be bogged down with upload traffic, we should see a vast improvement in performance.

September 7, 2005 - 20:30 UTC
A temporary solution to our current woes is at hand. In fact, it's already half implemented. During our regular weekly database-backup outage we dismantled the disk volume attached to our old replica database server (which hasn't been in use for months) and attached it to the E3500 which is currently handling all the uploads/downloads. Right now a new 0.25TB RAID 10 filesystem is being created/synced. Should take about a day.

This space should be enough to hold the entire upload directory but that's all. Thus we are splitting the uploads and download onto two separate file servers, with the upload disks directly attached to the server that writes the result files.

When the system is ready we estimated it will take about a half day to move the upload directories to the new location, during which all services will be off line. This may happen very soon.

Note that this is not a permanent fix, but something has to happen ASAP before a new client (or new hardware) arrives. We'd rather both the upload and download directories move to directly attached storage, but we currently don't have the disk space available. And the disks we are going to use are old and have a potentially high rate of disk failure (there are several hot spare disks in the RAID system). But we're running out of space as the queues fail to drain, so we're out of options.
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Updated Technical News:
September 9, 2005 - 17:00 UTC
(This is an update to a post made yesterday)

We are now moving all the results from the upload/download file server onto a separate file system (directly attached to the upload/download server). We are copying as fast as we can - our early estimates were a bit off. Now we see that the entire file transfer process will take about 48 hours, all told (it should finish Saturday morning, Pacific time). After that we will turn on all the backend processes and drain all the queues. Since the upload directories will be on local disks, and the download directories won't be bogged down with upload traffic, we should see a vast improvement in performance.


 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Originally posted by: Assimilator1
Thx RD

I take it this is part of the scaling up for the big switch?

Well, not quite, it's rather the SnapAppliance doesn't even handle the current load so they're trying if putting uploads on some old disks instead will help-out...


September 4, 2005 - 23:00 UTC
So we're still suffering from the inexplicably slow reads/writes to the file server that holds the results and workunits. This server worked much better in the past - we're not sure what changed. Perhaps just the influx of new users?

We tried some reconfiguration today, none of which helped. For example, we moved some services off of Solaris 9 machines onto Solaris 10 machines. The Sol 10 machines seemed at first to be able to access the file server much better, but when push came to shove this was simply not true. Linux machines don't fare much better, either.

Basically, nothing we try helps because the nfsd's on the file server are always in a disk wait state. You can add a million validators/assimilators/etc. running on the fastest machines in the world, but nothing will improve if the disks are on hold.

Meanwhile, the queues are barely moving, only a fraction of the backend services are actually running, and the filesystem is filling up again. So the big question is: what are we going to do about this?

Several people on the message boards suggested we split the upload/download directories onto separate servers. This has always been our plan, but due to lack of hardware this is difficult to enact. We don't have an extra terabyte just hanging around ready to use. Though we've made plans to move a bunch of things around in order to make some room, we would still need a really long outage (ugh) in order to copy the upload or download directories to the new space.

An even better (and quicker) solution is that we release the new SETI@home/BOINC client which does a lot more science (with much better resolution in chirp space) and therefore it takes much longer for workunits to complete. While this will not affect user credit (as BOINC credit is based on actual work, not the more arbitrary number of workunits), this will reduce the load on our servers by as much as 75% (maybe more), since there will be a lot less workunits/results to process. This should have an immediate positive effect on all our backend services, and then we can diagnose our disk wait issues in a less stressful environment. We are still testing this new client, and the scientist/programmer doing most of the work on this will be returning from vacation shortly.

Others have mentioned we should just shut down SETI classic to make use of the extra hardware. This won't happen at least until we finish improving the BOINC user interface and get the aforementioned new client released. Even after classic is shut down there is, at best, a month of post-project data management before we could make use of these servers, and then there would be at best a week of OS upgrades, hardware configuration, etc. In short: not an option.
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Latest update, hidden-away in the forums:
Yes. We're gonna clear all queues first. There are still millions of files to delete. We want them outta here. Plus we want a full ready-to-send queue.

So it might be a while (a few hours? half a day?) after WFV reaches 0 before the schedulers are actually turned back on.

- Matt (Lebofsky)


A rough estimate shows Validator-queue is decreasing 60k wu/h, and should hit zero before 04 GMT.
Ready to send is a little slower with 40k "results"/h, but should hit 700k by 10 GMT, so both of these queues will be ready to whenever Berkeley wakes-up on Sunday... :sun:

Since neither Assimilator-queue nor file_deleter-queue is shown, can't guess how long this will take...
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Looks like they turned on uploads/downloads immediately after Validator-queue hit zero, so was up again at 04 GMT.

As normal after any SETI@Home/BOINC-outages, "classic" was at the same time shut-down to give full network-capasity to SETI@Home/BOINC, and "classic" is still down. Since the initial rush now seems to be over it's probably not so long before also "classic" is turned back on...



While after previous outage it took a couple days before it was possible to upload any results at all, today it took only a couple minutes after allowing connection before all results was uploading without any backoffs at all. :cool: The only is waiting on is the 10-minutes between it's allowed to connect to the scheduling-server. :beer:

 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
SETI@Home/BOINC have now stabilized, and "classic" have therefore also just been turned on again.


The SETI@Home/BOINC status-page just added Assimilator-queues and file_deleter-queues, so it will be interesting to follow these...

Currently there are, 16.10.07 GMT:
Ready to send: 408.944 results
In progress: 1.063.211 results
Waiting in validator-queue: 313.140 wu.
Waiting in Assimilator-queue: 1.130.714 wu.
Wu-deleter-queue: 3.917
result-deleter-queue: 1.110.511
Transitioner-backlog: 0 hours.


For anyone not knowing, Ready to send is normally 500k, at this point the Splitters slows down till drops below 500k again...
In progress is around 1M, the more users is running and the larger queues they're having the larger this number is.
The rest of the queues should ideally be zero...
 

petrusbroder

Elite Member
Nov 28, 2004
13,348
1,155
126
:D Thanks for the current and detailed info, Rattledagger. You are a solid pillar of information in this community and I am very greatful for your efforts and info. You save us a lot of time - we just have to read your posts and do not need to dig up the info ourseves.... Thanks again! :D
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Updated technical news:
September 11, 2005 - 19:00 UTC
After weeks of dealing with stymied servers and painful outages we're back on line and catching up with the backlog of work. It was a month in the making, but it was always the same problem - dozens of processes randomly accessing thousands of directories each containing up to (and over) ten thousand files located on a single file server which doesn't have enough RAM to contain these directories in cache.

Since this file server is maxed out in RAM, our only immediate option was to create a second file server out of parts we have at the lab. So the upload and download directories are on physically separate devices, and no longer competing with each other. The upload directories are actually directly attached to the upload/download server, so all the result writes are to local storage, which vastly helps the whole system.

While this all very good news, this isn't the final step. The disks on the new upload file server are old - we'd like to replace this whole system at some point soon (something with bigger, newer, faster disks and faster CPUs).


As for current queue-status, 4 hours since last update it's now:
Ready to send: 460k
In progress: 1.076k
Waiting validation: 281
waiting assimilator: 1.215.620
wu-deletion: 1.978
result-deletion: 754k
Transitioner: 0h.

This means, the only queue going wrong way is the Assimilator-queue, this have increased 85k in 4 hours. But, at the same time validator-queue dropped with 313k. Since a newly-validated wu means assimilator-queue increases with 1, now validator-queue have hit zero would expect assimilator-queue also will start decreasing...
 

Pokey

Platinum Member
Oct 20, 1999
2,781
480
126
Thanks for the info RD.
I don't always understand the process and set-up completely, but your posts sure do help me toward that goal. :light:
Keep up the good work. :thumbsup:
 

Assimilator1

Elite Member
Nov 4, 1999
24,165
524
126
Originally posted by: Rattledagger
SETI@Home/BOINC have now stabilized, and "classic" have therefore also just been turned on again.


The SETI@Home/BOINC status-page just added Assimilator-queues and file_deleter-queues, so it will be interesting to follow these...

Currently there are, 16.10.07 GMT:
Ready to send: 408.944 results
In progress: 1.063.211 results
Waiting in validator-queue: 313.140 wu.
Waiting in Assimilator-queue: 1.130.714 wu.
Wu-deleter-queue: 3.917
result-deleter-queue: 1.110.511
Transitioner-backlog: 0 hours.


For anyone not knowing, Ready to send is normally 500k, at this point the Splitters slows down till drops below 500k again...
In progress is around 1M, the more users is running and the larger queues they're having the larger this number is.
The rest of the queues should ideally be zero...

:confused:
Man BOINC is so confusing ,I'll never remember what all this stuff is! (well ok I remember what validation is about:p)