Technical News, and SETI@home daily stats for 17. - 18.05.2007.

Rattledagger

Elite Member
Feb 5, 2001
2,989
18
81
Fast One (May 16 2007)

Quick note as I gotta catch a bus..

Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow.

See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.

Gotta go..

- Matt
This one could probably go in the techincal news, but since I haven't blogged in a while, I decided to jot it down here.

Following the large outage, bruno's been having some problems keeping up. Lots of dropped connections. I guess most of you noticed that. It's not a lack of hardware this time, just an over-abundance of connection attempts.

Some of the dropped connections were local file-server connections, which causes some of the http processes to wait around which causes more dropped connections. Changing some of the TCP tuning parameters helped, but didn't solve the problem.

We did some brain storming before the outage and have come up with some tactics to combat these issues.

We're setting up our router to proxy the SYN/ACK handshakes. That way if we are flooded, the connections will be dropped before they get to bruno. That'll in turn prevent the NFS connections from getting dropped.

We're also getting rid of some configuration remnants from earlier BOINC server code. Currently bruno handles all of the incoming connections and forwards them to other machines when appropriate for uploads and downloads. We can designate other machines as upload or download handlers so that bruno won't have to touch those connections at all.

If that's not enough, we'll set up web servers on some of the other machines and get back to round robin DNS for the upload and download servers.

Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.

--
Eric
Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

Eric
Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.
We're about to put "ptolemy" in the mix in the next few hours. I'll certainly let you know if we need more beyond that.

Eric
We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem.

We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation.

Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM.

Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention.

Eric
It's still pretty hit and miss (as you can see). Hopefully it's getting toward more hit than miss at this point.

Ptolemy lost connectivity to the upload directories on bruno for a while. Just fixed that, so our upload rate should double.

This Graph is still your best bet of checking your chances of getting through. The higher the graph is, the better. But we should be hovering around 22 Mbps rather than 15.

We're still operating on a single scheduler due to compile problems. G'nite. I'll catch up on where we are in the morning.

Eric

#____Total Work Done____Todays WD_____AWD________overtake_______Team-name
01______259.283.081______807.210______342.616______impossible______SETI.USA
02______231.401.263______424.031______160.639______impossible______SETI.Germany
03_______98.336.456_______92.374_______43.326______impossible______BroadbandReports.com Team Starfire
04_______91.701.565______179.630_______90.167______impossible______L'Alliance Francophone
05_______82.232.959______140.351_______65.643______impossible______BOINC Synergy
06_______80.743.057______105.339_______41.977______impossible______Czech National Team
07_______69.645.543______101.105_______42.877______impossible______SETI@Netherlands
08_______46.368.618______145.479_______66.360______impossible______The Knights Who Say Ni!
09_______43.847.063________3.100________2.091______impossible______OcUK - Overclockers UK
10_______32.458.677_______-8.674_______-4.359______7.446 days______BOINC.Italy
11_______30.914.397________7.305_______19.402______impossible______Overclockers.com
12_______29.504.138_______54.606_______28.740______impossible______Team Art Bell
13_______26.254.691______104.391_______31.441______impossible______Team 2ch
14_______21.664.423_______58.915_______17.513______impossible______The Planetary Society
15_______16.143.340_______19.749_______10.379______impossible______Ars Technica
16_______14.115.365______110.117_______48.016______impossible______Team MacNN
17________2.874.839_______-4.772________9.386______impossible______Universe Examiners
18_______50.334.220_______76.487_______29.667______notanoption_____TeAm AnandTech
19_______-2.084.215_______16.084________6.322________330 days______Phoenix Rising
20_______-5.211.107_______70.801_______20.205________258 days______SETI@Taiwan
21_______-9.920.573________3.596_________-552______impossible______Hewlett-Packard
22______-10.075.065______-26.801_______-2.489______impossible______Amateur Radio Operators
23______-10.759.112_______25.182_______42.652________252 days______Team Starfire World BOINC
24______-11.203.213______-31.539_______-6.171______impossible______PC Perspective Killer Frogs
25______-11.331.584______-46.437______-16.741______impossible______Planet 3DNow!
26______-11.734.416_______28.124_______12.443________943 days______SETI@China
27______-11.987.093______-12.195_______-3.867______impossible______Canada
28______-12.743.632______-41.831______-11.291______impossible______2CPU.com
29______-13.941.413_______39.934_______18.071________771 days______Dutch Power Cows
30______-14.991.612________7.116_______-1.178______impossible______Team MacAddict
31______-16.141.904______-13.758_______-3.551______impossible______Team NIPPON
32______-17.857.345______-15.368_______-7.362______impossible______BOINC SETI@home RUSSIA
33______-18.085.241______-30.090______-10.447______impossible______BOINC@Denmark
34______-18.431.482______-33.872______-14.534______impossible______Portugal@Home
35______-19.213.819_______-9.626_______-4.429______impossible______Hungary
36______-19.392.208______-56.536______-22.556______impossible______Picard
37______-20.748.039______-50.043______-19.000______impossible______LittleWhiteDog
38______-21.173.362_______13.456________8.061______2.627 days______UK BOINC Team
39______-22.985.892_______11.545________5.623______4.088 days______US NAVY
40______-23.666.554______-29.607______-11.316______impossible______SETI@klamm.de
41______-23.845.956______-27.709_______-6.907______impossible______U.S.Air Force
42______-23.947.102______-15.783_______-4.087______impossible______BOINC@AUSTRALIA
43______-24.033.285______-40.344______-16.743______impossible______Team EDGE
44______-24.609.653______-21.334______-11.891______impossible______HispaSeti & BOINC
45______-24.803.975______-30.865_______-5.644______impossible______SETI.hr
46______-25.918.690______-44.296______-16.151______impossible______SETI Sverige [Sweden]
47______-27.560.178______-26.849______-10.412______impossible______BOINC UK
48______-28.499.205______-64.041______-21.891______impossible______The Final Front Ear
49______-28.725.061______-37.009______-16.431______impossible______SETI@Home Poland
50______-29.020.299______-43.330______-18.431______impossible______World Wide S.E.T.I.

Appart for Anandtech's stats, it shows how much more/less than Anandtech.
Also shows based on Average Work Done how many days for Anandtech to overtake the team, or be overtaken by a team behind

 

petrusbroder

Elite Member
Nov 28, 2004
13,343
1,138
126
Thanks fo rthe news and stats, Rattledagger.
I still have some 20 WUs trying to get reported - 2 or 3 of them are due May 18 and May 19.
I hope I will get some creds for them - as they are uploaded but not reported ... we will see.