- Feb 5, 2001
- 2,994
- 19
- 81
Noddy Goes to Sweden (Dec 12 2007)
Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.
However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type.
We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself.
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.
- Matt
The Story of the Hare who Lost his Spectacles (Dec 13 2007)
Roll up your sleeves, get the coffee brewing, etc.
So yesterday's "bug" hasn't been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday's spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn't suitable for sending out.
Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy.
Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue!
Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno's refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?!
After countless fprintf's were added to the scheduler code we found this actually wasn't the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory.
Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that's completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been...
...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way.
Now back to some actual programming (helping Jeff wrap up work on radar blanking code).
- Matt
#____Total Work Done____Todays WD_______AWD________overtake________Team-name
01______379.627.624______606.246______540.496______impossible______SETI.USA
02______307.782.778______558.445______411.002______impossible______SETI.Germany
03______132.418.312______187.905______166.240______impossible______L'Alliance Francophone
04______114.242.809_______93.056_______58.734______impossible______BroadbandReports.com Team Starfire
05______110.493.415______139.479______117.471______impossible______BOINC Synergy
06_______96.277.224_______69.533_______45.829______impossible______Czech National Team
07_______89.001.560______105.735_______85.752______impossible______SETI@Netherlands
08_______78.197.010______126.288______109.352______impossible______The Knights Who Say Ni!
09_______42.064.378_______63.616_______61.438______impossible______Overclockers.com
10_______41.573.352______-33.751______-37.228______1.117 days______OcUK - Overclockers UK
11_______39.513.545_______29.761_______28.411______impossible______Team Art Bell
12_______33.315.154________8.384________1.518______impossible______Team 2ch
13_______28.715.567_______14.368_______-1.316_____21.820 days______The Planetary Society
14_______27.787.121_______36.278_______26.596______impossible______Team MacNN
15_______25.036.047______-79.975______-87.289________287 days______BOINC.Italy
16_______13.482.892______-47.457______-57.139________236 days______Ars Technica
17_______70.870.492______161.655______157.656______notanoption_____TeAm AnandTech
19__________-63.088_______13.264______-10.415______impossible______SETI@Taiwan
18_________-231.109______-59.385______-63.636______impossible______Universe Examiners
20_________-564.678_______85.069_______55.058_________10 days______Team China
21_______-1.582.965______-13.920______-23.683______impossible______Phoenix Rising
22_______-4.879.081_______71.111_______28.199________173 days______Team Starfire World BOINC
23______-13.848.878______-26.304______-42.284______impossible______Canada
24______-13.857.346______-51.488______-46.831______impossible______Dutch Power Cows
25______-15.631.762______-94.360______-95.790______impossible______Hewlett-Packard
26______-15.706.246______-68.039______-69.971______impossible______Amateur Radio Operators
27______-15.715.059______-75.490______-62.852______impossible______PC Perspective Killer Frogs
28______-20.565.802______-27.760______-29.146______impossible______UK BOINC Team
29______-22.123.436______-82.406______-85.636______impossible______Team MacAddict
30______-22.221.643______-64.662______-63.160______impossible______Team NIPPON
31______-23.204.698______-36.417______-36.436______impossible______US NAVY
32______-23.251.915______-40.864______-51.248______impossible______BOINC SETI@home RUSSIA
33______-23.683.589______-46.829______-54.746______impossible______BOINC@AUSTRALIA
34______-24.136.588______-22.673________9.966______2.422 days______AUSTRIA - NATIONAL - TEAM
35______-24.726.257______-54.735______-68.543______impossible______BOINC@Denmark
36______-24.740.969_____-100.990_____-106.545______impossible______2CPU.com
37______-25.025.769______-70.395______-71.784______impossible______Hungary
38______-25.155.920_____-124.360_____-121.641______impossible______Planet 3DNow!
39______-26.694.835______-34.869______-39.524______impossible______U.S.Air Force
40______-30.491.495_____-112.594_____-114.025______impossible______Portugal@Home
41______-32.200.937______-89.131______-91.739______impossible______SETI.hr
42______-32.962.761______-85.956______-85.919______impossible______SETI@klamm.de
43______-35.958.623_____-113.600_____-112.370______impossible______Team EDGE
44______-36.130.114_____-104.613_____-108.463______impossible______HispaSeti & BOINC
45______-36.396.593_____-138.298_____-136.643______impossible______Picard
46______-36.616.189______-95.747______-90.421______impossible______BOINC.SK
47______-37.203.225_____-141.794_____-139.776______impossible______LittleWhiteDog
48______-37.481.180______-97.807_____-100.901______impossible______SETI Sverige [Sweden]
49______-39.790.514______-49.417______-59.045______impossible______BOINC@Poland
50______-40.746.273_____-114.226_____-116.663______impossible______BOINC UK
Appart for Anandtech's stats, it shows how much more/less than Anandtech.
Also shows based on Average Work Done how many days for Anandtech to overtake the team, or be overtaken by a team behind...