- Feb 5, 2001
- 2,989
- 18
- 81
Oh no! Bruno! (Oct 29 2008)
Well we haven't really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues.
Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We're resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We're also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time.
As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads.
The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we've been having...
- Matt
The End of All Things (Oct 30 2008)
Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it.
Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses.
It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work.
The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results.
Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters.
Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet.
Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn.
- Matt
#____Total Work Done____Todays WD_______AWD________overtake________Team-name
01______691.109.738______793.600______776.564______impossible______SETI.USA
02______494.740.890______577.366______599.195______impossible______SETI.Germany
03______217.386.992______219.212______228.108______impossible______L'Alliance Francophone
04______132.159.553______128.216______141.333______impossible______The Knights Who Say Ni!
05______129.283.746______-22.433______-15.467______8.359 days______BOINC Synergy
06______123.860.825______137.638______149.171______impossible______SETI@Netherlands
07______122.047.366_______85.290_______97.071______impossible______Czech National Team
08______118.447.452_______-8.386_______12.255______impossible______BroadbandReports.com Team Starfire
09_______72.160.355_______50.559_______53.662______impossible______Overclockers.com
10_______35.062.331_______12.532_______14.072______impossible______Team 2ch
11_______23.094.384_____-100.656_____-103.303________224 days______Team MacNN
12_______22.948.007_____-150.618_____-143.796________160 days______Team Art Bell
13_______20.158.832______-88.664______-76.950________262 days______The Planetary Society
14_______18.416.370_____-140.406_____-131.912________140 days______OcUK - Overclockers UK
15________4.342.215______-46.875______-47.951_________91 days______Team Starfire World BOINC
16________3.028.090______-15.905______-12.000________252 days______SETI@Taiwan
17______150.491.447______318.067______317.375______notanoption_____TeAm AnandTech
18_________-363.224______-48.591______-46.984______impossible______Team China
19______-24.011.976_____-203.965_____-202.194______impossible______BOINC.Italy
20______-24.757.171_____-191.323_____-189.178______impossible______Ars Technica
21______-27.004.396______-46.975______-42.723______impossible______US NAVY
22______-32.474.979_____-194.004_____-192.277______impossible______Phoenix Rising
23______-34.458.006_____-108.282_____-105.606______impossible______Canada
24______-39.007.543_____-118.807______-95.741______impossible______Amateur Radio Operators
25______-39.366.726______-93.682______-96.233______impossible______U.S.Air Force
26______-42.746.779_____-200.109_____-201.931______impossible______Universe Examiners
27______-45.677.159_____-158.405_____-155.029______impossible______UK BOINC Team
28______-46.196.728_____-146.960_____-150.658______impossible______Dutch Power Cows
29______-48.387.436_____-116.659_____-116.103______impossible______BOINC@AUSTRALIA
30______-50.133.037_____-171.872_____-174.325______impossible______AUSTRIA - NATIONAL - TEAM
31______-50.589.517_____-192.768_____-188.359______impossible______PC Perspective Killer Frogs
32______-56.159.229_____-160.127_____-151.555______impossible______BOINC SETI@home RUSSIA
33______-60.928.726_____-185.848_____-179.897______impossible______Team NIPPON
34______-64.184.927_____-213.654_____-192.325______impossible______BOINC@Denmark
35______-65.440.448_____-103.340______-99.109______impossible______BOINC@Poland
36______-68.668.904_____-203.396_____-202.632______impossible______Hungary
37______-72.522.022_____-257.696_____-254.881______impossible______Hewlett-Packard
38______-74.750.994_____-133.140_____-132.824______impossible______Elite Games
39______-74.788.993_____-240.672_____-240.396______impossible______Team MacAddict
40______-75.575.173_____-196.260_____-197.487______impossible______Planet 3DNow!
41______-84.728.948_____-251.926_____-251.493______impossible______2CPU.com
42______-85.408.302_____-225.292_____-220.363______impossible______Portugal@Home
43______-86.158.435_____-255.856_____-249.865______impossible______SETI@klamm.de
44______-86.205.276_____-237.018_____-239.025______impossible______SETI.hr
45______-90.554.483_____-121.291_____-121.576______impossible______Boone Community School District - Iowa
46______-91.797.506_____-236.454_____-230.874______impossible______BOINC.SK
47______-92.809.489_____-242.968_____-239.382______impossible______SETI Sverige [Sweden]
48______-93.531.519_____-268.449_____-257.576______impossible______BOINC.BE
49______-95.198.893_____-251.387_____-249.358______impossible______Team EDGE
50______-96.035.036_____-261.969_____-260.574______impossible______HispaSeti & BOINC
Appart for Anandtech's stats, it shows how much more/less than Anandtech.
Also shows based on Average Work Done how many days for Anandtech to overtake the team, or be overtaken by a team behind...
000______________Credit________Total_____
#00_____Pos._____/day__________credit_______User Name
001_____001_____68.008_____16.839.978_____dajeepster
002_____002_____41.180_____12.614.237_____keeleysam
003_____003_____34.997______9.628.360_____Todd Hebert
004_____006_____21.313______4.874.616_____Bryan Wallace
005_____005______9.866______6.501.223_____baxsie
006_____035______5.798________759.051_____Wiz
007_____027______5.725______1.185.436_____Yooshaw
008_____009______5.652______3.264.332_____Astro-AL
009_____004______5.119______6.581.044_____cory e
010_____021______4.648______1.401.690_____Jack Whitmire
011_____015______4.225______2.063.677_____RussianSensation
012_____078______4.222________285.255_____citsacras
013_____016______3.795______1.900.772_____Lane42
014_____019______3.347______1.580.430_____WhoKnows
015_____074______3.333________308.086_____Rebel Alliance
016_____036______3.106________757.951_____Assimilator1 - Team AnandTech
017_____010______2.906______2.852.148_____Alan J. Simpson
018_____307______2.858_________25.497_____Grimwulf@work
019_____031______2.547________931.718_____Wayne S
020_____011______2.115______2.722.247_____petrusbroder
021_____029______2.052______1.058.620_____tonyhams
022_____115______1.911________158.530_____Vindum
023_____047______1.625________608.157_____RedFish
024_____014______1.408______2.333.198_____nova
025_____022______1.306______1.397.030_____TeAm Enterprise
026_____028______1.286______1.114.424_____CraigRT
027_____034______1.248________793.865_____QuietDad
028_____059______1.231________396.853_____Smoke
029_____342______1.169_________19.097_____Andrew Rynn
030_____080______1.164________281.001_____jlaine
031_____062______1.141________377.033_____JCA
032_____181______1.130_________77.854_____mrwizer
033_____086______1.054________249.573_____IJump - Team Anandtech
034_____030______1.018______1.017.344_____Splotto
035_____350______1.013_________18.170_____Phiton
036_____072________941________314.605_____Daniel Schulken
037_____155________926________103.081_____Chad
038_____083________913________267.809_____Jason
039_____069________904________344.274_____SilverHair
040_____178________894_________79.789_____Beets
041_____248________818_________39.245_____Patrick
042_____120________801________148.198_____jta-seti
043_____053________797________508.297_____DimensionX
044_____232________757_________45.828_____amish
045_____038________755________752.271_____Spongebob_fan
046_____124________753________139.749_____mrbeci
047_____109________752________177.762_____Webmonkey
048_____032________706________861.186_____Chris Buchach
049_____140________676________120.335_____Shelgeyr
050_____007________611______4.843.943_____TenaciousT - Huntsville, AL
051_____065________576________365.604_____Steve
052_____023________480______1.344.401_____Chris S
053_____052________478________522.912_____Patrick Van Uffelen
054_____042________468________654.931_____Rich Hisgen
055_____243________396_________41.440_____littletemple
056_____386________394_________13.261_____NPG
057_____013________357______2.488.686_____del Sol
058_____039________348________743.322_____Niege
059_____045________322________624.960_____Fee
060_____037________303________753.428_____Voodoo
061_____125________279________135.496_____Arvind
062_____091________278________229.108_____McCormick
063_____017________265______1.729.229_____mk
064_____055________265________488.277_____rlame
065_____092________253________226.041_____cryhavoc
066_____160________247_________99.208_____PATHIK
067_____095________241________221.247_____bigwoofer
068_____008________237______4.396.892_____Ace
069_____068________225________346.914_____RoHo (Apple G5 dual 2.3)
070_____075________218________301.312_____Zim Hosein
071_____076________216________298.994_____CheesePoofs
072_____073________201________308.925_____bountyhunterxl
073_____082________195________271.715_____panhead49
074_____285________173_________29.567_____CrimsonWolf
075_____090________164________229.652_____chusteczka
076_____123________164________140.479_____DrKicker
077_____219________163_________52.533_____filibusterman
078_____209________154_________56.927_____linuxidiot
079_____136________149________124.088_____asim
080_____128________123________134.117_____theguru
081_____049________117________534.381_____Pokey
082_____061________109________382.830_____iamwiz82
083_____067________109________350.194_____bsauerbr
084_____084________108________267.123_____Soggy
085_____097________108________218.299_____mnelsonx
086_____064________103________367.959_____Alpha Psi
087_____099________103________209.407_____Romulanmale
088_____040________102________672.270_____Boosted Def
089_____146________102________108.476_____Tom Philippart
090_____093_________98________225.688_____vulcan800
091_____327_________97_________21.280_____freeloader
092_____050_________92________531.915_____groundedsailor
093_____103_________92________201.818_____Soni
094_____020_________90______1.428.547_____plasma
095_____202_________78_________62.248_____TA_andy
096_____113_________76________168.977_____zordz
097_____276_________75_________31.589_____Tom Pounds
098_____228_________70_________46.628_____Dougal
099_____212_________63_________56.074_____flywaldo
100_____262_________63_________35.060_____Takemaru
101_____265_________63_________34.326_____cyost91
102_____066_________60________356.612_____freakyfragiles
103_____448_________58__________7.119_____Jeffe19007
104_____290_________57_________28.725_____Thyme
105_____688_________57____________153_____bup
106_____051_________46________522.983_____LANMAN
107_____223_________46_________50.733_____Evadman
108_____233_________46_________45.370_____Ken_g6 - Team Anandtech
109_____253_________46_________37.413_____CB0159
110_____320_________46_________22.590_____Sickle 584th
111_____355_________45_________17.392_____Chad Green
112_____043_________41________646.442_____matchbook
113_____142_________37________110.602_____Mortimer
114_____094_________34________223.521_____Blade
115_____183_________33_________76.884_____benedict
116_____186_________32_________76.411_____jnj
117_____235_________32_________44.508_____Cydd
118_____085_________18________250.162_____Viperoni
119_____189_________16_________74.161_____Jarmonk