My Setiqueue ..........Error Log Files.....

MoFunk · Dec 29, 2003

So I have been re-vamping my lan a bit. I have a smoothwall system working as my router and now will be moving a webserver to the more secure "orange" network of the smoothie system. So I would basically like to move my setiqueue there as well so I can take it off my main rig. Since Seti1 is supposed to end soon, should I even bother? Or if the end is still a few months away, what is the best way to move the queue? Thanks.

Oh and on my main rig I have the queue running as a service, and I totally forgot how I did that, could someone give me a refresher on starting and stopping setiqueue as a service? Thanks.

soni · Dec 29, 2003

setiq -u

move queue, entire folder to new machine

setiq -i

MoFunk · Dec 29, 2003

Thanks Soni. I think I will do that just to get it off my main rig.

MoFunk · Dec 30, 2003

Well I am kinda glad (I guess) that I decided to move my queue. I have not paid much attention to it but I have a ton of error logs. Here is what they say....
the log_date.txt says
12:00am: s@h_wrk Queued passthrough request
12:00am: s@h_wrk Passing through request send_result_get_user_stats
12:00am: s@h_wrk Seti@home message: 'Corrupt user ID. Please exit SETI@Home, delete user_info.sah and result.sah and restart SETI@Home. This problem results from a server side error that has been fixed.'
12:00am: s@h_wrk Passthrough: Seti@home status: ErrorCode 0x00000064 100
12:00am: s@h_wrk Returning passing through request response

and the err_date.txt file says
12:00am: s@h_wrk Seti@home message: 'Corrupt user ID. Please exit SETI@Home, delete user_info.sah and result.sah and restart SETI@Home. This problem results from a server side error that has been fixed.'

So, how am I supposed to figure this out? I have a ton of them.

Smoke · Dec 30, 2003

My Q is also suffering from the same thing. It is a result of the "defective WUs" Berkeley sent out over a month ago when their server was crashing.

I have taken all sorts of measures to remedy this problem without a whole lot of success.

When it was first discovered I took the extreme measure of completely deleting all 7,000+ cached WUs on my Q to insure that no additional "defective WUs" were given out. But it appears the damage was done and a whole lot of them were distributed.

The "defective WUs" don't have a proper "user_info.sah" file. This causes an additional error in the "result.sah" file (no ID or KEY). If the machine can be discovered that has the "defective WU(s)", it can be fixed. But in my case there are so many Queues and Clients that I have no way of knowing which machine is at fault.

In case you are wondering how this affects a SetiQueue, this is what happens: If the machine is running a caching program like SetiDriver, it starts working on another WU as soon as it completes one (or almost immediately for you people who like to nit-pick 😛). Let's say the "next" WU is a "good one". When it finishes crunching the good WU, SetiDriver dutifully transmits the freshly completed good WU and tries once more to transmit the old "defective WU". Therein lies the dilemma of trying to figure out which machine is the culprit. If the "defective WU" would completely halt the process we could just look for machines that have stopped transmitting completed WUs. But that is not the case, the defective WU laden machines continue crunching new fresh good WUs and endlessly try to transmit both the results of a freshly completed good WU and the old "defective WU".

The result of all of this is that our SetiQueue's waste a good bit of time trying to submit these defective WUs to Berkeley. Berkeley refuses them but the Q tries again and again each time the infected machine transmits. You see the evidence of all of this when you inspect the LOGs. 🙁

The only FIX I can think of would be to just start everything all over from scratch. I'm not just talking about the SetiQueue installation but every machine that is using the Q. If you have only your own machines this is doable but if you have a large Q like the one I'm running you are between a rock and a hard place.

I keep telling myself to just struggle on because S@H-1 will be ending soon and then all of this "stuff" will be behind us. But in the meantime, I'm personally suffering a lot of wasted bandwidth and I'm probably not alone in this situation.

Freewolf · Dec 30, 2003

I know we are still getting it on the Rebel Alliance team q. I finally got rid of the problem at home. If you aren't using a caching program your production on the machine with the bad work unit stops because you never download a new work unit to start on.

Crazee · Dec 30, 2003

It has wrecked havoc for me 🙁

MoFunk · Dec 30, 2003

So do these wu's get counted?

Smoke · Dec 30, 2003

Originally posted by: MoFunk
So do these wu's get counted?

Only if the user_info.sah file is replaced with a good one and only if the result.sah file is "fixed".

I've written about this before so I don't know if it is necessary for me to outline the procedure again?

I have discovered a way to find the defective WU. Just search your Seti Folder (or hard drive using Windows Search Function) for a file named "result.sah". This file only exists when the computation of the WU is finished.

A quick fix would be to just delete (thereby losing the analysis) the numbered folder containing a "result.sah" file. I am of course referring to SetiDriver installs.

The above "quick fix" may also work on service installs that are hung. Why don't one of you service install users give the quick fix a try and report if that actually is a "fix"?

MoFunk · Dec 30, 2003

Well Smoke I started with my main system. I am using setihide as a service so here is what I did. I stopped the service and dumped the entire folder! The stuff that is there for the service install I have backed up and it's already installed as a service anyway. So then I deleted EVERYTHING. Copied over my seti client that I have backed up as well as setihide. I then brought up setihide, configured, started grabbing wu's and closed the program. Then I started the service back up. I then dumped the error logs and will see if I still get them. If so, I will try the other computers on my lan. If that does not do it, then it may be a system here at work, which is fine since I have to take them down soon again anyway.

MoFunk · Dec 31, 2003

Now with this error, would the setiqueue even show any results from that specific machine? See I have 5 on my lan and they are all showing results being received. If this error will not allow a client to upload results then I can pretty easily rule out my lan.

Freewolf · Dec 31, 2003

if you're using a program that caches work units then yes they will still produce. They will just keep trying to upload that work unit everytime it sends one. If you are using a service install that doesn't cache work units then no it will not produce because the machine can not download a new work unit to start on. You have to stop the service, delete or fix the bad work unit then restart the service and in some cases I've had to reboot the machine, so the service will download a new work umit to process.

MoFunk · Dec 31, 2003

Thanks Freewolf. I can then say it is not a pc on my lan then. I am using setihide for them all and they all are able to dump the wu they completed and then move on to the next. None of my home pc's are clogged. It must be one from work. I will have to search around tomorrow.

Freewolf · Dec 31, 2003

Setihide can cache work units, if you have work units in the cache on the machine it will try to send the bad work unit give a brief error message then start processing the next work unit in the cache. When that work unit finishes it will send the good work unit then try again to send the bad work unit and repeat the cycle until the bad work unit is deleted or fixed.

Smoke · Dec 31, 2003

I agree with Freewolf.

That means you cannot rule out your home machines!

Did you do that search I recommended?

MoFunk · Dec 31, 2003

I see what you guys are saying.....

What I did was looked at all the cached wu's. They were all ready to crunch except for the one it was working on, that is what made me think it was not my lan. Setihide makes the WU a green happy face when it is done and since I saw none of those I thought I was fine. So To make sure I turned off ALL the seti threads on the pc's and the error log continued. I will search my work for the bad boy today.

Smoke · Dec 31, 2003

I can't seem to get a response to my question.

Did you use the SEARCH function for "result.sah" on your entire hard drive?

If you don't find any such file then "that" machine is OK.

It is the easiest and quickest method to locate the "defective WUs".

MoFunk · Dec 31, 2003

Sorry. No I did not do that. I will right now though and let you know.

Smoke · Dec 31, 2003

I apologize for writing, "I can't seem to get a response to my question."

I was a little perturbed at the moment. A momentary power-outage (all of 1 second) crashed 8 computers on my crack rack. It took several trips and about 45 minutes to get them all started again and I was in the middle of a "trade".

Well, all of the computers are running again and the trade has gone my way and I'm out. 😀

Let me rephrase what I wrote:

"I'd really like to find out if my SEARCH idea works for finding these defective WUs. I especially would like to find out if the SEARCH idea works on SERVICE INSTALLS. So if you (Mo) or anyone else can try it out I'd appreciate a report so we can pass on the idea (fix) to others." 🙂

There ... that sounds a lot better. 😛 😀

MoFunk · Dec 31, 2003

Smoke - No worries man..... Yes you can do a search and if you have a clogged system you will find a result.sah file. I did find the culprits. I have all but 2 of my work computers running the cli from the flex installer. The 2 others are running setihide. I started with the flex computers and found they were fine. I then went to setihide system A, and opened setihide. There I found a WU that was stuck and trying to phone home. I watched it for a moment and would see it try to transmit and then stop, then a minute later do it again and stop. This is the exact patern my error log was showing, a corrupt connection every minute. So I dumped the entire directory, can the cli to get my user file and loaded setihide again. Set it all up and have it running as a service now. So I went to setihide system B and saw that there were 5 doing the phone home thing, so I dumped and started over. No error logs for a while now!

Before I dumped the stuff though, I did a search for the result.sah file and it did find them. If you are looking at these systems from a lan, you can search rather easily. Bring up your search window and search for files named result.sah, then in the look in area type in the comuter name and the folder if it is shared, or the default c$ to search the whole drive (NT based os's). So it will look like this \\computername\c$ that is the quickest way I could come up with to search 8 computers that are in another building.

Smoke · Dec 31, 2003

Thanks for that report, Mo. 🙂

I'd like to try this FIX if you find another "service installed" system with a defective WU (read "result.sah" file):

Just DELETE the RESULT.SAH file and reboot the computer. I would like to know if the Seti Service starts up and runs properly without any further action?

I know on systems with SetiDriver all you have to do is DELETE the "numbered folder" that contains the RESULT.SAH file and you are good to go. SetiHide may work the same way.

But we have a definite need for a quick easy fix for service installs that get clogged because of these "defective WUs". 😉

MoFunk · Dec 31, 2003

Yes Setihide works the same way, just a numbered folder that you can delete. If I have time today I will try this on one of my systems that has the regular service installed. I kept a copy of the bum result file. I will let you know if I get a chance to do that.

Smoke · Dec 31, 2003

I just found a shareware program that really makes looking for this easy.

LAN Find

I have all of my Seti installs in folders named, "Seti@Home". I put "result.sah" in the "File or folder mask block" and "Seti@Home" in the "Share mask block" ... the search took all of 2 seconds. My LAN consists of approximately 20 computers.

😀

MoFunk · Dec 31, 2003

Thanks for the link Smoke! I am playing with it right now. I have a few hundred PC's on my LAN at work so it will take longer, but this is very cool!

Coolkid · Dec 31, 2003

Wouldent you need to delete all the .sah files in the directory?
The RESULT.SAH file is just the result, which would mean that it would restart, find no result.sah file and just start processing the defective WU again?

My Setiqueue ..........Error Log Files.....

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Elite Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Distributed Computing Elite Member

Diamond Member

Platinum Member