folding on skulltrail

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BlackMountainCow

Diamond Member
May 28, 2003
5,759
0
0
My 2 cents:

I just had the idea that maybe you should download and run BOINC on that skulltrail for a while. To test the bad RAM idea, I'd suggest 8 climateprediction (CPDN) models at the same time. That'll give your system quite a ride. CPDN has been known to be one of the most picky projects when it comes to RAM quality. Some folks even use it to test their overclocks for stability. If a CPDN model crashes, it's likely (of course not 100%) that something is indeed wrong with your system.

Had the same happen to me. New box, passed all "regular" stress tests (Prime95, Orthos, 12 hours of memtest86x, 3DMark, ...). But still, I had random reboots while gaming. So I tried CPDN and indeed, the models crashed as well, but only if two ran at the same time and peaked the memory usage. So I went and checked the RAM again with memtest86+, and guess what: around the 15th hour (aimed for a 48h test) the first error reports started to occur. Oh well, changed the RAM, tested for 48h, no problems anymore, no crashes and CPDN runs fine ever since.

:beer:

 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
i did not use internet explorer settings

... i also set affinity (manually using task manager) to odd numbered cores so the clients would be evenly spread across the CPUs.

it ran all night just fine like that.

i'm going to try running 3 or 4 clients on a single CPU and see if i get stability issues ...

i don't have much time today, but i'll check out BOINC and CPDN if i get a chance.

I am planning on testing out different RAM soon (4x 4GB of 4:4:4:12 FBDIMMs if everything works out), but i don't have any other modules yet.
 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
by the way ... is there any inherent issue setting things up like this:

I've got and F@H directory and then 8 subdirs. in the F@H dir i have a batch file that does this:

cd 1
start FAH504-Console.exe -local -advmethods -forceasm -verbosity 9
cd ..
cd 2
start FAH504-Console.exe -local -advmethods -forceasm -verbosity 9
cd ..

...

cd 8
start FAH504-Console.exe -local -advmethods -forceasm -verbosity

i created a shortcut to this on my desktop and set the advanced properties to run with administrative privs ... (now that biodoc suggested it -- i haven't tried running all 8 again)

all 8 fire very quickly ... should i put some sleeps in there before i fire each one off?

so far 4 is running very stable ... still going from last night ... haven't put them on a single CPU yet, but I'll try that tomorrow or something.
 

Insidious

Diamond Member
Oct 25, 2001
7,649
0
0
Since they run for a few hours (at least) before crapping out, I don't think there is reason to suspect the way you are starting them.

I'm kind of baffled in all this. I wouldn't think the system would care if you ran 8, but obviously when you run less, your system seems able to do them without errors.

So there's two possibilities that I can see.

1. F@H CLI clients can't run if there more than 4 installed. (or so it looks so far)

I tend to think this is somewhat unlikely. I know the SMP clients get more and more froggy with more clients, but they are communicating between cores using the loopback adapter and the smpd.exe process which is flakey and they just can't compete with themselves... so to speak.

2. The Skulltrail system gets less and less stable as it is more fully utilized. Maybe it's the memory controllers for all those cores, maybe it's marginal RAM, maybe it's something that we haven't even thought of yet....

I would love to know what is the result of running memtest86+ for 24 hours or so
I would love to know what 8 instances of the Gromacs tester that GLeeM linked to (above) does.

honestly, I think it's a long shot that it's a weak link in the CLI clients on this one. If you were running 4 gpu clients (with that monster mobo) or 4 SMP clients, I'd feel differently. Those are beta clients. But the CLI has an awful lot of experience under it's belt.

-Sid
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Originally posted by: Insidious
Since they run for a few hours (at least) before crapping out, I don't think there is reason to suspect the way you are starting them.

I'm kind of baffled in all this. I wouldn't think the system would care if you ran 8, but obviously when you run less, your system seems able to do them without errors.

So there's two possibilities that I can see.

1. F@H CLI clients can't run if there more than 4 installed. (or so it looks so far)

I tend to think this is somewhat unlikely. I know the SMP clients get more and more froggy with more clients, but they are communicating between cores using the loopback adapter and the smpd.exe process which is flakey and they just can't compete with themselves... so to speak.
Running 8 instances of the normal client shouldn't be a problem, as long as they're correctly configured and started, as they seems to be...

2. The Skulltrail system gets less and less stable as it is more fully utilized. Maybe it's the memory controllers for all those cores, maybe it's marginal RAM, maybe it's something that we haven't even thought of yet....

I would love to know what is the result of running memtest86+ for 24 hours or so
I would love to know what 8 instances of the Gromacs tester that GLeeM linked to (above) does.
The Gromac-tester should automatically spawns N threads, if doesn't override it.

honestly, I think it's a long shot that it's a weak link in the CLI clients on this one. If you were running 4 gpu clients (with that monster mobo) or 4 SMP clients, I'd feel differently. Those are beta clients. But the CLI has an awful lot of experience under it's belt.

-Sid
FAH and Vista isn't the best combination, but as long as it doesn't immediately crap-out it shouldn't be the problem. More likely is overheating, bad cpu or ram, maybe psu isn't up to the job or something... Since Skulltrail is new, it's also possible some unexpected bugs somewhere...

BTW, it's possible there's 1 "bad" core, so you'll be stable as long as you're affinity-locked to the other cores, but using the "wrong" crashes the system...




 

Foxery

Golden Member
Jan 24, 2008
1,709
0
0
An interesting tidbit from the project head, for anyone else who might be crazy enough to run on a >4 core system :)


Re: Releasing a 8-instance version of FAH-SMP?

New post by VijayPande on Wed Mar 05, 2008 11:05 pm

Our plan is to have a core which can handle an arbitrary # of cores. We are doing testing there, but it's not ready to roll out just yet.
 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
soon as i get this never ending article done ... (grumble grumble) ... i'll set up the gromacs test ... if that fails i'll run memtest ...

if that fails i'll reinstall the os ...

if that fails i'll have to wait until i get my new RAM -- and I'm not sure exactly when that will be ...
 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
should i try the v6 beta clients ? would there be any possible difference that could some how lead to better stability?
 

GLeeM

Elite Member
Apr 2, 2004
7,199
128
106
Originally posted by: DerekWilson
should i try the v6 beta clients ? would there be any possible difference that could some how lead to better stability?

I think the v6 clients are OK and will be leaving beta soon, but I know that the v5 console client is very stable, at least on Win XP.

Also, the client starts the "core" which is what does the work and uses the computers resources. You have the same chances of what WU core you will get to crunch with either client if configured the same.

What does the FAHlog.txt file show for error type at the point where the app crashes?

Sorry we haven't been able to help you more :( The folders here really are very good at helping, but you are trying things none of us have experienced before :thumbsup: so we are anxiously hoping you/we can resolve these issues with future technology :D

Well, time to finish the stats, I see you have your first Milestone ;)
 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
....

could this be the problem

i'm gonna kick myself if it is ...

i didn't set the clients to report less memory than the system has ... each client was reporting 4GB RAM ...

it might be that 4 clients reporting 4GB didn't ahve a problem but 8 clients reporting 4GB might ...

i set up all 8 to report 2024mb ... which is less reported total than the 4x 4092 that the stable 4 clients were running ...

should i drop it down to less than that? is this not a problem at all usually?
 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Originally posted by: DerekWilson
....

could this be the problem

i'm gonna kick myself if it is ...

i didn't set the clients to report less memory than the system has ... each client was reporting 4GB RAM ...

it might be that 4 clients reporting 4GB didn't ahve a problem but 8 clients reporting 4GB might ...

i set up all 8 to report 2024mb ... which is less reported total than the 4x 4092 that the stable 4 clients were running ...

should i drop it down to less than that? is this not a problem at all usually?

Most FAH-wu's uses either around 10 MB or around 100 MB memory, but there is some using around 250 MB (especially smp-wu). But, regardless of that wu's you're getting, you shouldn't have a problem on a 4 GB-system...

It's possible you'll hit the bug there memory-size is wrongly reported, but the only effect this would have is you'll be assigned only "small" wu's (or none at all). Still, setting all to 2 GB or 1 GB shouldn't be a problem.


As for v6, it seems to have got roughly the same bugs as v5, and in any case it's running the same science-core, and it's the science-cores that crashes your computer...

 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
after i set mem reporting to <2GB, 5 clients ran all night ... i'm gonna set them all to report 1 gb and try running all 8 again ...

also, can you just drop in a v6 client into the folder where v5 was running and have it work? just wondering ...
 

biodoc

Diamond Member
Dec 29, 2005
6,338
2,243
136
Not sure about that but in ver 6 you will need a passkey to enter during config.

Here's a link to get the passkey.

Here's a link to the passkey FAQ.

Cheers!
 

Foxery

Golden Member
Jan 24, 2008
1,709
0
0
Derek, I think you can overwrite v5 with v6, but I'd back up the directory before trying one just in case - or use the "-oneunit" flag to have the client finish its current WU without retrieving a new one. But like others said, it's unlikely the client will change much since the work is done by the "Cores."

biodoc, The passkey is an optional feature. My clients all run without one.
 

biodoc

Diamond Member
Dec 29, 2005
6,338
2,243
136
Nope, you don't need a passkey but probably a good idea to use one. If you check out the FAQ, they were talking about handing out bonus points if you use the passkey.

Could be just Stanford "bait" though for whatever reason.;)
 

Insidious

Diamond Member
Oct 25, 2001
7,649
0
0
I think that passkey thing was pretty much just a ploy to try and scare people out of cherry-picking the better scored work units (deleting the ones that were scored ridiculously far below others interms of PPD and only crunching the ones that were scored higher.)
Stanford's "non official" representatives (moderators, etc..) posted stuff like... "you better watch out.... they might take your points away"... "we'll know who you are now...." and the like.
But of course, Stanford persued this "anti-WU choice" software with the same vigor they have persued fixing lots of other known issues (eg: the Windows SMP clients, etc.....) so all that was really done was add another useless and confusing step to running F@H with no benefit what so ever.

(IMO)

-Sid

 

Rattledagger

Elite Member
Feb 5, 2001
2,994
19
81
Originally posted by: Insidious
I think that passkey thing was pretty much just a ploy to try and scare people out of cherry-picking the better scored work units (deleting the ones that were scored ridiculously far below others interms of PPD and only crunching the ones that were scored higher.)
Stanford's "non official" representatives (moderators, etc..) posted stuff like... "you better watch out.... they might take your points away"... "we'll know who you are now...." and the like.
But of course, Stanford persued this "anti-WU choice" software with the same vigor they have persued fixing lots of other known issues (eg: the Windows SMP clients, etc.....) so all that was really done was add another useless and confusing step to running F@H with no benefit what so ever.

(IMO)

-Sid
If the Passkey-FAQ is anything to go by, you'll only get bonus if you're using a Passkey... But, as long as v6 is still "beta", they can't really stop giving bonuses to non-passkey-clients...

As for the Passkey itself, it's a 32-character long random string, from the numbers 0123456789, and AFAIK only the characters ABCDEF...

... Meaning, it looks just like a BOINC Account Key...

Still, even support for Account Key is now added, it's doubtful if stanford manages to make an usable FAH/BOINC-application this decade...


 

DerekWilson

Platinum Member
Feb 10, 2003
2,920
34
81
running all 8 with each reporting 1 GB crashed the box in a couple hours again ... but here's a strange thing ...

even though the keyboard didn't respond and hitting the soft power button on the mobo didn't work ... i was able to access shared files over smb ...

if i could get to the HD through the NIC, doesn't that mean i should be able to do other things as well? if i setup telnet could i figure out what is going on? what about if i set up the remote assistance thing? i've never used it before, but would that work?

... also when the machine locked up a bunch of checksums didn't match and WUs that were at ~80% had to start over :-(
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
Originally posted by: petrusbroder
:D No, Jim, we are not doing a rerun of the "please delete" - thread - it was way too much writing, work, stats and so on! ;)

But it was a lot of fun ...

OnTopic: I am following this discussion because I am contemplating a similar rig.

why do you want a skulltrail?? Who's gonna use all those old p3's and a64's ??? :)
 

Insidious

Diamond Member
Oct 25, 2001
7,649
0
0
a bunch of checksums didn't match and WUs that were at ~80% had to start over :-(

Time to test the hardware (RAM, HD and MoBo)

Is that motherboard sporting the latest BIOS? Is Vista running the latest drivers?

Seems like when that machine gets busy, it dies

-Sid
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
Derek, I would run the gromacs app on all 8 cores overnight. It could also very easily be cpu overheating. 8 cores at 3.2 ghz running f@h has to generate a TON of heat!

btw, I think that your earlier post about possibly using f&h as a stability tester is a very good idea. It appears that you're doing everything right but the system has a flaw in one of its components. Maybe intel/amd will test their components a little more rigorously if they know that AT and others are going to really thrash the hell outta them with f&h!
 

Foxery

Golden Member
Jan 24, 2008
1,709
0
0
Originally posted by: DerekWilson
even though the keyboard didn't respond and hitting the soft power button on the mobo didn't work ... i was able to access shared files over smb ...

if i could get to the HD through the NIC, doesn't that mean i should be able to do other things as well? if i setup telnet could i figure out what is going on? what about if i set up the remote assistance thing? i've never used it before, but would that work?

Huh, that shouldn't be. When do the logs say the clients stopped responding; overnight, or when you turned off the machine? Sounds more like a Windows problem to me if the CPU is still active at all. You could also share a Folding directory or two over the network, and the next time it crashes, see if it's still running. (Timestamps are in one of the european time zones. Subtract hours as needed for where you live.)

I'd be very amused if you wound up discovering a fatal hardware flaw in Intel's newest poster-boy platform... but maybe that's just my long, bad day at work talking :p
 

biodoc

Diamond Member
Dec 29, 2005
6,338
2,243
136
Yep, I agree, need to do bios & driver updates, then do extensive hardware tests.

In the immortal words of Ross Perot, "This dog won't hunt!"

cheers! :beer:

 

GLeeM

Elite Member
Apr 2, 2004
7,199
128
106
Originally posted by: DerekWilson
....

could this be the problem

i'm gonna kick myself if it is ...

i didn't set the clients to report less memory than the system has ... each client was reporting 4GB RAM ...

it might be that 4 clients reporting 4GB didn't ahve a problem but 8 clients reporting 4GB might ...

i set up all 8 to report 2024mb ... which is less reported total than the 4x 4092 that the stable 4 clients were running ...

should i drop it down to less than that? is this not a problem at all usually?

You could set each client to report 512 M and you would still get the same WUs. That feature was added for rigs with not much memory.

The WUs would still run on rigs with not enough ram, just not very well as virtual memory (hard drive) would be used.

Someone should go over to the folding forum with this problem or PM a good helper over there to come over here and help. They should be interested in this new technology as it is their future. They should like to see this eight core work with their software.