BOINC: Windows to Linux

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
A room thermometer? :-)

If FAHControl doesn't show it that fahcore is stuck, then you could run something like nvidia-smi dmon -d 30 instead.
Ha, well, one of these is in a cold garage, and the other in the house, where we are performing our annual ritual to see who gives in and turns on the heat. I think the Sprint may have staved off the need to burn natural gas for another week or so, but it was 14C in the great room this morning and noticeably colder in the bedrooms.
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
My linux dual 1080Ti rig stopped folding a couple of days back. I haven't stopped the client yet so I still have the log files "frozen in time". It looks like both stopped due to downloading issues from each of the 2 servers from Temple University.

Code:
09:12:48:WU00:FS00:0x21:Completed 12000000 out of 12500000 steps (96%)
09:15:07:WU00:FS00:0x21:Completed 12125000 out of 12500000 steps (97%)
09:17:26:WU00:FS00:0x21:Completed 12250000 out of 12500000 steps (98%)
09:19:46:WU00:FS00:0x21:Completed 12375000 out of 12500000 steps (99%)
09:22:04:WU00:FS00:0x21:Completed 12500000 out of 12500000 steps (100%)
09:22:05:WU02:FS00:Connecting to 65.254.110.245:8080
09:22:05:WU02:FS00:Assigned to work server 155.247.166.219
09:22:05:WU02:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
09:22:05:WU02:FS00:Connecting to 155.247.166.219:8080
09:22:05:WU02:FS00:Downloading 27.47MiB
09:22:05:WU00:FS00:0x21:Saving result file logfile_01.txt
09:22:05:WU00:FS00:0x21:Saving result file checkpointState.xml
09:22:05:WU00:FS00:0x21:Saving result file checkpt.crc
09:22:05:WU00:FS00:0x21:Saving result file log.txt
09:22:05:WU00:FS00:0x21:Saving result file positions.xtc
09:22:05:WU00:FS00:0x21:Folding@home Core Shutdown: FINISHED_UNIT
09:22:06:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:22:06:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:4 clone:407 gen:44 core:0x21 unit:0x0000003c0002894c5d3b54d93db7f612
09:22:06:WU00:FS00:Uploading 14.80MiB to 155.247.166.220
09:22:06:WU00:FS00:Connecting to 155.247.166.220:8080
09:22:07:WU00:FS00:Upload complete
09:22:08:WU00:FS00:Server responded WORK_ACK (400)
09:22:08:WU00:FS00:Final credit estimate, 197856.00 points
09:22:08:WU00:FS00:Cleaning up
09:22:20:WU02:FS00:Download 0.23%
09:23:17:WU02:FS00:Download 0.46%
09:23:32:WU02:FS00:Download 0.91%

Code:
05:16:41:WU00:FS01:0x21:Completed 12000000 out of 12500000 steps (96%)
05:19:07:WU00:FS01:0x21:Completed 12125000 out of 12500000 steps (97%)
05:21:32:WU00:FS01:0x21:Completed 12250000 out of 12500000 steps (98%)
05:23:58:WU00:FS01:0x21:Completed 12375000 out of 12500000 steps (99%)
05:26:22:WU00:FS01:0x21:Completed 12500000 out of 12500000 steps (100%)
05:26:23:WU01:FS01:Connecting to 65.254.110.245:8080
05:26:23:WU01:FS01:Assigned to work server 155.247.166.219
05:26:23:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:1:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
05:26:23:WU01:FS01:Connecting to 155.247.166.219:8080
05:26:23:WU01:FS01:Downloading 27.50MiB
05:26:24:WU00:FS01:0x21:Saving result file logfile_01.txt
05:26:24:WU00:FS01:0x21:Saving result file checkpointState.xml
05:26:24:WU00:FS01:0x21:Saving result file checkpt.crc
05:26:24:WU00:FS01:0x21:Saving result file log.txt
05:26:24:WU00:FS01:0x21:Saving result file positions.xtc
05:26:24:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
05:26:24:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:26:24:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:7 clone:469 gen:66 core:0x21 unit:0x0000006b0002894c5d3b555d1adf11ae
05:26:24:WU00:FS01:Uploading 14.88MiB to 155.247.166.220
05:26:24:WU00:FS01:Connecting to 155.247.166.220:8080
05:26:26:WU00:FS01:Upload complete
05:26:26:WU00:FS01:Server responded WORK_ACK (400)
05:26:26:WU00:FS01:Final credit estimate, 193512.00 points
05:26:26:WU00:FS01:Cleaning up
05:27:19:WU01:FS01:Download 0.45%
05:27:25:WU01:FS01:Download 1.59%
05:27:37:WU01:FS01:Download 2.95%
05:27:45:WU01:FS01:Download 3.18%
05:27:59:WU01:FS01:Download 3.64%
05:28:06:WU01:FS01:Download 3.86%
05:28:16:WU01:FS01:Download 4.55%
05:29:14:WU01:FS01:Download 5.68%
05:29:20:WU01:FS01:Download 8.18%
05:29:26:WU01:FS01:Download 10.91%
05:29:32:WU01:FS01:Download 14.09%
05:29:40:WU01:FS01:Download 15.00%
05:29:49:WU01:FS01:Download 18.18%
05:29:55:WU01:FS01:Download 18.41%
05:30:02:WU01:FS01:Download 18.64%
05:30:09:WU01:FS01:Download 20.91%
05:30:16:WU01:FS01:Download 23.41%
05:30:27:WU01:FS01:Download 24.77%
05:30:48:WU01:FS01:Download 25.00%
05:30:58:WU01:FS01:Download 25.23%

I don't see the wrapper running.

Code:
mark@x20-linux:~$ ps -ef | grep fah
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
boinc     3578 15046 99 05:35 ?        05:41:31 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 656793313 -trickle 0 -upload 0 -wcgval 10000
boinc    13722 15046 99 09:27 ?        01:49:58 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 964070089 -trickle 0 -upload 0 -wcgval 10000
mark     16909  2363  0 11:17 pts/0    00:00:00 grep --color=auto fah
mark@x20-linux:~$ ps -ef | grep FAH
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
mark     14903  1392  1 10:21 ?        00:00:39 /usr/bin/python /usr/bin/FAHControl
mark     16912  2363  0 11:17 pts/0    00:00:00 grep --color=auto FAH

Anything else you'd like me to check before I restart F@H?
 
  • Like
Reactions: crashtech

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,096
16,014
136
My linux dual 1080Ti rig stopped folding a couple of days back. I haven't stopped the client yet so I still have the log files "frozen in time". It looks like both stopped due to downloading issues from each of the 2 servers from Temple University.

Code:
09:12:48:WU00:FS00:0x21:Completed 12000000 out of 12500000 steps (96%)
09:15:07:WU00:FS00:0x21:Completed 12125000 out of 12500000 steps (97%)
09:17:26:WU00:FS00:0x21:Completed 12250000 out of 12500000 steps (98%)
09:19:46:WU00:FS00:0x21:Completed 12375000 out of 12500000 steps (99%)
09:22:04:WU00:FS00:0x21:Completed 12500000 out of 12500000 steps (100%)
09:22:05:WU02:FS00:Connecting to 65.254.110.245:8080
09:22:05:WU02:FS00:Assigned to work server 155.247.166.219
09:22:05:WU02:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
09:22:05:WU02:FS00:Connecting to 155.247.166.219:8080
09:22:05:WU02:FS00:Downloading 27.47MiB
09:22:05:WU00:FS00:0x21:Saving result file logfile_01.txt
09:22:05:WU00:FS00:0x21:Saving result file checkpointState.xml
09:22:05:WU00:FS00:0x21:Saving result file checkpt.crc
09:22:05:WU00:FS00:0x21:Saving result file log.txt
09:22:05:WU00:FS00:0x21:Saving result file positions.xtc
09:22:05:WU00:FS00:0x21:Folding@home Core Shutdown: FINISHED_UNIT
09:22:06:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:22:06:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:4 clone:407 gen:44 core:0x21 unit:0x0000003c0002894c5d3b54d93db7f612
09:22:06:WU00:FS00:Uploading 14.80MiB to 155.247.166.220
09:22:06:WU00:FS00:Connecting to 155.247.166.220:8080
09:22:07:WU00:FS00:Upload complete
09:22:08:WU00:FS00:Server responded WORK_ACK (400)
09:22:08:WU00:FS00:Final credit estimate, 197856.00 points
09:22:08:WU00:FS00:Cleaning up
09:22:20:WU02:FS00:Download 0.23%
09:23:17:WU02:FS00:Download 0.46%
09:23:32:WU02:FS00:Download 0.91%

Code:
05:16:41:WU00:FS01:0x21:Completed 12000000 out of 12500000 steps (96%)
05:19:07:WU00:FS01:0x21:Completed 12125000 out of 12500000 steps (97%)
05:21:32:WU00:FS01:0x21:Completed 12250000 out of 12500000 steps (98%)
05:23:58:WU00:FS01:0x21:Completed 12375000 out of 12500000 steps (99%)
05:26:22:WU00:FS01:0x21:Completed 12500000 out of 12500000 steps (100%)
05:26:23:WU01:FS01:Connecting to 65.254.110.245:8080
05:26:23:WU01:FS01:Assigned to work server 155.247.166.219
05:26:23:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:1:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
05:26:23:WU01:FS01:Connecting to 155.247.166.219:8080
05:26:23:WU01:FS01:Downloading 27.50MiB
05:26:24:WU00:FS01:0x21:Saving result file logfile_01.txt
05:26:24:WU00:FS01:0x21:Saving result file checkpointState.xml
05:26:24:WU00:FS01:0x21:Saving result file checkpt.crc
05:26:24:WU00:FS01:0x21:Saving result file log.txt
05:26:24:WU00:FS01:0x21:Saving result file positions.xtc
05:26:24:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
05:26:24:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:26:24:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:7 clone:469 gen:66 core:0x21 unit:0x0000006b0002894c5d3b555d1adf11ae
05:26:24:WU00:FS01:Uploading 14.88MiB to 155.247.166.220
05:26:24:WU00:FS01:Connecting to 155.247.166.220:8080
05:26:26:WU00:FS01:Upload complete
05:26:26:WU00:FS01:Server responded WORK_ACK (400)
05:26:26:WU00:FS01:Final credit estimate, 193512.00 points
05:26:26:WU00:FS01:Cleaning up
05:27:19:WU01:FS01:Download 0.45%
05:27:25:WU01:FS01:Download 1.59%
05:27:37:WU01:FS01:Download 2.95%
05:27:45:WU01:FS01:Download 3.18%
05:27:59:WU01:FS01:Download 3.64%
05:28:06:WU01:FS01:Download 3.86%
05:28:16:WU01:FS01:Download 4.55%
05:29:14:WU01:FS01:Download 5.68%
05:29:20:WU01:FS01:Download 8.18%
05:29:26:WU01:FS01:Download 10.91%
05:29:32:WU01:FS01:Download 14.09%
05:29:40:WU01:FS01:Download 15.00%
05:29:49:WU01:FS01:Download 18.18%
05:29:55:WU01:FS01:Download 18.41%
05:30:02:WU01:FS01:Download 18.64%
05:30:09:WU01:FS01:Download 20.91%
05:30:16:WU01:FS01:Download 23.41%
05:30:27:WU01:FS01:Download 24.77%
05:30:48:WU01:FS01:Download 25.00%
05:30:58:WU01:FS01:Download 25.23%

I don't see the wrapper running.

Code:
mark@x20-linux:~$ ps -ef | grep fah
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
boinc     3578 15046 99 05:35 ?        05:41:31 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 656793313 -trickle 0 -upload 0 -wcgval 10000
boinc    13722 15046 99 09:27 ?        01:49:58 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 964070089 -trickle 0 -upload 0 -wcgval 10000
mark     16909  2363  0 11:17 pts/0    00:00:00 grep --color=auto fah
mark@x20-linux:~$ ps -ef | grep FAH
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
mark     14903  1392  1 10:21 ?        00:00:39 /usr/bin/python /usr/bin/FAHControl
mark     16912  2363  0 11:17 pts/0    00:00:00 grep --color=auto FAH

Anything else you'd like me to check before I restart F@H?
I have gotten so used to things like these, I just reboot.
 

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
My linux dual 1080Ti rig stopped folding a couple of days back. I haven't stopped the client yet so I still have the log files "frozen in time". It looks like both stopped due to downloading issues from each of the 2 servers from Temple University.

Code:
09:12:48:WU00:FS00:0x21:Completed 12000000 out of 12500000 steps (96%)
09:15:07:WU00:FS00:0x21:Completed 12125000 out of 12500000 steps (97%)
09:17:26:WU00:FS00:0x21:Completed 12250000 out of 12500000 steps (98%)
09:19:46:WU00:FS00:0x21:Completed 12375000 out of 12500000 steps (99%)
09:22:04:WU00:FS00:0x21:Completed 12500000 out of 12500000 steps (100%)
09:22:05:WU02:FS00:Connecting to 65.254.110.245:8080
09:22:05:WU02:FS00:Assigned to work server 155.247.166.219
09:22:05:WU02:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
09:22:05:WU02:FS00:Connecting to 155.247.166.219:8080
09:22:05:WU02:FS00:Downloading 27.47MiB
09:22:05:WU00:FS00:0x21:Saving result file logfile_01.txt
09:22:05:WU00:FS00:0x21:Saving result file checkpointState.xml
09:22:05:WU00:FS00:0x21:Saving result file checkpt.crc
09:22:05:WU00:FS00:0x21:Saving result file log.txt
09:22:05:WU00:FS00:0x21:Saving result file positions.xtc
09:22:05:WU00:FS00:0x21:Folding@home Core Shutdown: FINISHED_UNIT
09:22:06:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:22:06:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:4 clone:407 gen:44 core:0x21 unit:0x0000003c0002894c5d3b54d93db7f612
09:22:06:WU00:FS00:Uploading 14.80MiB to 155.247.166.220
09:22:06:WU00:FS00:Connecting to 155.247.166.220:8080
09:22:07:WU00:FS00:Upload complete
09:22:08:WU00:FS00:Server responded WORK_ACK (400)
09:22:08:WU00:FS00:Final credit estimate, 197856.00 points
09:22:08:WU00:FS00:Cleaning up
09:22:20:WU02:FS00:Download 0.23%
09:23:17:WU02:FS00:Download 0.46%
09:23:32:WU02:FS00:Download 0.91%

Code:
05:16:41:WU00:FS01:0x21:Completed 12000000 out of 12500000 steps (96%)
05:19:07:WU00:FS01:0x21:Completed 12125000 out of 12500000 steps (97%)
05:21:32:WU00:FS01:0x21:Completed 12250000 out of 12500000 steps (98%)
05:23:58:WU00:FS01:0x21:Completed 12375000 out of 12500000 steps (99%)
05:26:22:WU00:FS01:0x21:Completed 12500000 out of 12500000 steps (100%)
05:26:23:WU01:FS01:Connecting to 65.254.110.245:8080
05:26:23:WU01:FS01:Assigned to work server 155.247.166.219
05:26:23:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:1:GP102 [GeForce GTX 1080 Ti] 11380 from 155.247.166.219
05:26:23:WU01:FS01:Connecting to 155.247.166.219:8080
05:26:23:WU01:FS01:Downloading 27.50MiB
05:26:24:WU00:FS01:0x21:Saving result file logfile_01.txt
05:26:24:WU00:FS01:0x21:Saving result file checkpointState.xml
05:26:24:WU00:FS01:0x21:Saving result file checkpt.crc
05:26:24:WU00:FS01:0x21:Saving result file log.txt
05:26:24:WU00:FS01:0x21:Saving result file positions.xtc
05:26:24:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
05:26:24:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:26:24:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:7 clone:469 gen:66 core:0x21 unit:0x0000006b0002894c5d3b555d1adf11ae
05:26:24:WU00:FS01:Uploading 14.88MiB to 155.247.166.220
05:26:24:WU00:FS01:Connecting to 155.247.166.220:8080
05:26:26:WU00:FS01:Upload complete
05:26:26:WU00:FS01:Server responded WORK_ACK (400)
05:26:26:WU00:FS01:Final credit estimate, 193512.00 points
05:26:26:WU00:FS01:Cleaning up
05:27:19:WU01:FS01:Download 0.45%
05:27:25:WU01:FS01:Download 1.59%
05:27:37:WU01:FS01:Download 2.95%
05:27:45:WU01:FS01:Download 3.18%
05:27:59:WU01:FS01:Download 3.64%
05:28:06:WU01:FS01:Download 3.86%
05:28:16:WU01:FS01:Download 4.55%
05:29:14:WU01:FS01:Download 5.68%
05:29:20:WU01:FS01:Download 8.18%
05:29:26:WU01:FS01:Download 10.91%
05:29:32:WU01:FS01:Download 14.09%
05:29:40:WU01:FS01:Download 15.00%
05:29:49:WU01:FS01:Download 18.18%
05:29:55:WU01:FS01:Download 18.41%
05:30:02:WU01:FS01:Download 18.64%
05:30:09:WU01:FS01:Download 20.91%
05:30:16:WU01:FS01:Download 23.41%
05:30:27:WU01:FS01:Download 24.77%
05:30:48:WU01:FS01:Download 25.00%
05:30:58:WU01:FS01:Download 25.23%

I don't see the wrapper running.

Code:
mark@x20-linux:~$ ps -ef | grep fah
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
boinc     3578 15046 99 05:35 ?        05:41:31 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 656793313 -trickle 0 -upload 0 -wcgval 10000
boinc    13722 15046 99 09:27 ?        01:49:58 ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.30_x86_64-pc-linux-gnu -seed 964070089 -trickle 0 -upload 0 -wcgval 10000
mark     16909  2363  0 11:17 pts/0    00:00:00 grep --color=auto fah
mark@x20-linux:~$ ps -ef | grep FAH
fahclie+   688     1  0 Oct16 ?        00:02:45 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
fahclie+   690   688  0 Oct16 ?        00:09:45 /usr/bin/FAHClient --child --lifeline 688 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
mark     14903  1392  1 10:21 ?        00:00:39 /usr/bin/python /usr/bin/FAHControl
mark     16912  2363  0 11:17 pts/0    00:00:00 grep --color=auto FAH

Anything else you'd like me to check before I restart F@H?
I don't know what I don't know! For the race, I just with there was a way for the system to trip an alarm when this happens, it's a busy time of year and potentially millions of ppd could be at stake.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,096
16,014
136
I don't know what I don't know! For the race, I just with there was a way for the system to trip an alarm when this happens, it's a busy time of year and potentially millions of ppd could be at stake.
Use This:
PhEVFyY.png


And you only have to check one computer, and click on each client for "running" and ppd. It takes 30 seconds total, and I do it 3 times a day or more. If you are in a real hurry, sometimes just looking at the total farm ppd (lower left corner) will tell you if a box is not working correctly.

Edit: the above works in windows or linux, with clients mixed. In the above, its running on windows, and 8 of the 12 are linux.
 
Last edited:
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
6,553
10,293
136
In the meantime, is there a better method to keep tabs on these GPUs and be warned when they fall idle?
A script could check periodically whether FahCore is running. If it doesn't for more than e.g. one or two minutes, the script could killall -9 FAHClient, then restart the service. (Multi GPU hosts require extra consideration.)

Another but unproven idea would be to blacklist the offending work servers in a firewall rule.
 
  • Like
Reactions: crashtech

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
Both suggestions have merit, gentlemen. In the meantime, I am continuing migration to Linux (terra incognita for me). Hopefully by race time either or both solutions can be implemented!
 

StefanR5R

Elite Member
Dec 10, 2016
6,553
10,293
136
A script is available for this purpose. I'm not sure I'll run it though.
This script does something different from what I suggested. It reboots.* I for one would not run something like this.

*) ..under a variety of circumstances. Under other circumstances, it does other things. Since it has literally a dozen purposes, it is highly convoluted. I believe in shell tools which do one thing, and do it well.
 

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
@StefanR5R, I don't know enough to say, but it seems that a script that kills and restarts the client when it shows no progress is probably not very difficult for someone like you or @Ken g6.
 

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
@StefanR5R, I suppose I ought to at least say please! :oops:

I should be more ambitious and try to learn to do it myself, like I did when I was younger. But I suspect the learning curve would be pretty steep. Over time, studying what you present is helpful though. Thanks!
 
  • Haha
Reactions: TennesseeTony

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,096
16,014
136
OK, it just happened to me. It was an error I have never seen before, but I can't find the log file now. I just rebooted and BAM, all is fine.

Something about a timeout, multiple times.

Edit, but after it started up, its unit 14230, the ppd calcultor knows nothing about it, and its doing 180k on a 2080TI which normally does 2.2 million ppd
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
Last night before I went to bed I checked my dual 1080Ti rig and found one GPU had stopped/hung during a WU download but the other GPU was still working normally. I was too tired to do any detective work so I committed a linux user cardinal sin and rebooted the rig and went to bed. This morning I did some searching and found the "parent" fahclient process ID is logged in /var/run/fahclient.pid.

I was thinking a monitoring script for a single gpu rig would be more straightforward than a rig with multiple gpus. For a single gpu rig, a script could monitor changes in the log file (/var/lib/fahclient/log.txt) and if no changes are detected after 15-30 minutes, then kill the primary PID and restart the client. That would not work in a multiple gpu rig since all the clients write to the same log file. Detecting one "hung" gpu in a multiple gpu rig is a more complicated problem.
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
That would not work in a multiple gpu rig since all the clients write to the same log file.

Correction: There are log files for each slot too. In my dual gpu rig with 2 gpu folding slots:

/var/lib/fahclient/work/01/logfile_01.txt
/var/lib/fahclient/work/02/logfile_01.txt
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
/var/lib/fahclient/work/01/logfile_01.txt
/var/lib/fahclient/work/02/logfile_01.txt

Apparently these directories are unit #'s rather than slots. Now I'm seeing:

/var/lib/fahclient/work/00/logfile_01.txt
/var/lib/fahclient/work/02/logfile_01.txt

Lol, it's more complicated again. Directory 01 was replaced with 00 when a new WU was started.
 

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
Two out of three of my Linux F@H clients were stuck this morning. One of them is single GPU, the other dual.
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
Two out of three of my Linux F@H clients were stuck this morning. One of them is single GPU, the other dual.

Bummer. :( I'm also running WCG on 18/20 threads with 2 reserved for FAH GPUs. I was wondering if boinc uploads/downloads was competing/interfering or taking priority over FAH downloads so in FAHControl/configure/advanced I switched "folding core priority" from "lowest possible" to "slightly higher".
 

StefanR5R

Elite Member
Dec 10, 2016
6,553
10,293
136
I'd worry about a possibility of home networking stalls if LHC@home was on. But since Mark & Mark & Jon & presumably others experience this problem, it's certainly rather a networking problem near the F@H servers, as has been alluded to in a forum thread which one of you found earlier. (That said, it's a bug in the client if it hangs when a download stalls, instead of trying to recover e.g. by a new work request.)

Monitoring F@H log files could be useful, if there was a way to unjam F@H not by restarting it completely but by kicking only those subprocesses which are associated with a specific slot.
 

biodoc

Diamond Member
Dec 29, 2005
6,326
2,243
136
I was looking through FAHClient --help output and found this one:

respawn <boolean=false>
Run the application as a child process and respawn if it is killed or exits.

Do I dare add "respawn true" to the client options? The description makes it sound like it's difficult to stop if necessary.

Monitoring F@H log files could be useful, if there was a way to unjam F@H not by restarting it completely but by kicking only those subprocesses which are associated with a specific slot.

The output of 'ps -u fahclient f' would be useful I think for that purpose.

Code:
mark@x20-linux:~$ ps -u fahclient f
  PID TTY      STAT   TIME COMMAND
2442 ?        Ssl    0:25 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
2444 ?        Sl     2:18  \_ /usr/bin/FAHClient --child --lifeline 2442 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
  972 ?        SNl    0:03      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version
  976 ?        RNl  215:19      |   \_ /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 705 -lifeline 972 -
5223 ?        SNl    0:01      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version
5227 ?        RNl  112:52          \_ /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 705 -lifeline 5223
 

crashtech

Lifer
Jan 4, 2013
10,682
2,280
146
Bringing this thread back to the top again because I have a new problem. An X99 2676v3 Mint box in particular that has two 1070's in it has a FAH problem where the WUs appear to get stuck at 99.99%. Trying to stop and restart the client does not work, and "kill -9" does not work either. This is what I get after trying to kill the client(s):
Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH
1820 ?        00:00:18 FAHClient <defunct>
1822 ?        00:02:26 FAHClient <defunct>
3574 ?        00:10:37 FAHControl
What I have learned is that the client has become a zombie process, possibly because of my overzealous use of kill -9, or some other reason. using "ps j" to determine the parent process of the zombies returns value "1." Since 1 is "init," and the processes stay this way indefinitely, it means I have to reboot at this point , since init can't be killed, and FAHClient can't be restarted while it's zombie-self "lives" on. A curious thing is why there are two entries, I will have to keep an eye on that to see if there are two when it's running normally.

Edit:
When running "normally," there are two FAHClients. The parent of the first is init, the parent of the second is the first FAHClient:
Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH
 1805 ?        00:00:00 FAHClient
 1808 ?        00:00:01 FAHClient
 1841 ?        00:00:00 FAHCoreWrapper
 1865 ?        00:00:00 FAHCoreWrapper
 2704 ?        00:00:03 FAHControl
c7x99@c7x99-C7X99-OCE-F:~$ ps j 1805
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    1  1805  1805  1805 ?           -1 Ssl    124   0:00 /usr/bin/FAHClient /etc
c7x99@c7x99-C7X99-OCE-F:~$ ps j 1808
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
 1805  1808  1805  1805 ?           -1 Sl     124   0:01 /usr/bin/FAHClient --ch
 
Last edited: