BOINC: Windows to Linux

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,555
14,511
136
Bringing this thread back to the top again because I have a new problem. An X99 2676v3 Mint box in particular that has two 1070's in it has a FAH problem where the WUs appear to get stuck at 99.99%. Trying to stop and restart the client does not work, and "kill -9" does not work either. This is what I get after trying to kill the client(s):
Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH
1820 ?        00:00:18 FAHClient <defunct>
1822 ?        00:02:26 FAHClient <defunct>
3574 ?        00:10:37 FAHControl
What I have learned is that the client has become a zombie process, possibly because of my overzealous use of kill -9, or some other reason. using "ps j" to determine the parent process of the zombies returns value "1." Since 1 is "init," and the processes stay this way indefinitely, it means I have to reboot at this point , since init can't be killed, and FAHClient can't be restarted while it's zombie-self "lives" on. A curious thing is why there are two entries, I will have to keep an eye on that to see if there are two when it's running normally.

Edit:
When running "normally," there are two FAHClients. The parent of the first is init, the parent of the second is the first FAHClient:
Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH
 1805 ?        00:00:00 FAHClient
 1808 ?        00:00:01 FAHClient
 1841 ?        00:00:00 FAHCoreWrapper
 1865 ?        00:00:00 FAHCoreWrapper
 2704 ?        00:00:03 FAHControl
c7x99@c7x99-C7X99-OCE-F:~$ ps j 1805
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    1  1805  1805  1805 ?           -1 Ssl    124   0:00 /usr/bin/FAHClient /etc
c7x99@c7x99-C7X99-OCE-F:~$ ps j 1808
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
 1805  1808  1805  1805 ?           -1 Sl     124   0:01 /usr/bin/FAHClient --ch
Reboot the computer and it may fix it. I think I remember having this and that was the fix.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Reboot the computer and it may fix it. I think I remember having this and that was the fix.
Yeah, but it has to be done multiple times a day. So if I can't fix it, I'll just nuke the install and start over rather than have to reboot so often.

@biodoc, I'll check the sleep settings.

Edit: It is and has been set to never sleep. I've never noticed it suspended/sleeping before, the fans are always running when inspected.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,555
14,511
136
Yeah, but it has to be done multiple times a day. So if I can't fix it, I'll just nuke the install and start over rather than have to reboot so often.

@biodoc, I'll check the sleep settings.

Edit: It is and has been set to never sleep. I've never noticed it suspended/sleeping before, the fans are always running when inspected.
It only hasppened to me once in a while, not all the time....
 

StefanR5R

Elite Member
Dec 10, 2016
5,510
7,817
136
An X99 2676v3 Mint box in particular that has two 1070's in it has a FAH problem where the WUs appear to get stuck at 99.99%.
It may not be of much help to you, but I never encountered this with X99 2696v4, 3x 1080Ti, Mint 18.3, nvidia driver 384.130.
What I have learned is that the client has become a zombie process, possibly because of my overzealous use of kill -9, or some other reason. using "ps j" to determine the parent process of the zombies returns value "1."
When you use "ps afx" (or other means to check the command line), do the zombie processes have "--child --lifeline ..." in their command line? Edit, when it happens the next time, check the process hierarchy with the "f" and "x" flags of ps, to see which processes are still there, and what their states and relationships are, before you begin to kill them.
 
Last edited:
  • Like
Reactions: crashtech

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Output of ps afx | grep FAH:
Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep FAH
 2712 ?        Sl     3:35          |       \_ /usr/bin/python /usr/bin/FAHControl
 6447 pts/0    S+     0:00          \_ grep --color=auto FAH
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:36  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1853 ?        SNl    0:04      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 705 -lifeline 1816 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 1 -cuda-device 1 -gpu 1
 1891 ?        SNl    0:04      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 705 -lifeline 1816 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0

Image of FAHControl, both WUs sitting at 99.99%:
fah99.jpg

Set logfile to highest verbosity, but there are no recorded errors.
 

StefanR5R

Elite Member
Dec 10, 2016
5,510
7,817
136
Weird! There are no FahCore_21 processes running.
I suppose you already tried to pause and resume the slots, right?
Next would be to gently "sudo killall FAHCoreWrapper", I'd say.


Edit,
wait. You did a "grep FAH". Remove that, of do at least "grep -i fah" for case-insensitive filtering.

Please look up whether or not the cores are still present, and if they are, in which state.

Edit 2, I'm away for ~6 h.
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Okay, here it is:

Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep -i fah
 2712 ?        Sl     4:00          |       \_ /usr/bin/python /usr/bin/FAHControl
 6555 pts/0    S+     0:00          \_ grep --color=auto -i fah
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:40  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1861 ?        ZNl  291:44 [FahCore_21] <defunct>
 1898 ?        DN   102:04 [FahCore_21]

Oh, and here's a massively truncated version of the logfile after I try to pause Folding in preparation for a reboot:
Code:
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,555
14,511
136
Okay, here it is:

Code:
c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep -i fah
 2712 ?        Sl     4:00          |       \_ /usr/bin/python /usr/bin/FAHControl
 6555 pts/0    S+     0:00          \_ grep --color=auto -i fah
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:40  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1861 ?        ZNl  291:44 [FahCore_21] <defunct>
 1898 ?        DN   102:04 [FahCore_21]

Oh, and here's a massively truncated version of the logfile after I try to pause Folding in preparation for a reboot:
Code:
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
Don't pause before reboot, just do it @!
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
The first time I tried to fix this, I uninstalled and reinstalled F@H, but that did not remove the data (which is normally okay) but is not what was needed. Now I've managed to delete the slots and recreate them. One of the slots would not delete, another reboot was required. If this doesn't fix it, I'll have uninstall F@H, nuke the data directory, and reinstall. At the moment, it seems okay, though.
 

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
@crashtech , It's possible it's a driver issue. I've got the 390 driver running with my dual GTX 1080 Ti rig. When I change or reinstall the drivers on Mint, I purge the system of nvidia related files first.

sudo apt purge nvidia*
sudo apt install nvidia-driver-390
reboot
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
This rig has had 430 drivers all along. I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands, but I believe their performance characteristics are similar. The second GPU seemed to work okay, at first.
 

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
This rig has had 430 drivers all along. I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands, but I believe their performance characteristics are similar. The second GPU seemed to work okay, at first.

After adding a second gpu, you should run sudo nvidia-xconfig -a
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
After adding a second gpu, you should run sudo nvidia-xconfig -a
This is what I get:
Code:
WARNING: Unable to locate/open X configuration file.

Package xorg-server was not found in the pkg-config search path.
Perhaps you should add the directory containing `xorg-server.pc'
to the PKG_CONFIG_PATH environment variable
No package 'xorg-server' found

WARNING: Unable to use the nvidia-cfg library to query NVIDIA hardware.


ERROR: Unable to determine number of GPUs in system; cannot honor
       '--enable-all-gpus' option.

New X configuration file written to '/etc/X11/xorg.conf'
The good news is that it appears to be cured, my forced deletion of all slots might have cured it for now.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Ugh, no, it's still broken. One slot hangs at 99.99%, when I try to pause to reboot, the log file gives the dreaded "WARNING:FS01:Killing WU01" forever. Perhaps I just need to re-install with both GPUs in place, though this seems pretty stupid coming from Win10, where such hardware swaps are a breeze.
 

StefanR5R

Elite Member
Dec 10, 2016
5,510
7,817
136
1898 ? DN 102:04 [FahCore_21]
"D" = The process is blocked in a kernel call, and cannot be interrupted. It may be possible to find out where exactly it is blocked in the kernel, but I'd have to look this up. Can't do it right now.

It's possible it's a driver issue.
It very likely is. This being a 10-series GPU system, removing the 400 driver and installing a 300 driver would be worthwhile IMO.

Another possibility is a hardware defect, causing the driver to hang. But a driver bug is more likely IMO than a hardware defect.

I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands,
No, this is not a problem at all. In my experience,¹ once you have one 10-series GPU working properly, you can just add in more, including others (1080Ti, 1080, 1070...), or remove one again, without any changes to the system config, e.g. xorg.

¹) I have a simple software setup though, e.g. without coolbits. My fans are set to a fixed speed in hardware.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
Maybe it's time to start with a clean slate. Uninstall and reinstall fahclient and nvidia drivers.

sudo dpkg -P fahclient
sudo apt purge nvidia*
reboot
sudo apt install nvidia-driver-390 (I have driver 430 installed on my dual 1080 rig and it's fine)
reboot
reinstall fahclient, etc.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
@biodoc, I'll give that a try, thanks.

Edit: Right off the bat, I get this:
Removing system user fahclient...failed
Because one of the damn clients is always hung. But I'll get it.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Pausing and rebooting worked. I don't mind autostart, really. One less thing to do on the 10 machines I manage.

Anyway, it's chugging away now, let's see if I can get back to where I was before:
output.jpg
 

StefanR5R

Elite Member
Dec 10, 2016
5,510
7,817
136
Did the problem start right away when you added the 2nd GPU,
or did it start only several days later after you added it?

Did the start of the problem coincide with a kernel update or driver update?

"D" = The process is blocked in a kernel call, and cannot be interrupted. It may be possible to find out where exactly it is blocked in the kernel, but I'd have to look this up.
cat /proc/12345/wchan shows the kernel function in which a process is blocked, if any.
Replace "12345" by the process ID.

The key combination [Alt][SysRq][t] dumps kernel traces of all current processes into the kernel log. This is a huge amount of data, but more informative than wchan. One way to look at the kernel log is via the "dmesg" command. If you are not at a local console, you can induce the same effect as the key combo by echo t | sudo tee /proc/sysrq-trigger. More on SysRq can be found e.g. at wikipedia.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Did the problem start right away when you added the 2nd GPU,
or did it start only several days later after you added it?

Did the start of the problem coincide with a kernel update or driver update?


cat /proc/12345/wchan shows the kernel function in which a process is blocked, if any.
Replace "12345" by the process ID.

The key combination [Alt][SysRq][t] dumps kernel traces of all current processes into the kernel log. This is a huge amount of data, but more informative than wchan. One way to look at the kernel log is via the "dmesg" command. If you are not at a local console, you can induce the same effect as the key combo by echo t | sudo tee /proc/sysrq-trigger. More on SysRq can be found e.g. at wikipedia.

It was well after the 2nd GPU was added, and I can't remember exactly if I did anything. I didn't think so, but something changed. The only thing I remember doing was adding "next-unit-percentage" to the advanced settings with a value of 100.

So far it is working with the driver downgrade to 390 and F@H reinstall. Let's see what happens.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
@StefanR5R , It's still broken. "cat /proc/12345/wchan " appears to return "0."

Deleted link, possible security risk...
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
From the google searches I've done, there's no known cause but some speculate that it's a driver crash issue. Nearly all the posts are several years old. One person said remote desktop apps can cause driver crashes on the remote computer.
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
Tomorrow I am probably going to nuke it and pave it. The only reason I didn't start yet is to get that kernel dump in case it had interesting info for Stefan.