BOINC: Windows to Linux

Markfw · Dec 11, 2019

crashtech said:
Bringing this thread back to the top again because I have a new problem. An X99 2676v3 Mint box in particular that has two 1070's in it has a FAH problem where the WUs appear to get stuck at 99.99%. Trying to stop and restart the client does not work, and "kill -9" does not work either. This is what I get after trying to kill the client(s):

Code:

c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH 1820 ? 00:00:18 FAHClient <defunct> 1822 ? 00:02:26 FAHClient <defunct> 3574 ? 00:10:37 FAHControl

What I have learned is that the client has become a zombie process, possibly because of my overzealous use of kill -9, or some other reason. using "ps j" to determine the parent process of the zombies returns value "1." Since 1 is "init," and the processes stay this way indefinitely, it means I have to reboot at this point , since init can't be killed, and FAHClient can't be restarted while it's zombie-self "lives" on. A curious thing is why there are two entries, I will have to keep an eye on that to see if there are two when it's running normally.

Edit:
When running "normally," there are two FAHClients. The parent of the first is init, the parent of the second is the first FAHClient:

Code:

c7x99@c7x99-C7X99-OCE-F:~$ ps -A | grep FAH 1805 ? 00:00:00 FAHClient 1808 ? 00:00:01 FAHClient 1841 ? 00:00:00 FAHCoreWrapper 1865 ? 00:00:00 FAHCoreWrapper 2704 ? 00:00:03 FAHControl c7x99@c7x99-C7X99-OCE-F:~$ ps j 1805 PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 1805 1805 1805 ? -1 Ssl 124 0:00 /usr/bin/FAHClient /etc c7x99@c7x99-C7X99-OCE-F:~$ ps j 1808 PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1805 1808 1805 1805 ? -1 Sl 124 0:01 /usr/bin/FAHClient --ch

Reboot the computer and it may fix it. I think I remember having this and that was the fix.

crashtech · Dec 11, 2019

Markfw said:
Reboot the computer and it may fix it. I think I remember having this and that was the fix.

Yeah, but it has to be done multiple times a day. So if I can't fix it, I'll just nuke the install and start over rather than have to reboot so often.

@biodoc, I'll check the sleep settings.

Edit: It is and has been set to never sleep. I've never noticed it suspended/sleeping before, the fans are always running when inspected.

Markfw · Dec 11, 2019

crashtech said:
Yeah, but it has to be done multiple times a day. So if I can't fix it, I'll just nuke the install and start over rather than have to reboot so often.

@biodoc, I'll check the sleep settings.

Edit: It is and has been set to never sleep. I've never noticed it suspended/sleeping before, the fans are always running when inspected.

It only hasppened to me once in a while, not all the time....

StefanR5R · Dec 11, 2019

crashtech said:
An X99 2676v3 Mint box in particular that has two 1070's in it has a FAH problem where the WUs appear to get stuck at 99.99%.

It may not be of much help to you, but I never encountered this with X99 2696v4, 3x 1080Ti, Mint 18.3, nvidia driver 384.130.

crashtech said:
What I have learned is that the client has become a zombie process, possibly because of my overzealous use of kill -9, or some other reason. using "ps j" to determine the parent process of the zombies returns value "1."

When you use "ps afx" (or other means to check the command line), do the zombie processes have "--child --lifeline ..." in their command line? Edit, when it happens the next time, check the process hierarchy with the "f" and "x" flags of ps, to see which processes are still there, and what their states and relationships are, before you begin to kill them.

crashtech · Dec 11, 2019

Output of ps afx | grep FAH:

Code:

c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep FAH
 2712 ?        Sl     3:35          |       \_ /usr/bin/python /usr/bin/FAHControl
 6447 pts/0    S+     0:00          \_ grep --color=auto FAH
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:36  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1853 ?        SNl    0:04      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 705 -lifeline 1816 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 1 -cuda-device 1 -gpu 1
 1891 ?        SNl    0:04      \_ /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 705 -lifeline 1816 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0

Image of FAHControl, both WUs sitting at 99.99%:

Set logfile to highest verbosity, but there are no recorded errors.

StefanR5R · Dec 11, 2019

Weird! There are no FahCore_21 processes running.
I suppose you already tried to pause and resume the slots, right?
Next would be to gently "sudo killall FAHCoreWrapper", I'd say.

Edit,
wait. You did a "grep FAH". Remove that, of do at least "grep -i fah" for case-insensitive filtering.

Please look up whether or not the cores are still present, and if they are, in which state.

Edit 2, I'm away for ~6 h.

crashtech · Dec 11, 2019

Okay, here it is:

Code:

c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep -i fah
 2712 ?        Sl     4:00          |       \_ /usr/bin/python /usr/bin/FAHControl
 6555 pts/0    S+     0:00          \_ grep --color=auto -i fah
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:40  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1861 ?        ZNl  291:44 [FahCore_21] <defunct>
 1898 ?        DN   102:04 [FahCore_21]

Oh, and here's a massively truncated version of the logfile after I try to pause Folding in preparation for a reboot:

Code:

22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02

Markfw · Dec 11, 2019

crashtech said:

Okay, here it is:

Code:

c7x99@c7x99-C7X99-OCE-F:~$ ps afx | grep -i fah
 2712 ?        Sl     4:00          |       \_ /usr/bin/python /usr/bin/FAHControl
 6555 pts/0    S+     0:00          \_ grep --color=auto -i fah
 1814 ?        Ssl    0:06 /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1816 ?        Sl     0:40  \_ /usr/bin/FAHClient --child --lifeline 1814 /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid --daemon
 1861 ?        ZNl  291:44 [FahCore_21] <defunct>
 1898 ?        DN   102:04 [FahCore_21]

Oh, and here's a massively truncated version of the logfile after I try to pause Folding in preparation for a reboot:

Code:

22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:24:WARNING:FS00:Killing WU02
22:50:24:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02
22:50:25:WARNING:FS01:Killing WU01
22:50:25:WARNING:FS00:Killing WU02

Don't pause before reboot, just do it @!

crashtech · Dec 11, 2019

The first time I tried to fix this, I uninstalled and reinstalled F@H, but that did not remove the data (which is normally okay) but is not what was needed. Now I've managed to delete the slots and recreate them. One of the slots would not delete, another reboot was required. If this doesn't fix it, I'll have uninstall F@H, nuke the data directory, and reinstall. At the moment, it seems okay, though.

biodoc · Dec 11, 2019

@crashtech , It's possible it's a driver issue. I've got the 390 driver running with my dual GTX 1080 Ti rig. When I change or reinstall the drivers on Mint, I purge the system of nvidia related files first.

sudo apt purge nvidia*
sudo apt install nvidia-driver-390
reboot

crashtech · Dec 11, 2019

This rig has had 430 drivers all along. I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands, but I believe their performance characteristics are similar. The second GPU seemed to work okay, at first.

biodoc · Dec 11, 2019

crashtech said:
This rig has had 430 drivers all along. I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands, but I believe their performance characteristics are similar. The second GPU seemed to work okay, at first.

After adding a second gpu, you should run sudo nvidia-xconfig -a

crashtech · Dec 11, 2019

biodoc said:
After adding a second gpu, you should run sudo nvidia-xconfig -a

Oh, shite...
I will do that now.

crashtech · Dec 11, 2019

biodoc said:
After adding a second gpu, you should run sudo nvidia-xconfig -a

This is what I get:

Code:

WARNING: Unable to locate/open X configuration file.

Package xorg-server was not found in the pkg-config search path.
Perhaps you should add the directory containing `xorg-server.pc'
to the PKG_CONFIG_PATH environment variable
No package 'xorg-server' found

WARNING: Unable to use the nvidia-cfg library to query NVIDIA hardware.


ERROR: Unable to determine number of GPUs in system; cannot honor
       '--enable-all-gpus' option.

New X configuration file written to '/etc/X11/xorg.conf'

The good news is that it appears to be cured, my forced deletion of all slots might have cured it for now.

crashtech · Dec 11, 2019

Ugh, no, it's still broken. One slot hangs at 99.99%, when I try to pause to reboot, the log file gives the dreaded "WARNING:FS01:Killing WU01" forever. Perhaps I just need to re-install with both GPUs in place, though this seems pretty stupid coming from Win10, where such hardware swaps are a breeze.

StefanR5R · Dec 11, 2019

crashtech said:
1898 ? DN 102:04 [FahCore_21]

"D" = The process is blocked in a kernel call, and cannot be interrupted. It may be possible to find out where exactly it is blocked in the kernel, but I'd have to look this up. Can't do it right now.

biodoc said:
It's possible it's a driver issue.

It very likely is. This being a 10-series GPU system, removing the 400 driver and installing a 300 driver would be worthwhile IMO.

Another possibility is a hardware defect, causing the driver to hang. But a driver bug is more likely IMO than a hardware defect.

crashtech said:
I am wondering if the problem was caused by me setting this rig up as single GPU, then throwing another 1070 in later. Also they are different brands,

No, this is not a problem at all. In my experience,¹ once you have one 10-series GPU working properly, you can just add in more, including others (1080Ti, 1080, 1070...), or remove one again, without any changes to the system config, e.g. xorg.

¹) I have a simple software setup though, e.g. without coolbits. My fans are set to a fixed speed in hardware.

biodoc · Dec 12, 2019

Maybe it's time to start with a clean slate. Uninstall and reinstall fahclient and nvidia drivers.

sudo dpkg -P fahclient
sudo apt purge nvidia*
reboot
sudo apt install nvidia-driver-390 (I have driver 430 installed on my dual 1080 rig and it's fine)
reboot
reinstall fahclient, etc.

crashtech · Dec 12, 2019

@biodoc, I'll give that a try, thanks.

Edit: Right off the bat, I get this:
Removing system user fahclient...failed
Because one of the damn clients is always hung. But I'll get it.

biodoc · Dec 12, 2019

crashtech said:
Removing system user fahclient...failed

Disable autostart on boot. I hate these things that start automatically on boot.

sudo systemctl disable FAHClient.service

reboot, then uninstall FAH

crashtech · Dec 12, 2019

Pausing and rebooting worked. I don't mind autostart, really. One less thing to do on the 10 machines I manage.

Anyway, it's chugging away now, let's see if I can get back to where I was before:

StefanR5R · Dec 12, 2019

Did the problem start right away when you added the 2nd GPU,
or did it start only several days later after you added it?

Did the start of the problem coincide with a kernel update or driver update?

StefanR5R said:
"D" = The process is blocked in a kernel call, and cannot be interrupted. It may be possible to find out where exactly it is blocked in the kernel, but I'd have to look this up.

cat /proc/12345/wchan shows the kernel function in which a process is blocked, if any.
Replace "12345" by the process ID.

The key combination [Alt][SysRq][t] dumps kernel traces of all current processes into the kernel log. This is a huge amount of data, but more informative than wchan. One way to look at the kernel log is via the "dmesg" command. If you are not at a local console, you can induce the same effect as the key combo by echo t | sudo tee /proc/sysrq-trigger. More on SysRq can be found e.g. at wikipedia.

crashtech · Dec 12, 2019

StefanR5R said:
Did the problem start right away when you added the 2nd GPU,
or did it start only several days later after you added it?

Did the start of the problem coincide with a kernel update or driver update?

cat /proc/12345/wchan shows the kernel function in which a process is blocked, if any.
Replace "12345" by the process ID.

The key combination [Alt][SysRq][t] dumps kernel traces of all current processes into the kernel log. This is a huge amount of data, but more informative than wchan. One way to look at the kernel log is via the "dmesg" command. If you are not at a local console, you can induce the same effect as the key combo by echo t | sudo tee /proc/sysrq-trigger. More on SysRq can be found e.g. at wikipedia.

It was well after the 2nd GPU was added, and I can't remember exactly if I did anything. I didn't think so, but something changed. The only thing I remember doing was adding "next-unit-percentage" to the advanced settings with a value of 100.

So far it is working with the driver downgrade to 390 and F@H reinstall. Let's see what happens.

crashtech · Dec 12, 2019

@StefanR5R , It's still broken. "cat /proc/12345/wchan " appears to return "0."

Deleted link, possible security risk...

biodoc · Dec 12, 2019

From the google searches I've done, there's no known cause but some speculate that it's a driver crash issue. Nearly all the posts are several years old. One person said remote desktop apps can cause driver crashes on the remote computer.

crashtech · Dec 12, 2019

Tomorrow I am probably going to nuke it and pave it. The only reason I didn't start yet is to get that kernel dump in case it had interesting info for Stefan.

BOINC: Windows to Linux

Moderator Emeritus, Elite Member

Lifer

Moderator Emeritus, Elite Member

Elite Member

Lifer

Elite Member

Lifer

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Elite Member

Diamond Member

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Lifer

Diamond Member

Lifer