Getting the most PPD out of your hardware for F@H

Discussion in 'Distributed Computing' started by Markfw, Nov 13, 2016.

  1. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    So, after several weeks tweaking my systems for the upcoming race, here is some advice.

    1) Unless you don't have GPU cards to fold with, CPU points for F@H are currently almost not worth turning them on. For example, so slowest video card I have in my "farm" gets 200,000 ppd average. The BIGGEST CPU machine has 24 threads, and it gets 40,000 ppd average, and uses way more electricity.

    2) Todays cards require very strong CPU individual cores. I think the fahcore_21.exe is single-threaded. (don't hold me to that). SO.... turn off hyperthreading in boxes that support that, and again, for GPU boxes, I don't recommend that you even allow F@H to create and run the CPU (or SMP depending on your version). Also, if you have a choice, and multiple boxes, pick the strongest single-core box for your video cards to run in.

    3) Using older technology boxes for your video cards (like PCI-e 1,0) or DDR2 ram , etc, can hurt your ppd. Almost any socket 775 motherboard is a bad idea. Better than nothing if you don't have anything newer, but will hurt your GPU ppd.

    4) I am fairly confident that Windows 10 has a fair superior "scheduler" and may help your ppd over win 7. For me, I really saw a difference.

    Below is my current farm. The descriptions tell you what CPU and what video card that line is for.
    [​IMG]

    If you have questions, please let me know, we really need to get the best out of our hardware for the upcoming race.
     
    Orange Kid likes this.
  2. Loading...

    Similar Threads - hardware Forum Date
    Debating on the next hardware.... hmmmm Advice guys, please Distributed Computing Mar 3, 2017
    Suggestions on where to buy 4P hardware Distributed Computing Nov 20, 2016
    Old Hardware Distributed Computing Nov 1, 2016
    Free hardware (except shipping) Distributed Computing Oct 12, 2016
    some hardware advice needed Distributed Computing May 19, 2014

  3. TennesseeTony

    TennesseeTony Elite Member

    Joined:
    Aug 2, 2003
    Messages:
    2,155
    Likes Received:
    491
    Thanks Mark!

    I've a bit to add to paragraph number 2 above. That last sentence in that paragraph may confuse some, Mark is not suggesting running a strong single core machine, he meant that if you have a choice between, for example, a Celeron, and an i3, run the faster CPU for use with your GPU.

    Additionally, you will want to allow one full (fast) core to be used with each GPU, and I personally suggest leaving one core for background processes, virus scans, etc. Therefore dual/triple GPU machines should have a quad core processor or better.

    Regarding the PCIe bus, I have personally found that maximum performance requires PCIe v2.0 @8x or better. Performance drops off sharply at 4x, and nearly dies at 1x.

    This information from Mark and myself has been proven on GTX 1080, 1070, 980, and 980Ti cards. The faster your GPU, the more CPU it needs. Lesser cards may not benefit as substantially, so don't be discouraged if you have prior gen hardware. Get 'em built and running! :)
     
  4. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    Tony, I have found that even one strong 4 ghz core may not be enough ! At least 2 strong cores per card. So my dual-card boxes without hyperthreading at I7 quad cores with NO CPU client running, and they do go over 50% load at times.
     
  5. petrusbroder

    petrusbroder Elite Member

    Joined:
    Nov 28, 2004
    Messages:
    12,891
    Likes Received:
    248
    Thanks for the info! :)
     
  6. Orange Kid

    Orange Kid Elite Member

    Joined:
    Oct 9, 1999
    Messages:
    3,210
    Likes Received:
    264
    Thanks for the info :cool:
     
  7. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    The thing about W10 is that it will restart the machine automatically from time to time. There doesn't seem to be an elegant fix at this time; I have disabled the Windows Update service on machines that will be competing. Of course, they must be re-enabled and updated manually on occasion if this is done.
     
    TennesseeTony likes this.
  8. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    OK, I just rebuilt my tri-1070 system and was having all sorts of issues with low ppd. So I paused the SMP (CPU) client and an hour later, all fixed ! So I went to the only other box with an SMP client enabled, and it was doing 12k, and the GPU was 550k. So I paused the client, and the ppd went up to 600k in 10 minutes !

    I think you should all disable any SMP client on any box that has a video card for F@H.
     
  9. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    I thought F@H was always SMP these days, unless there is a special client for multi-CPU systems?
     
  10. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    It will do one to at least 32 cores. More than that is a config nightmare.

    BUT 24 cores gets 40k ppd, and just about the worst video card in the last 3 years, gets 300k ppd, so no sense in doing CPU/SMP (same thing)
     
  11. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    I'm still doing my best to decipher what you are saying, sorry. Are multi-CPU systems required to run more than one client when the number of F@H threads will exceed 32? Can a dual CPU system run one client if it has two 8-core CPUs (32 threads)?
     
  12. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    There is a thread here somewhere about systems with more than 32 threads, I think in the one that has 4P in it, not sure.

    BUT for 32 threads or less total (regardless of 1 or 2 or 4 physical sockets), only one F@H task is required.
     
  13. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    My plan was to put a high core count system with GPU(s) in service this year, so your result alarms me. What about configuring the CPU slot to leave one or two cores per GPU available?
     
  14. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    This talks about 72 threads

    https://forums.anandtech.com/threads/help-with-f-h-client.2492328/

    As for configuring the CPU slot, I USED to say leave 2 threads or cores per card. But I think the problem is that memory access is being maxed when GPU + CPU are running. when I was getting 200k ppd per card, 600k total, I have 8 threads allotted to GPU. When I removed that (16 threads on CPU/SMP) the ppd went to 1.9 million, 600k or more per card.

    As I said, since the ppd is so low for CPU, if you have video cards, just leave that off.
     
  15. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    I guess my concern is really about CPU-only BOINC projects, these are what I would have liked to run along with F@H on one or more GPUs. If it really is the memory that is the problem, then the approach I am using now for F@H seems to be the most effective, that is, multiples of a cheap dual core CPU (well, Sandy or better) on a cheap motherboard with a single GPU.

    Given this new info, CPU-only projects might as well be fed with surplus server boards that many not even possess an x16 slot for a GPU.
     
  16. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    Ah, I see your point. I have such an animal, not sure what to do with it.
     
  17. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    Run CPU-only DC projects on it? Sell it to me? :)
     
  18. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    Sell it, in a heartbeat. 2 x 5639 CPU's, 24 threads, 2.13 ghz. Supermicro board, 12 gig of ram (If you need it ) ATX sized motherboard (thats rare for 2 sockets) and 2 heat sinks if you need them. Where are you (about) and how much, PM me if interested.

    PM sent.
     
    #17 Markfw, Jan 14, 2017
    Last edited: Jan 14, 2017
  19. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    800
    Likes Received:
    365
    I got the impression from a few short-term checks on one of my PCs that simply raising the priority of the graphics card feeder process (fahcore_21.exe or whatever it is called) from "lowest" to "normal" gets it back up to fair PPD even when there is 100 % other CPU load at "lowest" scheduling priority (e.g. a F@H CPU slot or a BOINC CPU project). Perhaps the below-normal/ second-to-lowest priority would already be sufficient too. I haven't checked this systematically yet, i.e. longer term, and compared to when any other CPU load is completely removed.

    This is with Windows 7 (I also have Linux, but only with CPU clients), and recently I looked only at its effect on a small AMD GPU whose feeder process actually takes only very little CPU time. But if Windows 7 is tardy waking up that processes when it should be, then I figure that such scheduling latency would introduce stalls into the GPU workload.

    On a related note, I have the option "Configure" -> "Advanced" -> "Folding Core Priority" -> "Slightly higher. Use this if other distributed computing applications are stopping Folding@home from running" enabled but it does not improve things. Apparently it raises priority of the fahcorewrapper, but definitely not of the GPU feeder process. Well, of course the problem here is not that the GPU client is not running at all, only that its throughput is decreased dramatically in presence of CPU load.

    If my observation with manually raised priority was correct, then perhaps a feature request should be opened with Pande group for a third option which would raise the priority of GPU feeder processes. Or maybe it should not even be an option, but simply hardwired (schedule the GPU feeder at second-to-lowest priority, or even normal priority, instead of lowest priority).

    I haven't checked at the folding@home support site yet whether something like this was already discussed there.
     
  20. Markfw

    Markfw CPU Moderator VC&G Moderator Elite Member
    Super Moderator

    Joined:
    May 16, 2002
    Messages:
    14,801
    Likes Received:
    1,450
    I thought I had everything set to 100% or high or whatever, since these are all dedicated folding boxes. Since nothing was running except folding, AT ALL, I still think its a memory bandwidth problem.
     
  21. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    800
    Likes Received:
    365
    Regarding GPU jobs being bottlenecked by concurrent CPU jobs:

    I just learned how to use Windows PowerShell for setting process priorities, and came up with the following script. This simply loops infinitly, looks every 60 seconds for processes named FahCore_1? or FahCore_2?, and switches their CPU scheduling priority to "AboveNormal".

    Save this as "whatever.ps1", then right-click on it and execute with PowerShell:
    Code:
    $ProcessNamePattern = "FahCore_[12]?"
    
    for (;;) {
        Get-Process $ProcessNamePattern | foreach {
            $_.PriorityClass = "AboveNormal"
        }
        Start-Sleep 60
    }
    
    Actually "AboveNormal" may be overkill; "Normal" priority may be sufficient, as long as the concurrent CPU jobs run at "Lower than normal" or "Low" priority.

    As this script will normally show only a blank black shell window, you can check that it really does something, thus: Start Windows Task Manager, switch to the "Processes" tab, sort the table by "Image Name", enable "View/ Select columns/ Base priority", witness that FahCore_21 is running at "Low" priority, then start the script. Task Manager should then show FahCore_21 running at "Above normal" priority.

    From my short-term experience, this should help keep up F@H GPU PPD even if the CPU is heavily loaded with CPU jobs. Just now I am running a nasty load of memory-intensive BOINC PrimeGrid SoB-LLR jobs in the background. Raising FahCore_* priority seems to help from what I see as estimated PPD. I currently have hyperthreading off, i.e. have not looked how it works out with hyperthreading on.

    However, I have not yet attempted to prove or disprove the effect of this script by proper measurements. So maybe I am merely wishfully thinking that it helps.
     
    TennesseeTony likes this.
  22. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    800
    Likes Received:
    365
    At least with bigger/newer Nvidia GPUs, FahCore_21.exe is obviously almost always loading one CPU hardware thread to its fullest. Therefore I am wondering whether single-core performance of the CPU can bottleneck GPU PPDs.

    To test that, I am still planning to find out how to set up FAHClient to process a given copy of always the same Work Unit.

    But for now I turned to FAHBench to get somewhat closer to an answer. The current FAHBench v2 supposedly uses the very same code as FahCore_21. FAHBench v2 comes with three different built-in WUs, and it is also possible to add custom WUs to FAHBench which can be derived from "real" Folding@Home WUs. I don't know how well the three built-in WUs reflect typical Folding@Home WUs.

    During benchmarking, FAHBench also loads one CPU hardware thread fully. So, there is at least a distinct chance that any CPU single-thread performance bottleneck would also be showing up in FAHBench.

    Software used in the tests:
    FAHBench v2.2.5, OpenMM version 6.2-core21-0.0.17
    options: OpenCL, single precision, accuracy check enabled, NaN check disabled, 60 s run length
    Nvidia driver version 272.06
    Windows 7​

    Hardware:
    Core i7-6950X, HT off, EIST off
    reference GTX 1080
    factory-overclocked GTX 1070 (Gainward Phoenix GS, presumedly 170 W TDP)
    both cards in 16-lane PCIe 3.0 slots​

    In all tests, the GPU performance cap was shown to be Voltage (not power or temperature), according to GPU-Z. Temperatures remained moderate, and cards ran at about 1.9 GHz (1080) and 2.0 GHz (1070).

    All FAHBench scores shown below in absolute numbers are averages from three consecutive runs. Variability between those triple runs was reasonably low. The percentages in the table are simply the score at the given CPU clock divided by the score at 4.0 GHz CPU clock.

    Work Unit: dhfr
    Code:
    CPU clock          4.0 GHz   3.5 GHz   3.0 GHz   2.5 GHz   2.0 GHz   1.5 GHz
    ----------------------------------------------------------------------------
    GTX 1080 scores     110       106       103        98        92        82
                      (100 %)    (97 %)    (94 %)    (89 %)    (84 %)    (75 %)
    ----------------------------------------------------------------------------
    GTX 1070 scores     106       103       100        95        89        80
                      (100 %)    (98 %)    (94 %)    (90 %)    (84 %)    (76 %)
    
    Work Unit: dhfr-implicit
    Code:
    CPU clock          4.0 GHz   3.5 GHz   3.0 GHz   2.5 GHz   2.0 GHz   1.5 GHz
    ----------------------------------------------------------------------------
    GTX 1080 scores     519       517       516       514       515       520
                      (100 %)   (100 %)    (99 %)    (99 %)    (99 %)   (100 %)
    ----------------------------------------------------------------------------
    GTX 1070 scores     477       476       475       473       476       480
                      (100 %)   (100 %)   (100 %)    (99 %)   (100 %)    (99 %)
    
    Work Unit: nav
    Code:
    CPU clock          4.0 GHz   3.5 GHz   3.0 GHz   2.5 GHz   2.0 GHz   1.5 GHz
    ----------------------------------------------------------------------------
    GTX 1080 scores     14.4      14.4      14.3      14.3      14.2      14.1
                      (100 %)   (100 %)   (100 %)    (99 %)    (99 %)    (98 %)
    ----------------------------------------------------------------------------
    GTX 1070 scores     14.7      14.6      14.6      14.6      14.5      14.4
                      (100 %)    (99 %)    (99 %)    (99 %)    (99 %)    (98 %)
    
    So, there is practically no dro-poff in the dhfr-implicit and nav tests, while the dhfr test shows ~5 % loss of performance when going from 4.0 to 3.0 GHz CPU clock, and ~10 % loss at 2.5 GHz. Not as pronounced as I suspected.

    It remains to be seen how this scales in FAHClient with typical WUs.
     
    crashtech and TennesseeTony like this.
  23. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    Very interesting, I may have to add some 1060 results to this.

    I remoted in and tried to run it on one of my 1060s, but got this error:

    [​IMG]

    Maybe because it is headless? But F@H runs fine...
     
    #22 crashtech, Mar 8, 2017
    Last edited: Mar 8, 2017
  24. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    800
    Likes Received:
    365
    @crashtech, my first thought was that FAHClient needs to be paused or shut down before FAHBench can be started, but this is not true: They can in fact run together (although that does not make a lot of sense), at least on the 1080.

    I would be surprised if headless was a problem for FAHBench. During my tests, only one of my cards was connected to a display, and the other was not.

    Browsing a random web shop for 1060 cards confuses me: Some cards are advertised with OpenCL support, while OpenCL is not mentioned at other 1060 offerings.

    GPU-Z's first tab is showing ticks at all "Computing" features of my 1080 (OpenCL, CUDA, PhysX, Direct Compute 5.0). Can't check the 1070 right now. Of course FAHBench uses OpenCL, although it can be recompiled to use CUDA, FWIW. BTW, I had downloaded the FAHBench 2.2.5 installer; haven't tried the standalone version. Neither did I try FAHBench 1.2.

    nvidia-smi.exe lists the compute mode ("Compute M." field) of my 1080 as "Default". This means that several concurrent compute contexts are supported. In contrast, modes "Exclusive Process" or "Prohibited" would mean max. 1 or 0 compute contexts, respectively. (The nvidia-smi command line utility is located at C:\Program Files\NVIDIA Corporation\NVSMI\ here.)

    Anyhow, could this be a problem with your Nvidia driver version?
     
  25. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,314
    Likes Received:
    545
    I'm not sure about the issue on the little workhorse that FAHBench was tried on initially, but I got it to work on my work machine, an i3-4370 @3.8GHz with a GTX 1060 (6GB) @1873MHz, the scores are as follows (average of five runs):

    dhfr: 83
    dhfr-implicit: 349
    nav: 9.9

    These figures are of extremely limited usefulness since I can't alter the clockspeed with any Haswell board I have right now. If there is interest, I might be able to put a 1060 into a Skylake rig with an i5-6500. That rig has, at the minimum, BCLK frequency control capability, though that might poison the data if not done carefully. It's a bit embarrassing not to have a "K" CPU handy, I have sold them all in preparation for other things.

    Honestly I wish I could more closely duplicate your setup for testing a 1060, since I believe it will be even less sensitive to CPU power than its bigger brothers.

    Edit: Clockspeed varied on this test between 1847 and 1911MHz depending on the test and its duration, it's a little funny having a card that likes to play by its own rules.
     
  26. Orange Kid

    Orange Kid Elite Member

    Joined:
    Oct 9, 1999
    Messages:
    3,210
    Likes Received:
    264
    Added a link to this in the Project List thread.