PrimeGrid: CPU benchmarks

Discussion in 'Distributed Computing' started by StefanR5R, Jan 5, 2017.

Tags:
  1. Kiska

    Kiska Senior member

    Joined:
    Apr 4, 2012
    Messages:
    556
    Likes Received:
    100
    RUN! An uncaffeinated @StefanR5R has appeared!
     
  2. TennesseeTony

    TennesseeTony Elite Member

    Joined:
    Aug 2, 2003
    Messages:
    2,788
    Likes Received:
    1,114
    And so has a sleepy Austrailian! :D
     
    Kiska likes this.
  3. Kiska

    Kiska Senior member

    Joined:
    Apr 4, 2012
    Messages:
    556
    Likes Received:
    100
    Sleepy? WHO TOLD YOU THAT?!
     
  4. Ken g6

    Ken g6 Programming Moderator, Elite Member
    Moderator

    Joined:
    Dec 11, 1999
    Messages:
    13,720
    Likes Received:
    890
    There's one option I didn't see you consider: HT on and using multiple cores. I hear that 4 core processors, using all cores on some large WUs, do better with HT on than off. I'm not sure about processors with more cores.
     
  5. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    I actually ran my Windows tests with a single 6-threaded task at a time on a 6-core CPU, configured as 6C/6T versus as 6C/12T. If the computationally intensive threads are placed on distinct physical cores (and that's what the Windows 7 kernel did in fact do, at least with "performance" energy scheme), than 6C/12T should be superior to 6/C/6T because the former has threads left for all the bloat of a typical desktop PC.

    I am now running the respective comparison on Linux with 7x4-threaded tasks on 2x14C/28T versus 2x14C/14T.

    With IIRC 2x 1.5 days for the Windows test, and less than 2x 1.5 days for the Linux test, I suspect the performance difference will be smaller than the error of measurement caused by variability between WUs.

    One thing which I did not test was multithreaded tasks occupying more CPU threads than there are physical cores. So far the only test with more CPU threads occupied than there are physical cores was a test with as many singlethreaded tasks as there are CPU threads.

    I should be putting together some tables with actual results now...
     
    crashtech likes this.
  6. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    GCW-LLR, "Generalized Cullen/Woodall (LLR) v8.00" application
    (name for app_config.xml: llrGCW)

    Tested on Broadwell-EP (Xeon E5-2690v4) @ 2.9 GHz (AVX all-core turbo), dual processor board, Linux, unless noted otherwise.


    Q: Should Hyperthreading be used?

    A: No. You take a small performance hit with Hyperthreading. Either switch hyperthreading off in the BIOS, or leave it on but use only half of the logical CPUs.
    Edit, September 9:
    This finding is true with singlethreaded tasks. See later posts for discussion of multithreaded tasks and Hyperthreading.

    Code:
    hyperthreading        on           off
    threads/task          1            1
    simultaneous tasks    56           28
    ----------------------------------------------
    avg task run time     41h25m       20h15m
    PPD                   118,000      126,000
    

    Q: What's preferable, HT off & 100 % CPU usage, or HT on & 50 % CPU usage?

    A: Impossible for me to say so far. The differences which you see in the next to tables are below the error of measurement caused by variability between WUs. It seems HT on & 50 % CPU usage is useful on hosts with low core count and some background load e.g. due to a desktop OS. But make sure that the operating system treats hyperthreaded cores correctly, i.e. spreads the PrimeGrid load across all physical cores.

    Here are results from a 6-core Ivy Bridge-E (i7-4960X) @ 4.2 GHz, Windows 7 Pro SP1, energy options set to "high performance power plan":
    Code:
    hyperthreading        off          on
    threads/task          6            6
    simultaneous tasks    1            1
    ----------------------------------------------
    avg task run time     1h53m        1h50m
    PPD                   45,900       46,100
    
    Next are results with the dual-socket Broadwell-EP and Linux which I also used in the other tests. The CPU frequency governor was counter-intuitively set to "powersave", not "performance", but this shouldn't make a difference with recent Intel processors and recent Linux kernels at least.
    Code:
    hyperthreading        off          on
    threads/task          4            4
    simultaneous tasks    7            7
    ----------------------------------------------
    avg task run time     2h57m        3h03m
    PPD                   207,000      208,000
    
    All the remaining tables are from dual-socket Broadwell-EP and Linux as well.


    Q: To how many threads per task should I configure the LLR tasks?

    A: This certainly depends on CPU type and memory performance. But generally, a moderate number of threads per process should be best. With fewer processes but respectively more threads per process, RAM usage as well as utilization of the CPU's caches and TLBs probably are reduced. Also, context switches between threads are faster than between processes; but this shouldn't matter much because you set up BOINC such that only as many processes × threads are running as there are hardware threads or cores, except for system threads and whatever other background load. On the other hand, too many threads per LLR process are not advisable because there is sub-linear performance scaling with thread count.

    Besides the impact to throughput (points per day), the impact on task duration may be important to you as well: At least during a competition which lasts, say, 3 days, you would have a bad standing with tasks that require 2 days to run. Or if you are keen on being a primary reporter of new primes instead of merely the validator, you want to complete your task earlier than the wingman.
    Code:
    hyperthreading        off          off          off          off          off
    threads/task          1            2            4            7            14
    simultaneous tasks    28           14           7            4            2
    ----------------------------------------------------------------------------------
    avg task run time     20h15m       7h53m        2h57m        1h43m        1h02m
    PPD                   126,000      149,000      207,000      204,000      172,00
    
    In order to enable multithreading, place a file named "app_config.xml" with the following contents into C:\ProgramData\BOINC\projects\www.primegrid.com\. Replace the two occurrences of "4" with the thread count which you desire.
    Code:
    <app_config>
        <app>
        <name>llrGCW</name>
        <fraction_done_exact/>
        </app>
        <app_version>
        <app_name>llrGCW</app_name>
        <cmdline>-t 4</cmdline>
        <avg_ncpus>4</avg_ncpus>
        </app_version>
    </app_config>
    
    To apply this configuration, either restart the BOINC client, or let it re-read all configuration files. The <avg_ncpus> element will take effect immediately. It is used by the client to determine how many tasks can be started until the user-configured CPU utilization is approached. The <cmdline> element takes effect as soon as a task is being started (or when resumed from suspension if "Leave non-GPU tasks in memory while suspended" is off, or when resumed after client restart). The <cmdline> element is the one which actually causes the LLR application to spawn multiple threads. The <fraction_done_exact> element merely tells the client to display the estimated time remaining until task completion on the basis of what the LLR application reports to the client.
     
    #31 StefanR5R, Aug 20, 2017
    Last edited: Sep 8, 2017
    biodoc and TennesseeTony like this.
  7. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    8,258
    Likes Received:
    949
    Any info about NUMA awareness? On my 2P 20C/40T box, configuring either 4 or 5 threads per task puts all tasks on one CPU instead of spreading them out amongst both CPUs.

    I may have to manually disable HT on the two 2P boxen if I can.

    EDIT: Attempting to turn HT off on the 2P box somehow broke Windows 10. Turning it back on revealed that Win 10 is STILL broken, and I don't have time to fix it. It can't even make it the point where Safe Mode can be activated. So I'm going to be a lot less help now because of attempt to fix something that wasn't broken.
     
    #32 crashtech, Aug 20, 2017
    Last edited: Aug 20, 2017
  8. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    That's weird. I can't answer this as I am not familiar with the Windows scheduler policies on dual socket machines.

    It's actually something where the kernel's scheduler needs some advice from the application or from the admin: A multithreaded process with lots of inter-thread communication should keep the communicating threads on the same socket. But a multithreaded process with high RAM bandwidth requirement should spread its threads across sockets. Without knowing much about PrimeGrid's LLR applications, my guess is they are of the latter type. But if several multithreaded processes are being run which all equally require RAM bandwidth, then it is optimal again to keep threads of the same process on the same socket (but put processes on both sockets of course).

    But why would Windows, when faced with several CPU intensive processes, saturate one socket first and only then begin to use the other socket? I can't think of any good application for such a policy.

    Perhaps you can fix this with Process Lasso.
     
  9. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    8,258
    Likes Received:
    949
    As it stands, it looks like I might have to repair/reload the OS to get it working at all, and due to the confluence of several unusual events, I am working today and won't be able to fix it in time. Perhaps this evening I will get it running and experiment with Process Lasso.

    Odd and frustrating to me that simply disabling HT would cause boot failure on this one machine. I've done that bunches of times on other PCs, and never had this happen. Murphy's Law stalks me continually.

    Edit: The other 2P box (2 Westmere hexacores) is NOT acting weird, it's distributing the threads evenly. So it's something about the 20C Ivy box that is causing a problem, perhaps an arcane BIOS setting.

    EDIT2: The Westmere box IS doing the same thing, but only when multithreading is used. Here is a screencap showing the problem, the CPUs on top are Node 0:

    [​IMG]

    What we ought to see is every other core being used instead of all threads on one node.

    Even with multithreading off, something doesn't look right with core utilization:

    [​IMG]

    @StefanR5R , if this is a cluttering of your benchmark thread, I can put this info somewhere else. I thought it might be tangentially relevant to the matter of getting the most out of our CPUs, though 2P boxes are the exception.
     
    #34 crashtech, Aug 20, 2017
    Last edited: Aug 20, 2017
  10. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    bill1024 says he tested this positively on 6c/12T Ivy Bridge-E.
    https://forums.evga.com/FindPost/2705100

    Maybe I will take measurements sometime later when there is no conflicting competition. Also depends on whether or not this will help me colour-coordinate my PrimeGrid badges.

    @crashtech, maybe HT on/ 100 % CPUs allowed to use/ two 12-threaded tasks at a time would fit your 2P Westmere-EP then.

    BTW if you try to use your primegrid.com results table to check the throughput, keep in mind that "Run time" in these table may be wrong. Perhaps it gets reported properly if your client only downloads one WU at a time, not several; but no guarantees.
     
    crashtech likes this.
  11. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    Some remarks about performance measurements with multithreaded LLR:

    Run time, CPU time, Credit of the tasks, and the ratios between them, are somewhat variable. Not much, but this variability makes it necessary that many tasks are being completed and validated in order to get reasonably precise PPD estimations.

    In order to get the average PPD of a host, do not take an average of the PPD of several tasks. Rather, take the sum of the run time of the tasks, the sum of the credits of the tasks, and then compute PPD from them. (The latter is a weighted average with run time being the weight. This is necessary because your host spends more time with longer tasks than with shorter tasks...)

    The above is true for single-threaded as well as multi-threaded tasks. But here is a catch: Run time of multithreaded tasks is often misreported at PrimeGrid's web interface if you don't take some precautions. This bug is known to the PrimeGrid developers and was attributed to boinc-client, not to the LLR applications. The precautions:
    • Always download only at most 1 task at a time. So far, this seems sufficient to avoid whatever bug causes the wrong reporting of run times.
    • If you accidently downloaded several tasks in one go while you do performance monitoring, cancel these tasks, and proceed to download one task after another.
    How to make sure that only 1 task is being downloaded?
    • Set "Store at least 0 days of work" and "Store up to an additional 0 days of work".
    • Begin with "Use at most 1 % of the CPUs". This will cause the client to use only 1 logical CPU, tell that to the server, and receive only 1 task.
    • Then increase "Use at most _ % of the CPUs" step by step, such that the client determines in each step that it has exactly 1 idle logical CPU to ask a new task for.
    These points are still true after you configured app_config.xml for multithreaded tasks. Say, you configured 4 thread per task. Then you increase "Use at most _ % of the CPUs" such that the client figures there are now 4 idle logical CPUs. When the client requests new work from the server now, the server will send 4 tasks, not 1 as desired.

    (If you add 1 idle CPU, then download e.g. a 4-threaded task, that task won't be started of course. Either add 3 more CPUs to get the task started, or add 4 more CPUs to start the task and download another one.)
     
    crashtech likes this.
  12. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    8,258
    Likes Received:
    949
    @StefanR5R , thanks for the in-depth analysis. I am at once fascinated and discouraged; it seems unlikely that I'll have time to do what is required to get accurate numbers.
     
  13. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    Such tests take a lot of time between races, because validations take so long. It's much better during a race.

    It also helps a lot if different configurations can be tested on two or more hosts with same hardware in parallel.

    Oh, I forgot: The hand-holding which I described to download only 1 task at a time is only required initially, but not anymore once the host is loaded and running. In my experience, the host is then unlikely to download more than one task at a time.

    To be sure, when I copy & paste results of validated tasks from the web page into a spreadsheet, I add a column in the spreadsheet which contains CPU time divided by run time of each task. This ratio must be slightly less than the number of threads per task. If I discover any outliers in this column, I remove the respective task from evaluation. It would most likely be one for which a wrong run time was reported.
     
    crashtech likes this.
  14. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    321-LLR, "321 (LLR) 8.00" application
    (name for app_config.xml: llr321)

    I checked dual-processor Linux hosts with two different configurations per processor type.
    Config 1:
    4 threads/task, use 50 % of hardware threads
    (Hyperthreading is on but not used)​
    Config 2:
    4 tasks/socket, use 100 % of hardware threads
    (Hyperthreading is on and used)​
    The latter config gives slightly better throughput.

    dual E5-2690v4 (2x 14C/28T = 56T) @ 2.9 GHz,
    7 tasks at a time x 4 threads/task:
    184,000 PPD
    35 minutes mean time between task completions (4h06m average task duration)​
    8 tasks at a time x 7 threads/task:
    208,000 PPD (+13 %)
    31 minutes mean time between task completions (4h08m average task duration)​

    dual E5-2696v4 (2x 22C/44T = 88T) @ 2.6 GHz,
    11 tasks at a time x 4 threads/task:
    262,000 PPD
    25 minutes mean time between task completions (4h32m average task duration)​
    8 tasks at a time x 11 threads/task:
    270,000 PPD (+3 %)
    24 minutes mean time between task completions (3h12m average task duration)​

    ------------

    Furthermore, I ran a few configurations on a single-processor host, socket 1150, Xeon E3-1245v3 (Haswell 4C/8T) @ 3.4 GHz, Linux. In each of these tests, I ran always the very same WU, getting an error of measurement below 0.2 %.

    1 process with 8 threads:
    12,460 s (3.46 h) run time, 89,390 s (24.8 h) CPU time, 33,800 PPD​
    1 process with 7 threads:
    12,520 s (3.48 h) run time, 81,320 s (22.6 h) CPU time, 33,600 PPD​
    1 process with 6 threads:
    12,790 s (3.55 h) run time, 72,560 s (20.2 h) CPU time, 32,900 PPD​

    2 processes with 4 threads each:
    33,570 s (9.33 h) run time, 248,950 s (69.2 h) CPU time, 25,000 PPD​
    4 processes with 2 threads each:
    about 22 h run time, about about 21,000 PPD
    (extrapolated after 15% completion)​
    2 processes with 2 threads each:
    about 9.4 h run time, about 24,800 PPD
    (extrapolated after 50% completion)​

    Conclusion:
    On the 4C/8T Haswell with dual-channel memory, it is best to run only one task at a time, leave Hyperthreading enabled, and give the task as many processor threads as can be spared.

    ------------

    Edit September 10, 2017:
    fixed typo in tasks/socket, clarified use of Hyperthreading, increased accuracy of average task duration
    Edit March 31, 2018:
    added Xeon E3 results
    Edits April 2-5, 2018:
    Xeon E3 ran at 3.4 GHz, typo in s->h conversion
     
    #39 StefanR5R, Sep 8, 2017
    Last edited: Apr 5, 2018
    zzuupp, crashtech and Ken g6 like this.
  15. ao_ika_red

    ao_ika_red Senior member

    Joined:
    Aug 11, 2016
    Messages:
    928
    Likes Received:
    332
    This should be useful for new year's SoB-LLR challenge.
     
    Ken g6 likes this.
  16. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    Note, SoB-LLR is now on version 8.00. (Post #2 was from before the update.) I guess it now scales similar to other LLR applications with long-running tasks. Still I wonder how much of the memory bandwidth dependence that I saw in #2 remained.
     
  17. petrusbroder

    petrusbroder Elite Member

    Joined:
    Nov 28, 2004
    Messages:
    13,133
    Likes Received:
    762
    StefanR5R, I really like your analysis and empirical data. Although I have not run PrimeRrid for quite some time, this makes it interesting again!
     
  18. biodoc

    biodoc Diamond Member

    Joined:
    Dec 29, 2005
    Messages:
    4,819
    Likes Received:
    134
    Excellent experimental designs and analyses StefanR5R. :sunglasses:
     
  19. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,529
    Likes Received:
    1,174
    Testing the LLR application outside of Boinc

    Almost all of the results which I reported in this thread came from running many random WUs (a sufficient number for each configuration to be tested), adding their run times and credits, and thus arriving at PPD of a given configuration. Benefits of this method are that you get credit for all CPU hours expended, might even find primes, and may do this testing even during a PrimeGrid challenge. The downsides are imprecision due to variability between WUs, and lack of repeatability.

    Alternatively, the LLR application can be run stand-alone without boinc and PrimeGrid's llr_wrapper. That way, you can feed the very same WU to LLR over and over, thus eliminate variability between WUs, and can directly compare results from a single run on each configuration to be tested. (In turn, the downside of this method is that your CPUs don't earn any boinc credit on the side while testing.)

    What's needed:

    I didn't go ask how to run LLR stand-alone; I merely watched how boinc-client + llr_wrapper run it and replicated that. Hence my recipe may contain some bits of cargo cult. File names and scripts are shown for Linux here, but the method can be replicated on Windows as well.
    1. Create working directory for testing. Copy the following files into this directory, and create a test script within the directory as shown below.
    2. Use boinc-client to run a single WU of the PrimeGrid LLR-based subproject that you want to test for. Watch which program binary it runs, and which input file it fed into the LLR application.
    3. Optionally: Later, when PrimeGrid validated the WU, make a note of the credit that was given for the WU.
    These files need to be copied out of the boinc data directory:
    • The executable file of the LLR program. Example name of file to copy:
      /var/lib/boinc/projects/www.primegrid.com/sllr64.3.8.20
    • The template for LLR's ini file. Example:
      /var/lib/boinc/projects/www.primegrid.com/llr.ini.6.07
    • The input file of the WU to test. Example:
      /var/lib/boinc/projects/www.primegrid.com/llr321_299635024
    The input file llr321_299635024 was the one which I used to obtain the results with Xeon E3 which I reported in post #39, and the content of this file was:
    Code:
    600000000000000:P:1:2:257
    3 13826132
    
    This WU was credited with 4,865.57 cobblestones.

    After having copied these files, I made a shell script which lets me run the LLR program with different thread counts, as one or several instances of the program in parallel, and which records the run times and more in a log file:
    Code:
    #!/bin/bash
    
    # edit these for the right file names
    # (files copied from boinc's projects/www.primegrid.com/ subdirectory)
    LLREXE="sllr64.3.8.20"
    LLRINI="llr.ini.6.07"
    LLRINPUT="llr321_299635024"    # 4,865.57 cobblestones
    
    
    TIMEFORMAT=$'\nreal\t%1lR\t(%0R s)\nuser\t%1lU\t(%0U s)\nsys\t%1lS\t(%0S s)'
    
    # run_one - run a single LLR process, and show timing information when finished
    #
    #     argument 1: slot number, i.e. unique name of the instance
    #     argument 2: thread count of the process
    #
    run_one () {
        SLOT="slot_$1"
        rm -rf ${SLOT}
        for ((;;))
        do
            mkdir ${SLOT}                    || break
            ln    ${LLREXE} ${SLOT}/llr.exe  || break
            cp -p ${LLRINI} ${SLOT}/llr.ini  || break
            cd ${SLOT}                       || break
    
            echo "---- slot $1 ----" > stdout
            time ./llr.exe -d -oDiskWriteTime=10 -oThreadsPerTest=$2 ../${LLRINPUT} >> stdout 2> stderr
            cat stdout stderr
            cd ..
            break
        done
        rm -rf ${SLOT}
    }
    
    
    # run_series - run one or more LLR processes in parallel, and log everything
    #
    #     argument 1: number of processes to run at once
    #     argument 2: thread count of each process
    #
    # stdout and stderr are appended into the file protocol.txt
    #
    run_series () {
        {
            echo "======== $(date) ======== starting $1 process(es) with $2 thread(s) ========"
            time {
                for (( s=1; s<=$1; s++ ))
                do
                    run_one $s $2 &
                done
                wait
            }
            echo "======== $(date) ======= done with $1 process(es) with $2 thread(s) ========"
            echo
        } 2>&1 | tee -a protocol.txt
    }
    
    
    # edit these for the desired test programme
    run_series 1 8
    run_series 1 7
    run_series 1 6
    run_series 2 4
    run_series 4 2
    run_series 2 2
    

    This script is to be placed and run within the same directory as the copies of the exe, ini, and input file. After the top and the tail of this script were edited suitably and the script was started, the waiting begins. To check on the progress of the running LLR process(es), the following simple script can be used:
    Code:
    #!/bin/bash
    
    SLOTS=$(echo slot_*)
    
    if [ "${SLOTS}" = 'slot_*' ]
    then
        echo "No processes found."
    else
        for s in ${SLOTS}
        do
            echo "---- ${s} ----"
            tail -1 ${s}/stdout
            echo
        done
    fi
    

    Note, the big wrapper script above (_run.sh) is a little bit flawed if LLR subprojects with very short-running WUs are to be tested with several processes running in parallel: The processes are started at virtually the same time, but they will finish after slightly different times. During the last part of such a test run, when some processes have already finished and only some are still running, the host is under less load than in steady state. Keep this in mind when you compute the average run time from the individual run times of the worker processes. This effect is negligible with longer running WUs though.

    Finally, PPD can of course be computed as
    credits/WU * number of processes in parallel * 24 h/day * 3600 s/h / average run time in seconds​
     
    TennesseeTony likes this.