PrimeGrid: CPU benchmarks

Discussion in 'Distributed Computing' started by StefanR5R, Jan 5, 2017.

Tags:
  1. Kiska

    Kiska Senior member

    Joined:
    Apr 4, 2012
    Messages:
    482
    Likes Received:
    73
    RUN! An uncaffeinated @StefanR5R has appeared!
     
  2. TennesseeTony

    TennesseeTony Elite Member

    Joined:
    Aug 2, 2003
    Messages:
    2,221
    Likes Received:
    539
    And so has a sleepy Austrailian! :D
     
    Kiska likes this.
  3. Kiska

    Kiska Senior member

    Joined:
    Apr 4, 2012
    Messages:
    482
    Likes Received:
    73
    Sleepy? WHO TOLD YOU THAT?!
     
  4. Ken g6

    Ken g6 Programming Moderator, Elite Member
    Moderator

    Joined:
    Dec 11, 1999
    Messages:
    13,081
    Likes Received:
    496
    There's one option I didn't see you consider: HT on and using multiple cores. I hear that 4 core processors, using all cores on some large WUs, do better with HT on than off. I'm not sure about processors with more cores.
     
  5. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    I actually ran my Windows tests with a single 6-threaded task at a time on a 6-core CPU, configured as 6C/6T versus as 6C/12T. If the computationally intensive threads are placed on distinct physical cores (and that's what the Windows 7 kernel did in fact do, at least with "performance" energy scheme), than 6C/12T should be superior to 6/C/6T because the former has threads left for all the bloat of a typical desktop PC.

    I am now running the respective comparison on Linux with 7x4-threaded tasks on 2x14C/28T versus 2x14C/14T.

    With IIRC 2x 1.5 days for the Windows test, and less than 2x 1.5 days for the Linux test, I suspect the performance difference will be smaller than the error of measurement caused by variability between WUs.

    One thing which I did not test was multithreaded tasks occupying more CPU threads than there are physical cores. So far the only test with more CPU threads occupied than there are physical cores was a test with as many singlethreaded tasks as there are CPU threads.

    I should be putting together some tables with actual results now...
     
    crashtech likes this.
  6. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    GCW-LLR, "Generalized Cullen/Woodall (LLR) v8.00" application
    (name for app_config.xml: llrGCW)

    Tested on Broadwell-EP (Xeon E5-2690v4) @ 2.9 GHz (AVX all-core turbo), dual processor board, Linux, unless noted otherwise.


    Q: Should Hyperthreading be used?

    A: No. You take a small performance hit with Hyperthreading. Either switch hyperthreading off in the BIOS, or leave it on but use only half of the logical CPUs.
    Edit, September 9:
    This finding is true with singlethreaded tasks. See later posts for discussion of multithreaded tasks and Hyperthreading.

    Code:
    hyperthreading        on           off
    threads/task          1            1
    simultaneous tasks    56           28
    ----------------------------------------------
    avg task run time     41h25m       20h15m
    PPD                   118,000      126,000
    

    Q: What's preferable, HT off & 100 % CPU usage, or HT on & 50 % CPU usage?

    A: Impossible for me to say so far. The differences which you see in the next to tables are below the error of measurement caused by variability between WUs. It seems HT on & 50 % CPU usage is useful on hosts with low core count and some background load e.g. due to a desktop OS. But make sure that the operating system treats hyperthreaded cores correctly, i.e. spreads the PrimeGrid load across all physical cores.

    Here are results from a 6-core Ivy Bridge-E (i7-4960X) @ 4.2 GHz, Windows 7 Pro SP1, energy options set to "high performance power plan":
    Code:
    hyperthreading        off          on
    threads/task          6            6
    simultaneous tasks    1            1
    ----------------------------------------------
    avg task run time     1h53m        1h50m
    PPD                   45,900       46,100
    
    Next are results with the dual-socket Broadwell-EP and Linux which I also used in the other tests. The CPU frequency governor was counter-intuitively set to "powersave", not "performance", but this shouldn't make a difference with recent Intel processors and recent Linux kernels at least.
    Code:
    hyperthreading        off          on
    threads/task          4            4
    simultaneous tasks    7            7
    ----------------------------------------------
    avg task run time     2h57m        3h03m
    PPD                   207,000      208,000
    
    All the remaining tables are from dual-socket Broadwell-EP and Linux as well.


    Q: To how many threads per task should I configure the LLR tasks?

    A: This certainly depends on CPU type and memory performance. But generally, a moderate number of threads per process should be best. With fewer processes but respectively more threads per process, RAM usage as well as utilization of the CPU's caches and TLBs probably are reduced. Also, context switches between threads are faster than between processes; but this shouldn't matter much because you set up BOINC such that only as many processes × threads are running as there are hardware threads or cores, except for system threads and whatever other background load. On the other hand, too many threads per LLR process are not advisable because there is sub-linear performance scaling with thread count.

    Besides the impact to throughput (points per day), the impact on task duration may be important to you as well: At least during a competition which lasts, say, 3 days, you would have a bad standing with tasks that require 2 days to run. Or if you are keen on being a primary reporter of new primes instead of merely the validator, you want to complete your task earlier than the wingman.
    Code:
    hyperthreading        off          off          off          off          off
    threads/task          1            2            4            7            14
    simultaneous tasks    28           14           7            4            2
    ----------------------------------------------------------------------------------
    avg task run time     20h15m       7h53m        2h57m        1h43m        1h02m
    PPD                   126,000      149,000      207,000      204,000      172,00
    
    In order to enable multithreading, place a file named "app_config.xml" with the following contents into C:\ProgramData\BOINC\projects\www.primegrid.com\. Replace the two occurrences of "4" with the thread count which you desire.
    Code:
    <app_config>
        <app>
        <name>llrGCW</name>
        <fraction_done_exact/>
        </app>
        <app_version>
        <app_name>llrGCW</app_name>
        <cmdline>-t 4</cmdline>
        <avg_ncpus>4</avg_ncpus>
        </app_version>
    </app_config>
    
    To apply this configuration, either restart the BOINC client, or let it re-read all configuration files. The <avg_ncpus> element will take effect immediately. It is used by the client to determine how many tasks can be started until the user-configured CPU utilization is approached. The <cmdline> element takes effect as soon as a task is being started (or when resumed from suspension if "Leave non-GPU tasks in memory while suspended" is off, or when resumed after client restart). The <cmdline> element is the one which actually causes the LLR application to spawn multiple threads. The <fraction_done_exact> element merely tells the client to display the estimated time remaining until task completion on the basis of what the LLR application reports to the client.
     
    #31 StefanR5R, Aug 20, 2017
    Last edited: Sep 8, 2017
    TennesseeTony likes this.
  7. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,590
    Likes Received:
    666
    Any info about NUMA awareness? On my 2P 20C/40T box, configuring either 4 or 5 threads per task puts all tasks on one CPU instead of spreading them out amongst both CPUs.

    I may have to manually disable HT on the two 2P boxen if I can.

    EDIT: Attempting to turn HT off on the 2P box somehow broke Windows 10. Turning it back on revealed that Win 10 is STILL broken, and I don't have time to fix it. It can't even make it the point where Safe Mode can be activated. So I'm going to be a lot less help now because of attempt to fix something that wasn't broken.
     
    #32 crashtech, Aug 20, 2017
    Last edited: Aug 20, 2017
  8. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    That's weird. I can't answer this as I am not familiar with the Windows scheduler policies on dual socket machines.

    It's actually something where the kernel's scheduler needs some advice from the application or from the admin: A multithreaded process with lots of inter-thread communication should keep the communicating threads on the same socket. But a multithreaded process with high RAM bandwidth requirement should spread its threads across sockets. Without knowing much about PrimeGrid's LLR applications, my guess is they are of the latter type. But if several multithreaded processes are being run which all equally require RAM bandwidth, then it is optimal again to keep threads of the same process on the same socket (but put processes on both sockets of course).

    But why would Windows, when faced with several CPU intensive processes, saturate one socket first and only then begin to use the other socket? I can't think of any good application for such a policy.

    Perhaps you can fix this with Process Lasso.
     
  9. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,590
    Likes Received:
    666
    As it stands, it looks like I might have to repair/reload the OS to get it working at all, and due to the confluence of several unusual events, I am working today and won't be able to fix it in time. Perhaps this evening I will get it running and experiment with Process Lasso.

    Odd and frustrating to me that simply disabling HT would cause boot failure on this one machine. I've done that bunches of times on other PCs, and never had this happen. Murphy's Law stalks me continually.

    Edit: The other 2P box (2 Westmere hexacores) is NOT acting weird, it's distributing the threads evenly. So it's something about the 20C Ivy box that is causing a problem, perhaps an arcane BIOS setting.

    EDIT2: The Westmere box IS doing the same thing, but only when multithreading is used. Here is a screencap showing the problem, the CPUs on top are Node 0:

    [​IMG]

    What we ought to see is every other core being used instead of all threads on one node.

    Even with multithreading off, something doesn't look right with core utilization:

    [​IMG]

    @StefanR5R , if this is a cluttering of your benchmark thread, I can put this info somewhere else. I thought it might be tangentially relevant to the matter of getting the most out of our CPUs, though 2P boxes are the exception.
     
    #34 crashtech, Aug 20, 2017
    Last edited: Aug 20, 2017
  10. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    bill1024 says he tested this positively on 6c/12T Ivy Bridge-E.
    https://forums.evga.com/FindPost/2705100

    Maybe I will take measurements sometime later when there is no conflicting competition. Also depends on whether or not this will help me colour-coordinate my PrimeGrid badges.

    @crashtech, maybe HT on/ 100 % CPUs allowed to use/ two 12-threaded tasks at a time would fit your 2P Westmere-EP then.

    BTW if you try to use your primegrid.com results table to check the throughput, keep in mind that "Run time" in these table may be wrong. Perhaps it gets reported properly if your client only downloads one WU at a time, not several; but no guarantees.
     
    crashtech likes this.
  11. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    Some remarks about performance measurements with multithreaded LLR:

    Run time, CPU time, Credit of the tasks, and the ratios between them, are somewhat variable. Not much, but this variability makes it necessary that many tasks are being completed and validated in order to get reasonably precise PPD estimations.

    In order to get the average PPD of a host, do not take an average of the PPD of several tasks. Rather, take the sum of the run time of the tasks, the sum of the credits of the tasks, and then compute PPD from them. (The latter is a weighted average with run time being the weight. This is necessary because your host spends more time with longer tasks than with shorter tasks...)

    The above is true for single-threaded as well as multi-threaded tasks. But here is a catch: Run time of multithreaded tasks is often misreported at PrimeGrid's web interface if you don't take some precautions. This bug is known to the PrimeGrid developers and was attributed to boinc-client, not to the LLR applications. The precautions:
    • Always download only at most 1 task at a time. So far, this seems sufficient to avoid whatever bug causes the wrong reporting of run times.
    • If you accidently downloaded several tasks in one go while you do performance monitoring, cancel these tasks, and proceed to download one task after another.
    How to make sure that only 1 task is being downloaded?
    • Set "Store at least 0 days of work" and "Store up to an additional 0 days of work".
    • Begin with "Use at most 1 % of the CPUs". This will cause the client to use only 1 logical CPU, tell that to the server, and receive only 1 task.
    • Then increase "Use at most _ % of the CPUs" step by step, such that the client determines in each step that it has exactly 1 idle logical CPU to ask a new task for.
    These points are still true after you configured app_config.xml for multithreaded tasks. Say, you configured 4 thread per task. Then you increase "Use at most _ % of the CPUs" such that the client figures there are now 4 idle logical CPUs. When the client requests new work from the server now, the server will send 4 tasks, not 1 as desired.

    (If you add 1 idle CPU, then download e.g. a 4-threaded task, that task won't be started of course. Either add 3 more CPUs to get the task started, or add 4 more CPUs to start the task and download another one.)
     
    crashtech likes this.
  12. crashtech

    crashtech Diamond Member

    Joined:
    Jan 4, 2013
    Messages:
    7,590
    Likes Received:
    666
    @StefanR5R , thanks for the in-depth analysis. I am at once fascinated and discouraged; it seems unlikely that I'll have time to do what is required to get accurate numbers.
     
  13. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    Such tests take a lot of time between races, because validations take so long. It's much better during a race.

    It also helps a lot if different configurations can be tested on two or more hosts with same hardware in parallel.

    Oh, I forgot: The hand-holding which I described to download only 1 task at a time is only required initially, but not anymore once the host is loaded and running. In my experience, the host is then unlikely to download more than one task at a time.

    To be sure, when I copy & paste results of validated tasks from the web page into a spreadsheet, I add a column in the spreadsheet which contains CPU time divided by run time of each task. This ratio must be slightly less than the number of threads per task. If I discover any outliers in this column, I remove the respective task from evaluation. It would most likely be one for which a wrong run time was reported.
     
    crashtech likes this.
  14. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    923
    Likes Received:
    447
    321-LLR, "321 (LLR) 8.00" application
    (name for app_config.xml: llr321)

    I checked dual-processor Linux hosts with two different configurations per processor type.
    Config 1:
    4 threads/task, use 50 % of hardware threads
    (Hyperthreading is on but not used)​
    Config 2:
    4 tasks/socket, use 100 % of hardware threads
    (Hyperthreading is on and used)​
    The latter config gives slightly better throughput.

    dual E5-2690v4 (2x 14C/28T = 56T) @ 2.9 GHz,
    7 tasks at a time x 4 threads/task:
    184,000 PPD
    35 minutes mean time between task completions (4h06m average task duration)​
    8 tasks at a time x 7 threads/task:
    208,000 PPD (+13 %)
    31 minutes mean time between task completions (4h08m average task duration)​

    dual E5-2696v4 (2x 22C/44T = 88T) @ 2.6 GHz,
    11 tasks at a time x 4 threads/task:
    262,000 PPD
    25 minutes mean time between task completions (4h32m average task duration)​
    8 tasks at a time x 11 threads/task:
    270,000 PPD (+3 %)
    24 minutes mean time between task completions (3h12m average task duration)​
    ------------
    Edit September 10: fixed typo in tasks/socket, clarified use of Hyperthreading, increased accuracy of average task duration
     
    #39 StefanR5R, Sep 8, 2017
    Last edited: Sep 10, 2017
    Ken g6 likes this.