• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

PrimeGrid: CPU benchmarks

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,417
2,462
55
There's one option I didn't see you consider: HT on and using multiple cores. I hear that 4 core processors, using all cores on some large WUs, do better with HT on than off. I'm not sure about processors with more cores.
 

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
I actually ran my Windows tests with a single 6-threaded task at a time on a 6-core CPU, configured as 6C/6T versus as 6C/12T. If the computationally intensive threads are placed on distinct physical cores (and that's what the Windows 7 kernel did in fact do, at least with "performance" energy scheme), than 6C/12T should be superior to 6/C/6T because the former has threads left for all the bloat of a typical desktop PC.

I am now running the respective comparison on Linux with 7x4-threaded tasks on 2x14C/28T versus 2x14C/14T.

With IIRC 2x 1.5 days for the Windows test, and less than 2x 1.5 days for the Linux test, I suspect the performance difference will be smaller than the error of measurement caused by variability between WUs.

One thing which I did not test was multithreaded tasks occupying more CPU threads than there are physical cores. So far the only test with more CPU threads occupied than there are physical cores was a test with as many singlethreaded tasks as there are CPU threads.

I should be putting together some tables with actual results now...
 
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
GCW-LLR, "Generalized Cullen/Woodall (LLR) v8.00" application
(name for app_config.xml: llrGCW)

Tested on Broadwell-EP (Xeon E5-2690v4) @ 2.9 GHz (AVX all-core turbo), dual processor board, Linux, unless noted otherwise.


Q: Should Hyperthreading be used?

A: No. You take a small performance hit with Hyperthreading. Either switch hyperthreading off in the BIOS, or leave it on but use only half of the logical CPUs.
Edit, September 9:
This finding is true with singlethreaded tasks. See later posts for discussion of multithreaded tasks and Hyperthreading.

Code:
hyperthreading        on           off
threads/task          1            1
simultaneous tasks    56           28
----------------------------------------------
avg task run time     41h25m       20h15m
PPD                   118,000      126,000

Q: What's preferable, HT off & 100 % CPU usage, or HT on & 50 % CPU usage?

A: Impossible for me to say so far. The differences which you see in the next to tables are below the error of measurement caused by variability between WUs. It seems HT on & 50 % CPU usage is useful on hosts with low core count and some background load e.g. due to a desktop OS. But make sure that the operating system treats hyperthreaded cores correctly, i.e. spreads the PrimeGrid load across all physical cores.

Here are results from a 6-core Ivy Bridge-E (i7-4960X) @ 4.2 GHz, Windows 7 Pro SP1, energy options set to "high performance power plan":
Code:
hyperthreading        off          on
threads/task          6            6
simultaneous tasks    1            1
----------------------------------------------
avg task run time     1h53m        1h50m
PPD                   45,900       46,100
Next are results with the dual-socket Broadwell-EP and Linux which I also used in the other tests. The CPU frequency governor was counter-intuitively set to "powersave", not "performance", but this shouldn't make a difference with recent Intel processors and recent Linux kernels at least.
Code:
hyperthreading        off          on
threads/task          4            4
simultaneous tasks    7            7
----------------------------------------------
avg task run time     2h57m        3h03m
PPD                   207,000      208,000
All the remaining tables are from dual-socket Broadwell-EP and Linux as well.


Q: To how many threads per task should I configure the LLR tasks?

A: This certainly depends on CPU type and memory performance. But generally, a moderate number of threads per process should be best. With fewer processes but respectively more threads per process, RAM usage as well as utilization of the CPU's caches and TLBs probably are reduced. Also, context switches between threads are faster than between processes; but this shouldn't matter much because you set up BOINC such that only as many processes × threads are running as there are hardware threads or cores, except for system threads and whatever other background load. On the other hand, too many threads per LLR process are not advisable because there is sub-linear performance scaling with thread count.

Besides the impact to throughput (points per day), the impact on task duration may be important to you as well: At least during a competition which lasts, say, 3 days, you would have a bad standing with tasks that require 2 days to run. Or if you are keen on being a primary reporter of new primes instead of merely the validator, you want to complete your task earlier than the wingman.
Code:
hyperthreading        off          off          off          off          off
threads/task          1            2            4            7            14
simultaneous tasks    28           14           7            4            2
----------------------------------------------------------------------------------
avg task run time     20h15m       7h53m        2h57m        1h43m        1h02m
PPD                   126,000      149,000      207,000      204,000      172,00
In order to enable multithreading, place a file named "app_config.xml" with the following contents into C:\ProgramData\BOINC\projects\www.primegrid.com\. Replace the two occurrences of "4" with the thread count which you desire.
Code:
<app_config>
    <app>
    <name>llrGCW</name>
    <fraction_done_exact/>
    </app>
    <app_version>
    <app_name>llrGCW</app_name>
    <cmdline>-t 4</cmdline>
    <avg_ncpus>4</avg_ncpus>
    </app_version>
</app_config>
To apply this configuration, either restart the BOINC client, or let it re-read all configuration files. The <avg_ncpus> element will take effect immediately. It is used by the client to determine how many tasks can be started until the user-configured CPU utilization is approached. The <cmdline> element takes effect as soon as a task is being started (or when resumed from suspension if "Leave non-GPU tasks in memory while suspended" is off, or when resumed after client restart). The <cmdline> element is the one which actually causes the LLR application to spawn multiple threads. The <fraction_done_exact> element merely tells the client to display the estimated time remaining until task completion on the basis of what the LLR application reports to the client.
 
Last edited:

crashtech

Diamond Member
Jan 4, 2013
9,758
1,592
126
Any info about NUMA awareness? On my 2P 20C/40T box, configuring either 4 or 5 threads per task puts all tasks on one CPU instead of spreading them out amongst both CPUs.

I may have to manually disable HT on the two 2P boxen if I can.

EDIT: Attempting to turn HT off on the 2P box somehow broke Windows 10. Turning it back on revealed that Win 10 is STILL broken, and I don't have time to fix it. It can't even make it the point where Safe Mode can be activated. So I'm going to be a lot less help now because of attempt to fix something that wasn't broken.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
That's weird. I can't answer this as I am not familiar with the Windows scheduler policies on dual socket machines.

It's actually something where the kernel's scheduler needs some advice from the application or from the admin: A multithreaded process with lots of inter-thread communication should keep the communicating threads on the same socket. But a multithreaded process with high RAM bandwidth requirement should spread its threads across sockets. Without knowing much about PrimeGrid's LLR applications, my guess is they are of the latter type. But if several multithreaded processes are being run which all equally require RAM bandwidth, then it is optimal again to keep threads of the same process on the same socket (but put processes on both sockets of course).

But why would Windows, when faced with several CPU intensive processes, saturate one socket first and only then begin to use the other socket? I can't think of any good application for such a policy.

Perhaps you can fix this with Process Lasso.
 

crashtech

Diamond Member
Jan 4, 2013
9,758
1,592
126
As it stands, it looks like I might have to repair/reload the OS to get it working at all, and due to the confluence of several unusual events, I am working today and won't be able to fix it in time. Perhaps this evening I will get it running and experiment with Process Lasso.

Odd and frustrating to me that simply disabling HT would cause boot failure on this one machine. I've done that bunches of times on other PCs, and never had this happen. Murphy's Law stalks me continually.

Edit: The other 2P box (2 Westmere hexacores) is NOT acting weird, it's distributing the threads evenly. So it's something about the 20C Ivy box that is causing a problem, perhaps an arcane BIOS setting.

EDIT2: The Westmere box IS doing the same thing, but only when multithreading is used. Here is a screencap showing the problem, the CPUs on top are Node 0:



What we ought to see is every other core being used instead of all threads on one node.

Even with multithreading off, something doesn't look right with core utilization:



@StefanR5R , if this is a cluttering of your benchmark thread, I can put this info somewhere else. I thought it might be tangentially relevant to the matter of getting the most out of our CPUs, though 2P boxes are the exception.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
There's one option I didn't see you consider: HT on and using multiple cores. I hear that 4 core processors, using all cores on some large WUs, do better with HT on than off. I'm not sure about processors with more cores.
bill1024 says he tested this positively on 6c/12T Ivy Bridge-E.
https://forums.evga.com/FindPost/2705100

Maybe I will take measurements sometime later when there is no conflicting competition. Also depends on whether or not this will help me colour-coordinate my PrimeGrid badges.

@crashtech, maybe HT on/ 100 % CPUs allowed to use/ two 12-threaded tasks at a time would fit your 2P Westmere-EP then.

BTW if you try to use your primegrid.com results table to check the throughput, keep in mind that "Run time" in these table may be wrong. Perhaps it gets reported properly if your client only downloads one WU at a time, not several; but no guarantees.
 
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
Some remarks about performance measurements with multithreaded LLR:

Run time, CPU time, Credit of the tasks, and the ratios between them, are somewhat variable. Not much, but this variability makes it necessary that many tasks are being completed and validated in order to get reasonably precise PPD estimations.

In order to get the average PPD of a host, do not take an average of the PPD of several tasks. Rather, take the sum of the run time of the tasks, the sum of the credits of the tasks, and then compute PPD from them. (The latter is a weighted average with run time being the weight. This is necessary because your host spends more time with longer tasks than with shorter tasks...)

The above is true for single-threaded as well as multi-threaded tasks. But here is a catch: Run time of multithreaded tasks is often misreported at PrimeGrid's web interface if you don't take some precautions. This bug is known to the PrimeGrid developers and was attributed to boinc-client, not to the LLR applications. The precautions:
  • Always download only at most 1 task at a time. So far, this seems sufficient to avoid whatever bug causes the wrong reporting of run times.
  • If you accidently downloaded several tasks in one go while you do performance monitoring, cancel these tasks, and proceed to download one task after another.
How to make sure that only 1 task is being downloaded?
  • Set "Store at least 0 days of work" and "Store up to an additional 0 days of work".
  • Begin with "Use at most 1 % of the CPUs". This will cause the client to use only 1 logical CPU, tell that to the server, and receive only 1 task.
  • Then increase "Use at most _ % of the CPUs" step by step, such that the client determines in each step that it has exactly 1 idle logical CPU to ask a new task for.
These points are still true after you configured app_config.xml for multithreaded tasks. Say, you configured 4 thread per task. Then you increase "Use at most _ % of the CPUs" such that the client figures there are now 4 idle logical CPUs. When the client requests new work from the server now, the server will send 4 tasks, not 1 as desired.

(If you add 1 idle CPU, then download e.g. a 4-threaded task, that task won't be started of course. Either add 3 more CPUs to get the task started, or add 4 more CPUs to start the task and download another one.)
 
  • Like
Reactions: crashtech

crashtech

Diamond Member
Jan 4, 2013
9,758
1,592
126
@StefanR5R , thanks for the in-depth analysis. I am at once fascinated and discouraged; it seems unlikely that I'll have time to do what is required to get accurate numbers.
 

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
Such tests take a lot of time between races, because validations take so long. It's much better during a race.

It also helps a lot if different configurations can be tested on two or more hosts with same hardware in parallel.

Oh, I forgot: The hand-holding which I described to download only 1 task at a time is only required initially, but not anymore once the host is loaded and running. In my experience, the host is then unlikely to download more than one task at a time.

To be sure, when I copy & paste results of validated tasks from the web page into a spreadsheet, I add a column in the spreadsheet which contains CPU time divided by run time of each task. This ratio must be slightly less than the number of threads per task. If I discover any outliers in this column, I remove the respective task from evaluation. It would most likely be one for which a wrong run time was reported.
 
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
321-LLR, "321 (LLR) 8.00" application
(name for app_config.xml: llr321)

I checked dual-processor Linux hosts with two different configurations per processor type.
Config 1:
4 threads/task, use 50 % of hardware threads
(Hyperthreading is on but not used)​
Config 2:
4 tasks/socket, use 100 % of hardware threads
(Hyperthreading is on and used)​
The latter config gives slightly better throughput.

dual E5-2690v4 (2x 14C/28T = 56T) @ 2.9 GHz,
7 tasks at a time x 4 threads/task:
184,000 PPD
35 minutes mean time between task completions (4h06m average task duration)​
8 tasks at a time x 7 threads/task:
208,000 PPD (+13 %)
31 minutes mean time between task completions (4h08m average task duration)​

dual E5-2696v4 (2x 22C/44T = 88T) @ 2.6 GHz,
11 tasks at a time x 4 threads/task:
262,000 PPD
25 minutes mean time between task completions (4h32m average task duration)​
8 tasks at a time x 11 threads/task:
270,000 PPD (+3 %)
24 minutes mean time between task completions (3h12m average task duration)​

------------

Furthermore, I ran a few configurations on a single-processor host, socket 1150, Xeon E3-1245v3 (Haswell 4C/8T) @ 3.4 GHz, Linux. In each of these tests, I ran always the very same WU, getting an error of measurement below 0.2 %. The WU in this test had an FFT length of 768K (data size: 6.0 MB).

1 process with 8 threads:
12,460 s (3.46 h) run time, 89,390 s (24.8 h) CPU time, 33,800 PPD​
1 process with 7 threads:
12,520 s (3.48 h) run time, 81,320 s (22.6 h) CPU time, 33,600 PPD​
1 process with 6 threads:
12,790 s (3.55 h) run time, 72,560 s (20.2 h) CPU time, 32,900 PPD​

2 processes with 4 threads each:
33,570 s (9.33 h) run time, 248,950 s (69.2 h) CPU time, 25,000 PPD​
4 processes with 2 threads each:
about 22 h run time, about about 21,000 PPD
(extrapolated after 15% completion)​
2 processes with 2 threads each:
about 9.4 h run time, about 24,800 PPD
(extrapolated after 50% completion)​

Conclusion:
On the 4C/8T Haswell with dual-channel memory, it is best to run only one task at a time, leave Hyperthreading enabled, and give the task as many processor threads as can be spared.

------------

Edit September 10, 2017:
fixed typo in tasks/socket, clarified use of Hyperthreading, increased accuracy of average task duration
Edit March 31, 2018:
added Xeon E3 results
Edits April 2-5, 2018:
Xeon E3 ran at 3.4 GHz, typo in s->h conversion
Edit October 20, 2019:
added FFT length of the Xeon E3 tests
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
Note, SoB-LLR is now on version 8.00. (Post #2 was from before the update.) I guess it now scales similar to other LLR applications with long-running tasks. Still I wonder how much of the memory bandwidth dependence that I saw in #2 remained.
 

petrusbroder

Elite Member
Nov 28, 2004
13,322
1,067
126
StefanR5R, I really like your analysis and empirical data. Although I have not run PrimeRrid for quite some time, this makes it interesting again!
 

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
Testing the LLR application outside of Boinc

Almost all of the results which I reported in this thread came from running many random WUs (a sufficient number for each configuration to be tested), adding their run times and credits, and thus arriving at PPD of a given configuration. Benefits of this method are that you get credit for all CPU hours expended, might even find primes, and may do this testing even during a PrimeGrid challenge. The downsides are imprecision due to variability between WUs, and lack of repeatability.

Alternatively, the LLR application can be run stand-alone without boinc and PrimeGrid's llr_wrapper. That way, you can feed the very same WU to LLR over and over, thus eliminate variability between WUs, and can directly compare results from a single run on each configuration to be tested. (In turn, the downside of this method is that your CPUs don't earn any boinc credit on the side while testing.)

What's needed:

I didn't go ask how to run LLR stand-alone; I merely watched how boinc-client + llr_wrapper run it and replicated that. Hence my recipe may contain some bits of cargo cult. File names and scripts are shown for Linux here, but the method can be replicated on Windows as well.
  1. Create working directory for testing. Copy the following files into this directory, and create a test script within the directory as shown below.
  2. Use boinc-client to run a single WU of the PrimeGrid LLR-based subproject that you want to test for. Watch which program binary it runs, and which input file it fed into the LLR application.
  3. Optionally: Later, when PrimeGrid validated the WU, make a note of the credit that was given for the WU.
These files need to be copied out of the boinc data directory:
  • The executable file of the LLR program. Example name of file to copy:
    /var/lib/boinc/projects/www.primegrid.com/sllr64.3.8.20
  • The template for LLR's ini file. Example:
    /var/lib/boinc/projects/www.primegrid.com/llr.ini.6.07
  • The input file of the WU to test. Example:
    /var/lib/boinc/projects/www.primegrid.com/llr321_299635024
The input file llr321_299635024 was the one which I used to obtain the results with Xeon E3 which I reported in post #39, and the content of this file was:
Code:
600000000000000:P:1:2:257
3 13826132
This WU was credited with 4,865.57 cobblestones.

After having copied these files, I made a shell script which lets me run the LLR program with different thread counts, as one or several instances of the program in parallel, and which records the run times and more in a log file:
Bash:
#!/bin/bash

# edit these for the right file names
# (files copied from boinc's projects/www.primegrid.com/ subdirectory)
LLREXE="sllr64.3.8.20"
LLRINI="llr.ini.6.07"
LLRINPUT="llr321_299635024"    # 4,865.57 cobblestones


LOGFILE="protocol.txt"
TIMEFORMAT=$'\nreal\t%1lR\t(%0R s)\nuser\t%1lU\t(%0U s)\nsys\t%1lS\t(%0S s)'

# run_one - run a single LLR process, and show timing information when finished
#
#     argument 1, mandatory: slot number, i.e. unique name of the instance
#     argument 2, mandatory: thread count of the process
#     argument 3, optional: completion percentage at which to terminate a test
#
run_one () {
    SLOT="slot_$1"
    rm -rf ${SLOT}
    for ((;;))
    do
        mkdir ${SLOT}                    || break
        ln    ${LLREXE} ${SLOT}/llr.exe  || break
        cp -p ${LLRINI} ${SLOT}/llr.ini  || break
        cd ${SLOT}                       || break

        echo "---- slot $1 ----" > stdout
        if [ -z "$3" ]
        then
            time ./llr.exe -d -oDiskWriteTime=10 -oThreadsPerTest=$2 ../${LLRINPUT} >> stdout 2> stderr
        else
            ./llr.exe -d -oDiskWriteTime=10 -oThreadsPerTest=$2 ../${LLRINPUT} >> stdout 2> stderr &
            LLRPID=$!
            while sleep 5
            do
                tail -1 stdout | grep -e "[[]$3[.]" > /dev/null && break
            done
            kill ${LLRPID}
            wait ${LLRPID} 2> /dev/null
        fi
        cat stdout stderr
        cd ..
        break
    done
    rm -rf ${SLOT}
}


# run_series - run one or more LLR processes in parallel, and log everything
#
#     argument 1, mandatory: number of processes to run at once
#     argument 2, mandatory: thread count of each process
#     argument 3, optional: at which to terminate a test
#
# stdout and stderr are appended into ${LOGFILE}.
#
run_series () {
    {
        echo "======== $(date) ======== starting $1 process(es) with $2 thread(s) ========"
        time {
            for (( s=1; s<=$1; s++ ))
            do
                run_one $s $2 $3 &
            done
            wait
        }
        echo "======== $(date) ======= done with $1 process(es) with $2 thread(s) ========"
        echo
    } 2>&1 | tee -a "${LOGFILE}"
}


# edit these for the desired test programme
run_series 1 8
run_series 1 7
run_series 1 6
run_series 2 4
run_series 4 2
run_series 2 2

This script is to be placed and run within the same directory as the copies of the exe, ini, and input file. After the top and the tail of this script were edited suitably and the script was started, the waiting begins. To check on the progress of the running LLR process(es), the following simple script can be used:
Bash:
#!/bin/bash

SLOTS=$(echo slot_*)

if [ "${SLOTS}" = 'slot_*' ]
then
    echo "No processes found."
else
    for s in ${SLOTS}
    do
        echo "---- ${s} ----"
        tail -1 ${s}/stdout
        echo
    done
fi

Note, the big wrapper script above (_run.sh) is a little bit flawed if LLR subprojects with very short-running WUs are to be tested with several processes running in parallel: The processes are started at virtually the same time, but they will finish after slightly different times. During the last part of such a test run, when some processes have already finished and only some are still running, the host is under less load than in steady state. Keep this in mind when you compute the average run time from the individual run times of the worker processes. This effect is negligible with longer running WUs though.

Finally, PPD can of course be computed as
credits/WU * number of processes in parallel * 24 h/day * 3600 s/h / average run time in seconds​

――――
Edit January 3, 2019:
Script updated to optionally run llr.exe only until a given percentage of completion:

E.g. run_series 2 4 10 would start 2 tasks in parallel, each with 4 threads, and terminate them when 10 % complete.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
ESP-LLR, tests done in April with sllr64.3.8.20
(name for app_config.xml: llrESP)

I used the test method described in post #44 with the following WU:
llrESP_295093708 (6,554 cobblestones)
Code:
1000000000:P:0:2:257
200749 11371602
Haswell Xeon E3-1245v3 @3.4 GHz, 4C/8T, Linux:

1 task x 4 threads: 18,577 s = 5.16 h, 30.5 kPPD
1 task x 6 threads: 17,840 s = 4.96 h, 31.7 kPPD <- best
1 task x 7 threads: 17,866 s = 4.96 h, 31.7 kPPD
1 task x 8 threads: 19,089 s = 5.30 h, 29.7 kPPD​

Broadwell-EP dual E5-2690v4 @2.9 GHz, 2x 14C/28T, Linux:

HT on, but only ~50% of the hardware threads used:
2 tasks x 14T: 1.82 h, 173 kPPD
4 tasks x 7T: 3.15 h, 200 kPPD
6 tasks x 5T: 4.71 h, 201 kPPD
7 tasks x 4T: 5.54 h, 199 kPPD​

HT on, ~all hardware threads used:
2 tasks x 28T: 1.99 h, 158 kPPD
4 tasks x 14T: 3.50 h, 180 kPPD
6 tasks x 9T: 4.27 h, 220 kPPD <- best
7 tasks x 8T: 6.93 h, 159 kPPD
8 tasks x 7T: 6.16 h, 204 kPPD
10 tasks x 5T: 9.72 h, 162 k PPD​

HT on, 3 tasks per socket, 54...96% of the hardware threads used:
6 tasks x 5T: 4.71 h, 201 kPPD
6 tasks x 7T: 4.54 h, 208 kPPD
6 tasks x 9T: 4.27 h, 220 kPPD <- best​

My interpretation of the E5v4 results so far:
  • The optimum number of simultaneous ESP-LLR tasks on MCC-die Broadwell-EP is 3 per processor.
  • If the number of simultaneous tasks is less or more than the optimum, then it's better to use only "real cores".
  • If the number of simultaneous tasks is at the optimum, then it's better to use all hardware threads.
In the two configs with 7 tasks, at least one task needed to be spread across sockets. I haven't checked how Linux assigned the tasks to the processors and cores. In the 7x4 test, one of the seven task durations was indeed a bit longer than the other six, but not by much. In the 7x8 test, there was only a small difference between durations of the longest few tasks.

I tested the best config of 6x9 twice, to be sure. The second test gave 4.29 h, 220 kPPD.
 

crashtech

Diamond Member
Jan 4, 2013
9,758
1,592
126
I want to do a bit of testing on my 2P 2680 v2 system. I wonder what a presumed optimal configuration would be for llrSR5? 4 tasks and 10T, or 6 tasks and 6T?
 

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
Or 2 tasks x 5T? (I.e. HT off. — In the llrESP tests on Broadwell-EP, the best non-HT config was only 10 % behind the best HT config. And many HT configs were worse than similar non-HT configs.)

I am in the process of testing llrSR5 v8.01 with the following exe and inputs:
LLREXE="sllr64.3.8.21"
LLRINI="llr.ini.6.07"
LLRINPUT="llrSR5_299571544" # 1,678.34 cobblestones

Contents of the input file:
Code:
1000000000:M:0:5:258
322498 2527171
 
Last edited:

crashtech

Diamond Member
Jan 4, 2013
9,758
1,592
126
Well, with 2x E5-2680 v2 (10C/20T) on Win 10, it looks like both 4 tasks x 10T and 2 tasks x 5T load the sockets unevenly, favoring one over the other. I'm really puzzled why the 4 tasks x 10T config only resulted in about 45% load on the machine. So far the only config that seems to load the sockets evenly is 8 tasks x 5T.

Edit:
It looks like 8 tasks x 5T is resulting in an average of 4.38 hour completion times, which seems not too shabby for Ivy Bridge, I think.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
One factor which results in lower CPU utilization of 4x10T compared with 8x5T is the higher number of threads per process. The application scales sub-linearly to higher thread counts. There may even be a point from which on it scales negatively... The other factor must be Windows. I don't know how the Windows process scheduler works, but I suspect Linux would do better.
 

StefanR5R

Elite Member
Dec 10, 2016
4,078
4,479
136
User forest of CNT suggests to use Prime95 for offline testing of different #tasks x #threads. The FFT length needs to be chosen to match the PrimeGrid subproject that is to be tested for. For example, he suggests 640k for SR5-LLR.

(The FFT length used by LLR in any given PrimeGrid LLR-based subproject can be found in the stderr output of a PrimeGrid task; i.e. in slots\#\stderr.txt of a running task, or at the web page of a finished and reported task. A quick look at three SR5-LLR tasks of mine showed FFT lengths of 640k and 560k.)

The big benefit would be that Prime95 tests would be done a lot quicker than typical LLR runs.

I do wonder whether Woltman's Prime95 and Penné's LLR really have similar enough performance profiles, such that Prime95's and LLR's sweet spots are the same or at least close to each other. Something to test in the future.
 
Last edited:

ASK THE COMMUNITY