PrimeGrid: CPU benchmarks

biodoc · Oct 7, 2019

@StefanR5R , I was going to use your script to test my 3700X but I got distracted messing with the new PG preferences settings. It it appears PG thinks I only have 8 threads. It seems only combinations of # of tasks and # max threads per task that add use 8 threads works. If I use settings that use 16 threads I get eventual computational errors and odd task usage as monitored by top. Basically it seems to work as if HT or SMT is turned off. Boinc is set to use 100% of the CPUs, so that's not the problem.

StefanR5R · Oct 8, 2019

biodoc said:
@StefanR5R , I was going to use your script to test my 3700X but I got distracted messing with the new PG preferences settings. It it appears PG thinks I only have 8 threads. It seems only combinations of # of tasks and # max threads per task that add use 8 threads works. If I use settings that use 16 threads I get eventual computational errors and odd task usage as monitored by top. Basically it seems to work as if HT or SMT is turned off. Boinc is set to use 100% of the CPUs, so that's not the problem.

Are the web prefs passed through as expected?
To verify, look at stderr.txt (while the task is running or suspended: in the respective slot directory; after the task was reported: at the task web page at primegrid.com) and see whether or not the number of threads logged by LLR matches your "Multi-threading: Max # of threads for each task(Only applies to LLR tasks)" web preference.
Related to that: Remember, if you have an app_config.xml with <appversion> sections with <plan_class>mt</plan_class>, then these are overriding the web preference. (Old <appversion> sections |edit| of LLR-based applications without this new <plan_class> are now ignored.)

If this doesn't work as expected, then the respective support thread should be a good place to bring this up.

If you run a small task like PPS-DIV's with a high number of threads per task, then it is normal IME that a process monitor like "top" shows a lot less CPU usage than theoretically possible. Maybe this is due to thread synchronization overhead overshadowing the actual computation.

I see at the results list of one of your hosts at primegrid.com that you got 3 error results (each with 4 threads per task, error finish after a few minutes, and exit code 139 = segfault), and 2 results marked invalid (one ran with 4 threads, the other with 2 threads). I suspect these are not related to the fact that you ran multithreaded. I rather wonder whether this could be due to too tight memory timings, new RAM which was not yet excessively tested for defects, or non-default CPU settings, e.g. undervolting.

biodoc · Oct 8, 2019

@StefanR5R , It was a very disturbing Gigabyte x570 bios issue. With CPB (core performance boost) disabled, it runs the llr app with a maximum of 8/16 threads. When I enable CPB, everything is OK. I believe the task errors were due to running 2 x 8 thread tasks on 8 threads. Monitoring using top, one task was using most of the 8 thread processor power and the other task was struggling to get processor time. I would have expected the linux scheduler to have distributed half the processing power to each task. On the ASRock X470 board with CPB disabled the llr app uses all 16 threads with no errors. The Gigabyte board has AGESA 1.0.0.3 ABBA and the ASRock board has AGESA 1.0.0.3 ABB.

One odd observation that I observed on both computers is a setting of 1 task/16 threads resulted in the llr app only using 11 threads. The contents in stderr.tx was as follows:

BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8

LLR command line: primegrid_llr -d -oDiskWriteTime=1 -oThreadsPerTest=16 llr.in
Using all-complex FMA3 FFT length 256K, Pass1=1K, Pass2=256, clm=1, 16 threads, a = 3

Could this be a llr app bug?

StefanR5R · Oct 8, 2019

biodoc said:
Monitoring using top, one task was using most of the 8 thread processor power and the other task was struggling to get processor time. I would have expected the linux scheduler to have distributed half the processing power to each task.

Did they have same priority? (If they were both started through the same boinc-client instance, then the answer is most certainly yes. If so, then that's odd indeed.)

biodoc said:
One odd observation that I observed on both computers is a setting of 1 task/16 threads resulted in the llr app only using 11 threads. The contents in stderr.tx was as follows:

[...]
LLR command line: primegrid_llr -d -oDiskWriteTime=1 -oThreadsPerTest=16 llr.in
Using all-complex FMA3 FFT length 256K, Pass1=1K, Pass2=256, clm=1, 16 threads, a = 3

Click to expand...

It claims to use 16 threads, and I believe it to have done so.
Do you mean you saw only ~1100 % CPU utilization, instead of the ideal ~1600 %? This would be expected. LLR can only scale that high if you give it really large WUs to chew on, such as SoB or PSP --- and even these are not able to reach that theoretical utilization at such high thread counts, if I remember correctly.

But there is one important thing to keep in mind when it comes to Zen/ Zen+/ Zen2: If program threads are spread over more than one core complex (CCX, i.e. a 4c/8t unit in fully enabled SKUs), they can only synchronize via main memory, not through processor caches. (At least this is what the more technically in-depth reviews claimed.) It is plausible that this becomes visible in multithreaded programs like LLR as a drop in processor utilization, depending on the ratio of synchronization to computation (which is presumably higher the smaller the FFT sizes are).

biodoc · Oct 8, 2019

StefanR5R said:
Did they have same priority? (If they were both started through the same boinc-client instance, then the answer is most certainly yes. If so, then that's odd indeed.)

Yes, same priority.

StefanR5R said:
Do you mean you saw only ~1100 % CPU utilization, instead of the ideal ~1600 %?

Yes.

StefanR5R said:
But there is one important thing to keep in mind when it comes to Zen/ Zen+/ Zen2: If program threads are spread over more than one core complex (CCX, i.e. a 4c/8t unit in fully enabled SKUs), they can only synchronize via main memory, not through processor caches. (At least this is what the more technically in-depth reviews claimed.) It is plausible that this becomes visible in multithreaded programs like LLR as a drop in processor utilization, depending on the ratio of synchronization to computation (which is presumably higher the smaller the FFT sizes are).

That was the intent of that particular test (drive synchronization into main memory to see how it affects PPD). I have a better test when I post the results of your script.

biodoc · Oct 8, 2019

PPS-DIV v8.04

llrDIV_324142536

Here are test results with a workunit which is 240K FFT size ( = 1.875 MB FFT data size, getting 389.87 credits).
All tests were done on Linux 64 bit, otherwise idle systems, running each job until completion using @StefanR5R 's script. Times are median run times.

AMD 3700X (8C/16T, 32 MB L3 cache divided between 2 CCX caches (16 MB each)) @3.7-3.95 GHz (varied since CPB was on) with DDR4-3000 CL16 RAM.

SMT on and used:
2 concurrent tasks with 8 threads each: 993 s (174 tasks/day, 68 kPPD)
4 concurrent tasks with 4 threads each: 1790 s (193 tasks/day, 75 kPPD)
8 concurrent tasks with 2 threads each: 3238 s (213 tasks/day, 83 kPPD) <---best

SMT on and not used:
1 task with 8 threads: 1067 s (81 tasks/day, 31.5 kPPD) <----bad choice for CPU architecture; forces synchronization into main memory
2 concurrent tasks with 4 threads each: 1097 s (157 tasks/day, 61 kPPD)
4 concurrent tasks with 2 threads each: 1961 s (176 tasks/day, 69 kPPD)
8 concurrent tasks with 1 thread each: 3340 s (207 tasks/day, 80.7 kPPD) <---second best

crashtech · Oct 8, 2019

Wow, SMT produces gains with Ryzen where Intel HT does not, at least as far as we know; there's Skylake-X to consider.

biodoc · Oct 9, 2019

PPS-DIV v8.04

llrDIV_324142536

Here are test results with a workunit which is 240K FFT size ( = 1.875 MB FFT data size, getting 389.87 credits).
All tests were done on Linux 64 bit, otherwise idle systems, running each job until completion using @StefanR5R 's script. Times are median run times.

This is my other 3700X where CPB is disabled so clock speed is fixed under full load and I can measure power draw from the wall.

AMD 3700X (8C/16T, 32 MB L3 cache divided between 2 CCX caches (16 MB each)) @3.55 GHz with DDR4-3200 CL14 RAM.

SMT on and used: 127 watts from the wall during test
8 concurrent tasks with 2 threads each: 3423 s (202 tasks/day, 78.7 kPPD,); 620 PPD/watt

SMT on and not used: 113 watts from the wall during test
8 concurrent tasks with 1 thread each: 3551 s (195 tasks/day, 76 kPPD); 673 PPD/watt

EDIT: This computer has 2 GTX 1080 cards: Idle power draw from the wall is 47 watts

StefanR5R · Oct 9, 2019

Thanks, very informative.
I should repeat some of my test cases with power meter directly on the computer too. (Usually I only have one meter for several computers together; and the tests typically run unattended.) This should be interesting for workload sizes at which the throughput sweet spot is beginning to shift from single-threaded to multi-threaded or/and from HT-off to HT-on.

biodoc · Oct 9, 2019

I switched my power meter to the other 3700X (X570 board with CPB on) and am rerunning the 2 tests. I want to make sure the bios isn't overvolting the processor. The first test is running at 3.675 GHz with a 133 watt power draw.

biodoc · Oct 9, 2019

biodoc said:
AMD 3700X (8C/16T, 32 MB L3 cache divided between 2 CCX caches (16 MB each)) @3.55 GHz with DDR4-3200 CL14 RAM.

SMT on and used: 127 watts from the wall during test
8 concurrent tasks with 2 threads each: 3423 s (202 tasks/day, 78.7 kPPD,); 620 PPD/watt

SMT on and not used: 113 watts from the wall during test
8 concurrent tasks with 1 thread each: 3551 s (195 tasks/day, 76 kPPD); 673 PPD/watt

Same tests on other 3700X (X570 board with CPB on) with power meter at the wall.

SMT on and used: 133 watts from the wall during test and running @3.675 GHz
8 concurrent tasks with 2 threads each: 3428 s (202 tasks/day, 78.7 kPPD,); 592 PPD/watt

SMT on and not used: 139 watts from the wall during test and running @3.892 GHz
8 concurrent tasks with 1 thread each: 3308 s (209 tasks/day, 81.5 kPPD); 586 PPD/watt

I don't think there's a significant difference between the 2 computers with SMT on and used. With SMT on and not used, CPB pushes the clock speed up a bit so kPPD goes up significantly but PPD/watt goes down significantly I think. There are more aggressive CPB core boost curves in the x570 bios that I can use but I can live with the most conservative one. The other choices result in progressively higher clock speeds during all core loads.

StefanR5R · Oct 15, 2019

For posterity: PPS-DIV performance with increasingly sized jobs (already reported elsewhere)

Test device: Kaby Lake i7-7700K (4C/8T, 8 MB L3 cache) @4.2 GHz with DDR4-3000c14 dual-channel RAM.

Notes on the test method:

I modified my script slightly for convenience, such that I can pass k*b^n+c expressions as input directly.
Post #71 showed a real WU, downloaded on October 4, ran to 100 % completion, run times reported as median values from each test (that is: if a test runs more than one task concurrently, I reported the run time out of the middle), rounded towards nearest 10 s.
This time I tested the same WU again but also four "synthetic" larger WUs. I ran them only to 20 % completion, and run times are here reported as the maximum value of each test (that is: if a test runs more than one task concurrently, I report the run time of the one which took the longest to reach 20 % completion), rounded up to the next 5 s (actually, not rounded at all, but the script watched task progress in 5 s intervals).
This different reporting is due to how I implemented the script when it quits a test at a percentage below 100 %. It is the reason why my re-run of the October 4 tests gives now a little less estimated tasks-per-day.

How to read the table:

Each column is for one of the 5 WUs tested. The first one is outdated, as it is smaller than the tasks which are emitted now.
The second WU is representative of those which were available at the beginning of the challenge. WUs of the 3rd size made up the bulk of at least the second half of the challenge. WUs of the 4th and 5th size were initially estimated by Michael Goetz as possible work near the end of the challenge, but it really only got until 288K and no further.
Each row is for one configuration WRT number of concurrent tasks and threads per task.
The important number to look at is "tasks/d" = estimated number of tasks completed by the host per day.

Code:

candidate               15*2^4110098+1          15*2^4400000+1          15*2^4700000+1          15*2^5200000+1          15*2^5900000+1
FFT length              240K                    256K                    288K                    320K                    384K
FFT data size           1.875 MB                2.0 MB                  2.25 MB                 2.5 MB                  3.0 MB
credit/task             389.87 cobblestones     unknown                 unknown                 unknown                 unknown
===============================================================================================================================================
HT on but not used
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 task at a time,        260 s until 20 %        210 s until 20 %        270 s until 20 %        305 s until 20 %        415 s until 20 %
4 threads/task           ~66 tasks/d             ~82 tasks/d             ~64 tasks/d            ~57 tasks/d              ~42 tasks/d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2 tasks at a time,       470 s until 20 %        370 s until 20 %        485 s until 20 %        560 s until 20 %        770 s until 20 %
2 threads/task           ~74 tasks/d             ~93 tasks/d             ~71 tasks/d             ~62 tasks/d             ~45 tasks/d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3 tasks at a time,       630 s until 20 %        655 s until 20 %        865 s until 20 %       1025 s until 20 %       1420 s until 20 %
1 thread/task            ~82 tasks/d             ~79 tasks/d             ~60 tasks/d             ~51 tasks/d             ~37 tasks/d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4 tasks at a time,       645 s until 20 %        675 s until 20 %        865 s until 20 %       1155 s until 20 %       1696 s until 20 %
1 thread/task           ~107 tasks/d  <-- best  ~102 tasks/d             ~80 tasks/d             ~60 tasks/d             ~41 tasks/d
===============================================================================================================================================
HT on and used
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 task at a time,        250 s until 20 %        190 s until 20 %        235 s until 20 %        275 s until 20 %        390 s until 20 %
8 threads/task           ~69 tasks/d             ~91 tasks/d             ~74 tasks/d             ~63 tasks/d             ~44 tasks/d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2 tasks at a time,       390 s until 20 %        330 s until 20 %        415 s until 20 %        495 s until 20 %        715 s until 20 %
4 threads/task           ~89 tasks/d            ~105 tasks/d  <-- tied   ~83 tasks/d  <-- best   ~70 tasks/d  <-- best   ~48 tasks/d  <-- best
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3 tasks at a time,       610 s until 20 %        510 s until 20 %        655 s until 20 %        785 s until 20 %       1155 s until 20 %
2 threads/task           ~85 tasks/d            ~102 tasks/d             ~79 tasks/d             ~66 tasks/d             ~45 tasks/d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4 tasks at a time,       730 s until 20 %        660 s until 20 %        850 s until 20 %       1206 s until 20 %       1841 s until 20 %
2 threads/task           ~95 tasks/d            ~105 tasks/d  <-- tied   ~81 tasks/d             ~57 tasks/d             ~38 tasks/d
===============================================================================================================================================

Conclusions:

As soon as FFT data size equals the desktop-Skylake derivative's amount of L3 cache per core, the setting for optimum throughput flips from 1 task/core, HT unused (i.e.: 4 single-threaded tasks at once on this 4C/8T processor), to 1 task per 2 cores, HT used (i.e.: two 4-threaded tasks at once on this processor).
The idea to leave a bit room on this 4-core processor by running only 3 tasks is not entirely unfounded. But in the end, it is not the best compromise between utilization of execution units and cache utilization. This idea has yet to be tested on high-core-count server CPUs.
The phenomenon that HyperThreading suddenly turns from detrimental to beneficial warrants testing on CPUs with different HT/SMT implementation, different cache performance and prefetching heuristics, etc. pp.
Let's assume the operator is lazy and is not switching from 4 tasks x 1 thread/task for a while, even though bigger WUs are already sent to the host. The host will no longer run at the optimum, but will still be very near the optimum --- even with 2.25 MB sized tasks on this 2.0 MB L3$/core processor. But with 2.5 MB sized tasks, that 4x1 setting is falling behind the optimum more noticeably.
My tests of the small October 4 WU until mere 20 % completion agree very much with the earlier tests that went to 100 % completion. This supports my claim that running a test just once is already highly accurate on an otherwise idle Linux system. And it also indicates that it is probably possible to cut the tests off at even lower percentage without notable loss of accuracy.

I haven't run similar tests on E5 v4 yet. Maybe I still will, maybe not. To them, 2.5 MB should be an inflection point similar to the 2.0 MB point on the desktop CPU.

StefanR5R · Nov 2, 2019

PrimeGrid 321-LLR v8.04

Before I forget --- here is a copy&paste of results from October.
Conclusions:

With these large WUs, LLR scales reasonably to program thread counts of 8 per task or even more.
Best throughput is with HyperThreading in use, on iterations of Skylake and Haswell-EP at least.
Best throughput is with configs which make good use of the L3 cache and all (or nearly all) available hardware threads.
An odd number of concurrent tasks doesn't work so well on dual socket computers.
There is a PPD advantage of 3*2^n-1 over 3*2^n+1 (seen with just one WU of each type), but it amounts to less than 10 %.
I ran almost all tests to only 10 % completion. This gave slightly conservative results compared to tests running to 100 % completion.

The work units had 800k FFT length = 6.25 MB FFT data size.

Kaby Lake i7-7700K (4C/8T, 8 MB L3 cache) --- llr321_327835846 ( = testing 3*2^15085929+1), 5,668.37 credit

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	kPPD	remark
1	4	10%	1135	7.61	5668.37	43
1	5	10%	1135	7.61	5668.37	43
1	6	10%	1090	7.93	5668.37	45
1	7	10%	1065	8.11	5668.37	46
1	8	10%	1040	8.31	5668.37	47	best
1	8	100%	10283	8.40	5668.37	48	for validation

Kaby Lake i7-7700K (4C/8T, 8 MB L3 cache) --- llr321_327835971 ( = testing 3*2^15086740-1), 5,689.43 credit

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	PPD	remark
1	4	10%	1025	8.43	5689.43	48
1	5	10%	1025	8.43	5689.43	48
1	6	10%	995	8.68	5689.43	49
1	7	10%	965	8.95	5689.43	51
1	8	10%	945	9.14	5689.43	52	best
1	8	100%	9379	9.21	5689.43	52	for validation

Broadwell-EP dual E5-2690 v4 (2x 14C/28T, 2x 35 MB L3 cache) --- llr321_327835971

HT on but not used (or at least not much)

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	PPD	remark
1	28	10%	955	9.0	5689.43	51
2	14	10%	715	24.2	5689.43	138
3	9	10%	1095	23.7	5689.43	135
4	7	10%	1120	30.9	5689.43	176
5	6	10%	1545	28.0	5689.43	159
6	5	10%	1560	33.2	5689.43	189
7	4	10%	2076	29.1	5689.43	166
8	3	10%	2331	29.7	5689.43	169
9	3	10%	2536	30.7	5689.43	174
10	2	10%	3422	25.2	5689.43	144
11	2	10%	4362	21.8	5689.43	124
12	2	10%	3692	28.1	5689.43	160

HT used (but not over-used, i.e. some hardware threads remain unused in some tests)

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	PPD	remark
1	56	10%	1375	6.3	5689.43	36
2	28	10%	830	20.8	5689.43	118
3	18	10%	1055	24.6	5689.43	140
4	14	10%	1005	34.4	5689.43	196
5	11	10%	1315	32.9	5689.43	187
6	9	10%	1401	37.0	5689.43	211
7	8	10%	1701	35.6	5689.43	202
8	7	10%	1831	37.7	5689.43	215	best
9	6	10%	2236	34.8	5689.43	198
10	5	10%	2386	36.2	5689.43	206
11	5	10%	3008	31.6	5689.43	180
12	4	10%	3258	31.8	5689.43	181
13	4	10%	4035	27.8	5689.43	158
14	4	10%	4511	26.8	5689.43	153

Broadwell-EP dual E5-2696 v4 (2x 22C/44T, 2x 55 MB L3 cache) --- llr321_327835971

HT on but not used (or at least not much)

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	PPD	remark
1	44	10%	1310	6.6	5689.43	38
2	22	10%	920	18.8	5689.43	107
3	14	10%	990	26.4	5689.43	150
4	11	10%	980	28.8	5689.43	164
5	9	10%	1200	35.6	5689.43	202
6	7	10%	1215	32.9	5689.43	187
7	6	10%	1575	37.4	5689.43	213
8	5	10%	1616	36.6	5689.43	208
9	5	10%	1891	39.2	5689.43	223
10	4	10%	1986	38.5	5689.43	219
11	4	10%	2246	36.4	5689.43	207
12	3	10%	2611	37.6	5689.43	214
13	3	10%	2761	42.7	5689.43	243
14	3	10%	2631	40.2	5689.43	228
15	3	10%	3012	43.0	5689.43	245
16	2	10%	3872	35.7	5689.43	203

HT used (but not over-used, i.e. some hardware threads remain unused in some tests)

concurrent tasks	threads/task	completion	run time (s)	tasks/day	credits/task	PPD	remark
1	88	10%	1881	4.6	5689.43	26
2	44	10%	1160	14.9	5689.43	85
3	29	10%	1220	21.2	5689.43	121
4	22	10%	875	39.5	5689.43	225
5	17	10%	1145	37.7	5689.43	215
6	14	10%	1085	47.8	5689.43	272
7	12	10%	1340	45.1	5689.43	257
8	11	10%	1451	47.6	5689.43	271
9	9	10%	1666	46.7	5689.43	266
10	8	10%	1701	50.8	5689.43	289
11	8	10%	1891	50.3	5689.43	286
12	7	10%	1971	52.6	5689.43	299	best
13	6	10%	2336	48.1	5689.43	274
14	6	10%	2327	52.0	5689.43	296
15	5	10%	2922	44.4	5689.43	252
16	5	10%	2857	48.4	5689.43	275

crashtech · Nov 25, 2020

Bump because I think this is valuable!

Unfortunately, nothing about Woodall or Cullen which are the current race choices.

StefanR5R · Nov 26, 2020

I am leaving a copy of these quick checks here:

StefanR5R said:
A quick and superficial look at the PG web site shows me that llrCUL is now using about 2 M long FFTs on FMA3 supporting hardware like Ryzen 3000. This means circa 16 MByte footprint of the hottest program data, per each task.

Ryzen 3600 has got 2 (two) core complexes. Each core complex has got 3 cores/ 6 threads and 16 MByte level 3 cache.

StefanR5R said:
llrCUL still seems to be at 2M. llrWOO may be at 2340K now (this is 18 MByte size) and thus a bit much for Zen2.

Hence, unless proven otherwise, better deselect llrWOO for Zen2 based computers. Run one llrCUL task per core complex.

On laptop and desktop class Intel CPUs, all is lost. Run only a single task at a time and watch it waiting on accesses to main memory. Or if you are in need of some excitement, go watch some paint drying instead.

On Haswell-E/EP and Broadwell-E/EP, run one llrCUL or llrWOO task per each ~18MB of L3$. E.g. 3 at a time on the 22-core E5 v4 which has 55 MB unified L3$.

On Skylake-X/SP and its refreshes, FFT sizes are a little different (because the application needs to use 512 bytes wide vectors there, otherwise half of the processor's vector units would be inaccessible to the application) and last level cache policy is different from BDW and earlier. I don't have one of these processors.

StefanR5R · Jan 10, 2021

Testing the LLR application outside of Boinc
Update from post #44: Now based on LLR2.

Preface
Results which I posted early in this thread came from running many random WUs (a sufficient number for each configuration to be tested), adding their run times and credits, and thus arriving at PPD of a given configuration. Benefits of this method are that you get credit for all CPU hours expended, might even find primes, and may do this testing even during a PrimeGrid challenge. The downsides are imprecision due to variability between WUs, and lack of repeatability.

Alternatively, the LLR application can be run stand-alone without boinc and PrimeGrid's llr_wrapper. That way, you can feed the very same WU to LLR over and over, thus eliminate variability between WUs, and can directly compare results from a single run on each configuration to be tested. (In turn, the downside of this method is that your CPUs don't earn any boinc credit on the side while testing.)

How the offline test works in principle
I didn't go ask how to run LLR stand-alone; I merely watched how boinc-client + llr_wrapper run it and replicated that. Hence my recipe may contain some bits of cargo cult. File names and scripts are shown for Linux here, but the method can be replicated on Windows as well.

Use boinc-client to run a single WU of the PrimeGrid LLR-based subproject that you want to test for. Watch which program binary it runs, and which input file it fed into the LLR application.
Optionally: Later, when PrimeGrid validated the WU, make a note of the credit that was given for the WU.
Create a working directory for testing. Copy the following files into this directory, and create a test script within the directory as shown below.
Choose combinations of the number of concurrent tasks and the number of threads per task which interest you, and edit the end section of the script accordingly.
Start the script and wait for the tests to complete.
Look through the log which the script wrote and make sense of the run times which are reported in it.

A note on steps 1 and 2: Instead of taking the input data from a real workunit, you can also directly specify a candidate number of your choice. E.g., look up which parameters a given PrimeGrid subproject is currently working on (Suproject status) and pick a k and n combination from there. However, you won't know the credits for this workunit then.

Required files
First, make a subdirectory in which the tests shall be executed.

Bash:

mkdir ~/PrimeGrid_Tests
cd ~/PrimeGrid_Tests

Get the LLR2 program. At the time of this writing, sllr2_1.1.0_linux64_201114 was current on x86-64 Linux. You may have it already in your boinc data directory:

Bash:

cp -p /var/lib/boinc/projects/www.primegrid.com/sllr2_1.1.0_linux64_201114 .

If boinc does not have this file yet, download it:

Bash:

wget https://www.primegrid.com/download/sllr2_1.1.0_linux64_201114.gz
gunzip sllr2_1.1.0_linux64_201114.gz
chmod +x sllr2_1.1.0_linux64_201114

Copy the input file as noted in the previous subsection from the PrimeGrid project directory within boinc's data directory. Such files have names like "llrDIV_694839913". — This is not required if you enter the formula with the k and n to test directly into the following script.

Create the script file.

Bash:

cat >_run_llr2.sh <<'EOF'
#!/bin/bash

# (Don't edit, unless PrimeGrid updates this.)
# The LLR2 program, copied from boinc's projects/www.primegrid.com/ subdirectory,
# or downloaded from https://www.primegrid.com/download/sllr2_1.1.0_linux64_201114.gz
# and then gunzip'ed.
LLREXE="sllr2_1.1.0_linux64_201114"

# (Don't edit, unless PrimeGrid updates this.)
LLROPTIONS=\
"-oGerbicz=1 -oProofName=proof -oProofCount=64 -oProductName=prod "\
"-oPietrzak=1 -oCachePoints=1 -pSavePoints -d -oDiskWriteTime=10"

# Edit this:
# The input can be given either as a filename,
# or as an expression in the form of "k*b^n+c" or "b^n-b^m+c".
LLRINPUT="11*2^7811487+1"       # llrDIV, FFT length 480K (FFT data size = 3.75 MB), 1,534.49 cobblestones

# Edit this:
# Choose a file name of the log.
LOGFILE="$(hostname)_llrDIV_480K_${LLRINPUT}_protocol.txt"

# Edit this:
# Set to 1 if boinc-client shall be suspended and resumed before/ after the tests.
SUSPEND_RESUME_BOINC=0

# (Don't edit.)
TIMEFORMAT=$'\nreal\t%1lR\t(%0R s)\nuser\t%1lU\t(%0U s)\nsys\t%1lS\t(%0S s)'

# (Don't edit.)
# run_one - run a single LLR process, and show timing information when finished
#
#     argument 1, mandatory: slot number, i.e. unique name of the instance
#     argument 2, mandatory: thread count of the process
#     argument 3, optional: completion percentage at which to terminate a test
#
run_one () {
        SLOT="slot_$1"
        rm -rf ${SLOT}
        for ((;;))
        do
                mkdir ${SLOT}                    || break
                ln    ${LLREXE} ${SLOT}/llr.exe  || break
                cd ${SLOT}                       || break

                [ -f ../${LLRINPUT} ] && i=../${LLRINPUT} || i="-q${LLRINPUT}"

                echo "---- slot $1 ----" > stdout
                if [ -z "$3" ]
                then
                        time ./llr.exe ${LLROPTIONS} -t$2 $i >> stdout 2> stderr
                else
                        ./llr.exe ${LLROPTIONS} -t$2 $i >> stdout 2> stderr &
                        LLRPID=$!
                        while sleep 5
                        do
                                tail -1 stdout | grep -e "[[]$3[.]" > /dev/null && break
                        done
                        kill ${LLRPID}
                        wait ${LLRPID} 2> /dev/null
                fi
                cat stdout stderr
                cd ..
                break
        done
        rm -rf ${SLOT}
}

# (Don't edit.)
# run_series - run one or more LLR processes in parallel, and log everything
#
#     argument 1, mandatory: number of processes to run at once
#     argument 2, mandatory: thread count of each process
#     argument 3, optional: completion percentage at which to terminate a test
#
# stdout and stderr are appended into ${LOGFILE}.
#
run_series () {
        {
                echo "======== $(date) ======== starting $1 process(es) with $2 thread(s) ========"
                time {
                        for (( s=1; s<=$1; s++ ))
                        do
                                run_one $s $2 $3 &
                        done
                        wait
                }
                echo "======== $(date) ======= done with $1 process(es) with $2 thread(s) ========"
                echo
        } 2>&1 | tee -a "${LOGFILE}"
}

# Edit the passwd part if necessary.
((SUSPEND_RESUME_BOINC)) && boinccmd --passwd "$(< /var/lib/boinc/gui_rpc_auth.cfg)" --set_run_mode never

# Edit this:
# Choose your set of tests here.

# SMT on but not used
run_series  4  1  10
run_series  2  2  10
run_series  1  4  10

# SMT used
run_series  8  1  10
run_series  4  2  10
run_series  2  4  10
run_series  1  8  10

# Edit the passwd part if necessary.
((SUSPEND_RESUME_BOINC)) && boinccmd --passwd "$(< /var/lib/boinc/gui_rpc_auth.cfg)" --set_run_mode auto
EOF
chmod +x _run_llr2.sh

Edit this script file in your favorite text editor, as desired for the particular tests which you plan to run.

Optionally: Create another script which can later be used to see to which percentage a currently running test has advanced. This script needs to be run in a separate terminal.

Bash:

cat >_show_progress.sh <<'EOF'
#!/bin/bash

SLOTS=$(echo slot_*)

if [ "${SLOTS}" = 'slot_*' ]
then
        echo "No processes found."
else
        for s in ${SLOTS}
        do
                echo "---- ${s} ----"
                tail -1 ${s}/stdout
                echo
        done
fi
EOF
chmod +x _show_progress.sh

When done, start the launcher script:
./_run_llr2.sh

The optional script to show the point of progress is started like this in another terminal:
cd ~/PrimeGrid_Tests
./_show_progress.sh

Summary
Get hold of the LLR2 executable, and copy the _run_llr2.sh script from above. Edit the script as advised for a specific test. And off you go.

StefanR5R · Jan 10, 2021

PPS-DIV v9.01

Tested workunit: 11*2^7811487+1
FFT length is 480K (FFT data size = 3.75 MB), gives 1,534.49 credit.

Each test was run until 10.11 % completion; and that's what the test durations in seconds are reported for. (Edit: In multi-tasking tests, I am reporting the duration of the longest task here. The other tasks finished mere seconds earlier than the last one.) Tasks/day and PPD are linearly extrapolated to 100 % completion. Power consumption is "at the wall".

Kaby Lake i7-7700K (4C/8T, 8 MB L3 cache) @3.4 GHz (turbo boost disabled in the BIOS) with DDR4-3200c14 dual-channel RAM, two idle Pascal GPUs in the system (48 W system power consumption when idle), Linux Mint 20, kernel 5.4 NO_HZ PREEMPT_VOLUNTARY, idle Cinnamon desktop.

1 task with 4 threads: 420 s, 93 W:
20.8 tasks/day, 31.9 kPPD, 0.343 kPPD/W <-- best power efficiency
2 concurrent tasks with 2 threads each: 805 s, 99 W:
21.7 tasks/day, 33.3 kPPD, 0.336 kPPD/W
4 concurrent tasks with 1 thread each: 1656 s, 103 W:
21.1 tasks/day, 32.4 kPPD, 0.314 kPPD/W <-- exceeds cache size

1 task with 8 threads: 415 s, 97 W:
21.0 tasks/day, 32.3 kPPD, 0.333 kPPD/W
2 concurrent tasks with 4 threads each: 765 s, 106 W:
22.8 tasks/day, 35.0 kPPD, 0.331 kPPD/W <-- best throughput
4 concurrent tasks with 2 threads each: 1747 s, 105 W:
20.0 tasks/day, 30.7 kPPD, 0.292 kPPD/W <-- exceeds cache size
8 concurrent tasks with 1 thread each: 3721 s, 103 W:
18.8 tasks/day, 28.8 kPPD, 0.280 kPPD/W <-- exceeds cache size

Kaby Lake i7-7700K @4.4 GHz, otherwise same as above.

1 task with 4 threads: 325 s, 146 W:
26.8 tasks/day, 41.2 kPPD, 0.282 kPPD/W <-- best power efficiency
2 concurrent tasks with 2 threads each: 620 s, 155 W:
28.2 tasks/day, 43.2 kPPD, 0.279 kPPD/W
4 concurrent tasks with 1 thread each: 1521 s, 148 W:
23.0 tasks/day, 35.3 kPPD, 0.238 kPPD/W <-- exceeds cache size

1 task with 8 threads: 320 s, 155 W:
27.3 tasks/day, 41.9 kPPD, 0.270 kPPD/W
2 concurrent tasks with 4 threads each: 605 s, 166 W:
28.9 tasks/day, 44.3 kPPD, 0.267 kPPD/W <-- best throughput
4 concurrent tasks with 2 threads each: 1677 s, 151 W:
20.8 tasks/day, 32.0 kPPD, 0.212 kPPD/W <-- exceeds cache size
8 concurrent tasks with 1 thread each: 3641 s, 147 W:
19.2 tasks/day, 29.5 kPPD, 0.200 kPPD/W <-- exceeds cache size

EPYC Rome dual 7452 (2× 32C/64T, 2× 128 MB L3 cache) or (16× 4C/8T, 16× 16 MB L3 cache) @ 2× 155 W PPT, 2×8-channel DDR4-3200c22, headless, ≈100 W system power consumption when idle, OpenSuse Leap 15.2, kernel 5.3.18 NO_HZ PREEMPT_VOLUNTARY, display-manager service stopped

8 tasks with 8 threads: 500 s, 355 W:
140 tasks/day, 214 kPPD, 0.60 kPPD/W
16 tasks with 4 threads: 790 s, 360 W:
177 tasks/day, 271 kPPD, 0.75 kPPD/W
32 tasks with 2 threads: 1206 s, 345 W:
232 tasks/day, 356 kPPD, 1.03 kPPD/W
64 tasks with 1 thread: 2232 s, 365 W:
250 tasks/day, 384 kPPD, 1.05 kPPD/W <-- best throughput

8 tasks with 16 threads: 1050 s, 305 W:
67 tasks/day, 102 kPPD, 0.33 kPPD/W <-- incurs traffic across CCXs
16 tasks with 8 threads: 570 s, 340 W:
245 tasks/day, 376 kPPD, 1.11 kPPD/W <-- best power efficiency
32 tasks with 4 threads: 1287 s, 365 W:
217 tasks/day, 333 kPPD, 0.91 kPPD/W
64 tasks with 2 thread: 2293 s, 380 W:
244 tasks/day, 374 kPPD, 0.98 kPPD/W
128 tasks with 1 thread: 7073 s, 400 W:
158 tasks/day, 243 kPPD, 0.61 kPPD/W <-- exceeds cache size

crashtech · Jan 10, 2021

The time figure you use in these calculations from the output file is user + sys?

StefanR5R · Jan 11, 2021

No, I use "real". In boinc's terminology, this is "run time".
(User + sys would be CPU time in boinc.)

A pro pos. The script logs real/user/sys time for each task in a test, and for the test as a whole. (Edit: That's if the tasks are carried out fully. If the test is set up to terminate the tasks at less than 100 % completion, only the whole duration of the test is logged.) To obtain the test durations in #92, I ~~was lazy and~~ merely reported the "real" time of the respective test as a whole. In a multi-tasked test, some tasks may randomly finish a little earlier than others. The real time of the test is then equal to the real time of the task which took the longest. (User + sys of the whole test are the sum of user + sys of all tasks in the test.) — Instead of the longest run time, the average or median of the individual task run times would be a somewhat better metric. That's still not perfect, because the tasks which take longer benefit from overall lighter system load towards the very end of the test when fewer tasks remain running. But the error of measurement due to this seems to be generally small to me.

Furthermore, if the test is set up to run to less than 100 % completion (e.g. in #92 I stopped at 10 %, as noted), then the timing is at 5 s granularity. This is because the script happens to check for progress percentage every 5 seconds; this can be trivially changed in the script. OTOH, tests should generally be configured to run reasonably long, therefore this is not really important.

StefanR5R · Jan 11, 2021

I added i7-7700K @4.4 GHz results to #92.

As soon as the workload exceeds the CPU cache size, throughput of the 4.4 GHz CPU goes down to almost the same level as the 3.4 GHz CPU which has the same main memory performance.

Clock ratio: 4.4 GHz ÷ 3.4 GHz = 1.29
Throughput ratio in the best throughput config: 44.3 kPPD ÷ 35.0 kPPD = 1.27
Throughput ratio in the 4-tasks config: 35.3 kPPD ÷ 32.4 kPPD = 1.09
Throughput ratio in the 8-tasks config: 29.5 kPPD ÷ 28.8 kPPD = 1.02

On another note, the throughput of the 4.4 GHz CPU of 1.27 times compared to the 4.3 GHz CPU comes at a price of 1.57 times the power draw, and 1.24 times the energy consumption to accomplish the given work. And that's even though I included the entire system overhead into the measurement (2 idle GPUs, a watercooling pump, six fans, et cetera).

crashtech · Jan 11, 2021

I ran your script with only the task and threads modified on a 3700X and got some results, but every time I tried to run either 8 tasks 2 threads, or 2 threads 8 tasks, there would be "Gerbicz check failed" errors. The other permutations I tried passed without problems, the current best looking like 8 tasks of 1 thread each.

Edit: For the record, this test is picking up an instability that Memtest is not able to find. Perhaps the CPU is unstable at its current settings, though I know it's not overclocked...

StefanR5R · Jan 12, 2021

So far I tested the script after its LLR2 update on Zen2 only very quickly, and I didn't see anything unexpected. I'll post some Zen2 results eventually.

"Gerbicz check failed" of course means that there were miscalculations. Anecdotally, the most frequent source of such miscalculations are RAM overclocks, or more generally speaking, unstable RAM settings.

biodoc · Jan 12, 2021

@StefanR5R , why is there an "extra" set of results for each test? This set is 2 tasks with 4 threads each (SMT on but not used). Is the top set an estimate or an actual run?

Code:

======== Mon 11 Jan 2021 03:23:09 PM EST ======== starting 2 process(es) with 4 thread(s) ========

real    73m12.9s        (4392 s)
user    281m18.2s       (16878 s)
sys     2m22.0s (142 s)
---- slot 1 ----
Starting Proth prime test of 21*2^7823737+1
Using all-complex FMA3 FFT length 480K, Pass1=640, Pass2=768, clm=1, 4 threads, a = 5, L2 = 366*334, M = 122245
21*2^7823737+1, bit: 10000 / 7823736 [0.12%], 0 checked.  Time per bit: 0.635 ms.^M21*2^7823737+1, bit: 20000 / 7823736 [0.25%], 0 checked.  Time per bit: 0.>
                                                                                                   ^M21*2^7823737+1 is not prime.  Proth RES64: 55737EBBC1BFA>

real    74m35.8s        (4475 s)
user    286m20.9s       (17180 s)
sys     2m25.7s (145 s)
---- slot 2 ----
Starting Proth prime test of 21*2^7823737+1
Using all-complex FMA3 FFT length 480K, Pass1=640, Pass2=768, clm=1, 4 threads, a = 5, L2 = 366*334, M = 122245
21*2^7823737+1, bit: 10000 / 7823736 [0.12%], 0 checked.  Time per bit: 0.611 ms.^M21*2^7823737+1, bit: 20000 / 7823736 [0.25%], 0 checked.  Time per bit: 0.>
                                                                                                   ^M21*2^7823737+1 is not prime.  Proth RES64: 55737EBBC1BFA>

real    74m35.8s        (4475 s)
user    567m39.1s       (34059 s)
sys     4m47.7s (287 s)
======== Mon 11 Jan 2021 04:37:45 PM EST ======= done with 2 process(es) with 4 thread(s) ========

EDIT: I looked at the script and it appears to me (script idiot) that one 1 process with 4 threads is run first, then 2 processes with 4 threads are run in parallel.

StefanR5R · Jan 12, 2021

The log which you posted is from a test which is configured as follows:

Test the candidate 21*2^7823737+1.
Run two LLR instances in parallel.
Use 4 threads per each instance.
Keep each LLR instance running until it exits on its own. In other words, run until 100 % completion; do not kill these instances earlier at less than 100 % completion.

The two instances run in throwaway subdirectories named slot_1 and slot_2. These subdirectories are deleted as soon as the respective instance exits. Before deletion, stdout and stderr from this slot directory are appended to the log file.

In the log, the first real:user:sys time stats relate to the first LLR instance. The second real:user:sys time stats relate to the second LLR instance. That is, differently from what one would expect from just looking at the script code, the time stats make it into the log before the corresponding "---- slot i ----" divider, rather than after. It's weird, and there are probably ways to order it differently, but I haven't looked into it further.

The last real:user:sys time stats right before the terminating "======== date ======= done with n processes… ========" line are the total stats over both instances. As one would expect, the total real time is the same as the real time of the instance which took longest, while the total user and sys times are the sum of the respective times of all concurrent instances.

biodoc · Jan 12, 2021

Here are the PPS-Div results from a 3700X (PPT set at 65 watts) using the script from @StefanR5R . The tests were run to 100% completion. I took the LLR input and cobblestones from a task that was completed and validated.
llrDIV, FFT length 480K (FFT data size = 3.75 MB), 1,536.90 cobblestones
LLRINPUT="21*2^7823737+1"

Table 1. SMT on but not used

# processes	# threads	run time (sec)	PPD
4	2	8116	65,445
2	4	4475	59,347
1	8	2978	44,590

Table 2. SMT on and used

# processes	# threads	run time (sec)	PPD
8	2	14523	73,146
4	4	7380	71,972
2	8	3832	69,305

PrimeGrid: CPU benchmarks

Diamond Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Elite Member

Diamond Member

Diamond Member

Elite Member

Elite Member

Lifer

Elite Member

Elite Member

Elite Member

Lifer

Elite Member

Elite Member

Lifer

Elite Member

Diamond Member

Elite Member

Diamond Member