PrimeGrid: CPU benchmarks

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tejas_2

Member
Jun 8, 2018
31
43
61
Both applications use George Woltman's gwnum library, the best optimized GMP libary for arbitrary-precision arithmetic, operating on signed integers, rational numbers, and floating point numbers available. The FFT lengths depend on different factors unique to each prime candidate (examples; prime base , exponent, type of prime being searched) as well as different CPU architecture.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,220
3,801
75
Yeah, the best way to test LLR is to run the real LLR app with a real input file from PrimeGrid.
 

biodoc

Diamond Member
Dec 29, 2005
6,257
2,238
136
Yeah, the best way to test LLR is to run the real LLR app with a real input file from PrimeGrid.

I agree.

In the middle of the race (Sierpinski/Riesel Base 5 (LLR) application) I tested 2 other options on my dual socket E5-2690 v2 (Ivy Bridge EP; 10 cores per socket) with HT off with 64-bit linux as the OS. The data were collected using 4 x 5t settings in the app_config.

Method 1: boinc client run times. Que up 20+ WU or so and turn off networking to get run times for multiple completed tasks as reported by the client. Average points per task are calculated by averaging a set of completed tasks on the server. These are not the same tasks where I'm measuring run times on the client.

For an average of 20 tasks, I see 5467 seconds per task which is 103.8K PPD. For this calculation, I used 1,642.53 points per task

Method 2: Run times reported by server. These are collected over time from WU downloaded to the client one at a time. In the client computing preference i used "store at least 0.01 days of work and store addional 0.01 days of work.

For an average of 16 tasks, I see 6,653.69 seconds per task which is 86.1K PPD. Average points per task was 1,657.47.

Clearly there is a disconnect between run times reported by the boinc client and the server. FYI, the client version used was 7.9.3.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Method 2: Run times reported by server. These are collected over time from WU downloaded to the client one at a time.
In the past, downloading only one WU per request was a sufficient workaround to avoid the run time tracking bug. But not anymore. :-( I don't know why.

Both applications use George Woltman's gwnum library,
So, for forest's method to be effective, one needs to pick a Prime95 binary which is linked against the same gwnum version as the LLR binary used by PrimeGrid, and built with the same compiler, with the same compiler flags...
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
SR5-LLR v8.01, tests done in June with sllr64.3.8.21
(name for app_config.xml: llrSR5)

I used the test method described in post #44 with the following WU:
llrSR5_299571544 (1,678.34 cobblestones)
Code:
1000000000:M:0:5:258
322498 2527171
All tests were done on Linux 64bit without background load.
Run times in seconds are median per task, overall throughput in 1,000 points-per-day is per host.

Kaby Lake i7-7700K @4.2 GHz (4C/8T)

2 tasks x 4 threads: 6587 s, 44.0 kPPD
1 task x 4 threads: 3162 s, 45.9 kPPD
1 task x 5 threads: 3142 s, 46.2 kPPD
1 task x 6 threads: 3022 s, 48.0 kPPD
1 task x 7 threads: 2937 s, 49.4 kPPD
1 task x 8 threads: 2813 s, 51.5 kPPD <- best​

Broadwell-EP dual E5-2690v4 @2.9 GHz (2x 14C/28T)

2 tasks x 28 threads: 2464 s, 118 kPPD
4 tasks x 14 threads: 3040 s, 191 kPPD
6 tasks x 9 threads: 4265 s, 204 kPPD
7 tasks x 8 threads: 4952 s, 205 kPPD
8 tasks x 7 threads: 5429 s, 214 kPPD <- best
9 tasks x 6 threads: 6224 s, 210 kPPD
10 tasks x 5 threads: 6990 s, 207 kPPD​

Broadwell-EP dual E5-2696v4 @2.6 GHz (2x 22C/44T)

4 tasks x 22 threads: 2748 s, 211 kPPD
5 tasks x 17 threads: 3227 s, 225 kPPD
6 tasks x 14 threads: 3336 s, 261 kPPD
7 tasks x 12 threads: 3874 s, 262 kPPD
8 tasks x 11 threads: 4098 s, 283 kPPD
9 tasks x 9 threads: 4746 s, 275 kPPD
10 tasks x 8 threads: 5120 s, 283 kPPD
11 tasks x 8 threads: 5452 s, 293 kPPD <- close second best
12 tasks x 7 threads: 5935 s, 293 kPPD <- close second best
13 tasks x 6 threads: 6631 s, 284 kPPD
14 tasks x 6 threads: 6851 s, 296 kPPD <- best​

Notes:
  • It is interesting how an uneven number of tasks performs in the same ballpark as an even number of tasks on the dual socket machines. That is, there is no pronounced penalty for threads of at least one process having to synchronize from socket to socket via QPI.
  • Best kPPD / cores / GHz:
    i7-7700K: 3.07 <- best
    dual E5-2690v4: 2.64
    dual E5-2696v4: 2.59
    The two E5 v4 don't have the same kPPD / cores / GHz because both have the same absolute memory bandwidth, and hence the 2690v4 have higher memory bandwidth per core than the 2696v4.
  • Best kPPD / Watt:
    i7-7700K: 0.57
    dual E5-2690v4: 0.79
    dual E5-2696v4: 1.02 <- best
    Caveat: This is relative to the TDP of these processors. Their actual consumption was probably quite a bit higher, let alone what the entire system pulls at the wall. Unfortunately I did not have power meters on the individual hosts at the time.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
PPS-LLR v8.01, tests done today with sllr64.3.8.21
(name for app_config.xml: llrPPS)

I used the test method described in post #44 with the following WU:
llrPPS_303160028 (159.04 cobblestones)
Code:
1000000000:P:1:2:257
839 2610671
All tests were done on Linux 64bit without background load.
Run times in seconds are median per task, overall throughput in 1,000 points-per-day is per host.

Kaby Lake i7-7700K @4.2 GHz (4C/8T)
HT on but not used:
4 tasks x 1 thread: 1434 s, 38.3 kPPD <- second best
2 tasks x 2 threads: 870 s, 31.6 kPPD
1 task x 4 threads: 490 s, 28.0 kPPD​
HT on and used:
8 tasks x 1 thread: 3938 s, 27.9 kPPD
4 tasks x 2 threads: 1374 s, 40.0 kPPD <- best
2 tasks x 4 threads: 747 s, 36.8 kPPD
1 task x 8 threads: 436 s, 31.5 kPPD​

Broadwell-EP dual E5-2690v4 @2.9 GHz (2x 14C/28T)
HT on but not used:
28 tasks x 1 thread: 2336 s, 165 kPPD <- best
14 tasks x 2 threads: 1424 s, 135 kPPD​
HT on and used:
56 tasks x 1 thread: 6678 s, 115 kPPD
28 tasks x 2 threads: 2420 s, 159 kPPD <- second best
18 tasks x 3 threads: 1663 s, 149 kPPD
14 tasks x 4 threads: 1282 s, 150 kPPD
8 tasks x 7 threads: 810 s, 136 kPPD​

Broadwell-EP dual E5-2696v4 @2.6 GHz (2x 22C/44T)
HT on but not used:
44 tasks x 1 thread: 2613 s, 231 kPPD <- best
22 tasks x 2 threads: 1622 s, 186 kPPD​
HT on and used:
88 tasks x 1 thread: 10130 s, 119 kPPD
44 tasks x 2 threads: 2692 s, 225 kPPD <- second best
29 tasks x 3 threads: 1889 s, 211 kPPD
22 tasks x 4 threads: 1455 s, 208 kPPD
17 tasks x 5 threads: 1196 s, 195 kPPD
14 tasks x 6 threads: 1013 s, 190 kPPD
12 tasks x 7 threads: 904 s, 182 kPPD
11 tasks x 8 threads: 888 s, 170 kPPD​

K10 Deneb Phenom II X4 905e @2.5 GHz (4C/4T)
4 tasks x 1 thread: 8445 s, 6.5 kPPD <- best
2 tasks x 2 threads: 5555 s, 4.9 kPPD
1 task x 4 threads: 3280 s, 4.2 kPPD​

Edit:
  • On Kaby Lake, once again a pattern can be observed:
    If the number of threads per task is not configured to the optimum, HyperThreading can be detrimental.
    Vice versa, if the number of threads per task is at the optimum, HyperThreading is beneficial.
  • But on Broadwell-EP, this time HT off and no multithreading wins. This is in contrast to most of the other LLR based subprojects at PrimeGrid, which have longer run times and larger FFT lengths. (PPS-LLR on both KBL and BDW-EP: 192K FFT length; SR5-LLR: 1152K FFT length)
  • Why does Kaby Lake still gain from HyperThreading in this application, while Broadwell-EP does not? It could be due to core architecture improvements. But I rather suspect it is because of much higher memory bandwidth per core for the little desktop CPU, compared to the big Xeon.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
I repeated the tests from post #56 (only those which gave the best PPD), and measured power consumption of the hosts "at the wall".

Kaby Lake i7-7700K @4.2 GHz (4C/8T)
power consumption:
50 W idle (includes two idle 1080Ti)
137 W when running PPS-LLR with 4 tasks x 1 threads
145 W when running PPS-LLR with 4 tasks x 2 threads​
host efficiency at PPS-LLR:
3.24 points/kJ ( = 38,300 PPD / 86,400 seconds/day / 0.137 kW) for 4 tasks x 1 thread
3.19 points/kJ ( = 40,000 PPD / 86,400 seconds/day / 0.145 kW) for 4 tasks x 2 threads​

Broadwell-EP dual E5-2690v4 @2.9 GHz (2x 14C/28T)
power consumption:
68 W idle (has some more peripherals in it than the 2696 below)
410 W when running PPS-LLR with 28 tasks x 1 thread​
host efficiency at PPS-LLR:
4.7 points/kJ ( = 165,000 PPD / 86,400 s/d / 0.410 kW)​

Broadwell-EP dual E5-2696v4 @2.6 GHz (2x 22C/44T)
power consumption:
62 W idle
432 W when running PPS-LLR with 44 tasks x 1 thread​
host efficiency at PPS-LLR:
6.2 points/kJ ( = 231,000 PPD / 86,400 s/d / 0.432 kW)​

K10 Deneb Phenom II X4 905e @2.5 GHz (4C/4T)
power consumption:
55 W idle
114 W when running PPS-LLR with 4 tasks x 1 thread​
host efficiency at PPS-LLR:
0.66 points/kJ ( = 6,500 PPD / 86,400 s/d / 0.114 kW)​
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
Thanks Stefan! Your results will hopefully help us come up with closer to optimal configs for this project. Under Windows with a 6C/12T Coffeelake 8700K, it looks like the benefit of using HT is even more pronounced, although interestingly I got my very best result with 2 tasks x 5 threads, with 2 x 6 being a very close second.

Best Coffeelake result in Win 10 is actually 6 tasks x 2 threads, followed closely by 4 tasks x 3 threads. Sorry for my earlier confusion! I have more results for 10C/20T x2 Ivy and 8C Ryzen, but I am going to double check them, and perhaps PM them to you before posting them here.
 
Last edited:
  • Like
Reactions: StefanR5R

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
I got my very best result with 2 tasks x 5 threads, with 2 x 6 being a very close second.
Such high thread counts per task to be effective at the small PPS WUs surprise me. Is the CFL perhaps running at different clocks when loaded with e.g. 6x2 versus 2x5...6?

I admit I haven't monitored my KBL closely during all the PPS runs, but when I looked (and under other AVX loads in the past too) it ran at the 4.2 GHz constantly. This is because I specified a -2 AVX offset in the BIOS. Before I did that, the BIOS defaults would cause the 7700K to attempt to run at up to 4.4 GHz, and then soon throttle randomly. (It's watercooled, but not delidded, hence insulated by Intel's toothpaste. Board is an ASRock Z270M Extreme 4.)
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
Have you heard the one about the dyslexic that walked into a bra?
I had your numbers mixed up in my mind, so I'll be running more tests now.

Edit: See above edited post, thanks.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,257
2,238
136
I have some llrPPS data for my single socket E5-2690 v2 @ 3.3GHz (10 cores; 20 threads). The OS is 64 bit linux mint 19 and the run times (average of 20) were taken from boinc manager with networking off. All tasks were validated and received 159.06 credits per task.

HT on in bios:

1 task per thread (20 simultaneous tasks): 7050 seconds per task = 245 tasks per day = 38986 PPD
1 task per 4 threads (5 simultaneous tasks): 1800 seconds per task = 240 tasks per day = 38174 PPD

Simulated HT off (50% CPU set in boinc manager):

1 task per thread (10 simultaneous tasks): 2820 seconds per task = 306 tasks per day = 48733 PPD

HT off in bios:

1 task per thread (10 simultaneous tasks): 2800 seconds per task = 308 tasks per day = 49081 PPD
1 task per 2 threads (5 simultaneous tasks): 1680 seconds per task = 257 tasks per day = 40901 PPD
 
  • Like
Reactions: StefanR5R

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
K10 ............... 650 PPD/core/GHz
IVB-EP ........ 1500 PPD/core/GHz
BDW-EP ..... 2000 PPD/core/GHz
KBL ............. 2400 PPD/core/GHz

(The KBL result is helped by higher RAM bandwidth per core compared to the EPs, and by sacrificing power efficiency.)
 

Orange Kid

Elite Member
Oct 9, 1999
4,323
2,110
146
I have found that with Ryzen's it does not seem to make any difference if I run one per core or 8 cores per task. The PPD will be the same. 16 ish minutes per task at 8 cores and 135 ish minutes at 1. Memory and CPU usage is the same in both instances.
The only advantage of more cores per task will be at the end as they get done and returned faster. So rather than sixteen being 2 seconds too late only two will miss.:D
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
My Ryzen 8C results point toward 8 tasks of 2 cores, SMT on being best, but not by a lot. But since I don't use Stefan's more exacting method of benching offline with one WU, my results have been inconsistent even with multiple runs and averaging of 10+ WUs each time. So what I am compiling might not even be worth the effort.

Edit: So far, all my 4-8C/8-16T CPUs with HT or SMT on, in Windows 10, regardless of year or make, like 2 logical cores per task the best. Perhaps the Windows scheduler is smart enough to put each instance on one physical core, simulating HT off? Not sure.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
So far, all my 4-8C/8-16T CPUs with HT or SMT on, in Windows 10, regardless of year or make, like 2 logical cores per task the best.
With <cmdline>-t 2</cmdline> or <cmdline>-t 1</cmdline>?
Perhaps the Windows scheduler is smart enough to put each instance on one physical core, simulating HT off?
It tries to, and mostly succeeds. But I read in the PrimeGrid forum that HT off in the BIOS still provides a discernable benefit on Windows for the smaller LLR projects on Windows (compared to HT on and 50 % CPUs used in BOINC).
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
With <cmdline>-t 2</cmdline> or <cmdline>-t 1</cmdline>?

It tries to, and mostly succeeds. But I read in the PrimeGrid forum that HT off in the BIOS still provides a discernable benefit on Windows for the smaller LLR projects on Windows (compared to HT on and 50 % CPUs used in BOINC).
<cmdline>-t 2</cmdline>

Turning HT off and on in the BIOS just isn't a practical solution for me. If there was an app to toggle it within the OS and have it take effect upon reboot, I'd do it.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Regarding the optimum number of concurrent LLR tasks, here is some useful info from the PrimeGrid forum:
mackerel said:
As for data requirements, the FFT data by itself is 8x the FFT size, so Woodall units would currently take just under 16MB each. There is also some other data that is needed, but based on observations only considering the FFT data relative to L3 (or L4) cache is sufficient to indicate where cache makes ram considerations unimportant. Exceeding that generally pushes you into ram limited performance.
Hence, on processors like Intel Sandy Bridge...Coffee Lake and Sandy Bridge-E/EP...Broadwell-E/EP and , the size of the shared inclusive L3 cache of a given processor, divided by 8x FFT size, should give the optimum number of tasks to be run concurrently. (From there, the optimum number of threads per task needs to be figured out for a given processor and operating system.)

On Intel Skylake-X/SP with their increased L2 and decreased, non-inclusive L3, the math is probably a bit different. Also, scaling on AMD Zen where L3 is segmented into 8 MB per core complex and 16 MB per die may not be as easy to figure out like on Sandy Bridge and its siblings.
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
Hmm, this might partly explain the poor performance I've been seeing on Ivy and Sandy Xeons, apparently only one task per socket will fit into L3!
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Regarding LLR scaling on AMD CPUs:
  • Bulldozer: I was told in the SETI.Germany forum that overall throughput does not change much with the number of LLR tasks running concurrently. Going from more tasks with fewer threads/task to fewer tasks with more threads/task results (a) in lower heat output, (b) in shorter run times.
    My personal guess is that lower heat output also means somewhat lower throughput. OTOH the user noted that if more tasks run simultaneously, the higher heat output may cause thermal throttling.
  • Zen: At the PrimeGrid forum, somebody posted results of an offline-test with a presumably fixed WU on Threadripper 1950X. This was showing negative scaling of the throughput at increased thread count per task:
    1 task x 16 threads/task: 10.4 h/task, 2.31 tasks/day/host <-- best run time
    2 tasks x 8 threads/task: 16.1 h/task, 2.98 tasks/day/host
    4 tasks x 4 threads/task: 31.4 h/task, 3.05 tasks/day/host
    8 tasks x 2 threads/task: 62.7 h/task, 3.06 tasks/day/host
    16 tasks x 1 thread/task: 111 h/task, 3.47 task/day/host <-- best throughput
    (all results extrapolated after 0.12 % completion, i.e. 0.7...8 minutes run time per test)
Bulldozer based CPUs have 2 MB L2 cache per module. At least in the first iteration, this was write-through cache, meaning that synchronization between modules caused snoop traffic up to the L1D cache. (I haven't looked up whether this was changed in later iterations of this CPU family.) Optionally, Bulldozer based CPUs have an L3 cache which is shared between modules, but it is merely an exclusive cache = victim cache, i.e. doesn't hold copies of L2 data.

Zen based CPUs have 0.5 MB L2 cache per core, and 8 MB L3 cache per core complex. This too is an exclusive cache = victim cache.

So, from what I understand, Bulldozer's and Zen's caches are segmented such that the LLR program's performance is very much limited by memory performance, not so much by cache performance. On the other hand, Bulldozer's and Zen's vector math execution units are not as wide in relation to width of the memory interface as Haswell's (or even as Sandy Bridge's? - I haven't looked it up). Or expressed the other way around, Bulldozer and Zen have more memory interface width per vector execution units than Haswell. Maybe this makes the cache deficit a moot point if we look only at throughput.

LLR performance on Haswell and later (perhaps Skylake-X/SP excluded; these have non-inclusive L3 cache now) obviously depends on the active FFT data fitting into the shared inclusive L3. The upside is that the #tasks:#threads setting of LLR which achieves best throughput on Haswell is at the same time a setting which also achieves good run times, in certain cases even best run times. On Bulldozer and Zen, it appears to be the other way around: The setting with best throughput has worst run times, and vice versa (except for PrimeGrid LLR subprojects with very small FFT data sizes that fit into L2 cache).
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
GFN21 v3.20 for CPUs

I didn't run any proper benchmarks of it yet. But some performance data and discussion can be found at the last pages of the "PrimeGrid Races 2018" thread, posts #608 and #638 especially.

――――――――

SoB-LLR v8.01

Most of the current WUs take an FFT size of 2560K ( = 20.0 MB FFT data size; these tasks receive ~52,000 credits) or 2880K ( = 22.5 MB FFT data size; ~58,000 credits). There are also still a few WUs with lower FFT size (and lower credits) around.

Right now I am beginning tests with a 2560K WU for 52,140.08 credits with the following input file:
Code:
8.0 3.8.20 Primality
1000000000:P:0:2:257
24737 28634431
and with the current "sllr64.3.8.21" Linux exe and "llr.ini.6.07" config file. (The 1st line in llrSOB_284832625 is bogus, but this is what was downloaded by the BOINC client for a real WU. sllr64 ignores this line after emitting an error notice. primegrid_llr_wrapper apparently removes this top line when it puts the input file into the task's slot directory.)

An i7-7700K (Kaby Lake, 4C/8T, 8 MB L3 cache) @4.2 GHz with DDR4-3000c14 dual-channel RAM takes about 1 day to complete a single task. To speed up testing, I updated my script in post #44 to optionally terminate each test run after a certain percentage of completion. So, running an llrSOB test only until 1 % takes this processor less than 1/4 h for a single task with 4 threads, and less than 1/2 h for two concurrent tasks with 2 threads each.
Bash:
LLREXE="sllr64.3.8.21"
LLRINI="llr.ini.6.07"
LLRINPUT="llrSOB_284832625"  # FFT length 2560K (FFT data size = 20.0 MB), ~52,000 cobblestones
Bash:
run_series 1 4 1    # 1 task at once,  4 threads/task, quit at 1 % completion
run_series 2 2 1    # 2 tasks at once, 2 threads/task, quit at 1 % completion
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
PPS-DIV v8.04

Here are test results with a workunit which 240K FFT size ( = 1.875 MB FFT data size, getting 389.87 credits), which is somewhat smaller than the work to be emitted during the upcoming challenge.
Code:
1000000000:P:0:2:257
15 4110098
All tests were done on Linux 64 bit, otherwise idle systems, running each job until completion. Times are median run times.

Zen 2 Ryzen 7 3700X (8C/16T) --> reported by @biodoc in posts #81 and #83.

Kaby Lake i7-7700K (4C/8T, 8 MB L3 cache) @4.2 GHz with DDR4-3000c14 dual-channel RAM,
HT on but not used:
  • 1 task with 4 threads: 1300 s (66 tasks/day, 26 kPPD)
  • 2 concurrent tasks with 2 threads each: 2330 s (74 tasks/day, 29 kPPD)
  • 4 concurrent tasks with 1 thread each: 3140s (110 tasks/day, 43 kPPD) <-- best throughput
HT on and used:
  • 1 task with 8 threads: 1220 s (71 tasks/day, 28 kPPD)
  • 2 concurrent tasks with 4 threads each: 1900 s (91 tasks/day, 35 kPPD)
  • 4 concurrent tasks with 2 threads each: 3530 s (98 tasks/day, 38 kPPD)
  • 8 concurrent tasks with 1 thread each: 9470 s (73 tasks/day, 28 kPPD) <-- exceeds the processor cache
Broadwell-EP dual E5-2690 v4 (2x 14C/28T, 2x 35 MB L3 cache) @2.9 GHz with DDR4-2400c17 2x quad-channel RAM,
HT on but not used:
  • 14 concurrent tasks with 2 threads each: 3890 s (311 tasks/day, 121 kPPD)
  • 28 concurrent tasks with 1 thread each: 4990 s (485 tasks/day, 189 kPPD) <-- best throughput
HT on and used:
  • 28 concurrent tasks with 2 threads each: 5740 s (421 tasks/day, 164 kPPD)
  • 56 concurrent tasks with 1 thread each: 15800 s (306 tasks/day, 119 kPPD) <-- exceeds the processor cache
Broadwell-EP dual E5-2696 v4 (2x 22C/44T, 2x 55 MB L3 cache) @2.6 GHz with DDR4-2400c17 2x quad-channel RAM,
HT on but not used:
  • 22 concurrent tasks with 2 threads each: 4350 s (437 tasks/day, 170 kPPD)
  • 44 concurrent tasks with 1 thread each: 5590 s (680 tasks/day, 265 kPPD) <-- best throughput
HT on and used:
  • 44 concurrent tasks with 2 threads each: 6520 s (583 tasks/day, 227 kPPD)
  • 88 concurrent tasks with 1 thread each: 24640 s (309 tasks/day, 120 kPPD) <-- exceeds the processor cache
Conclusions:
‒ On Broadwell-E/EP and Skylake, these small tasks work best with single-threaded jobs and HyperThreading unused.
‒ In the cases where the workload exceeds the processor cache by far, the dual-14C and dual-22C computers show practically the same performance, no doubt because of the same RAM performance. (And the two octa-channel machines show 4 times the performance of the dual-channel machine, which is a bit surprising as the latter has a faster RAM clock. Perhaps this is due to system overhead.)

For the upcoming challenge, it will be interesting to test the case where the FFT data size is the same as the per-core amount of L3 cache. Will it be better to run half as many concurrent tasks as there are cores (using dual-threaded tasks), or to stick with single-threaded tasks but maybe run one task less per CPU as there are cores, to leave a bit of space in the shared L3 cache for program code, kernel data etc.? --- Update: Test results with larger FFT lengths are reported in post #87.

------------
Edit October 9: added links to Zen 2 results
Edit October 15: added link to variable FFT length tests
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
There's also significant practical value to Stefan's testing, in that we can infer the correct numbers of tasks and threads on other CPUs also based on L3 amount, since while core performance might vary, memory requirements should not.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
@crashtech, beyond the monoculture I am testing on, there are other L3 cache policies though (Zen(+,2), Skylake-X/SP), segmented caches (Zen(+,2)), wider SMT implementations (Zen(+,2)), narrower FMAs per core (Zen(+)), wider FMAs per core (Skylake-X/SP models on which the extra vector unit isn't disabled). Any of these factors could shift the sweet spot in border cases. Also, I have abandoned testing on Windows, of which we know that its process scheduler policies differ from Linux's in hard to predict ways.

Apropos border cases; I now kicked off an array of tests with 256K...384K FFT length (2.0...3.0 MB FFT data size) on the little i7 with 2.0 MB L3$/core. Some tests with 320K...384K FFT length (2.5...3.0 MB FFT data size) on the E5s with 2.5 MB L3$/core will follow.
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
@StefanR5R , spot on as usual, but your testing is about the best we have as a starting point. I hope that the differences aren't too large between architectures; I feel fairly confident that Haswell-EP for one should be really close.
 
Last edited: