Info PrimeGrid Challenges 2024, sieve-free edition

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
This reduces the risk of damage from unforeseen skripting errors:
sudo -u boinc ./llr2_affinity.sh
or more complete:
sudo -u boinc -g boinc ./llr2_affinity.sh
Then the script can't do anything what the pseudo-user boinc isn't allowed to do, such as FORMAT C:. :-)

- - - - - - - -

I couldn't find a validated PSP result from a 3840K workunit. So I took a 3456K workunit from the results table of Pavel Atnashev's computer cluster instead. (That's 27.0 MB cache footprint of FFT coefficients.) It's the WU with the largest credit on this host when I looked about two hours ago. I ran this WU for 20 minutes per test and extrapolated total duration from the progress made until then.

workunit: 222113*2^34206293+1 for 82,165.65 credits
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821
hardware: EPYC 9554P (Zen 4 Genoa 64c/128t), cTDP = PPT = 400 W, 12 channels of DDR5-4800

testaffinityavg. durationavg. tasks/dayavg. PPDavg. core clockhost powerpower efficiency
8×8none (random scheduling by Linux)35:49:20 (128960 s)
5.4​
0.440 M​
3.60 GHz​
370 W​
1.19 kPPD/W​
8×81 task : 1 CCX, only lower SMT threads12:52:37 (46357 s)
14.9​
1.225 M​
3.34 GHz​
485 W​
2.53 kPPD/W​
8×161 task : 1 CCX, all SMT threads13:02:32 (46952 s)
14.7​
1.210 M​
3.05 GHz​
500 W​
2.42 kPPD/W​
4×161 task : 2 CCXs, only lower SMT threads8:35:14 (30914 s)
11.1​
0.919 M​
3.60 GHz​
480 W​
1.91 kPPD/W​
4×321 task : 2 CCXs, all SMT threads8:39:42 (31182 s)
11.0​
0.911 M​
3.18 GHz​
490 W​
1.86 kPPD/W​
I ran further tests but with a more recent, bigger workunit.

workunit: 225931*2^34726136+1 for 91,933.52 credits,
on Zen 4: all-complex AVX-512 FFT length 3600K (28.125 MBytes),
on Zen 2 and Broadwell-EP: zero-padded FMA3 FFT length 3840K (30.0 MBytes)
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821, 15 minutes/test

hardware: dual EPYC 7452 (Zen 2 Rome, 2× 32c/64t, 2× 8×16 MB L3$), cTDP = PPT = 180 W, 2×8 channels of DDR4-3200

testaffinityavg. durationavg. tasks/dayavg. PPDavg. core clockhost powerpower efficiency
8×8none (random scheduling by Linux)53:36:35 (192995 s)
3.6​
0.329 M​
3.11 GHz​
460 W​
0.72 kPPD/W​
8×81 task : 2 CCXs, only lower SMT threads32:23:52 (116632 s)
5.93​
0.545 M​
2.87 GHz​
480 W​
1.14 kPPD/W​
8×161 task : 2 CCXs, all SMT threads32:59:08 (118748 s)
5.82​
0.535 M​
2.75 GHz​
495 W​
1.08 kPPD/W​
4×161 task : 4 CCXs, only lower SMT threads17:02:29 (61349 s)
5.63​
0.518 M​
2.94 GHz​
460 W​
1.13 kPPD/W​
4×321 task : 4 CCXs, all SMT threads17:12:43 (61963 s)
5.58​
0.513 M​
2.83 GHz​
475 W​
1.08 kPPD/W​

A note on the "affinity = none" test: The outcome there is sensitive to the random nature of thread scheduling. If I re-ran this test, or ran it for much longer, I might get notably higher or lower results.

- - - - - - - -

hardware: dual Xeon E5-2696 v4 (Broadwell-EP, 2× 22c/44t, 2× 55 MB L3$), unlimited all-core turbo duration, 2×4 channels of DDR4-2133

testaffinityavg. durationavg. tasks/dayavg. PPDavg. core clockhost powerpower efficiency
4×112 tasks : 1 socket, only lower SMT threads28:04:01 (101041 s)
3.42​
0.314 M​
1.95 GHz​
475 W​
0.66 kPPD/W​
2×221 task : 1 socket, only lower SMT threads12:58:18 (46698 s)
3.70​
0.340 M​
1.95 GHz​
440 W​
0.77 kPPD/W​

There is quite some overhead here, due to memory accesses especially in the 4×11 test, and inter-thread synchronization especially in the 2×22 test. Due to the overhead, this host does not maintain its all-core turbo clock speed which is 2.6 GHz in AVX2 workloads. In PrimeGrid subprojects with considerably smaller workunits, 2.6 GHz could be maintained, causing way over 500 W host power consumption.

Still, LLR2's performance scaling to 22 threads per task is pretty good, much better than many other multithreaded programs, at least on this kind of host hardware with a big unified inclusive CPU cache.

- - - - - - - -

The workunit from above tests completed after 15.6 hours on the 9554P:

hardware: EPYC 9554P (Zen 4 Genoa 64c/128t, 8×32 MB L3$), cTDP = PPT = 400 W, 12 channels of DDR5-4800
running 1 task of the 225931*2^34726136+1 workunit along with 7 other random llrPSP tasks

setupaffinityactual durationtasks/day¹PPD¹avg. core clockhost powerpower efficiency¹
8×81 task : 1 CCX, only lower SMT threads15:36:16 (56176 s)
12.3​
1.131 M​
3.57 GHz​
475 W​
2.38 kPPD/W​
¹) if all tasks had the same performance

The first eight tasks which this host completed after the start of the challenge had 3600K (6×) and 3456K (2×) AVX-512 FFT length respectively.
The six 3600K (28.125 MBytes) units took ≈56,300 s (15.6 h)² and gave 1.13 MPPD (2.38 kPPD/W)³ on average.
The two 3456K (27.0 MBytes) units took ≈43,600 s (12.1 h)² and gave 1.32 MPPD (≈2.75 kPPD/W)³ on average.​
²) per task
³) per host, if it ran only this type of tasks, 8 at once

That is, my earlier 20 minutes short test with a 3456K unit underestimated the actual performance for this type of workunits, but is not representative for the host performance with the slightly bigger type of workunits. (PrimeGrid estimates the credit for each workunit a priori, based on expected computational workloads, with the goal of keeping PPD constant per subproject. But no estimation is perfect. The -14% drop of PPD from 3456K units to 3600K units is a bit surprising to me.) Still, between these two workunit types, the relative performance of the various threadcount/affinity combos which I tested should be very similar on the Zen computers with 16 or 32 MBytes L3$ segments.

In contrast, for the tests of the Broadwell-EP with its 55 MB L3$, it made sense to me to wait for a 30.0 MBytes large FMA3 FFT workunit to show up.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
PS,
a properly configured Ryzen 9 7950X outperforms this dual Xeon E5-2696 v4, or am I mistaken?

(That is, can do two 3600K tasks at once in under 46,000 seconds.)
 

emoga

Member
May 13, 2018
198
316
136
PS,
a properly configured Ryzen 9 7950X outperforms this dual Xeon E5-2696 v4, or am I mistaken?

(That is, can do two 3600K tasks at once in under 46,000 seconds.)

Using workunit: 225931*2^34726136+1 for 91,933.52 credits on a 7950x (4.5 / 1.0V)
Power was measured at the wall.

setupaffinityactual durationtasks/dayPPDavg. core clockhost powerpower efficiency
2×8ascending11:08:55 (40135 s)
4.305​
395,773​
4.5 GHz​
211 W​
1.88 kPPD/W​
 
Last edited:
  • Like
Reactions: StefanR5R

Orange Kid

Elite Member
Oct 9, 1999
4,391
2,176
146
Question using affinity
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best
If I am remembering correctly 0-7 " are real cores" and 16-23 are the "virtual cores" on the same ccx and 8-15 and 24-31 are on the other or am I completely wrong.
trying to use all "cores" on one ccx per task
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
Day 2 stats:

Rank___Credits____Username
1______16010194___markfw
4______7711744____w a h
6______5461176____Icecold
11_____3672649____crashtech
16_____3010042____ChelseaOilman
17_____2789036____cellarnoise2
44_____867732_____mmonnin
59_____489164_____Orange Kid
90_____267917_____waffleironhead
112____176237_____StephieDolores
114____175969_____Ken_g6

Rank__Credits____Team
1_____40631866___TeAm AnandTech
2_____14236956___Czech National Team
3_____12349651___Antarctic Crunchers
4_____12236424___SETI.Germany
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
Day 2 stats:
14,236,956 + 12,349,651 + 12,236,424 = 38,823,031 < 40,631,866 :-O

Question using affinity
First of all, the throughput loss when cache-aligned CPU affinity isn't assigned on a dual-CCX Ryzen is probably not as large as on my 2P Rome (sixteen CCXs) or 1P Genoa (eight CCXs) because the EPYCs have so many more cache boundaries which are getting in the way. On the other hand, the gap between core speed and RAM speed tends to be a bit bigger on Ryzen systems than on EPYCs.

Now to the numbering part of your questions:
From what I have heard, it differs between Windows and Linux. On Linux, it should be as you remember:

5950X:
0-7 = lower threads of CCX0, 8-15 = lower threads of CCX1,
16-23 = upper threads of CCX0, 24-31 = upper threads of CCX1​
5900X:
0-5 = lower threads of CCX0, 6-11 = lower threads of CCX1,
12-17 = upper threads of CCX0, 18-23 = upper threads of CCX1​

This can be verified with various tools. One which is available on probably all Linux installations is lscpu -e. In its output, threads which belong to the same physical core are attached to the same level 1 cache. And threads which belong to the same CCX are attached to the same level 3 cache.

On 5950X, my guess is that it's best to run one task on CPUs 0-7 and the other on CPUs 8-15 (8 threads/task, leaving half of the host's hardware threads alone; background stuff can run there). Ditto, on 5900X one task on 0-5 and the other on 6-11 (6 threads/task). However, maybe somebody here with a Ryzen 59#0X has actually measured what's best...?
 

mmonnin03

Senior member
Nov 7, 2006
255
242
116
Question using affinity
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best
If I am remembering correctly 0-7 " are real cores" and 16-23 are the "virtual cores" on the same ccx and 8-15 and 24-31 are on the other or am I completely wrong.
trying to use all "cores" on one ccx per task
Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,477
15,588
136
Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.
when you say disables, maybe it pins tasks only to the even numbered, so it simulates that.
 

Orange Kid

Elite Member
Oct 9, 1999
4,391
2,176
146
When I use Stefs affinity script using blocks of 8 no smt, it assigns one task 0-7 and one task 8-15. Using that logic I thought if I assigned the threads per ccx, per task, it may be more effective (faster).
You know, the old adage, if some is good more is better.
Thanks for the answers.
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.
On Windows, this is the numbering of logical CPUs which I have read elsewhere (but don't know how to verify):

5950X, while SMT is enabled in the BIOS:
0,2,4,6,8,10,12,14 = lower threads of CCX0, 16,18,20,22,24,26,28,30 = lower threads of CCX1,
1,3,5,7,9,11,13,15 = upper threads of CCX0, 17,19,21,23,25,27,29,31 = upper threads of CCX1​
5900X, while SMT is enabled in the BIOS:
0,2,4,6,8,10 = lower threads of CCX0, 12,14,16,18,20,22 = lower threads of CCX1,
1,3,5,7,9,11 = upper threads of CCX0, 13,15,17,19,21,23 = upper threads of CCX1​
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best
OK, I haven't really directly answered this part of your post. The first CPU list is the better one, as this is indeed putting all threads of one CCX exclusively for one task, and then all threads of the other CCX exclusively to the other task.

(However, 8 program threads per task, and 1 physical core : 1 program thread would likely be a little bit better... is my guess. Edit: That is, merely 8 program threads per task, but still only 2 simultaneous tasks on the host.)

PS: These CPU numbers are only valid on Linux. On Windows, see above.
 
Last edited:
  • Like
Reactions: Orange Kid

Orange Kid

Elite Member
Oct 9, 1999
4,391
2,176
146
OK, I haven't really directly answered this part of your post. The first CPU list is the better one, as this is indeed putting all threads of one CCX exclusively for one task, and then all threads of the other CCX exclusively to the other task.

(However, 8 program threads per task, and 1 physical core : 1 program thread would likely be a little bit better... is my guess.)

PS: These CPU numbers are only valid on Linux. On Windows, see above.
I guess I should have mentioned that I use Linux.
I remembered seeing somewhere the numbering for threads but could not find it after many searches.
Using your affinity script with blocks of 8 no smt, I get 0-7 and 8-15
Using it with blocks of 16, I get 0-15 and 16-31
This then confused me (easily done) and made me second guess myself.
I shall let them run using full threads and see what happens. They are dedicated DC boxes.
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
The syntax "blocks of ..." is the script author's shorthand for consecutive numbering, "block" meaning just a "block of numbers". That is, it definitely does not mean "figure out how the hardware is organized into 'blocks' or 'complexes' or 'domains' or whatever".

There is indeed no shorthand for "each block = all hardware threads of a CCX". Not because it'd be impossible or difficult to add a shorthand for this case, but because the author hasn't gotten around to add one yet. You have to resort to write the lists of CPU numbers explicitly.
 

crashtech

Lifer
Jan 4, 2013
10,620
2,181
146
(snip)
On 5950X, my guess is that it's best to run one task on CPUs 0-7 and the other on CPUs 8-15 (8 threads/task, leaving half of the host's hardware threads alone; background stuff can run there). Ditto, on 5900X one task on 0-5 and the other on 6-11 (6 threads/task). However, maybe somebody here with a Ryzen 59#0X has actually measured what's best...?
My testing of PSP on the 5950X shows a small regression in PPD when using SMT threads even with Linux affinity set to 0-7,16-23 8-15,24-31. However, afaict, the 7950X shows SMT to be a small help, though it probably is a regression on PPD/watt, I haven't tested that.
 
  • Like
Reactions: Orange Kid

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
Day 3 stats:

Rank___Credits____Username
1______24611606___markfw
5______11966306___w a h
9______7518588____cellarnoise2
11_____6423280____Icecold
13_____6058677____ChelseaOilman
14_____5897854____crashtech
39_____1661832____mmonnin
66_____755220_____Orange Kid
75_____619605_____waffleironhead
111____360372_____Ken_g6
114____354779_____biodoc
115____352940_____StephieDolores
132____278443_____johnnevermind
219____12756______xxshanshon

Rank__Credits____Team
1_____66872263___TeAm AnandTech
2_____25030267___Czech National Team
3_____23672161___Antarctic Crunchers
4_____22665242___AMD Users
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
number of participants of the top ten teams, and median of their individual scores:

... 14 ....... 1,209k ........ TeAm AnandTech
... 32 .......... 407k ........ Czech National Team
... 12 .......... 470k ........ Antarctic Crunchers
..... 7 .......... 260k ........ AMD Users
... 22 .......... 430k ........ SETI.Germany
..... 1 ..... 16,392k ........ Romania
..... 6 ....... 1,102k ........ Aggie The Pew
..... 9 .......... 352k ........ Ukraine
..... 4 ....... 1,925k ........ BOINC@MIXI
..... 1 ..... 10,444k ........ Ural Federal University
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
Day 4 stats:

Rank___Credits____Username
1______35216675___markfw
6______15539186___w a h
10_____10204796___cellarnoise2
11_____8840990____crashtech
12_____8572735____ChelseaOilman
18_____7491777____Icecold
39_____2458049____mmonnin
65_____1183202____Orange Kid
90_____727140_____biodoc
96_____704104_____waffleironhead
125____453378_____Ken_g6
139____362950_____johnnevermind
145____352940_____StephieDolores
200____105768_____xxshanshon
282____1455_______SlangNRox

Rank__Credits____Team
1_____92215153___TeAm AnandTech
2_____37216282___Czech National Team
3_____35106174___Antarctic Crunchers
4_____34807515___AMD Users
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
Day 5 stats:

Rank___Credits____Username
1______46285444___markfw
6______19113066___w a h
10_____12948987___cellarnoise2
12_____11570548___ChelseaOilman
13_____11298990___crashtech
20_____8103498____Icecold
42_____2907579____mmonnin
67_____1546329____Orange Kid
88_____1080424____biodoc
98_____965188_____waffleironhead
115____727736_____johnnevermind
125____639927_____Ken_g6
165____352940_____StephieDolores
223____105768_____xxshanshon
320____1455_______SlangNRox

Rank__Credits____Team
1_____117647887___TeAm AnandTech
2_____51068086___Czech National Team
3_____47628145___Antarctic Crunchers
4_____46696108___AMD Users
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
Day 6 stats:

Rank___Credits____Username
2______56844754___markfw
6______23139273___w a h
11_____14722877___cellarnoise2
12_____14260987___crashtech
13_____14113089___ChelseaOilman
21_____9288531____Icecold
47_____3448382____mmonnin
67_____1908793____Orange Kid
81_____1446609____biodoc
89_____1320981____waffleironhead
110____997281_____johnnevermind
134____733648_____Ken_g6
169____437322_____StephieDolores
246____105768_____xxshanshon
352____1455_______SlangNRox

Rank__Credits____Team
1_____142769757___TeAm AnandTech
2_____66220790___Czech National Team
3_____62704884___Antarctic Crunchers
4_____62194555___AMD Users
 

cellarnoise

Senior member
Mar 22, 2017
769
417
136
Well the TeAm is doing great!

Let's see what the final outcome looks like? A few members are moving down the stack a little bit!

But it is great to see others moving on up!
 
  • Like
Reactions: crashtech

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,438
4,270
75
More-or-less final stats:

Rank___Credits____Username
2______67858580___markfw
4______40409619___crashtech
5______36445715___w a h
8______32139817___Icecold
14_____19224902___cellarnoise2
15_____16738054___ChelseaOilman
54_____3541952____mmonnin
78_____1908793____Orange Kid
82_____1806413____biodoc
101____1508370____waffleironhead
107____1370718____johnnevermind
131____921725_____Ken_g6
181____437322_____StephieDolores
269____105768_____xxshanshon
378____1455_______SlangNRox

Rank__Credits____Team
1_____224419210___TeAm AnandTech
2_____90911081___Czech National Team
3_____79739432___AMD Users
4_____74411166___Antarctic Crunchers
 

StefanR5R

Elite Member
Dec 10, 2016
6,125
9,254
136
TeAm AnandTech ranking in the last five seasons:

2020: 9, 9, 8, 7, 6, 16, 10, 14, 18
2021: 5, 7, 9, 1, 1, 1, 1, 1, 1
2022: 1, 2¹, 1, 2², 1, 1, 1, 1, 1³
2023: 1, 1, 1, 1, 1, 1, 1, 1, 1
2024: 1, 1, 1, 1⁴, 1, 1, 1, 1, ?⁵

________
¹) with extensive guest-computing by TeAm members for team Ukraine who won this
²) Antarctic Crunchers won this, by Gelly of AC duking the individuals' race out with Skillz
³) GFN-21 like in the upcoming challenge, plus GFN-22 and DYFL: TAAT won by making 22.8% of all points, AC = 11.2%, CNT = 10.1%, SG = 9.8%.
⁴) GFN-19: This was the latest combined GPU+CPU challenge, with GPUs generally having an edge in performance and performance/Watt over CPUs in this one. It ended with TAAT = 15.6% of all points, AC = 12.1%, SG = 10.8%, CNT = 10.4%.
⁵) GFN-21: This is going to be mostly a GPU challenge. Unless somebody has got a large number of X3D$ or HBM equipped CPUs at their disposal.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,477
15,588
136
So, I thought this next one was GPU only, yes ? even thought it can do GPU + CPU ? Is the CPU a waste, and thats why ?
 

Skillz

Senior member
Feb 14, 2014
987
1,020
136
GPUs will do them much faster. CPUs can do them, but I believe the L3 cache requirement is around 40MB which very few CPUs have leading to very long run times on CPUs.