PrimeGrid Challenges 2023

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,200
3,764
75
Day 4.5 stats:

Rank___Credits____Username
2______29573280___markfw
3______17035868___xii5ku
6______13334279___crashtech
11_____7892320____w a h
15_____4624072____cellarnoise2
38_____1797687____biodoc
40_____1757185____Orange Kid
41_____1631826____[TA]Skillz
57_____1185992____mmonnin
66_____1032301____waffleironhead
70_____980348_____Fardringle
111____569937_____Letin Noxe
129____450831_____Icecold
132____441562_____Ken_g6
286____31083______DROFFUNGUS

Rank__Credits____Team
1_____82338579___TeAm AnandTech
2_____45147691___Czech National Team
3_____44539120___BOINC@AUSTRALIA
4_____34153087___Antarctic Crunchers

Please remember to finish or abort your double-check WUs. (The fast ones.) The main ones don't matter as much to finalizing the stats.
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
I just found this attached guide on Discord 10/3.
https://forums.anandtech.com/attachments/pg_ht-png.86619/
The cache footprints of LLR2 based projects are of course slowly creeping up all the time (a bit faster during challenges). The T5K column is dynamic as well, but changes much slower over the years of course; might be a long while still until the next subproject drops out. The "HT or SMT" column should be ignored, because this is dependent on hardware, hardware tuning, and personal priorities.
 
  • Like
Reactions: cellarnoise

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
@crashtech, @Markfw, the AVX-512 question arose to me because I came across a report of a SETI.Germany member about lower per-core performance of his Ryzen 7950X (Zen 4, 2x 8c) compared to his Ryzen 5700X (Zen 3, 1x 8c) and some other 8-core computers of his, in several but not all PrimeGrid LLR2 based projects, in PRST, and in PFGW (a non-BOINC app which is based on the GWNUM library like LLR2 and PRST). I.e., the context is primarily a suspected regression from Zen 3 to Zen 4, and the GWNUM lib in the center of this suspicion.
Based on his different PrimeGrid results the reporter already saw himself that his 7950X seemed to run into a memory bottleneck (because SGS was somewhat decent, but larger projects regressed). This all was within a parallel discussion on CCX affinity, and I am pretty sure that a lack of the latter is heavily throttling this 7950X in the three mentioned applications. Now, SETI.Germany have their own (not-so-)secret sauce for CCX affinity on Windows computers, but seemingly not on Linux computers, and the reporter is in the Linux users minority of his team, but in addition the reporter seemed to ignore some results which his team mates provided for a direct comparison without versus with special sauce on Windows.

He gave me some details on PFGW, and based on this I might be able to make my own direct comparisons of AVX-512 on/off and CCX affinity on/off on Zen 4 with PFGW, though I'll first examine llrESP because I've got my reproducible test for this ready to run.

From what I understood from Mysticial's Zen 4 AVX-512 analysis, a regression potential exists but should only occur in obscure edge cases, if at all. AMD's implementation of 512 bits wide vector processing on 256 bits wide hardware units in Zen 4 has got similar precedent in their AVX2 implementation on 128 bits wide hardware in Zen 1, IIRC. This worked out pretty well back then, given the transistor budget and power characteristics of the GloFo 14nm process.

Intel's first implementation of AVX-512 happened to take place on 14nm too (but Intel's own of course), with Skylake-SP/-X. These processors still worked with per-core-utilization based fixed boost clock steps, a fixed negative AVX2 clock offset, and another fixed negative AVX-512 clock offset. And of the overall transistor budget, Intel dedicated a lot of it to dual 256 bits wide vector units plus another specialized 512 bits wide FMA unit per core. All this taken together made for some powerful and at the same time power hungry implementation, with regressions in some cases like when software couldn't be ported to double vector width very well, or when Intel added processor models which had the extra FMA disabled but nevertheless featured this fixed negative AVX-512 clock offset. I haven't been following Intel's subsequent processor generations very closely to know whether or not they addressed these old trouble points by now. But if we compare current Intel gear with current AMD gear, there is obviously a large manufacturing process disadvantage on Intel's side, which makes it difficult to isolate architectural disadvantages or advantages.
Anyway, long story short: I don't believe right now that LLR2 or PFGW fall into the category of edge cases which could theoretically regress from Zen 3 to Zen 4 (or from Zen 4 AVX2 to Zen 4 AVX-512). But measuring is better than believing, IMO.
 
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
I was almost, but only almost, tossed off of 3rd place by a cloud user. Once again running without at least a little bunkered reserve made me a sitting duck, but the shotgun missed by a hair.

Meanwhile, those evil ;-) bunker builders do exist, even at PrimeGrid. Or one at least.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,200
3,764
75
More-or-less final stats:

Rank___Credits____Username
2______32007892___markfw
3______18691951___xii5ku
6______14531162___crashtech
10_____8634439____w a h
15_____5054384____cellarnoise2
38_____1984113____biodoc
40_____1950740____Orange Kid
49_____1631826____[TA]Skillz
55_____1409905____mmonnin
68_____1128246____waffleironhead
71_____1043847____Fardringle
106____667561_____Letin Noxe
128____506856_____Ken_g6
130____483633_____Icecold
299____31083______DROFFUNGUS

Rank__Credits____Team
1_____89757646___TeAm AnandTech
2_____50221309___Czech National Team
3_____45465770___BOINC@AUSTRALIA
4_____37496123___Antarctic Crunchers
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
Here are the tests of llrESP on Zen 4, forced to use FMA3 like Zen 3 would, or using the new AVX-512 features, as Zen 4 does per default.

Software
sllr2_1.3.0_linux64_220821 from https://www.primegrid.com/download/
"LLR2 Program - Version 1.3.0, using Gwnum Library Version 30.9"

Task
Proth prime test of 238411*2^24201228+1 (would give 31,512.33 credits at PrimeGrid)

Host
Epyc 9554P, all BIOS settings at default, OpenSUSE Leap 15.5, graphical desktop shut off

Tests with FMA3
The BIOS has got an AVX512 option which can be set to auto/disabled/enabled, but this is doing nothing at all. Must be a bug of the BIOS. Instead, I forced LLR2 to use FMA3 (256 bits wide AVX2) instead of AVX-512 by means of booting with the kernel command line option clearcpuid=304. I stole this trick from Phoronix' report AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X.
Code:
# |     task duration     | tasks/day | points/day | avg clock | Tctl  | system power | efficiency
--+-----------------------+-----------+------------+-----------+-------+--------------+-----------
1 |  15:41:41 =   56501 s |      12.2 |    385,490 |  3.67 GHz | 47 °C |  405...435 W |  920 PPD/W
2 |  15:40:07 =   56407 s |      12.2 |    386,120 |  3.45 GHz | 49 °C |  405...425 W |  930 PPD/W
3 |  10:21:40 =   37300 s |      18.5 |    583,923 |  3.18 GHz | 51 °C |  439...442 W | 1330 PPD/W
4 |   8:46:59 =   31619 s |      21.8 |    688,859 |  3.01 GHz | 52 °C |  425...426 W | 1620 PPD/W
5 |  22:44:17 =   81857 s |     8.443 |    266,058 |  3.73 GHz | 43 °C |  375...405 W |  680 PPD/W
Code:
>>>> Sun  5 Nov 14:58:12 CET 2023, starting test 1 of 5: 238411*2^24201228+1 (31,512.33 credits), 8 tasks x 8 threads
Starting Proth prime test of 238411*2^24201228+1
Using all-complex FMA3 FFT length 2560K, Pass1=512, Pass2=5K, clm=2, 8 threads, a = 3, L2 = 468*404, M = 189072
238411*2^24201228+1, bit: 250000 / 24201227 [1.03%], 189072 checked.  Time per bit: 2.373 ms.
elapsed: 0:10:01 (601 s), remaining: 15:59:38 (57578 s), total: 16:09:39 (58179 s)
min = 55942 s, median = 55942 s, max = 58179 s
average task duration: 15:41:41 (56501 s), 12.2 tasks/day, 385,490 points/day
<<<< Sun  5 Nov 15:08:13 CET 2023, finished test 1 of 5: 238411*2^24201228+1 (31,512.33 credits), 8 tasks x 8 threads

Tests with AVX-512
That is, after reboot without the mentioned kernel command line parameter.
Code:
# |     task duration     | tasks/day | points/day | avg clock | Tctl  | system power | efficiency
--+-----------------------+-----------+------------+-----------+-------+--------------+-----------
1 |  15:40:48 =   56448 s |      12.2 |    385,836 |  3.71 GHz | 46 °C |  380...405 W |  980 PPD/W
2 |  15:15:06 =   54906 s |      12.5 |    396,677 |  3.71 GHz | 47 °C |  385...422 W |  980 PPD/W
3 |  10:38:09 =   38289 s |      18.0 |    568,860 |  3.65 GHz | 50 °C |  426...431 W | 1330 PPD/W
4 |   8:04:01 =   29041 s |      23.8 |    749,993 |  3.43 GHz | 51 °C |  418...420 W | 1790 PPD/W
5 |  22:36:39 =   81399 s |     8.491 |    267,571 |  3.65 GHz | 43 °C |  340...352 W |  770 PPD/W
Code:
>>>> Sun  5 Nov 16:12:06 CET 2023, starting test 1 of 5: 238411*2^24201228+1 (31,512.33 credits), 8 tasks x 8 threads
Starting Proth prime test of 238411*2^24201228+1
Using all-complex AVX-512 FFT length 2520K, Pass1=1344, Pass2=1920, clm=1, 8 threads, a = 3, L2 = 468*404, M = 189072
238411*2^24201228+1, bit: 250000 / 24201227 [1.03%], 189072 checked.  Time per bit: 2.334 ms.
elapsed: 0:10:00 (600 s), remaining: 15:58:02 (57482 s), total: 16:08:02 (58082 s)
min = 53780 s, median = 58082 s, max = 58082 s
average task duration: 15:40:48 (56448 s), 12.2 tasks/day, 385,836 points/day
<<<< Sun  5 Nov 16:22:06 CET 2023, finished test 1 of 5: 238411*2^24201228+1 (31,512.33 credits), 8 tasks x 8 threads

Explanation of the table contents
  • Each test ran only ten minutes. After that, progress was checked, and total task duration was extrapolated from this. Task duration is trivially converted to tasks/day (in all tests, 8 tasks were running in parallel) and points/day (PPD).
  • "avg clock" is the average of core clocks across all logical CPUs, sampled at circa 5 minutes into a test. If you dig into the results, you will notice that high clocks indicate that the CPUs didn't have much to do (waited a lot for RAM accesses to complete), whereas low clocks indicate that the vector arithmetic units were hard at work.
  • "Tctl" is the reading of the "controlling" CPU temperature at the same time. Tccd1…Tccd8 are always a little lower than Tctl. Ambient temperature was at a constant 24 °C during all tests, and the liquid cooling system operated with a fixed pump speed and fixed fan speeds at all times. Thus, Tctl and power draw correlate closely in these tests.
  • "system power" was also obtained at ~5 minutes into each test. It is the reading on a kill-a-watt, that is, includes not only CPU power draw but RAM, SSD, cooling system (which was constant during the tests), and PSU conversion losses (it's a platinum PSU).
  • "efficiency" is "points/day" divided by "system power".
  • Now the important bits, about the first column:
    All of the 5 tests were made with the mentioned test candidate (spoiler: it's composite, not prime), and with 8 simultaneous tasks, and 8 threads per task (so, 64 software threads in total on this 64c/128t hardware, which consists of 8 CCXs with 8c/16t/32MB L3$ each). Each task has got ~20 MB cache footprint.
    What differed between the 5 tests was CPU affinity.
    1. No affinity; the software threads were randomly spread by the OS kernel over the 128 hardware threads.
    2. 8x affinity masks of 0-63. This informed the OS to leave the SMT threads alone and use only real cores. As you can see, this makes only little difference because the Linux kernel isn't totally dumb.
    3. Pairs of tasks were assigned to pairs of CCXs. A similar situation would arise if you would operate a 7950X ( = 2 8c/16t CCXs) without affinity. Again only the lower half of the SMT sibling threads was admitted.
    4. Each task was assigned to one CCX exclusively. This is the only one of the 5 tests which guaranteed that each task had enough level 3 cache available for itself, and that inter-thread communications could happen within cache. Only the lower half of the SMT sibling threads was admitted here as well.
    5. This test was set up just for the LULZ: Each task was bound to 8 logical CPUs which came from 8 different CCXs. In other words, this test forced 8 concurrent tasks onto each CCX. Only the lower half of the SMT sibling threads was used.
    Code:
    1) 0-127 0-127 0-127 0-127 0-127 0-127 0-127 0-127
    2) 0-63 0-63 0-63 0-63 0-63 0-63 0-63 0-63
    3) 0-15 0-15 16-31 16-31 32-47 32-47 48-63 48-63
    4) 0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63
    5) 0,8,16,24,32,40,48,56
       1,9,17,25,33,41,49,57
       2,10,18,26,34,42,50,58
       3,11,19,27,35,43,51,59
       4,12,20,28,36,44,52,60
       5,13,21,29,37,45,53,61
       6,14,22,30,38,46,54,62
       7,15,23,31,39,47,55,63
    Notes:
    – Linux numbers logical CPUs differently from Windows.
    – The masks were applied to each task as a whole. All individual threads of the task inherited the same mask. Thus, how the threads were scheduled within the constraints of these masks was up to the operating system.

Conclusion
With the exception of test #3, AVX-512 provided superior performance and power efficiency over FMA3. Note however that all tests except #4 were heavily influenced by randomness in the thread scheduling, and by GMI/ IMC/ RAM performance. Vice versa, these latter influences were eliminated in test #4 and the vector units used to their fullest, highlighting the FAM3 -> AVX-512 improvements.

edited to fix typos: …fixed fan speeds… / …over FMA3…/ added the final sentence
 
Last edited:
  • Like
Reactions: Trotador22

crashtech

Lifer
Jan 4, 2013
10,511
2,105
146
@StefanR5R, have you tried any 16T per CCX tests on Genoa? Both @biodoc and I noticed a small uplift on ESP with our 7950Xs, though it is likely not power efficient (did not measure that). No other CPU I tested exhibited such behavior in ESP.
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
Yes. Before the challenge I did several tests at 400 W PPT, including ones in which I used all threads. This gave noticeably better throughput and even a tiny bit better efficiency (perf/W) than if I only used half the hardware threads at the same power level. (Older CPUs of mine behaved as I was used to: HT/SMT might increase throughput somewhat, but would decrease perf/W in llrESP.) I have read about throughput increase by SMT on Ryzen 7900X too; efficiency wasn't reported.

The above tests at default PPT (360 W) are missing tests with all SMT threads used because of… no deeper reason.
 
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,200
3,764
75
I anticipate GFN-16 will drop in early February, when everybody's looking for reportable primes.
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
10-20 December
GFN-18: 5.07 MB
GFN-19: 10.1 MB
GFN-20: 20.3 MB
GFN-18…20 are also available on GPUs.
I haven't checked yet but am supposing that these values are still current. Which means GFN-18 and -19 should work decently on a large variety of CPUs, and Zen 3 and Zen 4 as well as some big Intel CPUs should be able to deal with GFN-20 without memory bottleneck too.

Didn't we figure last year that running CPUs on the lower GFN was worth it?
For Linux folks, there are my yesteryear's scripts for genefer22 (CPUs) and genefer22g (GPUs) which give you direct comparison. These scripts should still give you decent PPD estimates for the upcoming challenge, because the "leading edges" of Genefer based projects are moving slower than the LLR based ones. And notably, the memory allocation of the transforms aren't creeping up gradually; instead they stay the same for a very long time until a limit of a transform is reached. Nevertheless, I'll eventually look around for updated test workunits for these genefer22(g) tests, unless somebody else gets to it quicker.
 
  • Like
Reactions: Ken g6

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,380
14,344
136
I haven't checked yet but am supposing that these values are still current. Which means GFN-18 and -19 should work decently on a large variety of CPUs, and Zen 3 and Zen 4 as well as some big Intel CPUs should be able to deal with GFN-20 without memory bottleneck too.


For Linux folks, there are my yesteryear's scripts for genefer22 (CPUs) and genefer22g (GPUs) which give you direct comparison. These scripts should still give you decent PPD estimates for the upcoming challenge, because the "leading edges" of Genefer based projects are moving slower than the LLR based ones. And notably, the memory allocation of the transforms aren't creeping up gradually; instead they stay the same for a very long time until a limit of a transform is reached. Nevertheless, I'll eventually look around for updated test workunits for these genefer22(g) tests, unless somebody else gets to it quicker.
I looked above and its says these are available for GPUs. Are CPUs or GPUs more proficient in PPD/Watt of electricity, and PPD/day
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
Are CPUs or GPUs more proficient in PPD/Watt of electricity, and PPD/day
I expect there to be a wide range of both metrics among old…new CPUs, as well as on old…new GPUs, with a certain overlap between GPUs' and CPUs' ranges. Traditionally, Genefer is more of a GPU play, especially if you have recent GPUs.

The last GFN-18/19/20 challenge was three years ago, and the TeAm was only 18th in it. So a few of us, myself included, appear to have got a bit of catching up to do WRT the technical details.
 
Last edited:

Skillz

Senior member
Feb 14, 2014
890
902
136
I expect there to be a wide range of both metrics among old…new CPUs, as well as on old…new GPUs, with a certain overlap between GPUs' and CPUs' ranges. Traditionally, Genefer is more of a GPU play, especially if you have recent GPUs.

The last GFN-18/19/20 challenge was three years ago, and the TeAm was only 18th in it. So a few of us, myself included, appear to have got a bit of catching up to do WRT the technical details.

Yeah, the last GFN challenge though was the much higher GFNs (21, 22 and DYFL) and we rocked 1st place with 2x the output of #2 team. So I think we'll do very well with this one.

I've got a bunch of new EPYCs I've built recently (some are waiting for parts). Pretty sure I'll have more EPYCs running 24/7 than anyone on the team. :) Except I don't have the fancy EPYC Genoa like Xii5ku and Cellar does. My sheer numbers should make up for it though. I hope. Haha
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,200
3,764
75
Are CPUs or GPUs more proficient in PPD/Watt of electricity, and PPD/day
GPUs are more efficient when running GFN-20 than when running GFN-18 or 19. This is mostly not true of CPUs, and cache limits may reverse the effect.

Comparing GPUs and CPUs is almost apples and oranges. (Grapefruits and mandarins?) I tend to think CPUs are more efficient but I'm really not sure. It very likely depends on the specific GPU, the specific CPU, and whether they're running optimal tasks.
 

StefanR5R

Elite Member
Dec 10, 2016
5,412
7,583
136
Besides different computational efficiency of one and the same CPU or GPU at GFN-18 vs. GFN-19 vs. GFN-20, there is also the factor that PrimeGrid grants 10% "long job credit bonus" for GFN-20, but not for the other two. This bonus alone should be the reason why almost all GPUs are best put to GFN-20. But it's probably not as clear-cut with CPUs, as @Ken g6 noted.
 
  • Like
Reactions: Ken g6

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,380
14,344
136
8 days to go ! And a couple toys to add to my fleet !
 

Attachments

  • 20231202_140340[1].jpg
    20231202_140340[1].jpg
    477.6 KB · Views: 9

waffleironhead

Diamond Member
Aug 10, 2005
6,904
427
136
Less than 24 hours to go until the next challenge.

"The ninth and final challenge of the 2023 Series will be a 10-day challenge toasting to the contributions of Professor Chris K. Caldwell, founder of the Largest Known Primes Database. The challenge will be offered on the GFN-18, GFN-19, and/or GFN-20 applications, beginning 10 December 19:00 UTC and ending 20 December 19:00 UTC."
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,380
14,344
136
Less than 24 hours to go until the next challenge.

"The ninth and final challenge of the 2023 Series will be a 10-day challenge toasting to the contributions of Professor Chris K. Caldwell, founder of the Largest Known Primes Database. The challenge will be offered on the GFN-18, GFN-19, and/or GFN-20 applications, beginning 10 December 19:00 UTC and ending 20 December 19:00 UTC."
I'm ready. I have all my boxes drained of other work, so I don't dump units.
 
  • Like
Reactions: Orange Kid