PrimeGrid Challenges 2021

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
I did a quick test of all "b" and "min n in progress"/ "max n in progress"/ "max n loaded" which are currently listed at Generalized Cullen Woodall Prime Search statistics:

bnFMA3 FFT lengthFMA3 FFT size
134234840/ 4330158/ 43430341920K15 MB
293259942/ 3298500/ 33081122016K15.75 MB
472850640/ 2884836/ 28933362M/ 2304K/ 2304K16 MB/ 18 MB/ 18 MB
492817958/ 2853898/ 28623481792K14 MB
552707792/ 2771662/ 27798282M/ 2304K/ 2304K16 MB/ 18 MB/ 18 MB
692581852/ 2623208/ 26309062304K18 MB
1012379576/ 2406490/ 24137022304K18 MB
1092323728/ 2367570/ 23741382304K18 MB
1212290840/ 2316018/ 23226781920K15 MB

(All of the tests used "zero-padded FMA3 FFT" on a Haswell CPU.)

Well, this is awkward for Zen 2 users. Some of the currently sent work still fits into the L3 cache of a single CCX (and would perform best if tied to the cores of one and the same CCX), while other currently sent work already exceeds the L3 cache of one CCX. — I suppose a good course of action would be to test both smaller and larger work, and then decide whether to proceed with 1 task : 1 CCX or with 1 task : 2 CCXs for all of the upcoming work.
 
  • Like
Reactions: biodoc and Ken g6

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
I don't imagine the server discriminates when it sends work, so it seems prudent to prepare for the worst case scenario. That's what I'm going to test for first.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
a random task with zero-padded FMA3 FFT length of 2304K is taking ~650 MB virtual and ~540 MB resident memory at 1 % completion.
The task is now 33 % complete. Virtual memory size: 1.0 GB, resident: 930 MB.
I.e. the physical RAM requirement is even larger than indicated at the PG prefs webpage.

Update: 75 % complete -> 1.5 GB virtual, 1.4 GB resident

I don't imagine the server discriminates when it sends work, so it seems prudent to prepare for the worst case scenario. That's what I'm going to test for first.
The fraction of >16 MB work which gets sent out will increase anyway. And while running the <16 MB work on merely 4 cores will result in better throughput due to lower synchronization overhead, it will also come at the price of really long task durations. So, biting the bullet and putting all of the work on 2 CCXs on Zen2 machines might make the most sense.
 
Last edited:
  • Like
Reactions: crashtech

biodoc

Diamond Member
Dec 29, 2005
6,257
2,238
136
I have a validated GCW-llr task that I completed if anyone is interested in doing some testing.

llrGCW_384103662_0
24,293.93 credits
Starting probable prime test of 50326*55^2767931+1
Using zero-padded FMA3 FFT length 2240K, Pass1=896, Pass2=2560, clm=1, 16 threads, a = 3, L2 = 212*204, M = 43248
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
For what it's worth: With your candidate (as well as with the one from #478: "2766958*55^2766958+1", 24,285.40 cobblestones), sllr2 chooses the PRP test with a zero-padded FMA3 FFT of 2240K length (17.5 MB) on your Zen3 computer as well as on a Zen2 computer of mine. But on Haswell and on Broadwell-EP it's 2304K length (18 MB). I don't recall that I have seen Haswell and Zen2/3 choosing different transforms before.
 

biodoc

Diamond Member
Dec 29, 2005
6,257
2,238
136
I don't know if it's worth running any tests on GCW-llr since I'll probably run with the same settings as PSP-llr:
2 tasks x 16 threads on the 3950X and 5950X
2 tasks x 12 threads on the 3900X
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
I too thought of foregoing the testing and sticking with what I found best for llrPSP, given that the llrGCW workunit sizes are between those of the previous two LLR challenge subprojects, llrESP and llrPSP. But GCW has candidate numbers with odd bases b and uses LLR2's PRP test (like llrSR5, fwiw), while ESP and PSP have b=2 and use LLR2's Proth test. So I might do some tests after all. Though maybe this doesn't really matter, because it boils down to Fast Fourier Transforms of similar size after all.

Also, there are still some points which I haven't investigated yet: E.g., while I know that I shall enforce processor affinity on Zen2, I haven't tried it on Broadwell-EP yet. Or what about the "NUMA nodes per socket" or "L3 Cache as NUMA Domain" options in the Rome BIOS…
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
I'm going to continue to test. My results on the 3900X indicated that 2x6 worked better than 2x12 for GCW.

For now, all my attention is being consumed by my main desktop, a 5950X. I can't get consistent good performance out of it no matter what settings I use. I'm starting to suspect that one of the automatic settings is killing performance, but since there are dozens of those, it's hard to tell what's going on. I can say that I'm ready to give up on Gigabyte motherboards.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
On Zen2, back in June with the 15 MB llrESP tasks, it made a notable difference whether the tasks were scheduled freely by the Linux kernel, or were each "manually" tied to a single CCX.

I suspect that a 5950X ( = Zen3 with two CCXs), when subjected to two concurrent llrPSP or llrGCW tasks, similarly benefits from enforcing that each task stays on cores which belong to the same CCX per task.

Granted, my Zen2 tests happened on CPUs which have eight CCXs, not just two CCXs, with respectively increased chance of inter-CCX traffic when the tasks were not pinned.
 

biodoc

Diamond Member
Dec 29, 2005
6,257
2,238
136
For now, all my attention is being consumed by my main desktop, a 5950X. I can't get consistent good performance out of it no matter what settings I use. I'm starting to suspect that one of the automatic settings is killing performance, but since there are dozens of those, it's hard to tell what's going on. I can say that I'm ready to give up on Gigabyte motherboards.
My 5950X is in a Gigabyte Aorus Master. I just set the PPT @ 105 watts and haven't seen any performance issues. Can you be more specific?
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
My 5950X is in a Gigabyte Aorus Master. I just set the PPT @ 105 watts and haven't seen any performance issues. Can you be more specific?
Specificity would probably end up as a wall of text, because I've tried so many things. The motherboard is an X570 Aorus Elite, and the 5950X is under water with a big radiator. I'm seeing half the performance on it versus my other 5950X which is air cooled and on an Asus Prime X570-P. Also it is massively inconsistent. I have the script set up to do 2% completion of 2x8 four times to help identify inconsistency. Sometimes it will finish faster than my other 5950X, sometimes twice as slow. Something is very very wrong.

Edit: I have decided to order another Asus Prime X570-P just in case my efforts to tame the Aorus fail. I'd actually been considering switching my HTPC over to AM4, so this will work out okay. When in doubt, throw money at it, I guess.
 
Last edited:

Icecold

Golden Member
Nov 15, 2004
1,090
1,008
146
That is bizarre. I can't even think of a bios setting that would cause that poor of performance. I've used Gigabyte motherboards without issue, but I'm not a huge fan of their's. I'm assuming you're tried resetting the bios to default settings?

Edit - what operating system is it running?
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
That is bizarre. I can't even think of a bios setting that would cause that poor of performance. I've used Gigabyte motherboards without issue, but I'm not a huge fan of their's. I'm assuming you're tried resetting the bios to default settings?

Edit - what operating system is it running?
Sorry Ice, I didn't see your post. This is Windows 10. I've tried so many things I have lost track. Defaults, various PBO iterations, locked at 4GHz (which is how my other one is because reasons), nothing works. Here's the latest results, locked at 4GHz, XMP 3200, FCLK 1600, most other things Auto, which is as close to my other one as I could manage. These are the latest raw results (snipped down to the relevant info) running to 2% 4 consecutive times, which is how I catch the radical inconsistency:

Code:
======== Mon, Oct 18, 2021  6:46:29 PM ======== starting 2 process(es) with 8 thread(s) ========
real    12m11.2s    (731 s)
user    0m4.0s    (4 s)
sys    0m15.0s    (15 s)
======== Mon, Oct 18, 2021  6:58:40 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  6:58:40 PM ======== starting 2 process(es) with 8 thread(s) ========
real    18m38.3s    (1118 s)
user    0m5.0s    (5 s)
sys    0m19.7s    (19 s)
======== Mon, Oct 18, 2021  7:17:18 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  7:17:18 PM ======== starting 2 process(es) with 8 thread(s) ========
real    22m39.3s    (1359 s)
user    0m8.7s    (8 s)
sys    0m35.4s    (35 s)
======== Mon, Oct 18, 2021  7:39:58 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  7:39:58 PM ======== starting 2 process(es) with 8 thread(s) ========
real    26m39.5s    (1599 s)
user    0m10.0s    (10 s)
sys    0m43.1s    (43 s)
======== Mon, Oct 18, 2021  8:06:37 PM ======= done with 2 process(es) with 8 thread(s) ========

You'd think from looking at that something would have to be overheating but it's not. I have a big 140 fan blowing on the VRMs, it's under water with a big rad, and Hwinfo says everything is very cool. I've basically given up at this point. I'm going to replace the motherboard.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,482
14,434
136
@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)
Is there any easy way to know how consistant machines are at WCG for example ? I can look at completed results, but I can't really tell how they are doing, as I have 2 5950x both doing WCG.
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)
Yes, the other PC is under the same OS with the same procedure using Cygwin and a modification of your llr2 script which I put together before you published your version. When using Cygwin, I've found it necessary to use the Windows binary, which I obtained by downloading a GCW task.
 
  • Like
Reactions: StefanR5R

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Is there any easy way to know how consistant machines are at WCG for example ? I can look at completed results, but I can't really tell how they are doing, as I have 2 5950x both doing WCG.
When I wanted to check performance in WCG, I used to go to the results table of a "device" a.k.a. computer on the WCG web site, then filter for valid results, then copy-and-paste one or more pages (15 results per page) into a text editor, reformat the text a little in the editor, then copy-and-paste into a Libreoffice spreadsheet (or Excel, if you have that), then compute points/runtime. I could be more specific if I had something running at WCG right now, but I don't.

However, considering that WCG tasks have somewhat variable sizes (speaking of tasks of one and the same WCG subproject of course), and there is CreditNew or something similar in play at WCG, this method may not give a really good answer to the question of computer performance consistency.

The PrimeGrid tests to which @crashtech refers to are done with a fixed workunit. I.e. one more or less arbitrary real workunit is chosen, then the science application is launched with this workunit outside of boinc in a scripted environment. That way, the test is highly repeatable/ reproducible. And quite precise comparisons between different configurations and different computers can be made. I have loosely thought about adopting this method of testing with a fixed workunit to other projects (WCG, Rosetta, etc. pp.) but haven't actually started any real work towards it yet.
 

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
@biodoc , @Icecold , @StefanR5R :

Edited with new info:

Turns out the performance problem was due to a bad Windows installation. Specifically, it was a Windows Insider build that I installed to gain access to some features that I ended up not really using. So putting it back to a fresh, stock build resulted in this:
Code:
======== Wed Oct 20 10:07:08 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m14.7s    (734 s)
user    0m3.8s    (3 s)
sys    0m15.8s    (15 s)
======== Wed Oct 20 10:19:23 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:19:23 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.5s    (729 s)
user    0m3.8s    (3 s)
sys    0m15.8s    (15 s)
======== Wed Oct 20 10:31:33 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:31:33 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.2s    (729 s)
user    0m3.8s    (3 s)
sys    0m15.3s    (15 s)
======== Wed Oct 20 10:43:42 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:43:42 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.1s    (729 s)
user    0m4.0s    (4 s)
sys    0m14.9s    (14 s)
======== Wed Oct 20 10:55:51 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

Now I have an extra motherboard...
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
729 seconds for 2 % completion --> somewhat north of 36,000 seconds for the full task.

As a 2-tasks x 8-threads result on 5950X without special sauce, this figure looks oddly familiar. ;-)
 
  • Like
Reactions: crashtech

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
729 seconds for 2 % completion --> somewhat north of 36,000 seconds for the full task.

As a 2-tasks x 8-threads result on 5950X without special sauce, this figure looks oddly familiar. ;-)
I might sauce it up a bit later, now that it's working properly. Still boggles my mind that the OS could mess performance up so badly. Makes me wonder if they were testing some Windows 11 stuff in that build.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
Maybe it's got a built-in AI which recognized that you wanted to test the same prime candidate all over again, decided that this was a waste, and secretly stuffed the program threads onto fewer cores.
 

Icecold

Golden Member
Nov 15, 2004
1,090
1,008
146
We can probably call this one pretty much done at this point. Current user stats below-

GCW-LLR: Martin Gardner's Birthday Challenge (2021-10-21 00:00:00 to 2021-10-28 00:00:00)
Last update: 2021-10-21 00:30:02
RankNameTeamScoreTasks
1​
Pooh Bear 27The Knights Who Say Ni!
675.15​
2​
2​
IcecoldTeAm AnandTech
381.88​
1​

:p

Edit - there are nearly 7 days left for everybody to catch up :p
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,521
2,111
146
I have nowhere to go but down, now. Makes me think I should adopt Stefan's method of doing a little bunkering, that way I could rise instead of fall.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
What? Bunkering in a PrimeGrid challenge? Who would do such a thing? O:-)


I do have to say though that I always find it a little irritating to wait for the users of ultra-slow rented VMs to crawl out of the woodwork. Who knows what they are up to, might take them days to show something.
 

lane42

Diamond Member
Sep 3, 2000
5,721
624
126