PrimeGrid Challenges 2021

StefanR5R · Oct 12, 2021

I did a quick test of all "b" and "min n in progress"/ "max n in progress"/ "max n loaded" which are currently listed at Generalized Cullen Woodall Prime Search statistics:

b	n	FMA3 FFT length	FMA3 FFT size
13	4234840/ 4330158/ 4343034	1920K	15 MB
29	3259942/ 3298500/ 3308112	2016K	15.75 MB
47	2850640/ 2884836/ 2893336	2M/ 2304K/ 2304K	16 MB/ 18 MB/ 18 MB
49	2817958/ 2853898/ 2862348	1792K	14 MB
55	2707792/ 2771662/ 2779828	2M/ 2304K/ 2304K	16 MB/ 18 MB/ 18 MB
69	2581852/ 2623208/ 2630906	2304K	18 MB
101	2379576/ 2406490/ 2413702	2304K	18 MB
109	2323728/ 2367570/ 2374138	2304K	18 MB
121	2290840/ 2316018/ 2322678	1920K	15 MB

(All of the tests used "zero-padded FMA3 FFT" on a Haswell CPU.)

Well, this is awkward for Zen 2 users. Some of the currently sent work still fits into the L3 cache of a single CCX (and would perform best if tied to the cores of one and the same CCX), while other currently sent work already exceeds the L3 cache of one CCX. — I suppose a good course of action would be to test both smaller and larger work, and then decide whether to proceed with 1 task : 1 CCX or with 1 task : 2 CCXs for all of the upcoming work.

crashtech · Oct 12, 2021

I don't imagine the server discriminates when it sends work, so it seems prudent to prepare for the worst case scenario. That's what I'm going to test for first.

StefanR5R · Oct 13, 2021

StefanR5R said:
a random task with zero-padded FMA3 FFT length of 2304K is taking ~650 MB virtual and ~540 MB resident memory at 1 % completion.

The task is now 33 % complete. Virtual memory size: 1.0 GB, resident: 930 MB.
I.e. the physical RAM requirement is even larger than indicated at the PG prefs webpage.

Update: 75 % complete -> 1.5 GB virtual, 1.4 GB resident

crashtech said:
I don't imagine the server discriminates when it sends work, so it seems prudent to prepare for the worst case scenario. That's what I'm going to test for first.

The fraction of >16 MB work which gets sent out will increase anyway. And while running the <16 MB work on merely 4 cores will result in better throughput due to lower synchronization overhead, it will also come at the price of really long task durations. So, biting the bullet and putting all of the work on 2 CCXs on Zen2 machines might make the most sense.

biodoc · Oct 14, 2021

I have a validated GCW-llr task that I completed if anyone is interested in doing some testing.

llrGCW_384103662_0
24,293.93 credits
Starting probable prime test of 50326*55^2767931+1
Using zero-padded FMA3 FFT length 2240K, Pass1=896, Pass2=2560, clm=1, 16 threads, a = 3, L2 = 212*204, M = 43248

StefanR5R · Oct 16, 2021

For what it's worth: With your candidate (as well as with the one from #478: "2766958*55^2766958+1", 24,285.40 cobblestones), sllr2 chooses the PRP test with a zero-padded FMA3 FFT of 2240K length (17.5 MB) on your Zen3 computer as well as on a Zen2 computer of mine. But on Haswell and on Broadwell-EP it's 2304K length (18 MB). I don't recall that I have seen Haswell and Zen2/3 choosing different transforms before.

biodoc · Oct 16, 2021

I don't know if it's worth running any tests on GCW-llr since I'll probably run with the same settings as PSP-llr:
2 tasks x 16 threads on the 3950X and 5950X
2 tasks x 12 threads on the 3900X

StefanR5R · Oct 16, 2021

I too thought of foregoing the testing and sticking with what I found best for llrPSP, given that the llrGCW workunit sizes are between those of the previous two LLR challenge subprojects, llrESP and llrPSP. But GCW has candidate numbers with odd bases b and uses LLR2's PRP test (like llrSR5, fwiw), while ESP and PSP have b=2 and use LLR2's Proth test. So I might do some tests after all. Though maybe this doesn't really matter, because it boils down to Fast Fourier Transforms of similar size after all.

Also, there are still some points which I haven't investigated yet: E.g., while I know that I shall enforce processor affinity on Zen2, I haven't tried it on Broadwell-EP yet. Or what about the "NUMA nodes per socket" or "L3 Cache as NUMA Domain" options in the Rome BIOS…

crashtech · Oct 16, 2021

I'm going to continue to test. My results on the 3900X indicated that 2x6 worked better than 2x12 for GCW.

For now, all my attention is being consumed by my main desktop, a 5950X. I can't get consistent good performance out of it no matter what settings I use. I'm starting to suspect that one of the automatic settings is killing performance, but since there are dozens of those, it's hard to tell what's going on. I can say that I'm ready to give up on Gigabyte motherboards.

StefanR5R · Oct 16, 2021

On Zen2, back in June with the 15 MB llrESP tasks, it made a notable difference whether the tasks were scheduled freely by the Linux kernel, or were each "manually" tied to a single CCX.

I suspect that a 5950X ( = Zen3 with two CCXs), when subjected to two concurrent llrPSP or llrGCW tasks, similarly benefits from enforcing that each task stays on cores which belong to the same CCX per task.

Granted, my Zen2 tests happened on CPUs which have eight CCXs, not just two CCXs, with respectively increased chance of inter-CCX traffic when the tasks were not pinned.

biodoc · Oct 16, 2021

crashtech said:
For now, all my attention is being consumed by my main desktop, a 5950X. I can't get consistent good performance out of it no matter what settings I use. I'm starting to suspect that one of the automatic settings is killing performance, but since there are dozens of those, it's hard to tell what's going on. I can say that I'm ready to give up on Gigabyte motherboards.

My 5950X is in a Gigabyte Aorus Master. I just set the PPT @ 105 watts and haven't seen any performance issues. Can you be more specific?

crashtech · Oct 16, 2021

biodoc said:
My 5950X is in a Gigabyte Aorus Master. I just set the PPT @ 105 watts and haven't seen any performance issues. Can you be more specific?

Specificity would probably end up as a wall of text, because I've tried so many things. The motherboard is an X570 Aorus Elite, and the 5950X is under water with a big radiator. I'm seeing half the performance on it versus my other 5950X which is air cooled and on an Asus Prime X570-P. Also it is massively inconsistent. I have the script set up to do 2% completion of 2x8 four times to help identify inconsistency. Sometimes it will finish faster than my other 5950X, sometimes twice as slow. Something is very very wrong.

Edit: I have decided to order another Asus Prime X570-P just in case my efforts to tame the Aorus fail. I'd actually been considering switching my HTPC over to AM4, so this will work out okay. When in doubt, throw money at it, I guess.

Icecold · Oct 16, 2021

That is bizarre. I can't even think of a bios setting that would cause that poor of performance. I've used Gigabyte motherboards without issue, but I'm not a huge fan of their's. I'm assuming you're tried resetting the bios to default settings?

Edit - what operating system is it running?

crashtech · Oct 18, 2021

Icecold said:
That is bizarre. I can't even think of a bios setting that would cause that poor of performance. I've used Gigabyte motherboards without issue, but I'm not a huge fan of their's. I'm assuming you're tried resetting the bios to default settings?

Edit - what operating system is it running?

Sorry Ice, I didn't see your post. This is Windows 10. I've tried so many things I have lost track. Defaults, various PBO iterations, locked at 4GHz (which is how my other one is because reasons), nothing works. Here's the latest results, locked at 4GHz, XMP 3200, FCLK 1600, most other things Auto, which is as close to my other one as I could manage. These are the latest raw results (snipped down to the relevant info) running to 2% 4 consecutive times, which is how I catch the radical inconsistency:

Code:

======== Mon, Oct 18, 2021  6:46:29 PM ======== starting 2 process(es) with 8 thread(s) ========
real    12m11.2s    (731 s)
user    0m4.0s    (4 s)
sys    0m15.0s    (15 s)
======== Mon, Oct 18, 2021  6:58:40 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  6:58:40 PM ======== starting 2 process(es) with 8 thread(s) ========
real    18m38.3s    (1118 s)
user    0m5.0s    (5 s)
sys    0m19.7s    (19 s)
======== Mon, Oct 18, 2021  7:17:18 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  7:17:18 PM ======== starting 2 process(es) with 8 thread(s) ========
real    22m39.3s    (1359 s)
user    0m8.7s    (8 s)
sys    0m35.4s    (35 s)
======== Mon, Oct 18, 2021  7:39:58 PM ======= done with 2 process(es) with 8 thread(s) ========

======== Mon, Oct 18, 2021  7:39:58 PM ======== starting 2 process(es) with 8 thread(s) ========
real    26m39.5s    (1599 s)
user    0m10.0s    (10 s)
sys    0m43.1s    (43 s)
======== Mon, Oct 18, 2021  8:06:37 PM ======= done with 2 process(es) with 8 thread(s) ========

You'd think from looking at that something would have to be overheating but it's not. I have a big 140 fan blowing on the VRMs, it's under water with a big rad, and Hwinfo says everything is very cool. I've basically given up at this point. I'm going to replace the motherboard.

StefanR5R · Oct 19, 2021

@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)

Markfw · Oct 19, 2021

StefanR5R said:
@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)

Is there any easy way to know how consistant machines are at WCG for example ? I can look at completed results, but I can't really tell how they are doing, as I have 2 5950x both doing WCG.

crashtech · Oct 19, 2021

StefanR5R said:
@crashtech, is the other computer which produces consistent results running Windows too?

How do you run the script and the LLR2 exe?
(Cygwin or WSL for the script?
Linux binary under Cygwin or WSL, or native Windows binary launched from the non-native script?)

Yes, the other PC is under the same OS with the same procedure using Cygwin and a modification of your llr2 script which I put together before you published your version. When using Cygwin, I've found it necessary to use the Windows binary, which I obtained by downloading a GCW task.

StefanR5R · Oct 19, 2021

Markfw said:
Is there any easy way to know how consistant machines are at WCG for example ? I can look at completed results, but I can't really tell how they are doing, as I have 2 5950x both doing WCG.

When I wanted to check performance in WCG, I used to go to the results table of a "device" a.k.a. computer on the WCG web site, then filter for valid results, then copy-and-paste one or more pages (15 results per page) into a text editor, reformat the text a little in the editor, then copy-and-paste into a Libreoffice spreadsheet (or Excel, if you have that), then compute points/runtime. I could be more specific if I had something running at WCG right now, but I don't.

However, considering that WCG tasks have somewhat variable sizes (speaking of tasks of one and the same WCG subproject of course), and there is CreditNew or something similar in play at WCG, this method may not give a really good answer to the question of computer performance consistency.

The PrimeGrid tests to which @crashtech refers to are done with a fixed workunit. I.e. one more or less arbitrary real workunit is chosen, then the science application is launched with this workunit outside of boinc in a scripted environment. That way, the test is highly repeatable/ reproducible. And quite precise comparisons between different configurations and different computers can be made. I have loosely thought about adopting this method of testing with a fixed workunit to other projects (WCG, Rosetta, etc. pp.) but haven't actually started any real work towards it yet.

crashtech · Oct 20, 2021

@biodoc , @Icecold , @StefanR5R :

Edited with new info:

Turns out the performance problem was due to a bad Windows installation. Specifically, it was a Windows Insider build that I installed to gain access to some features that I ended up not really using. So putting it back to a fresh, stock build resulted in this:

Code:

======== Wed Oct 20 10:07:08 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m14.7s    (734 s)
user    0m3.8s    (3 s)
sys    0m15.8s    (15 s)
======== Wed Oct 20 10:19:23 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:19:23 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.5s    (729 s)
user    0m3.8s    (3 s)
sys    0m15.8s    (15 s)
======== Wed Oct 20 10:31:33 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:31:33 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.2s    (729 s)
user    0m3.8s    (3 s)
sys    0m15.3s    (15 s)
======== Wed Oct 20 10:43:42 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

======== Wed Oct 20 10:43:42 MDT 2021 ======== starting 2 process(es) with 8 thread(s) ========

real    12m9.1s    (729 s)
user    0m4.0s    (4 s)
sys    0m14.9s    (14 s)
======== Wed Oct 20 10:55:51 MDT 2021 ======= done with 2 process(es) with 8 thread(s) ========

Now I have an extra motherboard...

StefanR5R · Oct 20, 2021

729 seconds for 2 % completion --> somewhat north of 36,000 seconds for the full task.

As a 2-tasks x 8-threads result on 5950X without special sauce, this figure looks oddly familiar. ;-)

crashtech · Oct 20, 2021

StefanR5R said:
729 seconds for 2 % completion --> somewhat north of 36,000 seconds for the full task.

As a 2-tasks x 8-threads result on 5950X without special sauce, this figure looks oddly familiar. ;-)

I might sauce it up a bit later, now that it's working properly. Still boggles my mind that the OS could mess performance up so badly. Makes me wonder if they were testing some Windows 11 stuff in that build.

StefanR5R · Oct 20, 2021

Maybe it's got a built-in AI which recognized that you wanted to test the same prime candidate all over again, decided that this was a waste, and secretly stuffed the program threads onto fewer cores.

Icecold · Oct 20, 2021

We can probably call this one pretty much done at this point. Current user stats below-

GCW-LLR: Martin Gardner's Birthday Challenge (2021-10-21 00:00:00 to 2021-10-28 00:00:00)
Last update: 2021-10-21 00:30:02

Rank	Name	Team	Score	Tasks
1	Pooh Bear 27	The Knights Who Say Ni!	675.15	2
2	Icecold	TeAm AnandTech	381.88	1

Edit - there are nearly 7 days left for everybody to catch up

crashtech · Oct 21, 2021

I have nowhere to go but down, now. Makes me think I should adopt Stefan's method of doing a little bunkering, that way I could rise instead of fall.

StefanR5R · Oct 21, 2021

What? Bunkering in a PrimeGrid challenge? Who would do such a thing? O:-)

I do have to say though that I always find it a little irritating to wait for the users of ultra-slow rented VMs to crawl out of the woodwork. Who knows what they are up to, might take them days to show something.

lane42 · Oct 21, 2021

1	TeAm AnandTech	4 048 818.27	214
2	Czech National Team	3 436 808.97	286
3	SETI.Germany	3 377 676.50	161
4	AMD Users	2 395 805.68	134
5	Antarctic Crunchers	2 326 280.50	120
6	Aggie The Pew	1 455 180.02	71
7	Sicituradastra.	1 194 288.31	68
8	BOINC@MIXI	1 122 635.38	55
9	BOINC@AUSTRALIA	998 962.91	48
10	Crunching@EVGA	921 307.08	74

Day 1, Off to a Good start

PrimeGrid Challenges 2021

Elite Member

Lifer

Elite Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Lifer

Elite Member

Diamond Member

Lifer

Golden Member

Lifer

Elite Member

Moderator Emeritus, Elite Member

Lifer

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Golden Member

Lifer

Elite Member

Diamond Member