PrimeGrid Challenges 2021

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,208
3,768
75
Current challenge: GFN-21-and-up!, December 7-17 (07:00 UTC)

Welcome back to another year of PrimeGrid challenges. I hope this year's better than the last, but for now we're all stuck social distancing, so might as well sit around and find some primes, right? ;)

#DateTime UTCProject(s)DurationChallenge
1​
14-19 January​
00:00:00PPS-DIV5 daysGood Riddance 2020! Challenge
2​
14-24 March​
12:00:00SoB-LLR10 daysSier"pi"nski's Birthday Challenge
3​
11-14 April​
18:00:00WW3 daysYuri's Night Challenge
4​
12-17 June​
13:00:00ESP-LLR5 daysPrimeGrid's 16th Birthday Challenge
5​
17-20 July​
22:00:00GFN-17-Low3 daysWorld Emoji Day Challenge
6​
12-22 August​
20:00:00PSP-LLR10 daysOnce In a Blue Moon Challenge
7​
21-28 October​
00:00:00GCW-LLR7 daysMartin Gardner's Birthday Challenge
8​
23-26 November​
05:00:00AP273 daysEuler's Constant Number Challenge
9​
7-17 December​
07:00:00GFN-21
GFN-22
DYFL
10 daysGeminids Shower Challenge

What you need:
  • One or more fast x86 processors, preferably with lots of cores. (Even slow ones might do!)
  • Windows (Vista or later 64-bit, or XP or later 32-bit), Linux, or MacOS 10.4+.
  • BOINC, attached to PrimeGrid (http://www.primegrid.com/).
  • Your PrimeGrid Preferences with only the above project(s) selected in the Projects section.
  • Patience! Most of these projects run long, slow WUs, at least on your CPU. As a result, most challenges are at least five days long. :eek:

What may help LLR (all but three of the challenges):
  • An Intel Sandy Bridge or later ("Core series" other than first-generation) processor with AVX may be 20-70% faster than with the default application. Zen 2 and later AMD processors also do well. Sadly, that does not include Pentium or Celeron processors.
  • In most challenges - probably all of these since their WUs are so large - it helps to enable multi-core processing in your PrimeGrid Preferences.
  • Faster RAM might help on some challenges, as long as it's stable.
  • A large amount of RAM, for GCW-LLR.
What may help in other challenges:
  • A GPU helps in four challenges. That's a record!
  • Juggling in some extra WUs may help in challenges where you run more than one WU on the CPU at a time. Mainly GFN-17-low. Switching to use all cores on one WU at the end may work equally well on other projects.

What won't help (but won't hurt either):
  • Any ARM, Android, or MIPS devices.
  • Unstable processors (In LLR ONLY.) Look for red Warnings in Your Results pages.

What won't help (and will hurt, sort of):
  • Work not downloaded anduploaded within the challenge. (It's not counted.) Should you not be able to be in front of one or more computers at that time, there are several options:
    • You can often set BOINC's network connection preferences to wait until a minute or two after challenge time.
    • And for short work units, you can just set the queue level very low (0.01 days). This also makes it more likely that you will be a prime finder rather than a double-checker. But you might want to raise their queue size after the challenge is underway.
  • Unstable processors. (Invalid work will be deducted! :eek: If Prime95 worked recently on your processor, it should be stable. See also the special case of LLR.)

Welcome and good luck to all! :)

P.S. If no one has posted stats lately, try tracking your stats with my user script. With that installed, visit the current challenge's Team stats link for TeAm stats.
 
Last edited:

crashtech

Lifer
Jan 4, 2013
10,519
2,109
146
Yes, that advice is a bit dated. Many AMD CPUs from Bulldozer up support AVX, although it took until Zen 2 to achieve rough parity with Intel CPUs.
 
  • Like
Reactions: Ken g6

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
Some October 2019 findings on PPS-DIV:
PrimeGrid: CPU benchmarks, #71 and several follow-ups

But tasks are larger now. Even the small increase which happened during the October 2019 challenge brought on some notable changes already; see post #87 over there.

Current PPS-DIV search range:
Fermat Divisor Prime Search Statistics
Edit: I'll probably post a translation of this table into FMA2 FFT sizes before the challenge.
 
Last edited:

TennesseeTony

Elite Member
Aug 2, 2003
4,201
3,630
136
www.google.com
Re: Tour de Primes for all of February

What projects are included, or if the list is shorter, what projects are NOT included? What projects are most likely to find a prime?

Side note: JTFrankenstein put up 118M points for SG the last 7 days, but this is the 8th/9th? day...so many more for 2021. Still running sieve. A crap load of points, but a sharp drop off from 2020. Could these be pendings? 120M+ pendings? :openmouth: Whoever they are, daaaaaannnng! Impressive!
 

Skivelitis2

Member
Jan 1, 2021
106
120
86
Last year's TdP thread is here. See first post.
Or. anything less than 5k in thePrime Rank column on the front page qualifies.
PPSE llr for CPU and GFN-16 for GPU are easiest (most likely) for prime finding.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,208
3,768
75
Last year's TdP thread is here. See first post.
Or. anything less than 5k in thePrime Rank column on the front page qualifies.
PPSE llr for CPU and GFN-16 for GPU are easiest (most likely) for prime finding.
There's actually a change from last year. If your CPU is so slow that it completes more PPS tests (at any rate) than its average of PPSE "1st" tests, then PPS is better than PPSE.
 
  • Like
Reactions: Skivelitis2

Skivelitis2

Member
Jan 1, 2021
106
120
86
Ah yes, forgot about the new fast double check. Thanks for the reminder, I certainly have some CPUs that qualify as slow. Now I gotta rethink my strategy.
 
  • Like
Reactions: Ken g6

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
Current challenge: PPS-DIV, January 14-19 (00:00 UTC)
The Fermat Divisor Search LLR (PPS-DIV) project is testing k·2^n+1 for k=[5, 7, …, 49].
Recent average CPU time: 5½ hours

If you run this in multithreaded mode at a moderate thread count and without HyperThreading or SMT, then the run time will be a little more than CPU time ÷ thread count per task. With HT/SMT, run time will be considerably longer than CPU time ÷ thread count per task, but you have double the thread count available.

(Whether or not to use HT/SMT somewhat depends on your CPU type, perhaps on the operating system, and on whether you want to optimize for run time, or throughput, or power efficiency. However, the more important question is how many tasks to run concurrently on your CPU.)

Current PPS-DIV search range:
Fermat Divisor Prime Search Statistics
I'll probably post a translation of this table into FMA2 FFT sizes before the challenge.
I took the ranges of n which are currently in progress from the statistics page and started the LLR2 program for each k × each min…max n in order to check for FFT lengths.

On a Haswell, on which LLR2 uses FMA3 FFTs, the FFT length was always 480K. That is, the FFT data occupy 3.75 MBytes.
Edit, this may differ on CPUs which are not AVX2 capable, and it may also differ on CPUs which are AVX512 capable. I have none of the latter to test with, and some of the former are packed away in "cold storage".

Your CPU will achieve highest throughput if you
  • run as many concurrent PPS-DIV tasks as last-level cache size ÷ FFT data size,
  • configure as many threads per task as required to maximize utilization of the FMA units.
Highest throughput-per-Watt will perhaps be achieved in the same way, but without using HyperThreading or SMT. (To be verified. Edit, @biodoc's tests in October 2019 showed that the efficiency optimum on Zen2 depends on the CPB configuration: #83, #86. These tests were done with a workunit with 240K FFT length, that is, less than 2 MB FFT data size.)

For posterity, here is how I checked for FFT lengths.
Bash:
#!/bin/bash

x () {
	${exe} \
		-oGerbicz=1 -oProofName=proof -oProofCount=64 -oProductName=prod \
		-oPietrzak=1 -oCachePoints=1 -pSavePoints -d -oDiskWriteTime=10 \
		-t${threads} -q"${1}*2^${2}${type}" > stdout 2> stderr &
	pid=$!

	while sleep 3
	do
		grep ", ${threads} threads" stdout && break
	done
	kill ${pid}
	wait ${pid} 2> /dev/null

	rm -f llr.ini lresults.txt proof.* stderr stdout [uz][0-9]*
}

y () {
	echo "=========== $1 ==========="
	[ -n "$5" ] && x $1 $5 || echo
	[ -n "$6" ] && x $1 $6 || echo
}


# PPS-DIV, 2021-01-10
exe="/var/lib/boinc/projects/www.primegrid.com/sllr2_1.1.0_linux64_201114"
threads="4"
type="+1"

y 5	25271	93	16656	7748257	7815139	7811065	8999977	
y 7	62002	239	40648	7748464	7815094	7811300	8999992	
y 9	89352	189	27836	7756121	7815143	7811241	8999977	
y 11	89868	152	25342	7776133	7815099	7811163	8999995	
y 13	63708	87	18211	7734466	7815130	7811020	8999992	1
y 15	195129	341	55434	7761739	7815139	7811124	8999992	3
y 17	43159	62	12291	7753739	7815087	7810511	8999879	
y 19	55848	97	15789	7725114	7814994	7810558	8999994	1
y 21	153117	248	43608	7748277	7815017	7811244	8999904	1
y 23	53507	83	15017	7761269	7815133	7810949	8999969	1
y 25	140794	233	39851	7754074	7815092	7811194	8999992	2
y 27	42013	77	11928	7765247	7814867	7810355	8999975	
y 29	140919	213	39454	7738681	7815081	7811247	8999975	3
y 31	51577	102	14298	7766708	7815008	7811244	8999984	2
y 33	131121	231	37249	7729485	7815142	7811218	8999980	1
y 35	133812	199	37842	7739325	7815085	7811191	8999991	
y 37	168520	278	47758	7744910	7815072	7811116	8999990	1
y 39	218219	350	61935	7738723	7815038	7811138	8999998	6
y 41	58099	84	16306	7756801	7815133	7811053	8999891	1
y 43	108547	190	30808	7764032	7815082	7811062	8999984	
y 45	167496	258	47152	7743750	7815141	7810973	8999993	4
y 47	12509	15	3555	7781575	7814263	7810375	8999131	
y 49	80916	132	22613	7761290	7815118	7810966	8999970	1
 
Last edited:
  • Like
Reactions: Skivelitis2

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
As far as I know, you need to look up spec sheets from the CPU vendor or from CPU review sites or wikis.
E.g., https://ark.intel.com/content/www/us/en/ark.html#@Processors

Locally on the system, there is for example /proc/cpuinfo. But while this is showing level 3 cache size on a Haswell Xeon E3, it is showing level 2 cache size on an Epyc Rome.

Edit: Zen2 CPUs (but not APUs) have several 16 MB L3 cache segments, one per core complex. That is, 4 concurrent PPS-DIV tasks can be supported per each core complex without spilling over the caches. Desktop Zen2 CPUs have 3 or 4 cores per core complex. Server Zen2 CPUs have 1…4 cores per core complex.

In contrast, Intel CPUs have a single unified L3 cache.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,256
2,238
136
lscpu will give you total L3 cache size in linux. My Ivy 2P shows 50 MiB total but that's 25 MiB per socket.

3900X: lscpu output
Code:
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB

The 3900X has 4 CCX so that's 16 MiB per CCX.
 

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
For Epyc 7002, lscpu from util-linux 2.33.1 is showing L1 and L2 sizes per core, and L3 size per core complex. (That is, not per socket or per machine.)
 

biodoc

Diamond Member
Dec 29, 2005
6,256
2,238
136
For Epyc 7002, lscpu from util-linux 2.33.1 is showing L1 and L2 sizes per core, and L3 size per core complex. (That is, not per socket or per machine.)
For Mint 20 , it appears I have a more recent version.
mark@linux-x24:~$ lscpu --version
lscpu from util-linux 2.34

lscpu -C on a 3900X gives me this table

Code:
mark@linux-x24:~$ lscpu -C
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL
L1d       32K     384K    8 Data            1
L1i       32K     384K    8 Instruction     1
L2       512K       6M    8 Unified         2
L3        16M      64M   16 Unified         3

EDIT: Key for columns:
ALL-SIZE size of all system caches
LEVEL cache level
NAME cache name
ONE-SIZE size of one cache
TYPE cache type
WAYS ways of associativity
 
Last edited:

Icecold

Golden Member
Nov 15, 2004
1,090
1,008
146
I'm probably being somewhat dense here, but what would be the optimal configuration on a 3900x or 3950x? 64 MB L3 cache, divided by 3.75 = 17 ? So don't run more than 17 tasks at a time would be my takeway or I'm misunderstanding that? Would it make sense to leave SMT on and just do 2 threads per task?

My 56 thread dual socket Xeon has 70MB L3 cache total, so if my math is right I need to do at least 3 threads per task?

Is the "smart cache" listed on ark.Intel.com the number to go on for this? I'm looking up my i5 8400, it has 9MB 'smart cache' listed. - https://ark.intel.com/content/www/u...5-8400-processor-9m-cache-up-to-4-00-ghz.html
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
I am updating my offline test from LLR to LLR2 and will post the update once I have tried it out.

I'm probably being somewhat dense here, but what would be the optimal configuration on a 3900x or 3950x? 64 MB L3 cache, divided by 3.75 = 17 ? So don't run more than 17 tasks at a time would be my takeway or I'm misunderstanding that? Would it make sense to leave SMT on and just do 2 threads per task?
3900X has got 12 cores = 12 FMA3 execution units.
3950X has got 16 cores = 16 FMA3 execution units.
It is likely that more than 12 or 16 concurrent tasks are detrimental to throughput, but this needs to be tested.

My 56 thread dual socket Xeon has 70MB L3 cache total, so if my math is right I need to do at least 3 threads per task?
Preferably, run an even number of concurrent tasks on a dual-socket machine. If there is an odd number of tasks (or if the operating system's kernel unnecessarily spreads tasks across sockets), at least one of them will occupy cache on both processors and will require synchronization via the QPI links as well as indirect main memory access through the QPI links.

The machine has got 2× 14 cores = 2× 14 FMA3 units, assuming Haswell-EP or Broadwell-EP. 2× 7 concurrent tasks may be a good choice, with either 2 or 4 threads per task.

However, testing is better than guessing. :-)

Is the "smart cache" listed on ark.Intel.com the number to go on for this? I'm looking up my i5 8400, it has 9MB 'smart cache' listed. - https://ark.intel.com/content/www/u...5-8400-processor-9m-cache-up-to-4-00-ghz.html
Yes. i5-8400 has got 6 cores = 6 FMA3 execution units, 256 kB L2 cache per core, and 9 MB unified L3 cache.
 
  • Like
Reactions: biodoc

biodoc

Diamond Member
Dec 29, 2005
6,256
2,238
136
I'm probably being somewhat dense here, but what would be the optimal configuration on a 3900x or 3950x? 64 MB L3 cache, divided by 3.75 = 17 ?
Normally we break it down to a single CCX in Zen 2 processors. Each CCX has a dedicated 16 MB L3. To make it simple for me, I'll use SMT "on" as the example. Besides exceeding the 16 MB cache limit we want to avoid "cross-talk" between CCX modules which would be via system RAM (slow).

3600 and 3900 series CCX:
16 MB dedicated L3 cache
6 threads available
Configurations that use all 6 threads:
1) 1 task per thread: 3.75 x 6 tasks = 22.5 MB <---exceeds the 16 MB limit
2) 1 task per 2 threads: 3.75 x 3 tasks = 11.25 MB <--OK
3) 1 task per 3 threads: 3.75 x 2 tasks = 7.5 MB <---OK
4) 1 task per 6 threads: 3.75 MB <---OK

3700, 3800 and 3950 series CCX:
16 MB dedicated L3 cache
8 threads available
Configurations that use all 8 threads:
1) 1 task per thread: 3.75 x 8 tasks = 30 MB <---exceeds the 16 MB limit
2) 1 task per 2 threads: 3.75 x 4 tasks = 15 MB <--OK but close
3) 1 task per 4 threads: 3.75 x 2 tasks = 7.5 MB <---OK
4) 1 task per 8 threads: 3.75 MB <---OK

Partially ninja'd by @StefanR5R :)
 

StefanR5R

Elite Member
Dec 10, 2016
5,441
7,681
136
Your CPU will achieve highest throughput if you
  • run as many concurrent PPS-DIV tasks as last-level cache size ÷ FFT data size,
  • configure as many threads per task as required to maximize utilization of the FMA units.
The first bullet point should rather read
  • run as many concurrent PPS-DIV tasks as possible, but not more than last-level cache size ÷ FFT data size.
To satisfy both criteria as well as possible, one could use two boinc client instances and configure each instance for a given number of concurrent tasks and, notably, for different thread counts per task. E.g. run a few 2-threaded tasks in one instance and a few 3-threaded tasks in the other instance.

However, I never tried this myself, and my offline testing script does not cover such a setup either. It also appears more complicated than necessary: My tests in the pasts have shown that there are several different configurations which come very close to the known optimum configuration (with the optimization goal either of host throughput, or of host throughput per Watt). Therefore, any untested configuration which would be even better then the found optimum, is likely only a very little better.
 

Icecold

Golden Member
Nov 15, 2004
1,090
1,008
146
I appreciate the info, that all makes sense. I knew, at least somewhat, that we wanted to avoid it having to communicate across QPI links or between the CCX's, but what you both posted really cleared it up for me.

Hs anybody tested the optimal thread count(in terms of PPD) for this FFT length yet? I don't mind testing myself but wouldn't reinvent the wheel if somebody else already has. It sounds like running as many tasks(bearing in mind the L3 cache limitation, and making sure each is running on one CCX or if dual socket on one processor) is generally best?