• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

PrimeGrid Challenges 2020

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

VirtualLarry

Lifer
Aug 25, 2001
48,792
5,302
126
I'm gunning for a top-59 finish in this Challenge personally; we'll see if I can rise up the ranks fast enough. Not going to be easy.

I put another Haswell quad (i5-4670), 16GB of DDR3-1600, and a GTX 1650 D5 4GB on PrimeGrid. Crossing my fingers that I don't trip a breaker.
 

VirtualLarry

Lifer
Aug 25, 2001
48,792
5,302
126
Didn't know that you could "bunker" PrimeGrid. Unless, the point is to sand-bag at the beginning.

Well, congrats to TSBT. I feel a small personal victory over them, though, as I came in 58th individually, just beating out two of their members, @ 59th and @ 60th respectively. It was close, for me at least.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,058
2,036
55
Day 3 stats:

Rank___Credits____Username
13_____43418480___Howdy
16_____39181133___xii5ku
31_____23556548___biodoc
37_____20967620___crashtech
58_____13480629___VirtualLarry
96_____8636502____Orange Kid
98_____8582566____Lane42
100____8184788____emoga
164____4014861___10esseeTony
218____2417007____far
244____1850679____Ken_g6
259____1597854____zzuupp
324____798927_____waffleironhead
450____246083_____geecee

Rank__Credits____Team
4_____225128864___SETI.Germany
5_____199526119___Aggie The Pew
6_____184150988___The Scottish Boinc Team
7_____176933677___TeAm AnandTech
8_____140803299___Sicituradastra.
9_____135011921___[H]ard|OCP
10____71758477___Crunching@EVGA

Well, that's a surprise, but I guess we are 7AA7. ;)
 

VirtualLarry

Lifer
Aug 25, 2001
48,792
5,302
126
Nice Larry:) i didn't know you had that many computers.
The "little" ones (quads), are the Fortnite-capable gaming PCs from my FS thread here, with GTX 1650 and 1050 cards. My "big" PCs are 3x 6C/12T Ryzen + GTX 1660 ti. And a mining rig with 3x GTX 1660 ti.
 
  • Like
Reactions: TennesseeTony

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
The next PG challenge will be the Summer LOLympics challenge ;-) for a whole week, during July 24–31.

It's going to be at TRP-LLR, which was previously used in a PG challenge in May 2019 and October 2017 for example. Neither my older PG CPU benchmarks thread nor my more recent locally stored testing protocols contain data on TRP. But I found notes which I left in my app_config.xml's in May 2019. At that time, FFTs with 5.6...7.5 MB memory footprint were in circulation.

According to Tony's calendar, there will be just a few days of spare time available for performance testing before the challenge. Personally, I won't need to test my Xeons, but I should test my newer toys.
 
  • Like
Reactions: TennesseeTony

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,058
2,036
55
Bump for the next challenge in almost exactly 5 days!
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,058
2,036
55
Is this next challenge, CPU, GPU, or both? I can be in with both Ryzen R5 3600 CPUs.
That turns out to be a good question, even though the answer is just CPU.

There is a GPU LLR app, but it doesn't work on this project.

Even if this were a project where it might be usable, it wouldn't work on AMD GPUs.

But, yes, your Ryzens should work well. :)
 
  • Like
Reactions: TennesseeTony

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
From a quick look through some hosts, current TRP-LLR work units have FFT lengths of 800, 864, 896, 960, 1000, 1008, 1024 K. (That's 6.25...8.0 MBytes of hot data.)

But I also saw a WU which was created already eight days ago and had FFT length 1120 K (8.75 MBytes) on a Haswell and 1152 K (9.0 MBytes) on a Skylake-SP.

That's a very awkward point in this project to start a challenge — at least for owners of computers which have a multiple of 8 MB as cache size.

Here are the candidates which are currently being tested:
http://www.primegrid.com/stats_trp_llr.php
 
Last edited:

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
Here are the candidates which are currently being tested:
http://www.primegrid.com/stats_trp_llr.php
I wrote a little batch file which starts llr.exe with each candidate, once with "min n in progress" and once with "max n in progress", and kills llr.exe as soon as it printed the chosen FFT length. Doing this on a Haswell which uses FMA3, this gave the following FFT lengths:
Candidatemin n in progressmax n in progressFFT lengthFFT size
2293*2^n-1​
10238231​
10293647​
768K​
6.00 MBytes​
9221*2^n-1​
10225014​
10293678​
800K​
6.25 MBytes​
23669*2^n-1​
10238148​
10293444​
864K​
6.75 MBytes​
31859*2^n-1​
10242276​
10293348​
864K​
6.75 MBytes​
38473*2^n-1​
10238195​
10292387​
896K​
6.75 MBytes​
40597*2^n-1​
—​
—​
—​
—​
46663*2^n-1​
10242911​
10293479​
896K​
6.75 MBytes​
65531*2^n-1​
—​
—​
—​
—​
67117*2^n-1​
10235797​
10293637​
896K​
6.75 MBytes​
74699*2^n-1​
10228636​
10293796​
960K​
7.00 MBytes​
81041*2^n-1​
10255714​
10292974​
960K​
7.00 MBytes​
93839*2^n-1​
10222272​
10293720​
960K​
7.00 MBytes​
97139*2^n-1​
10238220​
10293388​
960K​
7.00 MBytes​
107347*2^n-1​
10238269​
10293093​
960K​
7.00 MBytes​
121889*2^n-1​
10237760​
10293632​
960K​
7.00 MBytes​
123547*2^n-1​
—​
—​
—​
—​
129007*2^n-1​
10231433​
10293689​
960K​
7.00 MBytes​
141941*2^n-1​
—​
—​
—​
—​
143047*2^n-1​
10237573​
10293157​
960K​
7.00 MBytes​
146561*2^n-1​
10226890​
10293394​
960K​
7.00 MBytes​
161669*2^n-1​
10251344​
10293752​
960K​
7.00 MBytes​
162941*2^n-1​
—​
—​
—​
—​
191249*2^n-1​
—​
—​
—​
—​
192971*2^n-1​
10237258​
10293418​
1M​
8.00 MBytes​
206039*2^n-1​
10238128​
10292296​
1M​
8.00 MBytes​
206231*2^n-1​
10237658​
10293314​
1M​
8.00 MBytes​
215443*2^n-1​
10238099​
10293795​
1M​
8.00 MBytes​
226153*2^n-1​
10236035​
10293683​
1M​
8.00 MBytes​
234343*2^n-1​
10237751​
10293503​
1M​
8.00 MBytes​
245561*2^n-1​
10240622​
10292750​
1M​
8.00 MBytes​
250027*2^n-1​
10238089​
10293673​
1M​
8.00 MBytes​
252191*2^n-1​
—​
—​
—​
—​
273809*2^n-1​
—​
—​
—​
—​
304207*2^n-1​
—​
—​
—​
—​
315929*2^n-1​
10225740​
10293660​
1M​
8.00 MBytes​
319511*2^n-1​
10223182​
10293502​
1M​
8.00 MBytes​
324011*2^n-1​
10215490​
10293250​
1M​
8.00 MBytes​
325123*2^n-1​
10227191​
10293791​
1M​
8.00 MBytes​
327671*2^n-1​
10237886​
10293598​
1M​
8.00 MBytes​
336839*2^n-1​
10215620​
10293812​
1M​
8.00 MBytes​
342847*2^n-1​
10239481​
10293337​
1M​
8.00 MBytes​
344759*2^n-1​
10237104​
10293688​
1M​
8.00 MBytes​
353159*2^n-1​
—​
—​
—​
—​
362609*2^n-1​
10226880​
10293768​
1M​
8.00 MBytes​
363343*2^n-1​
10226891​
10293563​
1M​
8.00 MBytes​
364903*2^n-1​
10235903​
10293167​
1M​
8.00 MBytes​
365159*2^n-1​
10237320​
10293528​
1M​
8.00 MBytes​
368411*2^n-1​
10228826​
10293242​
1M​
8.00 MBytes​
371893*2^n-1​
10236699​
10293795​
1M​
8.00 MBytes​
384539*2^n-1​
10228856​
10293424​
1M​
8.00 MBytes​
386801*2^n-1​
10239262​
10292854​
1M​
8.00 MBytes​
397027*2^n-1​
10244717​
10292885​
1M​
8.00 MBytes​
398023*2^n-1​
—​
—​
—​
—​
402539*2^n-1​
—​
—​
—​
—​
409753*2^n-1​
10237907​
10293419​
1M​
8.00 MBytes​
415267*2^n-1​
—​
—​
—​
—​
428639*2^n-1​
—​
—​
—​
—​
444637*2^n-1​
10235469​
10292709​
1M...1120K​
8.00...8.75 MBytes​
470173*2^n-1​
10236579​
10293639​
1M...1120K​
8.00...8.75 MBytes​
474491*2^n-1​
10242506​
10293626​
1120K​
8.75 MBytes​
477583*2^n-1​
10255747​
10293619​
1120K​
8.75 MBytes​
485557*2^n-1​
10234077​
10293657​
1120K​
8.75 MBytes​
494743*2^n-1​
10236927​
10293447​
1120K​
8.75 MBytes​
502573*2^n-1​
—​
—​
—​
—​
With this info, we can perform offline tests with specific candidates and specific FFT sizes which we now know beforehand. (Earlier, I used to download one or a few random workunits, and used that to set up a few offline tests.)

I will simply configure my Xeon E5's with their large shared L3 caches (with near uniform latency between each core and each cache segment) such that they work best for work units with 1120K FFT length. My Zen 2 based computers which have a bunch of 16 MB sized L3 caches (which are "non-inclusive") need a bit of testing. I'll start with 1120K. If I have time, I'll continue testing 1M, then 960K.
 
Last edited:

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
The tests which ran last night on the Zen 2 based computer showed best throughput with twice as many concurrent tasks (of the 1120K = 8.75 MBytes flavor) as there are CCXs (with 16 MB L3$ each). This means that I don't need any tests with 1M sized and smaller FFTs — for now. But it raises other questions, related to the main question why the measured sweet spot is when the FFT data already slightly spill over the caches. Some potential factors come to my mind:
  • The LLR program noted that it actually chose zero-padded FMA3 FFTs. Maybe the algorithm and hardware is good at keeping the impact of the zero-padded data portion very low.
    (BTW, from what I remember, the 768K...1M work units in post #90 did not have zero-padding in play.)
  • The Zen 2 level-3 cache is non-inclusive. While an inclusive L3$ always holds copies of everything which is in the level-2 caches, a non-inclusive L3$ does not, most of the time. This saves space.
    (Data which is shared between cores may perhaps still be kept redundantly in the L3$. But just now I don't recall whether this is part of the caching policies in Zen 2.)
  • Relative to core count and core clocks, this computer has got a good amount of RAM bandwidth. (There are four dual-channel DDR4-3200 controllers per 32 cores.) This should help somewhat with moving some of the hot data onto and off the processors.
  • The Linux kernel which I am using is actively maintained by the distributor, and contains a bunch of backports from newer kernels. (I haven't looked into what these backports are precisely.) But it is on a rather old release of the mainline kernel. Maybe this kernel version is not particularly strict in concentrating all threads of a process in the same CCX. If so, the drop in performance when the workload of the process begins to exceed the resources of one CCX would be harder to notice.
  • EDIT: I ran rather short tests, until just 3 % completion. Impossible to say whether longer running tests would show a different sweet spot without actually trying. But that would cost precious computer time.
If I still feel up to it when I get home, I might update the OS, and thus get a kernel on a much newer base, to see if it makes a difference.

Another point to look into — not so much for my own immediate purposes but in general interest — would be to switch from 4-core CCXs to 3-core CCXs in the BIOS. These should perform rather differently.
 
Last edited:
  • Like
Reactions: TennesseeTony

VirtualLarry

Lifer
Aug 25, 2001
48,792
5,302
126
Thanks for the heads-up. Had a busy day today, almost forgot the start of the race!

Anyways, I'm in with both Ryzen R5 3600 CPUs, six tasks per CPU, 2 threads per task. Hopefully that's correct? Maybe I should have gone with 4 tasks per CPU, with a 6C/12T, dual-CCX with 3 cores per CCX?
 
  • Like
Reactions: biodoc

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
According to my tests on the Linux dual-socket computer, 3 tasks per CPU x 2 threads per task should give best throughput on a 6c/12t Ryzen (of the Zen 2 generation, that is). Then there are a few configs which come close to this optimum.

The config for shortest task durations is 1 task x 12 threads, with 89 % throughput compared to the optimum config but task completion time cut to 1/3rd.

My extrapolations from the dual socket computer to Ryzen may be somewhat off though because the smaller processor with relatively larger TDP (PPT) headroom may behave slightly differently.
 
Last edited:
  • Like
Reactions: biodoc

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
From the SGS-LLR challenge in April:
dual 14c BDW-EP @2.9 GHz . . . 126 kPPD . . . 385 W (0.33 kPPD/W)
dual 22c BDW-EP @2.6 GHz . . . 177 kPPD . . . 462 W (0.38 kPPD/W)
dual 32c Rome @≈2.4 GHz . . . . 283 kPPD . . . 297 W (0.95 kPPD/W)
Same procedure for TRP-LLR:
Each of the hosts which have powermeters in front of them got 20 tasks or more validated now. I took the last 20 valid tasks and got this:

dual 14c BDW-EP @2.9 GHz . . . 226 kPPD . . . 445 W (0.51 kPPD/W)
dual 22c BDW-EP @2.6 GHz . . . 294 kPPD . . . 530 W (0.56 kPPD/W)
dual 32c Rome @≈2.3 GHz . . . . 467 kPPD . . . 360 W (1.30 kPPD/W)

These hosts are of course all configured for what gets them best throughput as far as I could determine. I may need to configure the BDW-EPs for best efficiency instead in the future. E.g., disable their BIOS option which keeps them running at all-core turbo all the time (here: AVX2 all-core turbo, which is slightly less than normal turbo).

Some users like it a lot if they submit the first valid result of a WU. While this does not concern me a lot myself, here is how the same three hosts are doing in this regard, 22 hours into the challenge:
host . . . average task duration . . . number of results validated second . . . number of results either validated first or pending validation* (percentage of all results):
dual 14c BDW-EP @2.9 GHz . . . 4.7 h . . . . 9 . . . 23 (72 %)
dual 22c BDW-EP @2.6 GHz . . . 5.4 h . . . 12 . . . 28 (70 %)
dual 32c Rome @≈2.3 GHz . . . . 8.9 h . . . 20 . . . 44 (69 %)

*) I assume that all results which are pending validation will end up as validated 1st.

So even though none of these hosts are speed demons when it comes to task durations, they are doublechecker less than 1/3rd of the time nevertheless. Maybe that's because there are large arrays of rented VMs and university datacenters playing in this challenge, and are set up such that they have even longer run times.

Of course, PrimeGrid's BOINC scheduler is modified to randomly delay the distribution of one task of the initial replication. But if this was the factor with the biggest impact, I should have nearer to 50 % rate of 1st validations.
 
Last edited:
  • Like
Reactions: biodoc

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,058
2,036
55
Day 1 stats:

Rank___Credits____Username
5______1614500____xii5ku
22_____591730_____crashtech
31_____394983_____biodoc
37_____316069_____emoga
77_____139086_____Icecold-Team Anandtech
89_____114059_____Orange Kid
93_____104095_____VirtualLarry
110____90371______Ken_g6
203____22086______zzuupp
268____4786_______waffleironhead

Rank__Credits____Team
3_____4948624____Aggie The Pew
4_____4916740____Czech National Team
5_____4126670____SETI.Germany
6_____3391769____TeAm AnandTech
7_____3149811____Ultimate Chaos
8_____2451368____[H]ard|OCP
9_____1650157____AMD Users

We're doing quite well so far. :)
 

StefanR5R

Diamond Member
Dec 10, 2016
3,561
3,842
106
Just now I looked up what FFT lengths are current now:
  • The min and max candidates and n's from stats_trp_llr.php still give the same range of FFT lengths as on Wednesday.
  • slots/*/stderr.txt in my BOINC data directories show the following mixture:
    3 % 768K, 1 % 864K, 9 % 896K, 15 % 960K, 32 % 1M, 40 % zero padded 1120K.
I am idly wondering: How well do my tests with 100 % zero padded 1120K represent this actual mixed workload? A test suite with a similar mixture of fixed WUs could be set up. But the definition of performance with such a mixed workload will be difficult.

With a uniform workload of multiple same-sized WUs, I can simply run all concurrent tasks until completion or until a given progress percentage and look at how long this took. Or I could run the tasks for a given time and look which progress percentage they reached. But with a workload of mixed-sized WUs, I think I would first need to determine weighting factors according to "progress percentage per hour" for each WU size and for each thread count per task. — Sounds like too much effort to pursue this train of thought. ;-)
 
  • Like
Reactions: biodoc

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
15,058
2,036
55
Day 2 stats:

Rank___Credits____Username
4______3716283____xii5ku
23_____1408702____crashtech
37_____842324_____biodoc
41_____706266_____emoga
76_____327831_____Icecold-Team Anandtech
94_____263073_____Orange Kid
103____229197_____VirtualLarry
118____197770_____Ken_g6
211____54250______zzuupp
260____23299___10esseeTony
286____16794______waffleironhead

Rank__Credits____Team
3_____12765070___Czech National Team
4_____11860490___Aggie The Pew
5_____9507575____SETI.Germany
6_____7785794____TeAm AnandTech
7_____5545122____Ultimate Chaos
8_____5542837____[H]ard|OCP
9_____4633744____AMD Users

We're still doing well. @StefanR5R seems to be as high as he can go, unless he has a surprise dump. ;)
 

ASK THE COMMUNITY