PrimeGrid Challenges 2020

StefanR5R · Jun 26, 2020

Is the back door provided by Huawei, or by Cisco? ;-)

VirtualLarry · Jun 26, 2020

I'm gunning for a top-59 finish in this Challenge personally; we'll see if I can rise up the ranks fast enough. Not going to be easy.

I put another Haswell quad (i5-4670), 16GB of DDR3-1600, and a GTX 1650 D5 4GB on PrimeGrid. Crossing my fingers that I don't trip a breaker.

biodoc · Jun 26, 2020

StefanR5R said:
No, but perhaps we can stay ahead of TSBT.
https://www.seti-germany.de/pg_challenge/stats/team/202004/132 — click "Vergleich" (comparison)

TSBT released bunkers and passed us to take 6th place. Hats off to TSBT and to SG for providing this image.

VirtualLarry · Jun 26, 2020

Didn't know that you could "bunker" PrimeGrid. Unless, the point is to sand-bag at the beginning.

Well, congrats to TSBT. I feel a small personal victory over them, though, as I came in 58th individually, just beating out two of their members, @ 59th and @ 60th respectively. It was close, for me at least.

Ken g6 · Jun 26, 2020

Day 3 stats:

Rank___Credits____Username
13_____43418480___Howdy
16_____39181133___xii5ku
31_____23556548___biodoc
37_____20967620___crashtech
58_____13480629___VirtualLarry
96_____8636502____Orange Kid
98_____8582566____Lane42
100____8184788____emoga
164____4014861___10esseeTony
218____2417007____far
244____1850679____Ken_g6
259____1597854____zzuupp
324____798927_____waffleironhead
450____246083_____geecee

Rank__Credits____Team
4_____225128864___SETI.Germany
5_____199526119___Aggie The Pew
6_____184150988___The Scottish Boinc Team
7_____176933677___TeAm AnandTech
8_____140803299___Sicituradastra.
9_____135011921___[H]ard|OCP
10____71758477___Crunching@EVGA

Well, that's a surprise, but I guess we are 7AA7. 😉

lane42 · Jun 26, 2020

Nice Larry🙂 i didn't know you had that many computers.

VirtualLarry · Jun 26, 2020

lane42 said:
Nice Larry🙂 i didn't know you had that many computers.

The "little" ones (quads), are the Fortnite-capable gaming PCs from my FS thread here, with GTX 1650 and 1050 cards. My "big" PCs are 3x 6C/12T Ryzen + GTX 1660 ti. And a mining rig with 3x GTX 1660 ti.

motqalden · Jun 26, 2020

Good Race!

Orange Kid · Jun 26, 2020

Thanks for the race stats.

StefanR5R · Jun 27, 2020

The next PG challenge will be the Summer LOLympics challenge ;-) for a whole week, during July 24–31.

It's going to be at TRP-LLR, which was previously used in a PG challenge in May 2019 and October 2017 for example. Neither my older PG CPU benchmarks thread nor my more recent locally stored testing protocols contain data on TRP. But I found notes which I left in my app_config.xml's in May 2019. At that time, FFTs with 5.6...7.5 MB memory footprint were in circulation.

According to Tony's calendar, there will be just a few days of spare time available for performance testing before the challenge. Personally, I won't need to test my Xeons, but I should test my newer toys.

Ken g6 · Jul 19, 2020

Bump for the next challenge in almost exactly 5 days!

VirtualLarry · Jul 19, 2020

Is this next challenge, CPU, GPU, or both? I can be in with both Ryzen R5 3600 CPUs.

Ken g6 · Jul 19, 2020

VirtualLarry said:
Is this next challenge, CPU, GPU, or both? I can be in with both Ryzen R5 3600 CPUs.

That turns out to be a good question, even though the answer is just CPU.

There is a GPU LLR app, but it doesn't work on this project.

Even if this were a project where it might be usable, it wouldn't work on AMD GPUs.

But, yes, your Ryzens should work well. 🙂

StefanR5R · Jul 22, 2020

From a quick look through some hosts, current TRP-LLR work units have FFT lengths of 800, 864, 896, 960, 1000, 1008, 1024 K. (That's 6.25...8.0 MBytes of hot data.)

But I also saw a WU which was created already eight days ago and had FFT length 1120 K (8.75 MBytes) on a Haswell and 1152 K (9.0 MBytes) on a Skylake-SP.

That's a very awkward point in this project to start a challenge — at least for owners of computers which have a multiple of 8 MB as cache size.

Here are the candidates which are currently being tested:
http://www.primegrid.com/stats_trp_llr.php

StefanR5R · Jul 22, 2020

StefanR5R said:
Here are the candidates which are currently being tested:
http://www.primegrid.com/stats_trp_llr.php

I wrote a little batch file which starts llr.exe with each candidate, once with "min n in progress" and once with "max n in progress", and kills llr.exe as soon as it printed the chosen FFT length. Doing this on a Haswell which uses FMA3, this gave the following FFT lengths:

Candidate	min n in progress	max n in progress	FFT length	FFT size
2293*2^n-1	10238231	10293647	768K	6.00 MBytes
9221*2^n-1	10225014	10293678	800K	6.25 MBytes
23669*2^n-1	10238148	10293444	864K	6.75 MBytes
31859*2^n-1	10242276	10293348	864K	6.75 MBytes
38473*2^n-1	10238195	10292387	896K	6.75 MBytes
40597*2^n-1	—	—	—	—
46663*2^n-1	10242911	10293479	896K	6.75 MBytes
65531*2^n-1	—	—	—	—
67117*2^n-1	10235797	10293637	896K	6.75 MBytes
74699*2^n-1	10228636	10293796	960K	7.00 MBytes
81041*2^n-1	10255714	10292974	960K	7.00 MBytes
93839*2^n-1	10222272	10293720	960K	7.00 MBytes
97139*2^n-1	10238220	10293388	960K	7.00 MBytes
107347*2^n-1	10238269	10293093	960K	7.00 MBytes
121889*2^n-1	10237760	10293632	960K	7.00 MBytes
123547*2^n-1	—	—	—	—
129007*2^n-1	10231433	10293689	960K	7.00 MBytes
141941*2^n-1	—	—	—	—
143047*2^n-1	10237573	10293157	960K	7.00 MBytes
146561*2^n-1	10226890	10293394	960K	7.00 MBytes
161669*2^n-1	10251344	10293752	960K	7.00 MBytes
162941*2^n-1	—	—	—	—
191249*2^n-1	—	—	—	—
192971*2^n-1	10237258	10293418	1M	8.00 MBytes
206039*2^n-1	10238128	10292296	1M	8.00 MBytes
206231*2^n-1	10237658	10293314	1M	8.00 MBytes
215443*2^n-1	10238099	10293795	1M	8.00 MBytes
226153*2^n-1	10236035	10293683	1M	8.00 MBytes
234343*2^n-1	10237751	10293503	1M	8.00 MBytes
245561*2^n-1	10240622	10292750	1M	8.00 MBytes
250027*2^n-1	10238089	10293673	1M	8.00 MBytes
252191*2^n-1	—	—	—	—
273809*2^n-1	—	—	—	—
304207*2^n-1	—	—	—	—
315929*2^n-1	10225740	10293660	1M	8.00 MBytes
319511*2^n-1	10223182	10293502	1M	8.00 MBytes
324011*2^n-1	10215490	10293250	1M	8.00 MBytes
325123*2^n-1	10227191	10293791	1M	8.00 MBytes
327671*2^n-1	10237886	10293598	1M	8.00 MBytes
336839*2^n-1	10215620	10293812	1M	8.00 MBytes
342847*2^n-1	10239481	10293337	1M	8.00 MBytes
344759*2^n-1	10237104	10293688	1M	8.00 MBytes
353159*2^n-1	—	—	—	—
362609*2^n-1	10226880	10293768	1M	8.00 MBytes
363343*2^n-1	10226891	10293563	1M	8.00 MBytes
364903*2^n-1	10235903	10293167	1M	8.00 MBytes
365159*2^n-1	10237320	10293528	1M	8.00 MBytes
368411*2^n-1	10228826	10293242	1M	8.00 MBytes
371893*2^n-1	10236699	10293795	1M	8.00 MBytes
384539*2^n-1	10228856	10293424	1M	8.00 MBytes
386801*2^n-1	10239262	10292854	1M	8.00 MBytes
397027*2^n-1	10244717	10292885	1M	8.00 MBytes
398023*2^n-1	—	—	—	—
402539*2^n-1	—	—	—	—
409753*2^n-1	10237907	10293419	1M	8.00 MBytes
415267*2^n-1	—	—	—	—
428639*2^n-1	—	—	—	—
444637*2^n-1	10235469	10292709	1M...1120K	8.00...8.75 MBytes
470173*2^n-1	10236579	10293639	1M...1120K	8.00...8.75 MBytes
474491*2^n-1	10242506	10293626	1120K	8.75 MBytes
477583*2^n-1	10255747	10293619	1120K	8.75 MBytes
485557*2^n-1	10234077	10293657	1120K	8.75 MBytes
494743*2^n-1	10236927	10293447	1120K	8.75 MBytes
502573*2^n-1	—	—	—	—

With this info, we can perform offline tests with specific candidates and specific FFT sizes which we now know beforehand. (Earlier, I used to download one or a few random workunits, and used that to set up a few offline tests.)

I will simply configure my Xeon E5's with their large shared L3 caches (with near uniform latency between each core and each cache segment) such that they work best for work units with 1120K FFT length. My Zen 2 based computers which have a bunch of 16 MB sized L3 caches (which are "non-inclusive") need a bit of testing. I'll start with 1120K. If I have time, I'll continue testing 1M, then 960K.

StefanR5R · Jul 23, 2020

The tests which ran last night on the Zen 2 based computer showed best throughput with twice as many concurrent tasks (of the 1120K = 8.75 MBytes flavor) as there are CCXs (with 16 MB L3$ each). This means that I don't need any tests with 1M sized and smaller FFTs — for now. But it raises other questions, related to the main question why the measured sweet spot is when the FFT data already slightly spill over the caches. Some potential factors come to my mind:

The LLR program noted that it actually chose zero-padded FMA3 FFTs. Maybe the algorithm and hardware is good at keeping the impact of the zero-padded data portion very low.
(BTW, from what I remember, the 768K...1M work units in post #90 did not have zero-padding in play.)
The Zen 2 level-3 cache is non-inclusive. While an inclusive L3$ always holds copies of everything which is in the level-2 caches, a non-inclusive L3$ does not, most of the time. This saves space.
(Data which is shared between cores may perhaps still be kept redundantly in the L3$. But just now I don't recall whether this is part of the caching policies in Zen 2.)
Relative to core count and core clocks, this computer has got a good amount of RAM bandwidth. (There are four dual-channel DDR4-3200 controllers per 32 cores.) This should help somewhat with moving some of the hot data onto and off the processors.
The Linux kernel which I am using is actively maintained by the distributor, and contains a bunch of backports from newer kernels. (I haven't looked into what these backports are precisely.) But it is on a rather old release of the mainline kernel. Maybe this kernel version is not particularly strict in concentrating all threads of a process in the same CCX. If so, the drop in performance when the workload of the process begins to exceed the resources of one CCX would be harder to notice.
EDIT: I ran rather short tests, until just 3 % completion. Impossible to say whether longer running tests would show a different sweet spot without actually trying. But that would cost precious computer time.

If I still feel up to it when I get home, I might update the OS, and thus get a kernel on a much newer base, to see if it makes a difference.

Another point to look into — not so much for my own immediate purposes but in general interest — would be to switch from 4-core CCXs to 3-core CCXs in the BIOS. These should perform rather differently.

biodoc · Jul 24, 2020

The next PrimeGrid challenge has started. Join us if you can. 🙂

VirtualLarry · Jul 24, 2020

Thanks for the heads-up. Had a busy day today, almost forgot the start of the race!

Anyways, I'm in with both Ryzen R5 3600 CPUs, six tasks per CPU, 2 threads per task. Hopefully that's correct? Maybe I should have gone with 4 tasks per CPU, with a 6C/12T, dual-CCX with 3 cores per CCX?

StefanR5R · Jul 24, 2020

According to my tests on the Linux dual-socket computer, 3 tasks per CPU x 2 threads per task should give best throughput on a 6c/12t Ryzen (of the Zen 2 generation, that is). Then there are a few configs which come close to this optimum.

The config for shortest task durations is 1 task x 12 threads, with 89 % throughput compared to the optimum config but task completion time cut to 1/3rd.

My extrapolations from the dual socket computer to Ryzen may be somewhat off though because the smaller processor with relatively larger TDP (PPT) headroom may behave slightly differently.

zzuupp · Jul 25, 2020

I'm in
Better nate than lever!

StefanR5R · Jul 25, 2020

From the SGS-LLR challenge in April:

StefanR5R said:
dual 14c BDW-EP @2.9 GHz . . . 126 kPPD . . . 385 W (0.33 kPPD/W)
dual 22c BDW-EP @2.6 GHz . . . 177 kPPD . . . 462 W (0.38 kPPD/W)
dual 32c Rome @≈2.4 GHz . . . . 283 kPPD . . . 297 W (0.95 kPPD/W)

Same procedure for TRP-LLR:
Each of the hosts which have powermeters in front of them got 20 tasks or more validated now. I took the last 20 valid tasks and got this:

dual 14c BDW-EP @2.9 GHz . . . 226 kPPD . . . 445 W (0.51 kPPD/W)
dual 22c BDW-EP @2.6 GHz . . . 294 kPPD . . . 530 W (0.56 kPPD/W)
dual 32c Rome @≈2.3 GHz . . . . 467 kPPD . . . 360 W (1.30 kPPD/W)

These hosts are of course all configured for what gets them best throughput as far as I could determine. I may need to configure the BDW-EPs for best efficiency instead in the future. E.g., disable their BIOS option which keeps them running at all-core turbo all the time (here: AVX2 all-core turbo, which is slightly less than normal turbo).

Some users like it a lot if they submit the first valid result of a WU. While this does not concern me a lot myself, here is how the same three hosts are doing in this regard, 22 hours into the challenge:
host . . . average task duration . . . number of results validated second . . . number of results either validated first or pending validation* (percentage of all results):
dual 14c BDW-EP @2.9 GHz . . . 4.7 h . . . . 9 . . . 23 (72 %)
dual 22c BDW-EP @2.6 GHz . . . 5.4 h . . . 12 . . . 28 (70 %)
dual 32c Rome @≈2.3 GHz . . . . 8.9 h . . . 20 . . . 44 (69 %)

*) I assume that all results which are pending validation will end up as validated 1st.

So even though none of these hosts are speed demons when it comes to task durations, they are doublechecker less than 1/3rd of the time nevertheless. Maybe that's because there are large arrays of rented VMs and university datacenters playing in this challenge, and are set up such that they have even longer run times.

Of course, PrimeGrid's BOINC scheduler is modified to randomly delay the distribution of one task of the initial replication. But if this was the factor with the biggest impact, I should have nearer to 50 % rate of 1st validations.

Ken g6 · Jul 25, 2020

Day 1 stats:

Rank___Credits____Username
5______1614500____xii5ku
22_____591730_____crashtech
31_____394983_____biodoc
37_____316069_____emoga
77_____139086_____Icecold-Team Anandtech
89_____114059_____Orange Kid
93_____104095_____VirtualLarry
110____90371______Ken_g6
203____22086______zzuupp
268____4786_______waffleironhead

Rank__Credits____Team
3_____4948624____Aggie The Pew
4_____4916740____Czech National Team
5_____4126670____SETI.Germany
6_____3391769____TeAm AnandTech
7_____3149811____Ultimate Chaos
8_____2451368____[H]ard|OCP
9_____1650157____AMD Users

We're doing quite well so far. 🙂

StefanR5R · Jul 26, 2020

Just now I looked up what FFT lengths are current now:

The min and max candidates and n's from stats_trp_llr.php still give the same range of FFT lengths as on Wednesday.
slots/*/stderr.txt in my BOINC data directories show the following mixture:
3 % 768K, 1 % 864K, 9 % 896K, 15 % 960K, 32 % 1M, 40 % zero padded 1120K.

I am idly wondering: How well do my tests with 100 % zero padded 1120K represent this actual mixed workload? A test suite with a similar mixture of fixed WUs could be set up. But the definition of performance with such a mixed workload will be difficult.

With a uniform workload of multiple same-sized WUs, I can simply run all concurrent tasks until completion or until a given progress percentage and look at how long this took. Or I could run the tasks for a given time and look which progress percentage they reached. But with a workload of mixed-sized WUs, I think I would first need to determine weighting factors according to "progress percentage per hour" for each WU size and for each thread count per task. — Sounds like too much effort to pursue this train of thought. ;-)

biodoc · Jul 26, 2020

StefanR5R said:
Sounds like too much effort to pursue this train of thought. ;-)

I agree! 🙂 I switched from 1 to 2 tasks per CCX on all my Zen 2's.

Ken g6 · Jul 26, 2020

Day 2 stats:

Rank___Credits____Username
4______3716283____xii5ku
23_____1408702____crashtech
37_____842324_____biodoc
41_____706266_____emoga
76_____327831_____Icecold-Team Anandtech
94_____263073_____Orange Kid
103____229197_____VirtualLarry
118____197770_____Ken_g6
211____54250______zzuupp
260____23299___10esseeTony
286____16794______waffleironhead

Rank__Credits____Team
3_____12765070___Czech National Team
4_____11860490___Aggie The Pew
5_____9507575____SETI.Germany
6_____7785794____TeAm AnandTech
7_____5545122____Ultimate Chaos
8_____5542837____[H]ard|OCP
9_____4633744____AMD Users

We're still doing well. @StefanR5R seems to be as high as he can go, unless he has a surprise dump. 😉

PrimeGrid Challenges 2020

Elite Member

No Lifer

Diamond Member

No Lifer

Programming Moderator, Elite Member

Diamond Member

No Lifer

Member

Elite Member

Elite Member

Programming Moderator, Elite Member

No Lifer

Programming Moderator, Elite Member

Elite Member

Elite Member

Elite Member

Diamond Member

No Lifer

Elite Member

Lifer

Elite Member

Programming Moderator, Elite Member

Elite Member

Diamond Member

Programming Moderator, Elite Member