PrimeGrid Challenges 2021

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
For maximum throughput, the FMA units need to be utilized as much as possible. To achieve this,
  • avoid to overuse the caches, as discussed. If the active data cannot be cached completely, the memory controller(s) and RAM likely don't have the high throughput and low latency which is required to keep the FMA execution units fed.
  • Use a sufficiently high number of threads in total across all tasks, such that all FMA execution units are engaged.
  • Use as few threads per task as possible (while keeping the above in mind). More threads per task mean more time spent with inter-thread synchronization than with actual computation.

Has anybody tested the optimal thread count(in terms of PPD) for this FFT length yet?
For Zen2, @biodoc's results from 2019 with <2 MB FFT data size may still be representative of what works best with <4 MB FFT data size, since Zen2 has got 4 MB L3$ per core and 1 FMA3 unit per core (or: 1 AVX2 pipeline per core). @biodoc measured PPD and PPD/W back then:
@biodoc's tests in October 2019 showed that the efficiency optimum on Zen2 depends on the CPB configuration: #83, #86. These tests were done with a workunit with 240K FFT length, that is, less than 2 MB FFT data size.)
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
I wonder about the chances of the data size exceeding 2MB 4MB.

Edited as not to mislead.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
Do you mean, exceeding 4 MB size (512 K length)?
In September 2020, somebody posted a chart how the FMA3 FFT length with PPS-DIV develops depending on n:
https://www.primegrid.com/forum_thread.php?id=9339&nowrap=true#143703

(Work for all of the tabulated k's is given out all the time, and the n's which make up the first column in this chart are slowly creeping higher and higher. I.e., we are slowly advancing line by line of this chart.)

While I have no idea how fast the progress during the challenge will be, it seems to me as if the next length of 576 K is still a good way off. We are now square in the middle of the 480 K area.

Edit:
In my older tests, I found that the performance drop when processor caches are overused is not a sharp one, but merely gradual.
 
Last edited:
  • Like
Reactions: crashtech

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
In my older tests, I found that the performance drop when processor caches are overused is not a sharp one, but merely gradual.
Meanwhile I posted the updated recipe for offline tests and test results for i7-7700K @ 3.4 GHz with a PPS-DIV workunit which takes 480K long FMA3 FFTs. While best throughput can be had with 2 concurrent tasks on this processor as expected, running "too many" tasks results in 7…12 % loss of throughput with 4 concurrent tasks, and 18 % loss of throughput with 8 concurrent tasks.

I should test this processor once more at a higher core clock but same RAM speed. I suppose the benefit of keeping the entire workload cached is more pronounced then.
 
  • Like
Reactions: Ken g6

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
As expected, the loss of throughput when the workload no longer fits into cache becomes more of an issue (a) when core clocks are higher at same RAM performance, (b) when there is a higher core count relative to RAM performance.

Furthermore, the comparison of @biodoc's tests on Ryzen (3700X, 3900X) and my tests on Epyc (7452) indicates that the configuration of task count × thread count for best throughput on Zen2 depends on the power budget. Evidently, SMT helps if the power budget per core is high, but SMT may hurt if this budget is low. Though maybe processor firmware plays a role too.

What's weird is how well 8-threaded tasks perform on EPYC (when SMT is used). This configuration got me almost optimum throughput, and best power efficiency, combined with naturally short run times. I have no explanation why this ran so much better than 4-threaded tasks. To be sure I reran this and some of the other of my tests, and the test durations and the power meter readings were exactly the same in the re-runs.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,250
3,845
75
What's weird is how well 8-threaded tasks perform on EPYC (when SMT is used). This configuration got me almost optimum throughput, and best power efficiency, combined with naturally short run times. I have no explanation why this ran so much better than 4-threaded tasks. To be sure I reran this and some of the other of my tests, and the test durations and the power meter readings were exactly the same in the re-runs.
Maybe the threads are occasionally stepping on each others' cores? Maybe if the 4-threaded tasks were assigned to the proper cores, they'd work better? This script can assign PG tasks to cores. I'm not sure if they're "proper" cores, but it would be interesting to see how much "clump" or "spread" helps.
 

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
Right, I can try this with taskset added to my script.

(Note to self: lscpu shows that logical CPUs 0-31,64-95 belong to socket 0, logical CPUs 32-63,96-127 belong to socket 1. lscpu -e indicates that logical CPUs 0-3,64-67 belong to the first CCX, 4-7,68-71 to the second, and so on.)
 

crashtech

Lifer
Jan 4, 2013
10,524
2,111
146
I was having a difficult time imagining that numbering description, but then stumbled upon this diagram of a Threadripper 3970X, which helped it make sense, even though it "only" has half the cores:

AMD-Ryzen-Threadripper-3970-X-Topology.png
 
  • Like
Reactions: Ken g6

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,551
14,510
136
I joined with a 3900x@4 ghz, all cores. (except one for the video card)
 
  • Wow
Reactions: Ken g6

TennesseeTony

Elite Member
Aug 2, 2003
4,209
3,634
136
www.google.com
I am partly in. If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads. This means no threads for my GPUs, which is ok, the weather is suppose be a bit warmer for a day or two.

I will give credit where it is due, PG sure knows how to push the envelope in regards to hardware utilization. I wonder how much attention to PG other BOINC developers devote? It seems they could learn a lot.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,551
14,510
136
I am partly in. If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads. This means no threads for my GPUs, which is ok, the weather is suppose be a bit warmer for a day or two.

I will give credit where it is due, PG sure knows how to push the envelope in regards to hardware utilization. I wonder how much attention to PG sub-projects other BOINC developers pay? It seems they could learn a lot.
I have not read up on how to do the 8 hyper-threads. I have 23 threads running. I have surgery in the morning, so this will have to do. 90% at 13 minutes is the speed I am running at.
 

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads.
@biodoc's results compared to mine show that my EPYC based tests don't reflect very well what's best on Ryzen. Also, my 8-threaded results were on a processor with 4C/CCX, therefore are not valid on processors with 3C/CCX.

For reference:

I joined with a 3900x@4 ghz, all cores. (except one for the video card)
I have 23 threads running. I have surgery in the morning, so this will have to do.
Performance will go up somewhat if you reduce this to 12 tasks at once. Yep, less will do more. :-) It will be beneficial to PrimeGrid and to the GPU (Folding, certainly).

Or even reduce to 11 tasks at once, which will put a small dent into PrimeGrid but will help keep up Folding performance even more.

The reason why 12 or 11 tasks at once will result in more tasks finished per day than 23 tasks at once, is that the latter spend more time fighting for RAM access. With the fewer tasks, the work stays mostly in the CPU cache and the available RAM bandwidth is used more effectively, and the AVX2 units can be fully utilized.

A quick way to switch over to 12 tasks is to set "Use at most 50 % of the CPUs" in computing preferences. Or "…49 %…" for 11 tasks at once.

It is possible to configure more than 1 thread per task, either through the PrimeGrid web preferences, or with an app_config.xml. But you will probably get the most tasks/day done if you stick with single-threaded tasks but no more than 1 task per real core (i.e. 12 tasks at once on Ryzen 3900X).

I wish you well for the surgery. Have a good recovery!
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
BTW, I currently have just one 2xE5v4 in the race. When the first task finished, and the client attempted to request more work while starting the first two file uploads, the PrimeGrid server's scheduler timed out, and the uploads timed out too. That's not unexpected. This server already struggled in an earlier race even before the introduction of LLR2, in a race with another subproject with comparably small tasks = many tasks in progress = high database load. (At that race, the server was unable to keep up the 15 minutes period of race stats updates.) Now, with LLR2, they have not only high database load during a race, but also several magnitudes more network I/O and mass storage I/O.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
I decided on the following settings which are in the "SMT on but not used" category. Each task is expected to use 3.75 MB of cache.

3700X (32 MB L3 cache):
8 simultaneous tasks x 1 core each.
8 x 3.75 MB = 30 MB L3 cache used.

3900X (64 MB L3 cache):
12 simultaneous tasks x 1 core each.
12 x 3.75 MB = 45 MB L3 cache used.

3950X (64 MB L3 cache)
16 simultaneous tasks x 1 core each.
16 x 3.75 MB = 60 MB L3 cache used.

The tasks on the 3950X are running significantly slower than the other 2 computers (clock speeds are similar). I wonder why?

EDIT: the top computer is the 3950X, the 2nd is the 3900X and the bottom is the 3700X.

1610622725739.png
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,509
7,816
136
If the 3950X and 3900X were operating at the same power limit, then 130 kPPD : 118 kPPD looks alright to me.

- - - - - - - -

I currently have just one 2xE5v4 in the race.
It's doing 230 kPPD after validation. But from what I believe I saw today in the morning, it took 550 W or something like that.
 

geecee

Platinum Member
Jan 14, 2003
2,383
43
91
Hmm, getting DLL initialization errors on one of my boxes (example from event log: PrimeGrid | Task llrDIV_358048398c_0 exited with a DLL initialization error.) for tasks for this challenge. Strange, tried rebooting, setting my CPU usage to 100%, and removing and re-adding the project, but still getting the errors. I see some people posting about this on BOINC topics in general, but mostly old threads and old BOINC versions. Anyone ever see this before? Thanks.

EDIT: Running the latest version of BOINC.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
If the 3950X and 3900X were operating at the same power limit, then 130 kPPD : 118 kPPD looks alright to me.

Both are at 105 W PPT. The 3950X seems to be doing better at 8 tasks/4 threads each but I don't have enough data yet to be confident in that conclusion. The 3950X has been running on unbuffered ECC RAM for the last few days but I don't think RAM speed/timings would have an impact in this situation.