• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

PrimeGrid: CPU benchmarks

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

biodoc

Diamond Member
Dec 29, 2005
5,787
1,282
136
As one would expect, the total real time is the same as the real time of the instance which took longest, while the total user and sys times are the sum of the respective times of all concurrent instances.
For this thread, should I use the run time of the task with the longest run time or should I calculate an average run time?
 

StefanR5R

Diamond Member
Dec 10, 2016
3,870
4,197
136
So far, in my llrDIV 10% short tests on systems which are otherwise idle (no desktop use, or display-manager service stopped even), the LLR instances finish within mere seconds from each other. In your 74 minutes long 2x4 test until 100 %, one instance took 1m23s longer than the other. That is, the maximum run time in this example was 0.9 % longer than the average run time.

I have seen cases in the past in which one or two of concurrent tasks took awkwardly longer than the rest. I don't recall to which degree the end result then depended on whether the maximum or something else was used.

Would be nice if the script would directly summarize something like min/avg/max. Until it does, it should be enough just to document for sake of repeatability whether we simply looked at the max, or scrolled through the log to find the 50-percentile, or even did the work of calculating an average.
 
  • Like
Reactions: biodoc

biodoc

Diamond Member
Dec 29, 2005
5,787
1,282
136
Here are the PPS-Div results from a 3900X (PPT set at 105 watts) using the script from @StefanR5R . The tests were run to 10% completion. I took the LLR input and cobblestones from a task that was completed and validated.
llrDIV, FFT length 480K (FFT data size = 3.75 MB), 1,536.90 cobblestones
LLRINPUT="21*2^7823737+1"


Table 1. SMT on and used
# processes# threadsrun time (sec)PPD
12213810115,384
839750108,954
465200102,145
 

StefanR5R

Diamond Member
Dec 10, 2016
3,870
4,197
136
I added dual-32-core EPYC results to post #92.

This server is like a cluster of 8 Ryzen 3700X, each one configured to anemic 39 W PPT.

The throughput optimum on this computer is with single-threaded tasks, number of concurrent tasks = core count (that is, SMT not used). In contrast, @biodoc's 3700X @ 65 W (post #100) gets best throughput with dual-threaded tasks, number of concurrent tasks = core count (that is, SMT used). This difference between these computers probably comes from the frugal per-core power budget of the EPYC. Edit, maybe it's also a firmware thing. But if there is a relevant difference in the firmware, then it ultimately has to be connected with the different power budgets per core on EPYC and Ryzen.

To my surprise, the EPYC achieves best power efficiency (and at the same time near optimum throughput) with 8-threaded tasks and use of SMT, that is, when one task uses as many threads as available per CCX. With this high throughput, we can reasonably assume that the Linux kernel ran this workload with a 1:1 mapping of tasks to CCXs. Edit, but this alone does not explain why this worked so much better than 4-threaded tasks.

I also tested two configurations which were expected to give bad results, but I was curious to learn how bad:
  • With 128 single-threaded tasks, the processor caches are no longer sufficient for this workload. The result is that throughput goes down to 63 % of the throughput optimum, and power efficiency goes down to 58 % of the throughput optimum.
  • With 16-threaded tasks, one task needs to be executed on 2 CCXs. But while core-to-core communication within a CCX is extremely low-latency, communication across CCXs performs about as bad as communication via main memory. Apparently, the inter-thread communication in LLR2 is considerable, and therefore throughput goes down to 27 % of the throughput optimum, and power efficiency goes down to 32 % of the throughput optimum.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
5,787
1,282
136
Here are llrDiv results from 3 different PPT settings (65, 55, 45 watts) on a 3700X. I did notice the Cpu clock speed drop when switching from SMT "off" to SMT on at the the lower power settings which is reflected in the data. It's too bad I didn't have a power meter at the wall for this computer.

1610565623515.png
 

crashtech

Diamond Member
Jan 4, 2013
9,612
1,509
126
I was a little worried that my results pointing to 8 tasks 1 thread being best on the 3700X were anomalous, since some earlier results pointed to 8/2 being best. Do you know what clock speed that CPU is maintaining at the 65W setting and full load?
 

biodoc

Diamond Member
Dec 29, 2005
5,787
1,282
136
Do you know what clock speed that CPU is maintaining at the 65W setting and full load?
~3.5 GHz. If I were to do it again, I'd record the clock speed and compare the values with SMT "off" and "on". I'm pretty sure the clock drops about 10% with SMT "on" in the 45 and 55 watt runs but I didn't check the 65 watt run.
 

biodoc

Diamond Member
Dec 29, 2005
5,787
1,282
136
I collected some additional llrDiv data on the 3800X using Stefan's script. 2 tasks (6 threads each) is clearly a bad choice since it forces inter-CCX communication via system RAM.

1610802922415.png
 

ASK THE COMMUNITY