That's really interesting that that makes a difference in power usage.
Yes, the difference surprised me too. It's repeatable though. Very similarly, 64 single-threaded tasks pull =297 W, 32 dual-threaded tasks ≈335 W.
Looking it up it seems PrimeGrid takes much longer when multithreading as long as the task fits in L3 cache. The difference only disappears once L3 cache is saturated.
Correct; multithreaded LLR decreases throughput, and throughput-per-Watt compared to single-threaded LLR, as long as there is enough cache.
So maybe you can toy with bigger FFT sizes? Seems there is no job that could saturated the L3 cache in Zen 2 though?
Rome is a cluster of 4c/ 8t/ 16MB L3$ core complexes, of course. It will be interesting to see how it fares once individual tasks exceed the size of one complex.
A highly threaded application which makes good use of AVX2 vector hardware is Folding@home's FahCore_a7. This application is based on Gromax.
It has already been seen to show mediocre to poor performance on Naples and Rome, compared with dual-processor Broadwell-EP. Spanning a single FahCore_a7 process across all hardware threads of both sockets works extremely well with this application on BDW-EP. I am in the process to try different FahCore_a7 runs on Rome,
but first impressions show what I already heard from Mark: It's not running as well on Epyc as it does on Intel based servers.
(Edit: Folding@home performance is impossible to measure currently, due to extreme variations of TPF and PPD between work units.)
(Edit 2: Folding@home performance during the last several hours appeared comparable on dual-Rome and dual-Broadwell-EP, and more energy-efficient on dual-Rome; but it is not feasible to really measure this for now.)
(Edit 3: I am still not up to the difficult task to properly measure FahCore_a7 performance. But from what I have seen so far, per-core per-clock performance of Rome should easily match but more probably exceed BDW-EP's in this one, and scaling over all threads and both sockets works just as well as on my dual BDW-EP computers. Meaning, total FahCore_a7 performance on the dual EPYC is considerably higher because of its higher core count combined with very good scaling of this application, all the while the dual 32c EPYC draws only about as much power as a dual 14c BDW-EP.)
Did you check the during the AVX2/FMA3 load test the CPU's frequency and CPUpwm temperature sensor?
Alas not. With 64 single-threaded SGS-LLR jobs, the core clocks were around 2.4 GHz. I did not check temperatures.
At the moment, I have a single 128-threaded FahCore_a7 process running (avx_256 enabled, load average is indeed ≈128). Core clocks, from a single observation, are 2.47...2.82 GHz with a median of 2.49 GHz. Temperatures are
- 25 °C ambient at the air intake,
- 52 °C and 49 °C at the CPUs (no doubt because I applied thermal paste somewhat inconsistently),
- 50...69 °C at VRMs.
The BMC is showing 100 °C as critical VRM temperature threshold, and 105 °C as nonrecoverable VRM temperature threshold.
I'd would like to know the same results but setting cTDP to 180watts in BIOS, if you have spare time and want to dedicate the time.
In May, TeAm AnandTech will be taking part in the BOINC Pentathlon. This will probably be the moment for me to try a higher cTDP. ;
-)
Do you intend to dedicate it to distributed computing applications, BOINC, Folding..?
Not exclusively, but so far I presume that this computer will spend more time at the Distributed Computing hobby than at work.