PPD for my new 2990WX

Markfw · Aug 17, 2018

So, its not even OC'ed hardly at all until I get my custom water loop done, but its getting about 1000 ppd on Rosetta. What do you think ?

Orange Kid · Aug 17, 2018

I dont do rosetta much, but being a low scoring project, that sounds like a lot.

Markfw · Aug 17, 2018

Is there a good benchmark for a DC project that I could run ? based on the machine, not one unit. It does 64 at a time, so I can't calculate one thread easily.

biodoc · Aug 18, 2018

Here's a list of your machines on Rosetta. If you look at the average credit column, you'll see see that your 2990WX is up to 3,031.84 PPD. Unfortunately, it'll take a couple of weeks for that number to reach maximum PPD because of how recent average credit (RAC) is calculated.

Also, see this link for more details on your 2990WX. You have 254 valid work units but 115 have failed due to compute errors. I would reduce the overclock to see if you can eliminate the future compute errors. You're other machines don't have this problem.

StefanR5R · Aug 18, 2018

biodoc said:
You have 254 valid work units but 115 have failed due to compute errors. I would reduce the overclock to see if you can eliminate the future compute errors.

Could also be memory errors, e.g. too high memory clock (or too tight timings), especially if these are 2 DIMMs per channel; less likely defects of the new RAM sticks. @Markfw, have you run a memcheck on the new sticks yet?

(At the rare occasions when I take a new PC into operation, I always try to put time aside to run a memory checker for ~1/2 day or even longer, as one of the first steps. I have become cautious since we had one PC in the office which had crashes in certain programs. A RAM defect was found as the cause for these, but only by a memcheck that had to run for about a whole day before tripping a memory error.)

StefanR5R · Aug 18, 2018

Markfw said:
So, its not even OC'ed hardly at all until I get my custom water loop done, but its getting about 1000 ppd on Rosetta. What do you think ?

It's more than 1,000 PPD, and it really should be. (Edit, or did you mean PPD per thread?)

I looked at your AMD hosts. The following calculation assumes that none of the hosts had any errors. (Including errors, PPD are of course lower.) It also assumes that all of the processor threads are used to run Rosetta (i.e. 100 % CPUs in BOINC). I took task run times and credits per task from the latest 100 valid tasks.

CV means coefficient of variation. (The higher CV is, the more uncertainty is in the evaluation. 10 % would be nice IMO, but we probably can't have that at Rosetta with its crediting scheme which is similar to Credit New, and also because they run a variety of different models.)

PPD = number_of_logical_CPUs * credits_per_task / seconds_per_task * 24*3,600 seconds_per_day. (This is long-term PPD after validation lag was overcome, if Rosetta ran exclusively on the host.)

Host 3418081 Ryzen 2700X Windows

task run time: 27,600 s average (CV 4 %)
credits per task: 191 average (CV 21 %)
= 9,600 PPD

Host 3398281 Ryzen 2700X Windows

task run time: 27,700 s average (CV 5 %)
credits per task: 240 average (CV 35 %)
= 12,000 PPD

Host 3397540 Threadripper 1950X Linux

task run time: 27,700 s average (CV 3 %)
credits per task: 223 average (CV 21 %)
= 22,200 PPD

Host 3437205 Threadripper 2990WX Windows

task run time: 26,300 s average (CV 14 %)
credits per task: 141 average (CV 34 %)
= 29,700 PPD

These results tell us a few things:

The two Ryzen 2700X should have nearly the same PPD, but they are noticably apart. Could be because you are running additional workloads on these, e.g. Folding@home GPU jobs. But it could easily be because of Rosetta@home's fuzziness in its crediting scheme (edit) and due to different batches of tasks, i.e. different models being run by the hosts.

The 2990WX has the highest variability in task run times. Maybe this is an effect of the different RAM attachment of 2 of 4 of the processor dies.

The 2990WX gets the lowest credits per task. This is most certainly caused by the lower RAM bandwidth (edit: per core) compared to 2700X and 1950X, combined with the fact that Rosetta@home is one of the more RAM bandwidth dependent projects.

Of course I don't know how much power these hosts pull when you let them run Rosetta. But going by 250 W power envelope of the 2990WX, and 180 W power envelope of the 1950X, the 2990WX should get 250/180 = 1.39 times the PPD compared to 1950X if the performance was governed purely by the CPU power budget, and proportional to it (in reality, it isn't).

However, you get 29.7k/22.2k = 1.34 times the PPD. I think this is still really good scaling, considering that Rosetta likes RAM bandwidth, but both processors have about the same of that. (Edit, and besides, the difference between 1.39 vs. 1.34 is way within the bounds of Rosetta's WU variability anyway.)

So, roughly, 2990WX : 1950X at Rosetta@home = 4:3. (Ignoring the errors that you still get, but hopefully can fix.)

IEC · Aug 18, 2018

Hmm, so at least for DC purposes it's looking like a 2990WX may be a great purchase... that's some good efficiency.

I will consider replacing all my Ryzen rigs with a single 2990WX for density reasons.

StefanR5R · Aug 18, 2018

IEC said:
I will consider replacing all my Ryzen rigs with a single 2990WX for density reasons.

An open question is how well it works as GPU driver. IIRC, @TennesseeTony had lower GPU performance in a dual socket Xeon host, in comparison to single socket hosts. 2990WX could have similar issues, maybe more so on Windows compared to Linux. (Or maybe not at all on Linux?)

Markfw · Aug 18, 2018

neither the 2700x's have GPU's, but one is overclocked to 4.1 ghz, the other stock. As far as power pull, the 2990WX@3600 is taking 315 watts from the wall with a 1080TI idling, but all 64 threads at full load. Thats incredible to me. I think the big thing is linux. Once I have my custom water setup, and OC'ed stable to 4 or 4.1 ghz, I am going to install linux, as its way more efficient. Its alsi looks like I should do a dual boot on the 2700x's with linux for the same reason. Almost all the other boxes in the house are linix.

Markfw · Aug 18, 2018

I took it down to 3550, and it seems to have fixed the errors. Still at 117 like when I changed the speed,

StefanR5R · Aug 19, 2018

Markfw said:
Hey, guys what benchmarks do you want to see out of my 2990WX ? I only had one request, and for blender. I did that, so now I need more input.

I have been thinking about it, but can't recall a project which is a perfect benchmark. Most DC projects have a high variability of the computational workload per WU. And on top of that, credit estimation can be very random at times... Besides, there are projects with different performance on Linux and Windows. (Typically better on Linux, but in some cases the other way around.)

Also, I noticed that I have only sparse & old notes about the PPD of my own hosts.

SETI@home:

I have older and current data on CPU Linux jobs, stock and optimized applications. But the trouble is that there is a considerable day-to-day variability between PPD on the same host & same applications. So this makes a very bad case for PPD comparisons.

TN-Grid:

Early this year I made a note that my dual E5-2696v4 (2x 22C/44T) on Linux made about 52,000 PPD.

There were months during which it was very hard to receive TN-Grid tasks. But they should be plenty now. It's in the field of biology mostly; partly medicine.

World Community Grid:

PPD measured during November 2017 on dual E5-2690v4 (2x 14C/28T) on Linux:
40,000...42,000 PPD in Mapping Cancer Markers (there was some variability between batches that I captured for measurement)
57,000 PPD in Smash Childhood Cancer
65,000 PPD in OpenZika
69,000 PPD in Outsmart Ebola Together
Note that the Windows applications of OpenZika and of Outsmart Ebola Together do not perform as well as the Linux application.

PPD measured during November 2017 on dual E5-2696v4 (2x 22C/44T) on Linux:
54,000 PPD in Microbiome Immunity Project

These are Boinc credits, not WCG points.

All of the PPD are calculated from validated tasks only, equal to a steady state at which the validation lag was overcome.

StefanR5R · Sep 15, 2018

TN-Grid

I went to the web site, computers -> "Detail" -> "Application details". This is what I got:

Code:

                                                  Thread-  dual E5-  dual E5-  dual E5-  dual E5-                   
                                                  ripper   2690 v4   2690 v4   2696 v4   2696 v4                    
                                                  2990WX   3.2 GHz   3.2 GHz   2.8 GHz   2.8 GHz                           
------------------------------------------- per-thread performance -----------------------------------------
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (avx)     5.85      4.63      4.90      4.28      4.35    GFLOPS                  
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (sse2)    5.94      4.85      4.81      3.89      3.93    GFLOPS                  
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (fma)     5.83      4.75      4.56      4.11      4.06    GFLOPS                  
--------------------------------------------- per-host performance -----------------------------------------
number of processors                                 64        56        56        88        88                            
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (avx)      374       259       274       377       383    GFLOPS                  
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (sse2)     380       272       269       342       346    GFLOPS                  
gene@home PC-IM 1.10 x86_64-pc-linux-gnu (fma)      373       266       255       362       357    GFLOPS

IOW the GFLOPS estimations for one 32C/64T Threadripper 2990WX is pretty much identical with those for a dual-processor 2x{22C/44T} Broadwell EP (if 100 % of the hardware threads are loaded with TN-Grid's current gene@home application).

The four dual-processors together are pulling 1450 W at the wall. Alas I don't have instrumentation for individual hosts at the moment.

Edit:
Here are task run times and credits, averages from last 100 valid tasks (CV: coefficient of variation), and calculated from that: PPD in steady state = when validation has caught up with production, if all processor threads are used for gene@home.

Code:

                 Thread-   dual E5-   dual E5-   dual E5-   dual E5-
                 ripper    2690 v4    2690 v4    2696 v4    2696 v4
                 2990WX    3.2 GHz    3.2 GHz    2.8 GHz    2.8 GHz
------------------------ per-thread performance ------------------------
run time/task    11,657     13,680     13,577     15,464     14,956   s
(CV)             (0.07)     (0.01)     (0.01)     (0.07)     (0.10)
credits/task        158        152        153        150        147
(CV)             (0.07)     (0.03)     (0.03)     (0.07)     (0.08)
-------------------------- per-host performance ------------------------
# of processors      64         56         56         88         88
PPD              74,725     53,832     54,347     73,740     74,858

PS, edit 2,
steady_state_PPD = number_of_logical_CPUs * credits_per_task / seconds_per_task * 24*3,600 seconds_per_day

Markfw · Sep 15, 2018

I am pulling 400 at the wall including the water cooling. Looks like a good ppd for this one chip monster.

StefanR5R · Sep 15, 2018

That gives ~190 PPD/W for the 2990WX, vs. ~180 PPD/W for the mix of 14C and 22C Xeons.

PPD/$ of initial investment is on a whole other level, even for the period during which 2696 v4 were relatively affordable second hand.

TennesseeTony · Sep 16, 2018

Wow, the Super TR really likes this project! Thanks for the details Stefan!

StefanR5R · Oct 5, 2018

A Planet3DNow! user is testing the waters with a 2990WX on Windows and Linux with the multithreaded Cosmology@home camb_boinc2docker application. He wrote in their German forum (machine-translated and edited):

sompe of P3D said:
Cosmology seems to have quite a problem with the topology of the Threadripper 2990WX. At least I have been struggling with aborts due to runtime overruns.

After I disabled SMT to rule out its influence, a fairly clear picture seems to emerge as half of the WUs seem to need much more computing time and roughly run about 30% longer, unless the slower WUs jump over to the directly connected dies after the faster WUs finished. If this adds up with the slowdown by SMT, the WUs are apparently automatically shot down after too long computing time and marked as faulty. Here, the disadvantage of the indirectly connected memory seems to be in full effect.

Currently running the last WUs under Windows (limited to 4 cores per WU) and later continue my attempts with Ubuntu, which was the intended OS for this anyway.

The host under Windows: ID 368169 (average processing rate 29 GFLOPS, after >800 tasks)
The host under Linux: ID 362903 (average processing rate 39 GFLOPS, after 17 tasks, presumably with SMT on)

Like user sompe, I run camb_boinc2docker with 4 threads per task.
E5-2690 v4, HT on, Linux: average processing rate 29 GFLOPS
E5-2696 v4, HT on, Linux: average processing rate 23 GFLOPS

Markfw · Oct 6, 2018

StefanR5R said:
A Planet3DNow! user is testing the waters with a 2990WX on Windows and Linux with the multithreaded Cosmology@home camb_boinc2docker application. He wrote in their German forum (machine-translated and edited):

The host under Windows: ID 368169 (average processing rate 29 GFLOPS, after >800 tasks)
The host under Linux: ID 362903 (average processing rate 39 GFLOPS, after 17 tasks, presumably with SMT on)

Like user sompe, I run camb_boinc2docker with 4 threads per task.
E5-2690 v4, HT on, Linux: average processing rate 29 GFLOPS
E5-2696 v4, HT on, Linux: average processing rate 23 GFLOPS

How many threads are each of those two boxes of yours ? Trying to compare to the 2990wx

StefanR5R · Oct 6, 2018

Mine are dual-processor hosts.
Furthermore, they run Cosmo at their all-core non-AVX turbo clock, which is

E5-2690 v4: 3.2 GHz
E5-2696 v4: 2.8 GHz

The GFLOPS numbers reported above relate to one application instance which is configured to occupy 4 logical CPUs. This means:

dual E5-2690 v4:

2x 14 cores x 2 threads/core = 56 threads/host = 14 application instances per host
14 * 29 GFLOPS = 400 GFLOPS/host

dual E5-2696 v4:

2x 22 cores x 2 threads/core = 88 threads/host = 22 application instances per host
22 * 23 GFLOPS = 500 GFLOPS/host

2990WX, if SMT is on:

32 cores x 2 threads/core = 64 threads/host = 16 application instances per host
16(?) * 39(?) GFLOPS = 620(?) GFLOPS/host

I don't know though if user sompe ran 16 tasks at a time and if the GFLOPS measurement by boinc has converged enough. So the 620 GFLOPS figure is rather uncertain. The host details page says "consecutive valid tasks: 87" but "number of tasks today: 8". So maybe the host was set up to run only 8 application instances concurrently, cutting the host GFLOPS in half. (Darn, I should get one of these myself for 1:1 comparisons. But there should be even nicer stuff available from AMD next year...* )

--------
Edit,
* and I already am rather limited by power draw, cooling, shelf space, energy costs...

If anybody wants to compare, here is the app_config.xml to restrict camb_boinc2docker to 4 threads per task:

Code:

<app_config>
        <app_version>
                <app_name>camb_boinc2docker</app_name>
                <plan_class>vbox64_mt</plan_class>
                <avg_ncpus>4</avg_ncpus>
        </app_version>
</app_config>

But it is currently a bit hard to get tasks from Cosmo because their database and work generator are very slow, for now. You only get tasks after many requests, and the host may not stay fully occupied that way.

Markfw · Oct 6, 2018

@StefanR5R, so I guess you like the 2990wx and think its cheaper than a used Xeon E5 setup ?

I just spec'ed out a 128 thread, dual EPYC system after I saw that article that said EPYC 16 core demolishes TR 16 core by 62%, and it is $10,000 for 2 cpu's, motherboard and 128 gig of 2666 registered and ECC memory and 2 Noctura heatsinks....... Tempting.

StefanR5R · Oct 6, 2018

Markfw said:
@StefanR5R, so I guess you like the 2990wx and think its cheaper than a used Xeon E5 setup ?

I'm not sure about used Xeon E5 v3 (22 nm, less efficient, therefore I haven't watched these).

More than a year ago, price of used Xeon E5 v4 (14 nm) was similar to new single-socket EPYCs. About a year ago, used Xeon E5 v4 became sparse and pricey. AFAICT, they are no longer a price/performance match to EPYC, let alone Threadripper.

Markfw said:
I just spec'ed out a 128 thread, dual EPYC system after I saw that article that said EPYC 16 core demolishes TR 16 core by 62%, and it is $10,000 for 2 cpu's, motherboard and 128 gig of 2666 registered and ECC memory and 2 Noctura heatsinks....... Tempting.

Two thoughts:

Single socket EPYC systems have a much better price per processor (7351P, 7401P, 7551P), same price for RAM, but the downside of 1 mainboard, 1 disk, 1 PSU, 1 case needed per processor instead of per 2 processors.

I am certain that a 16 core TR performs at least as well as (or even notably better than) a 16 core EPYC in workloads which are not memory bandwidth controlled.
In AnandTech's TR2 review, a 32 core EPYC (single-socket, 180 W TDP) performed between the 16 core TR2 (180 W TDP) and the 32 core TR2 (250 W TDP) in a number of benchmarks, as to be expected.

PPD for my new 2990WX

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Diamond Member

Elite Member

Elite Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member