PrimeGrid: CPU benchmarks

Discussion in 'Distributed Computing' started by StefanR5R, Jan 5, 2017.

Tags:
  1. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    Here are some GCW Sieve stats from the CPU-only "Generalized Cullen/Woodall (Sieve) 1.00" application. Runtimes were taken from the latest 20 valid tasks per machine.

    Code:
                 CPU   2x E5-2690v4 2x E5-2690v4   i7-6950X     i7-4960X     E3-1245v3    i7-4900MQ  Phenom II X4 905e
               clock        3.2 GHz      3.2 GHz      4.0 GHz      4.5 GHz      3.6 GHz     3.15 GHz      2.5 GHz
    ------------------------------------------------------------------------------------------------------------------
     mean runtime/WU       7194 s       7242 s       5602 s       5479 s       6911 s       7123 s      10719 s
      min runtime/WU       6696 s       6772 s       5410 s       5249 s       6494 s       6762 s      10464 s
      max runtime/WU       7957 s       7984 s       5987 s       5835 s       7584 s       7373 s      11069 s
                 COV      0.059        0.055        0.026        0.024        0.048        0.022        0.021
    ------------------------------------------------------------------------------------------------------------------
    normalized runtime   22'900       23'200       22'400       24'700       24'900       22'400       26'800
    ------------------------------------------------------------------------------------------------------------------
    concurrent tasks         56           56           18           12            8            8            4
             WUs/day        673          668          278          189          100           97           32
         credits/day    367'000      365'000      152'000      103'344       55'000       53'000       18'000
    
    clock:
    actual average processor frequency during the GCW sieve multitask workload​

    normalized runtime:
    mean runtime/WU (s), multilplied by clock (GHz), i.e. runtime scaled to 1.0 GHz; lower is better​

    credits/day:
    WUs/day multiplied by the average credits/WU, which is slightly more than 546​

    Remarks:
    • My actual production in the ongoing Isaac Newton's Birthday Challenge will be lower than the sum of these machines. Some of them have to take breaks from the race periodically.
    • The two i7-X processors are overclocked, but I checked that there are no invalid returns.
    • The 6950X has got 20 logical cores but only 18 are being employed for PrimeGrid. Two remain for F@H.
    • The E3-1245v3 could be a few percent faster; there is a mixed workload on it.
    • Normalized to 1.0 GHz, the speed per core of the Phenom II is not so far behind the other CPUs. But then again, it runs only 1 task per physical core (no HT), all others run 2 tasks per physical core.
     
    TennesseeTony likes this.
  2. Loading...

    Similar Threads - PrimeGrid benchmarks Forum Date
    AMD FirePro S9150 for PrimeGrid Genefer Distributed Computing Mar 31, 2017
    GTX 1080 vs. 1070 in Folding@Home and PrimeGrid Distributed Computing Jan 2, 2017
    PrimeGrid Races 2017 Distributed Computing Dec 29, 2016
    MilkyWay@H - Benchmark thread Winter 2016 on (different WU) - GPU & CPU times wanted Distributed Computing Dec 29, 2016
    PrimeGrid Olympics Challenge August 2-5 Distributed Computing Aug 1, 2016

  3. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    Here are SoB-LLR runtimes with the "Seventeen or Bust 7.06" CPU application. These are unusually long tasks; I'll list the times in days, not hours.

    As Ken g6 already hinted in the PrimeGrid Races 2017 thread,
    • AVX units, if present, are given a really good workout.
    • Hyper-threading scaling is negative on the CPUs that I tested for it.
    • SoB-LLR chews memory bandwidth like there is no tomorrow.
    I'll list runtimes in four sections:
    1. Haswell laptop with four memory configurations
    2. Light load (2 tasks) on various CPUs
    3. Heavy load (tasks on all cores) versus light load
    4. Heavy load with hyper-threading
    Hyper-threading was disabled in the BIOS, except in section 4.

    Machines used in these tests:
    Deneb
    AMD Phenom II X4 905e, 2.5 GHz, 4 cores, no SMT, no AVX
    DDR2-800 cl6 (15 ns), dual channel​
    HSW
    Intel i7-4900MQ (Haswell), clock as listed in the tests, 4 cores
    DDR3-1600 cl11 (14 ns), single channel, unless noted otherwise in section 1​
    IVB-E
    Intel i7-4960X (Ivy Bridge-E), 4.5 GHz, 6 cores
    DDR3-2400 cl10 (8 ns), quad channel​
    BDW-E
    Intel i7-6950X (Broadwell-E), 3.5 GHz, 10 cores
    DDR4-3000 cl14 (9 ns), quad channel​
    BDW-EP
    dual Intel Xeon E5-2690 v4 (Broadwell-EP), 2.9 GHz, 2x 14 cores
    DDR4-2400 cl17 (14 ns), 2x quad channel, unregistered ECC​

    The listed CPU clocks are the average clocks during the tests. The Haswell PC is a laptop which does not offer any means to influence CPU clock in the BIOS. Its clock was always controlled by the thermal limit and therefore varied from workload to workload.

    Operating systems: Linux on Deneb and BDW-EP, Windows 7 otherwise.


    1. Haswell laptop with four memory configurations

    Workload:
    4 simultaneous SoB-LLR tasks (100 % CPU load)​

    Details of the RAM configurations:
    DDR3-1600, timings 11-11-11-28 (14 ns), command rate 1T, single or dual channel
    DDR3-2133, timings 11-11-11-31 (10 ns), command rate 2T, single or dual channel​

    Min/Max SoB-LLR runtimes:
    HSW@2.9GHz, DDR3-1600, single channel...........13.7 - 13.9 days
    HSW@2.7GHz, DDR3-2133, single channel...........10.8 - 10.9 days
    HSW@2.4GHz, DDR3-1600, dual channel................8.0 - 8.2 days
    HSW@2.4GHz, DDR3-2133, dual channel................7.6 - 7.8 days

    Remarks:
    • Obviously the single channel configs starved the CPU so much that its lower utilization, hence lower heat output, allowed for higher clocks.
    • Switching from single channel to dual channel reduced the runtimes by 56 %.
    • Comparing the two dual channel configs, 33 % faster memory resulted in 5 % lower runtimes. I wonder how much a role the command rate regression is playing there.

    2. Light load (2 tasks) on various CPUs

    Workload:
    2 simultaneous SoB-LLR tasks
    (on BDW-E: additionally 2 Foldig@Home GPU feeder threads)​

    Min/Max SoB-LLR runtimes:
    Deneb.............................24.0 - 24.0 days
    HSW@3.4GHz.................7.1 - 7.1 days
    IVB-E................................5.5 - 5.8 days
    BDW-E.............................5.3 - 5.6 days

    Remarks:
    • With >3 weeks time for task completion, the Deneb won't make it in the upcoming 2 weeks long "Year of the Fire Rooster" challenge.
    • Despite being equipped with better AVX units, the Haswell is much slower than the Ivy Bridge-E. The latter compensates by a faster clock (4.5 GHz : 3.4 GHz), much higher memory bandwidth (4-channel DDR-3-2400 : 1-channel DDR3-1600), and lower memory latency (8 ns : 14 ns).
    • Yet Ivy Bridge-E is beaten by Broadwell-E despite clock disparity (4.5 GHz vs. 3.5 GHz). The Broadwell-E brings much wider AVX units, and some more memory bandwidth to feed them.
    • As there is a water cooler mounted on my BDW-E, I could certainly try to clock it higher than 3.5 GHz when put into a race. But once more simultaneous tasks are put on the CPU, memory bandwidth will again become more of a factor than core clock.

    3. Heavy load (tasks on all cores)

    Workload:
    HSW: 4 simultaneous SoB-LLR tasks on 4 cores (100 % load)
    BDW-E: 8 Sob-LLR tasks + 2 F@H feeders on 10 cores (80 + 20 = 100 % load)
    BDW-EP: 26 Sob-LLR tasks on 2x 14 cores (93 % load)​

    Min/Max SoB-LLR runtimes:
    HSW@2.9GHz...............13.7 - 13.9 days
    BDW-E.............................6.2 - 6.2 days
    BDW-EP..........................9.8 - 10.5 days

    Remarks (compare with section 2, light load):
    • Haswell's runtimes increase by 94 % when workload is doubled on the single channel RAM config. I haven't done the same comparison on dual channel RAM.
    • Broadwell-E's runtimes increase by 14 % when going from 2 to 8 SoB-LLR tasks.
    • The fully loaded BDW-EPs show 64 % longer runtimes than the fully loaded BDW-E.
      Differences:
      Number of SoB-LLR tasks per processor differs by 63 %, core clocks differ by 21 %, RAM speed by 25 %, and RAM latency by 53 %.
      Other differences, which may or may not play a role, are dual-socket vs. single socket, MCC die with two ring buses and two home agents vs. LCC die with one ring bus and one home agent, and Linux vs. Windows.

    4. Heavy load with hyper-threading

    Workload:
    HSW: 8 SoB-LLR tasks on 4 cores/ 8 hardware threads (100 % load)
    BDW-E: 20 SoB-LLR tasks on 10 cores/ 20 hardware threads (100 % load)​

    Min/Max SoB-LLR runtimes:
    HSW@2.8GHz...............28.0 - 30.1 days
    BDW-E...........................16.8 - 17.5 days

    Remarks (compare with section 3, HT off):
    • Haswell's runtimes increase by 110 %, hence total throughput decreases by 5 %.
    • BDW-E's runtimes increase by 177 %, hence total throughput decreases by 38 %.
    --------
    Edit, April 18:
    corrected BDW-EP memory spec
     
    #2 StefanR5R, Feb 26, 2017
    Last edited: Apr 18, 2017
  4. TennesseeTony

    TennesseeTony Elite Member

    Joined:
    Aug 2, 2003
    Messages:
    1,763
    Likes Received:
    291
    Wow! What a treasure trove of information! Well done!

    I sure hope this thread doesn't get lost, that's a lot of work and good info.
     
  5. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    Initially I just wanted to get a quick overview how the different PCs fare with SoB-LLR. When I saw the low results of the Xeons compared to 6950X, I began the scaling tests. And I finally convinced myself that I needed to do something about that dismal out-of-the-box RAM in my laptop, despite RAM prices going up currently.
     
  6. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    PS:
    SoB-LLR is credited with about 64,500 points/WU. This works out to:

    HSW..............4 concurrent tasks, 13.8 days/task..........19,000 points/day
    BDW-E...........8 concurrent tasks, 6.2 days/task............83,000 points/day
    BDW-EP.......26 concurrent tasks, 10.1 days/task.......166,000 points/day​

    Oops, that's about half of the GCW-Sieve points/day from post #1. And this is supposed to include "50 % long job credit bonus and a 10 % conjecture credit bonus".

    The SoB-LLR project was started in 2010. Runtimes on today's CPUs are hardly shorter than those reported in 2010, despite better vector units, surely because of the RAM bottleneck. If the good folks at PrimeGrid calibrated points/task back in 2010 and perhaps even benchmarked with just one task/CPU, then that may explain why PPD are so low still in this day and age.
     
  7. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    PSP-LLR runtimes, "Prime Sierpinski Problem (LLR) 8.00" application,
    Broadwell-EP (Xeon E5-2690v4) @ 2.9 GHz (AVX all-core turbo), dual processor board, Linux

    HT on, 56 single-threaded jobs: 3.95...4.2 days per task .......... 13.7 tasks per day
    HT off, 28 single-threaded jobs: 1.9...2.0 days per task ............ 14.4 tasks per day

    t.b.d.: PPD, and multi-threaded tasks.
     
    TennesseeTony likes this.
  8. Orange Kid

    Orange Kid Elite Member

    Joined:
    Oct 9, 1999
    Messages:
    3,088
    Likes Received:
    134
    Added a link to this in the Project List under PrimeGrid.
     
  9. StefanR5R

    StefanR5R Senior member

    Joined:
    Dec 10, 2016
    Messages:
    320
    Likes Received:
    135
    PSP-LLR runtimes, "Prime Sierpinski Problem (LLR) 8.00" application,
    continued: multithreading study, and yield in terms of points per day

    Broadwell-EP (Xeon E5-2690v4) @ 2.9 GHz (AVX all-core turbo), dual processor board, Linux
    HT off
    Code:
                          session 1 (4 days)        session 2 (4 days)
    ----------------------------------------------------------------------
    threads/task          2            7            7            14
    simultaneous tasks    14           4            4            2
    ----------------------------------------------------------------------
    avg task run time     1d7h         5h27m        6h02m        3h07m
    tasks per day         11.0         17.6         15.9         15.4
    PPD                   155,000      250,000      245,000      235,000
    
    Edit, April 11:
    all prior results discarded, they were based on wrong run times and CPU times from primegrid.com's host task lists
    Edit, April 12:
    added results from 14-threads/task x 2 tasks
    Edit, April 14, 15:
    improved accuracy with longer session duration
     
    #8 StefanR5R, Apr 6, 2017
    Last edited: Apr 15, 2017