2x Xeon E5-2696 v4 Benchmark results and tuning log (44C/88T)

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
I like what I am seeing :)

1670027379572.png

This thread will serve as a benchmark log for my ongoing HP z840 project. The rig is specifically built for benching and nothing else. Obviously, it cannot be overclocked, but still, I am thinking it will perform well against 12th gen processors / AMD processors as it stands. We will see about 13th gen, so the only thing I can do is come at it with as many cores as possible. This HP Z840 is now maxed out at the very limit of theoretical performance, with the highest spec'd core count CPUs (22C/44T each) and highest allowable memory speed (2400Mhz in octal channel [effective] configuration) and lowest possible memory latency (CL17).

The processors got lost in a shipment for years, so I was able to score them brand new and from a supplier in states for a pretty good deal.

I will be posting a number of benchmarks here as I test out these brand new CPUs and push them to the limit. Also comparing and contrasting to my old CPUs, two E5 2696 v3 processors, which had 18 cores and 36 threads.

Specs on new processors:
Note there is one error here, which is single core turbo speed. It's actually 3.7GHz not 3.6.
1670027150704.png

Rig we will be working with:
1670029407593.png
1670029435249.png
1670030934916.png
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,483
14,434
136
What CPUs may post results here ? EPYCs ? running linux, so benchmarks that have free downloads of linux benches ? I would be interested. My best would be dual Milan 64 cores. 7763, so 128cores/256 threads. all the way down to 5950x.I have nothing with less than 16 cores.
 

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
What CPUs may post results here ? EPYCs ? running linux, so benchmarks that have free downloads of linux benches ? I would be interested. My best would be dual Milan 64 cores. 7763, so 128cores/256 threads. all the way down to 5950x.I have nothing with less than 16 cores.
You are welcome to post your results here from any of those machines. Basically any HEDT system is fine.

EDIT: Any newer gen INTEL/AMD CPUs are fine as well because I am interested in how they stack up against this rig
 
Last edited:
  • Like
Reactions: Kaluan and Markfw

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,483
14,434
136
Can someone suggest a linux benchmark that can also run on windows ? Something that is free ?
 

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
Can someone suggest a linux benchmark that can also run on windows ? Something that is free ?
y cruncher - This benchmark is pretty aggressive just make sure your cooling is good because it's really going to punish it.

EDIT: Since these machines are long legged, I'd recommend benching 2.5B then 10B so we can really see how well the processors scale in terms of multi core performance.

EDIT 2: I do believe passmark has a linux version as well and its free to test your CPU on.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,483
14,434
136
y cruncher - This benchmark is pretty aggressive just make sure your cooling is good because it's really going to punish it.

EDIT: Since these machines are long legged, I'd recommend benching 2.5B then 10B so we can really see how well the processors scale in terms of multi core performance.

EDIT 2: I do believe passmark has a linux version as well and its free to test your CPU on.
OK, I am a noob at linux, and I downloaded this, but no .deb package, the readme does not say how to run it, and the command lines file reads like greek. How do you run the linux download ?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,483
14,434
136
Here is the result.
y-cruncher v0.7.10 Build 9513

Detecting Environment...

CPU Vendor:
AMD = Yes
Intel = No

OS Features:
* 64-bit = Yes
* OS AVX = Yes
* OS AVX512 = No

Hardware Features:
MMX = Yes
* x64 = Yes
* ABM = Yes
RDRAND = Yes
RDSEED = Yes
BMI1 = Yes
* BMI2 = Yes
* ADX = Yes
MPX = No
PREFETCHW = Yes
PREFETCHWT1 = No
RDPID = Yes
GFNI = No
VAES = No

SIMD: 128-bit
* SSE = Yes
* SSE2 = Yes
* SSE3 = Yes
* SSSE3 = Yes
SSE4a = Yes
* SSE4.1 = Yes
* SSE4.2 = Yes
AES-NI = Yes
SHA = Yes

SIMD: 256-bit
* AVX = Yes
XOP = No
* FMA3 = Yes
* FMA4 = No
* AVX2 = Yes

SIMD: 512-bit
* AVX512-F = No
AVX512-CD = No
AVX512-PF = No
AVX512-ER = No
* AVX512-VL = No
* AVX512-BW = No
* AVX512-DQ = No
* AVX512-IFMA = No
* AVX512-VBMI = No

Alright Intel, how many drinks have you had tonight?
AVX512-VPOPCNTDQ = No
AVX512-4FMAPS = No
AVX512-4VNNIW = No
AVX512-VBMI2 = No
AVX512-VPCLMUL = No
AVX512-VNNI = No
AVX512-BITALG = No
AVX512-BF16 = No
* AVX512-GFNI = No
AVX512-VAES = No
AVX512-VP2INTERSECT = No
AVX512-FP16 = No


Auto-Selecting: 19-ZN2 ~ Kagari

/home/mark/Downloads/y-cruncher v0.7.10.9513-static/Binaries/19-ZN2 ~ Kagari


Launching y-cruncher...
================================================================



Insufficient permissions to set thread priority. Please retry as root.

Further messages for this warning will be suppressed.

Checking processor/OS features...

Required Features:
x64, ABM, BMI1, BMI2, ADX,
SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,
AVX, FMA3, AVX2



Parsing Core -> Handle Mappings...
Cores: 0-127

Parsing NUMA -> Core Mappings...
Node 0: 0-3 64-67
Node 1: 4-7 68-71
Node 2: 8-11 72-75
Node 3: 12-15 76-79
Node 4: 16-19 80-83
Node 5: 20-23 84-87
Node 6: 24-27 88-91
Node 7: 28-31 92-95
Node 8: 32-35 96-99
Node 9: 36-39 100-103
Node 10: 40-43 104-107
Node 11: 44-47 108-111
Node 12: 48-51 112-115
Node 13: 52-55 116-119
Node 14: 56-59 120-123
Node 15: 60-63 124-127


y-cruncher v0.7.10 Build 9513 ( www.numberworld.org )
Copyright 2008-2020 Alexander J. Yee ( a-yee@u.northwestern.edu )

Distribute Freely - Please report any bugs.

Tuning: Linux/19-ZN2 ~ Kagari - Zen 2 Matisse (x64 ADX)


0 Benchmark Pi (all in ram)
1 Component Stress Tester
2 Run I/O Performance Analysis

3 Custom Compute a Constant
- Compute other constants (e, Golden Ratio, etc...)
- Choose your own settings (use disk for large computations)

4 BBP Digit Extractor for Pi
5 Digit Viewer

6 Advanced Options
7 About

Enter your choice:
option: 0

Benchmark Pi:

Select a Benchmark Type:

0 Single-Threaded
1 Multi-Threaded

option: 0


Available Memory: 90.9 GiB

Option Decimal Digits Approx. Memory Needed

1 25,000,000 211 MiB
2 50,000,000 325 MiB
3 100,000,000 556 MiB
4 250,000,000 1.21 GiB
5 500,000,000 2.34 GiB
6 1,000,000,000 4.52 GiB
7 2,500,000,000 11.0 GiB
8 5,000,000,000 22.8 GiB
9 10,000,000,000 45.5 GiB
10 25,000,000,000 116 GiB
11 50,000,000,000 228 GiB
12 100,000,000,000 457 GiB
13 250,000,000,000 1.12 TiB
14 500,000,000,000 2.24 TiB
15 1,000,000,000,000 4.48 TiB
16 2,500,000,000,000 11.3 TiB

0 I prefer SuperPi sizes... (1M, 2M, 4M...)

option: 1

This process does not have "CAP_IPC_LOCK". Page locking will not be possible.
Please run y-cruncher with elevation to enable page locking.


Constant: Pi
Algorithm: Chudnovsky (1988)

Decimal Digits: 25,000,000
Hexadecimal Digits: Disabled

Computation Mode: Ram Only
Multi-Threading: None (No Multi-threading) -> 1 / ?

Start Time: Fri Dec 2 18:13:09 2022

Working Memory... 131 MiB (spread: ?)
Twiddle Tables... 81.6 MiB (spread: ?)

Begin Computation:

Series CommonP2B3... 1,762,854 terms (Expansion Factor = 2.360)
Time: 7.345 seconds ( 0.122 minutes )
Large Division...
Time: 0.510 seconds ( 0.008 minutes )
InvSqrt(10005)...
Time: 0.336 seconds ( 0.006 minutes )
Large Multiply...
Time: 0.182 seconds ( 0.003 minutes )

Pi: 8.374 seconds ( 0.140 minutes )

Base Converting:
Time: 0.827 seconds ( 0.014 minutes )
Writing Decimal Digits:
Time: 0.042 seconds ( 0.001 minutes )

Start Time: Fri Dec 2 18:13:09 2022
End Time: Fri Dec 2 18:13:18 2022

Total Computation Time: 9.201 seconds ( 0.153 minutes )
Start-to-End Wall Time: 9.358 seconds ( 0.156 minutes )

CPU Utilization: 99.01 % + 0.98 % kernel overhead
Multi-core Efficiency: 0.77 % + 0.01 % kernel overhead

Last Decimal Digits: Pi
3803750790 9491563108 2381689226 7224175329 0045253446 : 24,999,950
0786411592 4597806944 2455112852 2554677483 6191884322 : 25,000,000

Spot Check: Good through 25,000,000

Version: 0.7.10.9513 (Linux/19-ZN2 ~ Kagari)
Processor(s): AMD Eng Sample: 100-000000053-04_32/20_N
Topology: 128 threads / 64 cores / 1 socket / 16 NUMA nodes
Usable Memory: 118,070,956,032 ( 110 GiB)
CPU Base Frequency: 1,996,177,600 Hz

Validation File: Pi - 20221202-181319.txt
 
  • Like
Reactions: Mopetar

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
That was on a 64 core 128 thread 7742 EPYC.
That's the 1B run, so here is my result to contrast. Try to run the 2.5B or 10B option. Probably command line switch.

1670034561246.png

Here is my result with the same run bot windows 10, not sure if the scores translate entirely accurately.
1670034756801.png
 

Attachments

  • 1670034646244.png
    1670034646244.png
    59.3 KB · Views: 0
  • Like
Reactions: Mopetar

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
As I predicted, this rig does pretty well at benchmarking HWBOT because it's kind of a rare hardware configuration so the sample pool is much smaller.
1670120928312.png
 

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
I have got dual-socket E5-2696 v4 computers too.
And I've got dual-socket 7452 computers.

Similar to Markfw, I use them basically for Distributed Computing/ volunteer science.

I configured the BIOS of the 2696 v4s such that they run at turbo clocks for indefinite time, which isn't the most energy efficient thing to do (the clocks are still low though), but benefits performance consistency.
I configure the BIOS of the 7452s sometimes to their default 155 W TDP and PPT, sometimes to 180 W cTDP_up and PPT. So far I only used the default 'performance determinism' mode, not the alternative 'power determinism' mode. (Hopefully I remembered that right; I can't look into the BIOS settings right now to be sure of what I'm talking here.)

In almost all science applications at which I could make a reasonable accurate assessment of performance — often this isn't easy because there can be widely variable workloads for one and the same application — the 7452s had about 2x the performance and 2.5x the power efficiency of the 2696v4s. Which is to be expected, as that's 7nm against 14nm, and 32c against 22c.

There is one benefit of the 2696vs in comparison to the 7452s, which comes in handy at some occasions: 1x 55 MB level 3 cache per socket, while a 7452 has got 8x 16 MB level 3 cache. Some applications which support multithreading and operate on larger data structures in their innermost computational loops benefit heavily from the undivided cache of Broadwell-EP — or in other words, take a big performance and power efficiency hit by the many partitions of Epyc Rome's caches.

Example: During the next ~2 weeks, I'll be torturing both with >32 MB large Fast Fourier Transforms [edit: number-theoretic transforms which AFAIU work somewhat similar to FFTs; the implementation uses FMA3 operations on vectors of FP64 numbers] (probably about 40 MB cache demand with all the rest). In this exercise, the 7452s only reach about 1.2x the performance of the 2696v4s. I haven't checked the power efficiency ratio in this application yet. The 1.2x figure is after thorough optimization of the application on both platforms, each with the respective optimum thread count per task, and optimum binding of tasks to sets of logical CPUs. Which takes more work on the Epycs than on the Xeons, naturally, due to the different cache organization.
 
Last edited:

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
@StefanR5R

Thank you for adding some excellent context to the thread. Looking forward to further discussions in regards to this CPU tech stuff.... At the present am doing my best to get myself up to speed with these new processors. I'm coming from two Xeon E5-2696 v3 processors (18C/36T) FWIW. Regarding the memory, I had a decent bump in performance there as well, going from 2133MHz (Haswell max) to 2400MHz, the maximum speed possible with Broadwell. 64GB in total.

I have been working with the various snoop modes, benchmarking each one and experimenting to see which one offers optimal performance for a given workload. It's unfortunate that a vast majority of benchmarks are not numa aware because it does offer some decent performance boosts under the right circumstances. This is an interesting alternative memory benchmark to AIDA64. It is much more comprehensive but the latency #s themselves seem to be about ~20 ns higher than what you will see from AIDA64.

EARLY SNOOP:
1669954374788-png.2586392


HOME SNOOP:
1669954631879-png.2586393


DIRECTORY WITH OSB SNOOP MODE
1669955028143-png.2586394


CLUSTER-ON-DIE SNOOP MODE
1669955302647-png.2586395


The HP z840 is pretty much hard locked down so I cannot manipulate turbo settings in any way whatsoever. At one point, I had even tried a 1680 v2 just for kicks but even then, HP imposes the stock 150W TDP wattage limit.

@StefanR5R
@Markfw

I will post a link to memory bandwidth program shortly - hopefully they let you attach zip files. It will plot the chart for you just like above, and to be clear the first link is for NUMA enabled CPUs, second link is for non numa up to 64 threads. You can also measure latency.

HP's performance advisor has a cute little block diagram of my setup... lol
1670294007104.png
 
Last edited:

Storm-Chaser

Senior member
Mar 18, 2020
236
76
71
Steps to run the memory latency/bandwidth benchmark (NOT numa aware)

1) Download the benchmark here:
MicrobenchmarkGui.zip

2) Extract files to a local directory

3) Run the program (MicrobenchmarkGui.exe):

Rectangle Parallel Font Circle Magenta



4) Disregard SmartScreen filter and run it anyway

Product Azure Rectangle Font Electric blue


5) For bandwidth benchmark, be sure to max out your thread count. Make sure you max out your thread count here.

6) The bandwidth number you will be scored on is highlighted below. Please take a snip like this for your submission:

Font Parallel Screenshot Rectangle Number


7) Please also include CPUz screenshots of CPU and memory tabs, like this...
Also include your windows version with your submission, thank you!
Product Rectangle Azure Font Line


8) For latency test do the same. Once again, the result is listed at the very bottom of your result window:
Font Screenshot Parallel Rectangle Software


There will be two leaderboards for this completion. Latency and bandwidth. We can look at L1 L2 and L3 cache numbers later... here is the breakdown:

Yellow square is L1
Orange square is L2
Red square is L3


Dark red square is System RAM
1668030445095-png.2581867




Additional reading here:

AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion – Chips and Cheese

*I have attached the NUMA aware benchmark as well, directly to this post. You will need to re-name the extention from .TIFF to .EXE once you download it. If you have problems downloading it message and I will send it to you via email.

This GUI benchmark only measures read performance, but it seems to be quite accurate, and you get a neat little chart to go along with it.











Attachments
 

Attachments

  • BWGUI.TIFF
    9 KB · Views: 0

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
FWIW, I don't have Windows.
My Xeons are partly populated with 1 DIMM per channel, 16 GB single rank (128 GB total), and partly with 2 DIMMs per channel, 8 GB single rank (128 GB total). I once ran a somewhat memory performance sensitive application and got slightly better application performance from the 2 DPC config. I don't remember anymore which application it was. Maybe I should have looked for dual rank DIMMs for the 1 DPC population...
Though while the 1 DPC config runs its DDR4-2400 at full speed, I *believe* the 2 DPC config has got the memory kicked down to 2133. That's common with Broadwell-EP but could be avoided with LR-DIMMs.
 
  • Like
Reactions: Storm-Chaser