2x Xeon E5-2696 v4 Benchmark results and tuning log (44C/88T)

Storm-Chaser · Dec 2, 2022

I like what I am seeing

This thread will serve as a benchmark log for my ongoing HP z840 project. The rig is specifically built for benching and nothing else. Obviously, it cannot be overclocked, but still, I am thinking it will perform well against 12th gen processors / AMD processors as it stands. We will see about 13th gen, so the only thing I can do is come at it with as many cores as possible. This HP Z840 is now maxed out at the very limit of theoretical performance, with the highest spec'd core count CPUs (22C/44T each) and highest allowable memory speed (2400Mhz in octal channel [effective] configuration) and lowest possible memory latency (CL17).

The processors got lost in a shipment for years, so I was able to score them brand new and from a supplier in states for a pretty good deal.

I will be posting a number of benchmarks here as I test out these brand new CPUs and push them to the limit. Also comparing and contrasting to my old CPUs, two E5 2696 v3 processors, which had 18 cores and 36 threads.

Specs on new processors:
Note there is one error here, which is single core turbo speed. It's actually 3.7GHz not 3.6.

Rig we will be working with:

Markfw · Dec 2, 2022

What CPUs may post results here ? EPYCs ? running linux, so benchmarks that have free downloads of linux benches ? I would be interested. My best would be dual Milan 64 cores. 7763, so 128cores/256 threads. all the way down to 5950x.I have nothing with less than 16 cores.

Storm-Chaser · Dec 2, 2022

Markfw said:
What CPUs may post results here ? EPYCs ? running linux, so benchmarks that have free downloads of linux benches ? I would be interested. My best would be dual Milan 64 cores. 7763, so 128cores/256 threads. all the way down to 5950x.I have nothing with less than 16 cores.

You are welcome to post your results here from any of those machines. Basically any HEDT system is fine.

EDIT: Any newer gen INTEL/AMD CPUs are fine as well because I am interested in how they stack up against this rig

Markfw · Dec 2, 2022

Can someone suggest a linux benchmark that can also run on windows ? Something that is free ?

Storm-Chaser · Dec 2, 2022

Markfw said:
Can someone suggest a linux benchmark that can also run on windows ? Something that is free ?

y cruncher - This benchmark is pretty aggressive just make sure your cooling is good because it's really going to punish it.

EDIT: Since these machines are long legged, I'd recommend benching 2.5B then 10B so we can really see how well the processors scale in terms of multi core performance.

EDIT 2: I do believe passmark has a linux version as well and its free to test your CPU on.

Markfw · Dec 2, 2022

Storm-Chaser said:
y cruncher - This benchmark is pretty aggressive just make sure your cooling is good because it's really going to punish it.

EDIT: Since these machines are long legged, I'd recommend benching 2.5B then 10B so we can really see how well the processors scale in terms of multi core performance.

EDIT 2: I do believe passmark has a linux version as well and its free to test your CPU on.

OK, I am a noob at linux, and I downloaded this, but no .deb package, the readme does not say how to run it, and the command lines file reads like greek. How do you run the linux download ?

Storm-Chaser · Dec 2, 2022

Markfw said:
OK, I am a noob at linux, and I downloaded this, but no .deb package, the readme does not say how to run it, and the command lines file reads like greek. How do you run the linux download ?

Linux version?

Markfw · Dec 2, 2022

Storm-Chaser said:
Linux version?

Linux mint 19.2 some are 20.3

Markfw · Dec 2, 2022

This is all I see

Storm-Chaser · Dec 2, 2022

Markfw said:
Linux mint 19.2 some are 20.3

Okay, not sure if this will help because it's for a different version but check it out anyway.

How to use y-cruncher in ubuntu? | Tom's Hardware Forum (tomshardware.com)

also download CPU-X and screenshot it for the thread (different subject altogether)

Storm-Chaser · Dec 2, 2022

Baseline CPUz run with new processors installed

Markfw · Dec 2, 2022

Here is the result.
y-cruncher v0.7.10 Build 9513

Detecting Environment...

CPU Vendor:
AMD = Yes
Intel = No

OS Features:
* 64-bit = Yes
* OS AVX = Yes
* OS AVX512 = No

Hardware Features:
MMX = Yes
* x64 = Yes
* ABM = Yes
RDRAND = Yes
RDSEED = Yes
BMI1 = Yes
* BMI2 = Yes
* ADX = Yes
MPX = No
PREFETCHW = Yes
PREFETCHWT1 = No
RDPID = Yes
GFNI = No
VAES = No

SIMD: 128-bit
* SSE = Yes
* SSE2 = Yes
* SSE3 = Yes
* SSSE3 = Yes
SSE4a = Yes
* SSE4.1 = Yes
* SSE4.2 = Yes
AES-NI = Yes
SHA = Yes

SIMD: 256-bit
* AVX = Yes
XOP = No
* FMA3 = Yes
* FMA4 = No
* AVX2 = Yes

SIMD: 512-bit
* AVX512-F = No
AVX512-CD = No
AVX512-PF = No
AVX512-ER = No
* AVX512-VL = No
* AVX512-BW = No
* AVX512-DQ = No
* AVX512-IFMA = No
* AVX512-VBMI = No

Alright Intel, how many drinks have you had tonight?
AVX512-VPOPCNTDQ = No
AVX512-4FMAPS = No
AVX512-4VNNIW = No
AVX512-VBMI2 = No
AVX512-VPCLMUL = No
AVX512-VNNI = No
AVX512-BITALG = No
AVX512-BF16 = No
* AVX512-GFNI = No
AVX512-VAES = No
AVX512-VP2INTERSECT = No
AVX512-FP16 = No

Auto-Selecting: 19-ZN2 ~ Kagari

/home/mark/Downloads/y-cruncher v0.7.10.9513-static/Binaries/19-ZN2 ~ Kagari

Launching y-cruncher...
================================================================

Insufficient permissions to set thread priority. Please retry as root.

Further messages for this warning will be suppressed.

Checking processor/OS features...

Required Features:
x64, ABM, BMI1, BMI2, ADX,
SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,
AVX, FMA3, AVX2

Parsing Core -> Handle Mappings...
Cores: 0-127

Parsing NUMA -> Core Mappings...
Node 0: 0-3 64-67
Node 1: 4-7 68-71
Node 2: 8-11 72-75
Node 3: 12-15 76-79
Node 4: 16-19 80-83
Node 5: 20-23 84-87
Node 6: 24-27 88-91
Node 7: 28-31 92-95
Node 8: 32-35 96-99
Node 9: 36-39 100-103
Node 10: 40-43 104-107
Node 11: 44-47 108-111
Node 12: 48-51 112-115
Node 13: 52-55 116-119
Node 14: 56-59 120-123
Node 15: 60-63 124-127

y-cruncher v0.7.10 Build 9513 ( www.numberworld.org )
Copyright 2008-2020 Alexander J. Yee ( a-yee@u.northwestern.edu )

Distribute Freely - Please report any bugs.

Tuning: Linux/19-ZN2 ~ Kagari - Zen 2 Matisse (x64 ADX)

0 Benchmark Pi (all in ram)
1 Component Stress Tester
2 Run I/O Performance Analysis

3 Custom Compute a Constant
- Compute other constants (e, Golden Ratio, etc...)
- Choose your own settings (use disk for large computations)

4 BBP Digit Extractor for Pi
5 Digit Viewer

6 Advanced Options
7 About

Enter your choice:
option: 0

Benchmark Pi:

Select a Benchmark Type:

0 Single-Threaded
1 Multi-Threaded

option: 0

Available Memory: 90.9 GiB

Option Decimal Digits Approx. Memory Needed

1 25,000,000 211 MiB
2 50,000,000 325 MiB
3 100,000,000 556 MiB
4 250,000,000 1.21 GiB
5 500,000,000 2.34 GiB
6 1,000,000,000 4.52 GiB
7 2,500,000,000 11.0 GiB
8 5,000,000,000 22.8 GiB
9 10,000,000,000 45.5 GiB
10 25,000,000,000 116 GiB
11 50,000,000,000 228 GiB
12 100,000,000,000 457 GiB
13 250,000,000,000 1.12 TiB
14 500,000,000,000 2.24 TiB
15 1,000,000,000,000 4.48 TiB
16 2,500,000,000,000 11.3 TiB

0 I prefer SuperPi sizes... (1M, 2M, 4M...)

option: 1

This process does not have "CAP_IPC_LOCK". Page locking will not be possible.
Please run y-cruncher with elevation to enable page locking.

Constant: Pi
Algorithm: Chudnovsky (1988)

Decimal Digits: 25,000,000
Hexadecimal Digits: Disabled

Computation Mode: Ram Only
Multi-Threading: None (No Multi-threading) -> 1 / ?

Start Time: Fri Dec 2 18:13:09 2022

Working Memory... 131 MiB (spread: ?)
Twiddle Tables... 81.6 MiB (spread: ?)

Begin Computation:

Series CommonP2B3... 1,762,854 terms (Expansion Factor = 2.360)
Time: 7.345 seconds ( 0.122 minutes )
Large Division...
Time: 0.510 seconds ( 0.008 minutes )
InvSqrt(10005)...
Time: 0.336 seconds ( 0.006 minutes )
Large Multiply...
Time: 0.182 seconds ( 0.003 minutes )

Pi: 8.374 seconds ( 0.140 minutes )

Base Converting:
Time: 0.827 seconds ( 0.014 minutes )
Writing Decimal Digits:
Time: 0.042 seconds ( 0.001 minutes )

Start Time: Fri Dec 2 18:13:09 2022
End Time: Fri Dec 2 18:13:18 2022

Total Computation Time: 9.201 seconds ( 0.153 minutes )
Start-to-End Wall Time: 9.358 seconds ( 0.156 minutes )

CPU Utilization: 99.01 % + 0.98 % kernel overhead
Multi-core Efficiency: 0.77 % + 0.01 % kernel overhead

Last Decimal Digits: Pi
3803750790 9491563108 2381689226 7224175329 0045253446 : 24,999,950
0786411592 4597806944 2455112852 2554677483 6191884322 : 25,000,000

Spot Check: Good through 25,000,000

Version: 0.7.10.9513 (Linux/19-ZN2 ~ Kagari)
Processor(s): AMD Eng Sample: 100-000000053-04_32/20_N
Topology: 128 threads / 64 cores / 1 socket / 16 NUMA nodes
Usable Memory: 118,070,956,032 ( 110 GiB)
CPU Base Frequency: 1,996,177,600 Hz

Validation File: Pi - 20221202-181319.txt

Markfw · Dec 2, 2022

That was on a 64 core 128 thread 7742 EPYC.

Storm-Chaser · Dec 2, 2022

Markfw said:
That was on a 64 core 128 thread 7742 EPYC.

That's the 1B run, so here is my result to contrast. Try to run the 2.5B or 10B option. Probably command line switch.

Here is my result with the same run bot windows 10, not sure if the scores translate entirely accurately.

Storm-Chaser · Dec 2, 2022

Better CPUz result!

Storm-Chaser · Dec 2, 2022

Just purchased this for it:

Storm-Chaser · Dec 3, 2022

memory tuning

Storm-Chaser · Dec 3, 2022

As I predicted, this rig does pretty well at benchmarking HWBOT because it's kind of a rare hardware configuration so the sample pool is much smaller.

Storm-Chaser · Dec 5, 2022

@Markfw
Did u get a chance to run 2.5B or 10B on ycruncher? Really interested to see how that EPYC performs.

Markfw · Dec 5, 2022

Storm-Chaser said:
@Markfw
Did u get a chance to run 2.5B or 10B on ycruncher? Really interested to see how that EPYC performs.

How do you do that ? parameters ?????

Storm-Chaser · Dec 5, 2022

Markfw said:
How do you do that ? parameters ?????

Try this (you might have already downloaded this so skip it if you've done that already):
I downloaded the static binaries from numberworld.org (y-cruncher - A Multi-Threaded Pi Program) (http://www.numberworld.org/y-cruncher/y-cruncher v0.7.10.9513-static.tar.xz), extracted them with

Code:
tar xvf [filename]
and just ran

Code:

./y-cruncher bench 10b

from a terminal in the extracted directory.

StefanR5R · Dec 5, 2022

I have got dual-socket E5-2696 v4 computers too.
And I've got dual-socket 7452 computers.

Similar to Markfw, I use them basically for Distributed Computing/ volunteer science.

I configured the BIOS of the 2696 v4s such that they run at turbo clocks for indefinite time, which isn't the most energy efficient thing to do (the clocks are still low though), but benefits performance consistency.
I configure the BIOS of the 7452s sometimes to their default 155 W TDP and PPT, sometimes to 180 W cTDP_up and PPT. So far I only used the default 'performance determinism' mode, not the alternative 'power determinism' mode. (Hopefully I remembered that right; I can't look into the BIOS settings right now to be sure of what I'm talking here.)

In almost all science applications at which I could make a reasonable accurate assessment of performance — often this isn't easy because there can be widely variable workloads for one and the same application — the 7452s had about 2x the performance and 2.5x the power efficiency of the 2696v4s. Which is to be expected, as that's 7nm against 14nm, and 32c against 22c.

There is one benefit of the 2696vs in comparison to the 7452s, which comes in handy at some occasions: 1x 55 MB level 3 cache per socket, while a 7452 has got 8x 16 MB level 3 cache. Some applications which support multithreading and operate on larger data structures in their innermost computational loops benefit heavily from the undivided cache of Broadwell-EP — or in other words, take a big performance and power efficiency hit by the many partitions of Epyc Rome's caches.

Example: During the next ~2 weeks, I'll be torturing both with >32 MB large ~~Fast Fourier~~ Transforms [edit: number-theoretic transforms which AFAIU work somewhat similar to FFTs; the implementation uses FMA3 operations on vectors of FP64 numbers] (probably about 40 MB cache demand with all the rest). In this exercise, the 7452s only reach about 1.2x the performance of the 2696v4s. I haven't checked the power efficiency ratio in this application yet. The 1.2x figure is after thorough optimization of the application on both platforms, each with the respective optimum thread count per task, and optimum binding of tasks to sets of logical CPUs. Which takes more work on the Epycs than on the Xeons, naturally, due to the different cache organization.

Storm-Chaser · Dec 5, 2022

@StefanR5R

Thank you for adding some excellent context to the thread. Looking forward to further discussions in regards to this CPU tech stuff.... At the present am doing my best to get myself up to speed with these new processors. I'm coming from two Xeon E5-2696 v3 processors (18C/36T) FWIW. Regarding the memory, I had a decent bump in performance there as well, going from 2133MHz (Haswell max) to 2400MHz, the maximum speed possible with Broadwell. 64GB in total.

I have been working with the various snoop modes, benchmarking each one and experimenting to see which one offers optimal performance for a given workload. It's unfortunate that a vast majority of benchmarks are not numa aware because it does offer some decent performance boosts under the right circumstances. This is an interesting alternative memory benchmark to AIDA64. It is much more comprehensive but the latency #s themselves seem to be about ~20 ns higher than what you will see from AIDA64.

EARLY SNOOP:

HOME SNOOP:

DIRECTORY WITH OSB SNOOP MODE

CLUSTER-ON-DIE SNOOP MODE

The HP z840 is pretty much hard locked down so I cannot manipulate turbo settings in any way whatsoever. At one point, I had even tried a 1680 v2 just for kicks but even then, HP imposes the stock 150W TDP wattage limit.

@StefanR5R
@Markfw

I will post a link to memory bandwidth program shortly - hopefully they let you attach zip files. It will plot the chart for you just like above, and to be clear the first link is for NUMA enabled CPUs, second link is for non numa up to 64 threads. You can also measure latency.

HP's performance advisor has a cute little block diagram of my setup... lol

Storm-Chaser · Dec 5, 2022

Steps to run the memory latency/bandwidth benchmark (NOT numa aware)

1) Download the benchmark here:
MicrobenchmarkGui.zip

2) Extract files to a local directory

3) Run the program (MicrobenchmarkGui.exe):

4) Disregard SmartScreen filter and run it anyway

Product Azure Rectangle Font Electric blue

5) For bandwidth benchmark, be sure to max out your thread count. Make sure you max out your thread count here.

6) The bandwidth number you will be scored on is highlighted below. Please take a snip like this for your submission:

Font Parallel Screenshot Rectangle Number

7) Please also include CPUz screenshots of CPU and memory tabs, like this...
Also include your windows version with your submission, thank you!

8) For latency test do the same. Once again, the result is listed at the very bottom of your result window:

Font Screenshot Parallel Rectangle Software

There will be two leaderboards for this completion. Latency and bandwidth. We can look at L1 L2 and L3 cache numbers later... here is the breakdown:

Yellow square is L1
Orange square is L2
Red square is L3

Dark red square is System RAM

Additional reading here:

AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion – Chips and Cheese

*I have attached the NUMA aware benchmark as well, directly to this post. You will need to re-name the extention from .TIFF to .EXE once you download it. If you have problems downloading it message and I will send it to you via email.

This GUI benchmark only measures read performance, but it seems to be quite accurate, and you get a neat little chart to go along with it.

Attachments

StefanR5R · Dec 6, 2022

FWIW, I don't have Windows.
My Xeons are partly populated with 1 DIMM per channel, 16 GB single rank (128 GB total), and partly with 2 DIMMs per channel, 8 GB single rank (128 GB total). I once ran a somewhat memory performance sensitive application and got slightly better application performance from the 2 DPC config. I don't remember anymore which application it was. Maybe I should have looked for dual rank DIMMs for the 1 DPC population...
Though while the 1 DPC config runs its DDR4-2400 at full speed, I *believe* the 2 DPC config has got the memory kicked down to 2133. That's common with Broadwell-EP but could be avoided with LR-DIMMs.

2x Xeon E5-2696 v4 Benchmark results and tuning log (44C/88T)

Senior member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Senior member

Senior member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Senior member

Attachments

Senior member

Senior member

Senior member

Senior member

Senior member

Moderator Emeritus, Elite Member

Senior member

Elite Member

Senior member

Senior member

Attachments

Elite Member