Discussion PES | Assessing Power and Performance Efficiency of x86 CPU architectures

BorisTheBlade82 · Oct 9, 2021

Dear Community,

so this is my first thread here as a long-time lurker - but I felt the desire to share a small hobby-project of mine from the last couple of months with you...

Performance Efficiency Suite - What is it about?
Most Reviewers solely focus on what they consider to be the most important aspect of modern CPUs - the absolute performance. But this is only one side of the equation. Today Power Efficiency is at least as important - or to be more precise: The amount of energy (Wattseconds or Joules) a CPU needs in order to accomplish a given workload. Sadly most Reviewers shy away from the extra mile it needs to assess this aspect. This suite measures the Total Package Power of a CPU while running the Cinebench R23 benchmarks first in single-threaded mode (1 run), then running in multi-threaded mode (for 10 minutes + whatever it takes to finish the last run). The results will be rendered in the provided Results.xlsx Excel file. To combine Efficiency and Performance there is also a score provided called Performance Efficiency Score (how amazingly inspired I am

).

In the meantime I was able to aggregate more than 80 samples from members of the 3DC & CB communities (see below).

How-To

Unzip the latest release to wherever you want EXCEPT on your local OneDrive folder.
Open Settings.txt and insert your local Cinebench23 Directory.
Run PES Start - it will ask for Administrator rights as these are needed for measuring Package Power
Wait until the Powershell finishes.
Open the Excel file...
Allow external connections (to the generated CSV-files with the data)
Go to Data -> Refresh all
Enjoy and share your results - just take a screenshot of what the Excel renders.
If you want to do multiple measurements with different settings just copy the Excel file (inside the root-folder) before running and refreshing the data.

Some explanations about the Suite

This Suite has been made possible by Michael Möller and his amazing free and open-source Open Hardware Monitor and his .NET Library OpenHardwareMonitorLib.dll - Thanks a lot!!!
Homepage: https://openhardwaremonitor.org/
GitHub: https://github.com/openhardwaremonitor
The results for the Package Power look pretty accurate compared to the sparse data the internet provides. Seems, that the vendors are much more honest with those sensors than they are with temperature etc.
The suite basically consists some powershell scripts and an Excel file for presentation purposes
- RunAsAdminWrapper.ps1
  This is needed to have a convenient relative path shortcut in the root folder and request admin-rights at the same time
- Main.ps1
  - After setting up some stuff it basically starts the Cinebench R23 one at a time. It then checks for the "Cinebench.exe" process being active.
  - While this is true it queries the Package Power Sensor data with a lower bound of 10ms (in order to keep CPU-load of the script at bay).
  - After each run the aquired data gets pushed to CSV files located in the LogCsv subfolder.
- Results.xslx
  - The Excel file basically just does some import, calculations and a hopefully nice presentation of the data.
  - Histogram
    The bold line shows a running average of the last 100 data-points which should be sufficiently accurate. The pale line shows each single data-point.
  - Calculation of Total Package Consumption
    To get that number we need the integral. That is why we first calculate the timeframe between two data-points and then multiply the measured value.
  - Everything else in that Excel is hopefully more or less self-explaining

Online Resources

Disclaimer
I am by no means a Powershell professional or a professional Reviewer. I was just sick of the lack of information and wanted to propose a low-effort solution. Any input for further improvement is highly welcomed. Please feel free to use/extend/rip-off this solution as you wish. But please share your findings to the world.

BorisTheBlade82 · Oct 12, 2021

Det0x said:
First post don't really say if its stock only systems, or if tweaked ones also are allowed... But i have done some runs on my 5950x

Well, focus is on stock settings. But I guess I will add your 100w numbers - as they are quite astonishing - in order to show what a 5950x is capable of.

JoeRambo · Oct 12, 2021

Det0x said:
5950x pbo curve optimizer limited to 100w:

ST Efficiency Score = 89,92

MT Efficiency Score = 7516,68

5950x Hydra optimized to 100w:

ST Efficiency Score = 97,04

MT Efficiency Score = 8844,92

Great stuff, yet another reminder how mediocre stock and PBO settings are. Also how far things can be pushed with some know-how and tuning.

psolord · Oct 12, 2021

I though it had something to do with Pro Evolution Soccer, which isn't even called that any more. xD

moinmoin · Oct 12, 2021

psolord said:
I though it had something to do with Pro Evolution Soccer, which isn't even called that any more. xD

I honestly thought the timing was no coincidence.
"Konami doesn't use PES anymore, now my tool can use it instead."

amrnuke · Oct 13, 2021

0.7.3 version just runs multi-thread on an endless loop

JoeRambo · Oct 13, 2021

amrnuke said:
0.7.3 version just runs multi-thread on an endless loop

I think it runs for 10 minutes? At least that's what the script is supposed to do from the source.

amrnuke · Oct 13, 2021

JoeRambo said:
I think it runs for 10 minutes? At least that's what the script is supposed to do from the source.

I should probably not exaggerate! It felt like I waited more than 10 minutes, but I'm not sure.
Will trial again and time it to be sure.

therealmongo · Oct 13, 2021

Careful with this thread, he who shall not be named may magically appear (something to do with Apple being the greatest ....)

BorisTheBlade82 · Oct 13, 2021

amrnuke said:
0.7.3 version just runs multi-thread on an endless loop

No, it runs as often as needed so that there are more than 10 minutes of duration. I will add that to the description.
The idea is, that CPUs with many cores do not run entirely in Turbo.

amrnuke · Oct 15, 2021

5600X - PBO +200MHz
ST PES 65.24, Consumption 29,960, Duration 511.64
MT PES 1515.75, Consumption 8,638, Duration 76.37

Hmm...

BorisTheBlade82 · Oct 15, 2021

@amrnuke
It would be very nice if you could post a screenshot of the results graph.

The efficiency hit in comparison to the stock 5600X looks just as expected.

Det0x · Oct 15, 2021

A small update for me, its named "over nine thousand!"

This is the most efficient i can make my 5950x dual ccd 16 core run.
4250/4075mhz @ 0.962mv get.
Soc undervolted to 850mv get.

JoeRambo · Oct 16, 2021

Det0x said:
This is the most efficient i can make my 5950x dual ccd 16 core run.
4250/4075mhz @ 0.962mv get.
Soc undervolted to 850mv get.

Awesome! what memory it can do at thos SOC volts?

BorisTheBlade82 · Oct 16, 2021

So in general, what to make of all these numbers...

Let's have a look at the comparison between the Pentium Silver N6000 and the R3 4300G under ST for example as this should give us some nice clues regarding the underlying architecture and process:

The former is the direct predecessor of the ADL little Gracemont core. The process is also 10nm - although more the variant ICL was released on and not the current 10ESF / Intel7.
When comparing to the latter it is clear to me how much better Zen2 and TSMC 7nm would work for a small core. Not only is the latter much much faster, it also is quite a lot more power efficient. So for Intel to catch up or overtake with Gracemont will be quite a stretch. For me it is very impressive how widely Zen2/3 scales. Although it is the best foundation for what we could call a "little" core it also works pretty well as a "big" core at the same time.

Too bad we can not directly compare the Apple M1. I guess the results would be devastating for the competition.

JoeRambo · Oct 16, 2021

BorisTheBlade82 said:
The former is the direct predecessor of the ADL little Gracemont core. The process is also 10nm - although more the variant ICL was released on and not the current 10ESF / Intel7.

I think calling it "predecessor of Gracement" is giving WAY too much credit to previous Atom core. It was not meant to run workloads like CB at all. It is underpowered little core, behind the curve in power efficiency, good for minor integer workloads.
It has no machinery to properly run CB workload, and is meant to run Chromebook style of workloads at ARM SoC speeds.

Tremont:

vs this monster that has Skylake level of FP vec resources:

3x vec ALU, two symmetric FMUL/FADD capable pipes, backed by dual load / dual store.

BorisTheBlade82 · Oct 16, 2021

@JoeRambo
You are absolutely right. What you are pointing out leans more to the performance side of the equation. The question is what will happen wrt power efficiency. And there I do not expect miracles from 10ESF.
If we compare ICL and TGL we can see that 10SF helped with improving max frequency but not so much performance efficiency. Of course we would need a comparison with ISO frequency to be fair.

sallymander · Oct 16, 2021

BorisTheBlade82 said:
Too bad we can not directly compare the Apple M1. I guess the results would be devastating for the competition.

I was actually trying to do this on my M1 Mac Mini. I got similar results to Andrei (about 3.8w package power average over the run) and the performance was about the same as the i7 1165G7 for ST. I think that gives an ST efficiency score of 800+ if I calculated it correctly.

mmaenpaa · Oct 16, 2021

5600G @45W (set in bios, memory 3600MHz XMP, other settings at stock), fully passive (CPU that is) which begins to show in multi runs, build in progress picture below. PSU stays also passive with these loads (WAF requirement for living room

)

BorisTheBlade82 · Oct 16, 2021

@sallymander
Yes, you are right. So let's call this an estimate:
According to several reviews (AT for example) the CB23 ST performance is practically identical to the 1165G. So let's take its 553 seconds and with the 3,8w from you and Andrei we have 2101Ws for the run. So we are looking at a PES of 860,7. That is total carnage for basically anyone else. I need to insert that into the x-y-chart because I think visually this is better to grasp. That is like NFL vs. High-school Football.

What is interesting is that M1 loses a lot of its relative advantage in MT.
If we take the 7833 points from Andrei we are looking at around 102s for one run (because approximately CB23 score = 800000 / duration in seconds).
So with 15w we get around 1530ws and a PES of “only" around 6400.
That is still only second to the 16c/32t 5950x but nevertheless a significant relative regression.
Here I an only speculate:

Icestorm is not so much an efficiency core for PPW but PPA.
At full load Icestorm is way beyond its perf-eff-sweet-spot as it was designed for light load (background tasks, low frequency and voltage).
Something entirely different.

sallymander · Oct 16, 2021

BorisTheBlade82 said:
@sallymander
Yes, you are right. So let's call this an estimate:
According to several reviews (AT for example) the CB23 ST performance is practically identical to th 1165G. So let's take its 553 seconds and with the 3,8w from you and Andrei we have 2101Ws for the run. So we are looking at a PES of 860,7. That is total carnage for basically anyone else. I need to insert that into the x-y-chart because I think visually this is better to grasp. That is like NFL vs. High-school Football.

What is interesting is that M1 loses a lot of its relative advantage in MT.
If we take the 7833 points from Andrei we are looking at around 102s for one run (because approximately CB23 score = 800000 / duration in seconds).
So with 15w we get around 1530ws and a PES of “only" around 6400.
That is still only second to the 16c/32t 5950x but nevertheless a significant relative regression.
Here I an only speculate:

Icestorm is not so much an efficiency core for PPW but PPA.

At full load Icestorm is way beyond its perf-eff-sweet-spot as it was designed for light load (background tasks, low frequency and voltage).

Something entirely different.

I'll see if I can do some MT testing too. I wonder if Ryzen has some fixed overheads (RAM?) that are much lower on the M1, making the Ryzen ST look worse.

Hulk · Oct 16, 2021

@BorisTheBlade82

Great work here. Thanks for your effort. How are you computing efficiency? I assume multiplying total power x time?

BorisTheBlade82 · Oct 16, 2021

Hulk said:
@BorisTheBlade82

Great work here. Thanks for your effort. How are you computing efficiency? I assume multiplying total power x time?

Well, basically sampling package power as often as possible and then calculating the integral (Joule or Wattseconds). Because this is what it is about with a fixed workload: How much energy is needed to work it through?

Hulk · Oct 16, 2021

BorisTheBlade82 said:
Well, basically sampling package power as often as possible and then calculating the integral (Joule or Wattseconds). Because this is what it is about with a fixed workload: How much energy is needed to work it through?

Got it. Calculating the area under the power vs. time function. Smart. You say calculating the integral. Are you finding a function for the curve and then integrating or using the trapezoidal rule with a definite number of values for "n?" Just wondering because you mentioned the integral. Not sure if you meant that to mean calculating the area under the curve numerically or closed solution with the function between to bounds, that's why I'm asking. Just curious.

You could also take the average power during the run and multiply it by the time of the run and get the same result right?

BorisTheBlade82 · Oct 16, 2021

To be precise this is of course a discretized integral. I am gathering the package power samples and multiply it with the amount of time between two samples. Average power only works for very uniform data. With PL1, PL2 and stuff this just does not cut it in order to be accurate.

Hulk · Oct 16, 2021

BorisTheBlade82 said:
To be precise this is of course a discretized integral. I am gathering the package power samples and multiply it with the amount of time between two samples. Average power only works for very uniform data. With PL1, PL2 and stuff this just does not cut it in order to be accurate.

Understood. By average power I meant adding the power values sampled and then dividing by the total number of samples. Of course this is ultimately the same thing you are doing though.

Thanks for responding. I'm curious as to the sampling rate?

Finally, as you wrote with all of the opportunistic frequency manipulation of modern CPU's it is also very difficult to nail down frequency during a benchmark run. Is the data available to sample the frequency of each core as well? The total of all of these samples, divided by the total number of samples would provide an average clock speed during the benchmark run, which would be very interesting as it would allow IPC, or "throughput" for each core architecture and more insight into the results.

HWinfo provide a data point like this and they call it "average effective clock."

I'm sorry if I'm being that guy who has to ask someone who is doing something out of the goodness of their heart, donating time and intellect to the community to do additional work!!! Your effort is extremely appreciated!

Discussion PES | Assessing Power and Performance Efficiency of x86 CPU architectures

Senior member

Senior member

Golden Member

Platinum Member

Diamond Member

Golden Member

Golden Member

Golden Member

Member

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Junior Member

Member

Senior member

Junior Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member