Discussion Investigation of Single Thread CPU "Thoughput/cycle"

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
Here is my take on determining "throughput/cycle" for the purpose of comparing CPUs and their supporting memory systems. Full explantion in the secon post of this thread.

Also, it is understand that these benchmarks don't simply test compute. New processors have the advantage not only of better architectures, but also better memory subsystems to feed the cores, as well as new ISA instructions that may be used in the benchmarks. It's all folded in.

Data summary is presented in the first table. "Performance Points" are baed on a total possible score of "100." A processor would have to be tops in throughput/GHz in every category to reach that score. Supporting data is in the bottom table and comparion of cores in the table on the right.

Thanks to Igor Kavinski and Thunder 57 for their time in gathering accurate data.

1771252691070.png
 
Last edited:

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
Investigation of “Throughput/Cycle” for Client CPUs

Introduction
In the early days of computing, the computational capability of a microprocessor was often described using IPC (instructions per cycle), which measures how many instructions a processor can complete in a single clock cycle. Early client processors were single-issue, meaning they could issue at most one instruction per cycle. As a result, IPC was capped at 1, and in practice was frequently lower due to pipeline stalls, branch penalties, and memory latency.

This limitation changed with the introduction of superscalar processors, such as the Intel Pentium, which were capable of issuing multiple instructions per cycle under favorable conditions.

An intuitive way to understand this is to imagine an assembly line that, at best, can produce one widget per second. As long as every stage of the line operates perfectly—without errors or interruptions—the throughput remains one widget per second. Any disruption at any stage, however, reduces the effective throughput below this theoretical maximum.

The Pentium represented a major breakthrough because, under certain conditions—specifically when two instructions could be executed simultaneously while remaining in program order—it could achieve more than one instruction per cycle. Extending the analogy, this is similar to adding a second assembly line: even if one of the lines experiences frequent errors, the system can still produce more than one widget per second overall.

The P6 architecture (Pentium Pro) extended this concept further by increasing the number of execution “assembly lines” (making the core wider) and introducing out-of-order execution. This allowed steps in the assembly process to be performed in a different order than originally specified, as long as the final result remained correct. By dynamically reordering work, all assembly lines could be kept more fully occupied, increasing overall throughput.

As processor cores became wider over time and out-of-order execution grew more sophisticated, the concept of IPC alone became increasingly inadequate as a measure of performance. A more useful way to describe compute efficiency is throughput per cycle—that is, how much useful work the CPU, along with its memory subsystem and supporting components, can perform per clock cycle when running real applications.


Purpose
The purpose of this investigation is to compare the single-threaded (ST) throughput per cycle of modern client processors running 64-bit Windows, using standardized benchmark workloads as a proxy for real-world application performance.

Benchmarks
The following four benchmarks were selected to determine a throughput-per-cycle metric for the tested processors. Each benchmark was chosen because it emphasizes different aspects of single-threaded CPU behavior, providing a more representative measure of overall compute efficiency than any single test alone. Of course, no benchmark is perfect but I have found these benchmarks to be precise (return ~same scores run-after-run) and ubiquitous, which is nice because we have a “gut” feeling about how they translate to our use cases. The most important aspect of this investigation is getting consistent results.

The benchmarks used in this investigation are:

Geekbench 6.5 (Single-Thread)
Geekbench 6 was selected because it represents a broad mix of real-world workloads, including integer, floating-point, and memory-related tasks. Its short, diverse subtests make it a useful proxy for general application responsiveness and front-end efficiency.

Cinebench 2026 (Single-Core and Single-Thread tests) and/or Cinebench R23
Cinebench 2026 measures sustained single-threaded performance using a ray-tracing workload. It places significant emphasis on floating-point and vector execution and provides insight into a processor’s ability to sustain throughput under continuous computational load. If your processor will only run R23 then just run that one. If it will run both then please submit both scores.

7-Zip 25.01 (x64, Single Thread, run as close to 4.7GHz as possible, explanation for this request below)
The 7-Zip benchmark was included due to its heavy reliance on integer operations, branching behavior, and cache efficiency. When run in 64-bit mode, it minimizes legacy architectural constraints and highlights core execution efficiency in integer-dominated workloads.

7-Zip using a 32MB dictionary size as in this test will "blow through" the L3 cache for most systems and rely heavily on main memory access. As the CPU and main memory frequency discrepency grow larger and larger the CPU will be starved for data and MIPS/GHz will decrease. To keep the CPUs on a relatively level playing field let's try to keep the CPU frequency for this one around 4.7GHz is possible to keep this test more compute and less main memory stressor.

CPU-Z
CPU-Z’s benchmark is a FP32 math test using SSE instructions. It does not leverage SSE’s vector math capability with the exception of some 128-bit memory accesses. Most SSE instructions are scalar FP32 adds, multiplies, conversions, or compares. The long average instruction length could mean frontend throughput gets restricted by 16 byte per cycle L1 instruction cache bandwidth on older Intel CPUs. However, that limitation can be mitigated by op caches, and is only an issue if the execution engine can reach high enough IPC for frontend throughput to matter. CPU-Z benchmark has a typical mix of memory accesses. Branches are less common in CPU-Z than in games, compression, and Cinebench 2024. (From Chips and Cheese).


Procedure
For each processor, all benchmarks were run in a 64-bit Windows environment using their respective single-threaded test modes. The operating frequency observed during each benchmark run was recorded, and benchmark scores were normalized by frequency to produce a throughput-per-cycle value. The individual normalized results were then combined to form a composite measure of single-threaded compute efficiency.

Both the Single-Thread (ST) and Single-Core (SC) tests in Cinebench were executed to evaluate the effect of simultaneous multithreading (SMT) or hyper-threading (HT), when supported by the processor under test.

Since we are only running single thread benchmarks, the CPU should run very close to the top turbo or boost rated frequency for your CPU, but of them may run one or two hundred MHZ below that rating. Please have HWinfo open so you can watch the core frequency as the benchmark runs. You are looking for the highest current frequency and it might bounce around from core-to-core. For example, my HX370 has a boost frequency of 5.1GHz but in these benches average 4.9 or 5GHz. On the other hand, my 9950X holds a consistent 5.7GHz on all benchmarks.

Geekbench 6.5, Cinebench 2026, Cinebench R23, and HWinfo can be downloaded from their respective sites.

I have zipped up the correct versions of 7-Zip (25.01 x64) and CPUz (2.18.0 64 bit version) and you can download them at the link below. Both are portable and will not install anything on your computer. https://www.dropbox.com/scl/fi/mc5v...ey=o1aef8g5w9ycgnh1sh63pc5sf&st=5pz1npfg&dl=0

Run 7-Zip by double-clicking 7zFM.exe. The benchmark is located under Tools>Benchmark and it will start to run as soon as you select it. Change the “Number of CPU Threads” to 1 and make sure the Dictionary size is 32MB and Passes is set to 10. The bench will run 10 times and you will report the “Total Rating” in GIPS this is shown near the bottom right of the dialog.

Run CPUz by double-clicking “CPUz_x64.exe” and then selecting “Bench>Bench CPU.” Report the Single Thread result. I didn’t include the 32 bit version in this zip file so there is no chance of confusion as to which one to run.


A note on “cornering” E cores and benching hybrid CPU’s in general
Hybrid cores can be tricky to benchmark. The P cores are relatively straightforward because Windows will generally put a single thread on the most performant core. All you have to do is have a look in HWinfo at the frequency it is running the test.

For the E cores there are a few techniques you can use to “corner” them. On most desktops you can turn off all of the P cores except one, and then set that one to 900MHz (lowest generally allowed). This way the Windows will go for a more performant core, which now will be an E core.

Igor Kavinsky has also been very helpful and clever in how he obtained what I believe are very accurate scores of his Lion Cove and especially Skymont cores, which are very slippery. Here is what he told me he did and it helped me to benchmark the Zen 5c cores in my laptop.

“Use Statuscore. When you tick any core, it boosts to max speed. By clicking each core, you will know which one is the P-core and which one is the E-core from their max boost speeds.

So let's suppose Core 17 is the first Zen5c core. GB6 is the easiest. It obeys process affinity like a good dog. Set affinity to Core 17. In Status Core, turn on data recording and immediately click Run in GB6. When you see the Raytracing test, get ready. As soon as the motion test is done, go to the Statuscore menu and stop the data recording. Now just let the multicore part of GB6 finish so you can get the score. It's still going to test only the affinity core in MT because like I said, GB6 is a good dog.

In the recorded CSV file, go to the E-core section and you will see all the cores have clocks around 100-200 MHz while only the E-core clock will be continuously high. I took the values closest to the boost clock. So if most values were 4800 or so, I disregarded all values below 4800 and took an average of all values 4800 or above.

7-Zip is also easy just like GB6 and you shouldn't have any issue.

CPU-Z is a bit problematic and needs quick reflexes. Choose E-core from the drop down. Set affinity to one E-core only. Then start the test and immediately open the data recording dialog and click it but don't start. Wait for the single core test to start and start recording as soon as you see the ST test begin and stop when it's done. In the CSV, look for when the selected E-core's clocks start at its expected boost clock and take an average from that point till the last point where the boost clock is high and average those.

CB23 is stupid so you have to be quick. Start data recording, start the ST test, immediately set the affinity to the E-core and keep recording till the score appears. Same as before, take an average of the constantly high clocks only. CB26 is a bit better as you get some time while it prepares to run so you can do the above without much pressure.

Setting Max Frequency for non-hybrid CPUs using Window Power Plan
If you have a non-hybrid CPU then a simple registry tweak will allow you to set a max frequency. Set it below a frequency no throttling will occur and Windows will hold that frequency. Of course double-check that with HWinfo.

This file when run will add the needed registry key: https://www.dropbox.com/scl/fi/bigm...ey=jnyrsz7g9bq7nhlb5elx7jwcu&st=kub4nlh6&dl=0

This is where you set a max frequency.
1771031657259.png



Results
The benchmark results were compiled into a table for comparison. To enable cross-architecture evaluation, all benchmark scores were normalized relative to the Intel Skylake core, which was used as the baseline reference. Each benchmark score was first converted to a throughput-per-cycle value and then expressed as a ratio relative to the Skylake baseline.

The normalized results were combined into a single composite score using a weighted geometric mean. The weighting applied to each benchmark was as follows:

  • Geekbench 6.5 – 40%
  • Cinebench 2026 or/and R23 – 30%
  • 7-Zip – 20%
  • CPU-Z – 10%
These weightings were selected to balance general-purpose application behavior, sustained floating-point performance, and integer-focused execution characteristics while preventing any single benchmark from disproportionately influencing the final composite score.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,411
16,270
136
well, I know that was a lot of work, but multicore is almost more important, and power usage as well. Think about adding those. and maybe the 285 also ?
 

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
well, I know that was a lot of work, but multicore is almost more important, and power usage as well. Think about adding those. and maybe the 285 also ?
I know you are big in DC so that is important to you. But for most people the first 8 or 12 cores are the ones that really matter. 285K has the same type cores as 245K. I'm keeping the scope of this limited to ST workrate.
 

Abwx

Lifer
Apr 2, 2011
12,035
5,005
136
I would discard CPUz, i mean, a so called bench that show only 1% IPC difference between Zen 3 and Zen 4 is just a prove that it s a fishy bench, that looks like a look up table more than anything else, indeed it took Zen 3 to barely match Skylake while in any other bench the difference is huge.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,411
16,270
136
I know you are big in DC so that is important to you. But for most people the first 8 or 12 cores are the ones that really matter. 285K has the same type cores as 245K. I'm keeping the scope of this limited to ST workrate.
But that bench is for ONE core, even 8-12 core bench's included would be great !
 

zir_blazer

Golden Member
Jun 6, 2013
1,266
586
136
One of the reasons why I'm not that fond of this kind of testing is because it can't account for changes like memory subsystem. Zen 3 to Zen 4 includes the DDR4 to DDR5 jump, which for me, made Zen 4 a much smaller jump than it appears on this kind of test because you're supposedly testing with faster memory, so Core architecture IPC increase is LOWER than advertised. Zen 4 vs Zen 5 are far easier to directly compare because, well, you can.

LGA 1700 is perhaps the most interesing recent platform because you can isolate how certain changes may affect IPC, like:
Cache L2 size (Alder Lake vs Raptor Lake)
DDR4 vs DDR5 (With different Motherboard)
...which perhaps can help to extrapolate IPC on other platforms.

Otherwise you have AM4, which covers a whole range of CPUs/APUs Zen, Zen+, Zen 2, Zen 3 (The latter two due to the offdie memory controller again, makes it easier to compare between them but harder to the earlier two. Albeit for Zen vs Zen+ vs Zen 2 vs Zen 3 it may make more sense to compare Raven Ridge, Picasso, Renoir and Cezanne than the CPUs-only Ryzens, precisely, because all 4 are monolithic, so perhaps you can kind of extrapolate offdie memory controller impact).
 

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
One of the reasons why I'm not that fond of this kind of testing is because it can't account for changes like memory subsystem. Zen 3 to Zen 4 includes the DDR4 to DDR5 jump, which for me, made Zen 4 a much smaller jump than it appears on this kind of test because you're supposedly testing with faster memory, so Core architecture IPC increase is LOWER than advertised. Zen 4 vs Zen 5 are far easier to directly compare because, well, you can.

LGA 1700 is perhaps the most interesing recent platform because you can isolate how certain changes may affect IPC, like:
Cache L2 size (Alder Lake vs Raptor Lake)
DDR4 vs DDR5 (With different Motherboard)
...which perhaps can help to extrapolate IPC on other platforms.

Otherwise you have AM4, which covers a whole range of CPUs/APUs Zen, Zen+, Zen 2, Zen 3 (The latter two due to the offdie memory controller again, makes it easier to compare between them but harder to the earlier two. Albeit for Zen vs Zen+ vs Zen 2 vs Zen 3 it may make more sense to compare Raven Ridge, Picasso, Renoir and Cezanne than the CPUs-only Ryzens, precisely, because all 4 are monolithic, so perhaps you can kind of extrapolate offdie memory controller impact).
Yes, as I wrote in the first post.
"Also I understand that these benchmarks don't simply test compute. New processors have the advantage not only of better architectures, but also better memory subsystems to feed the cores, as well as new ISA instructions that may be used in the benchmarks. It's all folded in."

I am making no attempt to isolate the cores from cache, main memory, new ISA, etc.. Also cpus are generally designed not to have a giant bottleneck somewhere because the engineers know putting a super great cpu in a system where it is starved for bandwidth is not economical. Notice I wrote "generally."

This investigation is simply total throughput rate for some common benchmarks very accurately tested and recorded. That's it.

Lots of good ideas here and I suggest someone start another thread and incorporate the ideas suggested here! We only have about 5 active threads in here so more content would be great.

Another issue is when people have a dog in the race and this benchmark or that benchmark doesn't represent their dog fairly. That's a valid complaint but these benches are ubiquitious and often "leaked" so it's nice to really know how they compare among systems.
 
  • Like
Reactions: igor_kavinski

MS_AT

Senior member
Jul 15, 2024
948
1,878
96
CPU-Z was used as a lightweight, integer-focused benchmark that is particularly sensitive to instruction scheduling, front-end throughput, and branch prediction.
Not to nitpick, but let's link the C&C article again:

https://chipsandcheese.com/p/cpu-zs-inadequate-benchmark?utm_source=publication-search

Quotes:

CPU-Z’s benchmark is a FP32 math test using SSE instructions. It does not leverage SSE’s vector math capability with the exception of some 128-bit memory accesses. Most SSE instructions are scalar FP32 adds, multiplies, conversions, or compares.
So no, it's not integer focused.

Compared to Cinebench and gaming workloads, CPU-Z has fewer branches, which are also easier to predict. (...) CPU-Z has fewer unique branches than Cinebench 2024 or games. Even Goldmont Plus has no problem tracking CPU-Z’s branches, even though its BTB is the smallest of the bunch.
It's also not branch heavy.

A CPU’s frontend is responsible for bringing instructions into the core. It’s not seriously challenged by CPU-Z’s benchmark, but it’s nice to understand why
The bit about frontend does not seem to hold either.

Bonus:
I can’t think of anything that fits within the L1 cache and barely challenges the branch predictor. CPU-Z’s benchmark is an exception. The factors that limit performance in CPU-Z are very different from those in typical real-life workloads. (...) Thus, CPU-Z’s benchmark ends up being useless to both CPU designers and end users.
 

Doug S

Diamond Member
Feb 8, 2020
3,860
6,830
136
So a week ago I said this to @Hulk in response to his complaints about what he saw as shortcomings of benchmarking to produce information he wants to see:

So why you are complaining about the lack of benchmarks that meet your criteria? If you have 30 years of experience in this and know exactly what you want and believe you know how to get it? Why not create this great benchmark yourself?

And well, he did. Not saying I had anything to do with it, but still I'm impressed because too often when people complain about what they see as shortcomings in the way things are currently done, they don't want to step up and do something to address it because that's a lot more work than complaining.

Kudos!
 

poke01

Diamond Member
Mar 8, 2022
4,913
6,257
106
No Apple Silicon?
There’s no RISC-V or Qualcomm either and that’s fine. Hulk prefers x86 CPUs running Windows so that’s his focus and he states it in his OP. Although adding [x86] in the thread title would make it clearer
 

Thunder 57

Diamond Member
Aug 19, 2007
4,302
7,113
136
I'm sure Apple and others would be welcome. As you can see though Hulk as having a difficult enough time getting volunteers for x86.
 
Last edited:

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
No Apple Silicon?
Sorry, this investigation is limited to x86 architecture and ST throughput/cycle. As Thunder 57 wrote, I'm having a hard enough time getting x86 volunteers! I'm okay with this actually. This is a long-term investigation. I'd rather build the results slowly, carefully, and accurately than have a flood of nutty results.

I think the results so far are very interesting and I would "hang my hat" on them.

Lion Cove ~ Zen 5
Lion Cove>Raptor Cove, +7.5%
Skymont>Gracemont (Alder Lake), +35%
Golden Cove>Skymont, +5%
Gracemont (Alder Lake)>Skylake, +5.5%

So at the end of the day Lion Cove and Zen 5 are quite comparable and trade blows in different applications.
Lion Cove is a relatively small but significant upgrade from Raptor Cove (except for gaming).
Alder Lake Gracemont is not as performant as Skylake, but about 5% behind.
Skymont is not as performant as Golden Cove, again about 5% less, like Gracemont and Skylake.
Skymont is a massive leap from Gracemont, showing +35% average increase/cycle in work.

These are numbers we kind of knew in our "gut" but couldn't put good evidence behind. These results were recorded very carefully and consistently as I was in contact with the benchmarkers and we made sure we performed the tests the exact same way.

Now you can argue with my weighting and choice of benchmarks of course, but this is just one limited investigation so take from it what you will but I think the results so far are very honest with not crazy outliers we often see in such tables.

If anyone would like to test their system or systems I would be happy to help. If you have a non-hybrid desktop it's really easy. Just do the registry tweak to enable max frequency in Power Plan, set it to 4.7 for 7-Zip and 5.2GHz for the rest of the tests and run them. It's really that easy to obtain solid, repeatable scores run at known frequency.

Zen, Zen 2, Zen 4, Cypress Cove, Raptor Cove, Raptor Cove Gracemont... let's see how they stack up?
 
Last edited:

Thunder 57

Diamond Member
Aug 19, 2007
4,302
7,113
136
I'd like to see Zen or Zen+ if anyone is willing to contribute. I could borrow a 3500u but I would rather not and using a desktop is preferable.
 

gai

Junior Member
Nov 17, 2020
13
38
91
Skylake processors are shown as variably 60% to 87% higher IPC than...Skylake. The baseline measurement does not appear to be correct.

Measuring performance-per-clock across different processors includes an implicit dichotomy in the motivation for the measurement:
1) Do you want to compare performance-per-clock when operating at peak performance (generally peak frequency)?
2) Do you want to compare performance-per-clock at a normalized clock frequency?

The test suite includes both peak frequency (3/4 of tests) and normalized frequency (7-Zip). Mixing these together makes it difficult to understand the objective.

In the former case, what the user can compare is the time to complete a (hopefully fixed) unit of work. The classic equation from computer architecture textbooks arises: Task length = Cycle Time × Cycles/Instruction × [Dynamic] Instruction Count. Memory latency is only partly dependent on cycle time, so task length is not perfectly linear in cycle time, though it is often close to linear.

In the latter case, there is more room to understand which microarchitecture is more "advanced" in the sense that it completes more work per cycle. The final results for a user are task length and energy consumption, so this metric is a nice bit of curiosity, but is not especially useful outside of processor design itself. It is no accident that all processor manufacturers continue to improve both cycle time and work per cycle in tandem.

When comparing different processors, the same "unit of work" may involve executing more or fewer instructions, so the user-visible measurement is more of a "Performance/Clock" measurement than it is an "Instructions/Clock" measurement. IPC certainly is easier to fit into a table header, though.
 

Hulk

Diamond Member
Oct 9, 1999
5,426
4,156
136
Skylake processors are shown as variably 60% to 87% higher IPC than...Skylake. The baseline measurement does not appear to be correct.

Measuring performance-per-clock across different processors includes an implicit dichotomy in the motivation for the measurement:
1) Do you want to compare performance-per-clock when operating at peak performance (generally peak frequency)?
2) Do you want to compare performance-per-clock at a normalized clock frequency?

The test suite includes both peak frequency (3/4 of tests) and normalized frequency (7-Zip). Mixing these together makes it difficult to understand the objective.

In the former case, what the user can compare is the time to complete a (hopefully fixed) unit of work. The classic equation from computer architecture textbooks arises: Task length = Cycle Time × Cycles/Instruction × [Dynamic] Instruction Count. Memory latency is only partly dependent on cycle time, so task length is not perfectly linear in cycle time, though it is often close to linear.

In the latter case, there is more room to understand which microarchitecture is more "advanced" in the sense that it completes more work per cycle. The final results for a user are task length and energy consumption, so this metric is a nice bit of curiosity, but is not especially useful outside of processor design itself. It is no accident that all processor manufacturers continue to improve both cycle time and work per cycle in tandem.

When comparing different processors, the same "unit of work" may involve executing more or fewer instructions, so the user-visible measurement is more of a "Performance/Clock" measurement than it is an "Instructions/Clock" measurement. IPC certainly is easier to fit into a table header, though.
Thanks for taking the time to analyze and post.

I changed the comparison of performance to a max performance score of 100. In order to score 100 a processor would have to be the best in all four categories. This method is better than using Skylake as a baseline because it's easier to make comparisons between various CPU's in the table.

Yes, this is not IPC as I noted a few times both in the first post and the second. This is throughput/cycle, or work/cyle, or work rate. It takes more into account than simply architecture such as cache and memory subsystem, ISA, etc..

The results are accurate and align very well with what we've seen and heard regarding the processors. Having compiled the table myself and worked hand-in-hand with the contributors I have much confidence in the results.

But, at the end of the day all I can tell you is how the data was collected and the numbers. You can take it from there.
 
  • Like
Reactions: igor_kavinski