Investigation of “Throughput/Cycle” for Client CPUs
Introduction
In the early days of computing, the computational capability of a microprocessor was often described using
IPC (instructions per cycle), which measures how many instructions a processor can complete in a single clock cycle. Early client processors were
single-issue, meaning they could issue at most one instruction per cycle. As a result, IPC was capped at
1, and in practice was frequently lower due to pipeline stalls, branch penalties, and memory latency.
This limitation changed with the introduction of
superscalar processors, such as the Intel Pentium, which were capable of issuing multiple instructions per cycle under favorable conditions.
An intuitive way to understand this is to imagine an assembly line that, at best, can produce
one widget per second. As long as every stage of the line operates perfectly—without errors or interruptions—the throughput remains one widget per second. Any disruption at any stage, however, reduces the effective throughput below this theoretical maximum.
The Pentium represented a major breakthrough because, under certain conditions—specifically when two instructions could be executed simultaneously while remaining in program order—it could achieve
more than one instruction per cycle. Extending the analogy, this is similar to adding a second assembly line: even if one of the lines experiences frequent errors, the system can still produce
more than one widget per second overall.
The
P6 architecture (Pentium Pro) extended this concept further by increasing the number of execution “assembly lines” (making the core
wider) and introducing
out-of-order execution. This allowed steps in the assembly process to be performed in a different order than originally specified, as long as the final result remained correct. By dynamically reordering work, all assembly lines could be kept more fully occupied, increasing overall throughput.
As processor cores became wider over time and out-of-order execution grew more sophisticated, the concept of IPC alone became increasingly inadequate as a measure of performance. A more useful way to describe compute efficiency is
throughput per cycle—that is, how much useful work the CPU, along with its memory subsystem and supporting components, can perform per clock cycle when running real applications.
Purpose
The purpose of this investigation is to compare the
single-threaded (ST) throughput per cycle of modern client processors running
64-bit Windows, using standardized benchmark workloads as a proxy for real-world application performance.
Benchmarks
The following four benchmarks were selected to determine a
throughput-per-cycle metric for the tested processors. Each benchmark was chosen because it emphasizes different aspects of single-threaded CPU behavior, providing a more representative measure of overall compute efficiency than any single test alone. Of course, no benchmark is perfect but I have found these benchmarks to be precise (return ~same scores run-after-run) and ubiquitous, which is nice because we have a “gut” feeling about how they translate to our use cases. The most important aspect of this investigation is getting consistent results.
The benchmarks used in this investigation are:
Geekbench 6.5 (Single-Thread)
Geekbench 6 was selected because it represents a broad mix of real-world workloads, including integer, floating-point, and memory-related tasks. Its short, diverse subtests make it a useful proxy for general application responsiveness and front-end efficiency.
Cinebench 2026 (Single-Core and Single-Thread tests) and/or Cinebench R23
Cinebench 2026 measures sustained single-threaded performance using a ray-tracing workload. It places significant emphasis on floating-point and vector execution and provides insight into a processor’s ability to sustain throughput under continuous computational load. If your processor will only run R23 then just run that one. If it will run both then please submit both scores.
7-Zip 25.01 (x64, Single Thread, run as close to 4.7GHz as possible, explanation for this request below)
The 7-Zip benchmark was included due to its heavy reliance on integer operations, branching behavior, and cache efficiency. When run in 64-bit mode, it minimizes legacy architectural constraints and highlights core execution efficiency in integer-dominated workloads.
7-Zip using a 32MB dictionary size as in this test will "blow through" the L3 cache for most systems and rely heavily on main memory access. As the CPU and main memory frequency discrepency grow larger and larger the CPU will be starved for data and MIPS/GHz will decrease. To keep the CPUs on a relatively level playing field let's try to keep the CPU frequency for this one around 4.7GHz is possible to keep this test more compute and less main memory stressor.
CPU-Z
CPU-Z’s benchmark is a FP32 math test using SSE instructions. It does not leverage SSE’s vector math capability with the exception of some 128-bit memory accesses. Most SSE instructions are scalar FP32 adds, multiplies, conversions, or compares. The long average instruction length could mean frontend throughput gets restricted by 16 byte per cycle L1 instruction cache bandwidth on older Intel CPUs. However, that limitation can be mitigated by op caches, and is only an issue if the execution engine can reach high enough IPC for frontend throughput to matter. CPU-Z benchmark has a typical mix of memory accesses. Branches are less common in CPU-Z than in games, compression, and Cinebench 2024. (From Chips and Cheese).
Procedure
For each processor, all benchmarks were run in a
64-bit Windows environment using their respective
single-threaded test modes. The operating frequency observed during each benchmark run was recorded, and benchmark scores were normalized by frequency to produce a
throughput-per-cycle value. The individual normalized results were then combined to form a composite measure of single-threaded compute efficiency.
Both the
Single-Thread (ST) and
Single-Core (SC) tests in Cinebench were executed to evaluate the effect of
simultaneous multithreading (SMT) or hyper-threading (HT), when supported by the processor under test.
Since we are only running single thread benchmarks, the CPU should run very close to the top turbo or boost rated frequency for your CPU, but of them may run one or two hundred MHZ below that rating. Please have HWinfo open so you can watch the core frequency as the benchmark runs. You are looking for the highest current frequency and it might bounce around from core-to-core. For example, my HX370 has a boost frequency of 5.1GHz but in these benches average 4.9 or 5GHz. On the other hand, my 9950X holds a consistent 5.7GHz on all benchmarks.
Geekbench 6.5, Cinebench 2026, Cinebench R23, and HWinfo can be downloaded from their respective sites.
I have zipped up the correct versions of 7-Zip (25.01 x64) and CPUz (2.18.0 64 bit version) and you can download them at the link below. Both are portable and will not install anything on your computer.
https://www.dropbox.com/scl/fi/mc5v...ey=o1aef8g5w9ycgnh1sh63pc5sf&st=5pz1npfg&dl=0
Run 7-Zip by double-clicking 7zFM.exe. The benchmark is located under Tools>Benchmark and it will start to run as soon as you select it. Change the “Number of CPU Threads” to 1 and make sure the Dictionary size is 32MB and Passes is set to 10. The bench will run 10 times and you will report the “Total Rating” in GIPS this is shown near the bottom right of the dialog.
Run CPUz by double-clicking “CPUz_x64.exe” and then selecting “Bench>Bench CPU.” Report the Single Thread result. I didn’t include the 32 bit version in this zip file so there is no chance of confusion as to which one to run.
A note on “cornering” E cores and benching hybrid CPU’s in general
Hybrid cores can be tricky to benchmark. The P cores are relatively straightforward because Windows will generally put a single thread on the most performant core. All you have to do is have a look in HWinfo at the frequency it is running the test.
For the E cores there are a few techniques you can use to “corner” them. On most desktops you can turn off all of the P cores except one, and then set that one to 900MHz (lowest generally allowed). This way the Windows will go for a more performant core, which now will be an E core.
Igor Kavinsky has also been very helpful and clever in how he obtained what I believe are very accurate scores of his Lion Cove and especially Skymont cores, which are very slippery. Here is what he told me he did and it helped me to benchmark the Zen 5c cores in my laptop.
“Use Statuscore. When you tick any core, it boosts to max speed. By clicking each core, you will know which one is the P-core and which one is the E-core from their max boost speeds.
So let's suppose Core 17 is the first Zen5c core. GB6 is the easiest. It obeys process affinity like a good dog. Set affinity to Core 17. In Status Core, turn on data recording and immediately click Run in GB6. When you see the Raytracing test, get ready. As soon as the motion test is done, go to the Statuscore menu and stop the data recording. Now just let the multicore part of GB6 finish so you can get the score. It's still going to test only the affinity core in MT because like I said, GB6 is a good dog.
In the recorded CSV file, go to the E-core section and you will see all the cores have clocks around 100-200 MHz while only the E-core clock will be continuously high. I took the values closest to the boost clock. So if most values were 4800 or so, I disregarded all values below 4800 and took an average of all values 4800 or above.
7-Zip is also easy just like GB6 and you shouldn't have any issue.
CPU-Z is a bit problematic and needs quick reflexes. Choose E-core from the drop down. Set affinity to one E-core only. Then start the test and immediately open the data recording dialog and click it but don't start. Wait for the single core test to start and start recording as soon as you see the ST test begin and stop when it's done. In the CSV, look for when the selected E-core's clocks start at its expected boost clock and take an average from that point till the last point where the boost clock is high and average those.
CB23 is stupid so you have to be quick. Start data recording, start the ST test, immediately set the affinity to the E-core and keep recording till the score appears. Same as before, take an average of the constantly high clocks only. CB26 is a bit better as you get some time while it prepares to run so you can do the above without much pressure.
Setting Max Frequency for non-hybrid CPUs using Window Power Plan
If you have a non-hybrid CPU then a simple registry tweak will allow you to set a max frequency. Set it below a frequency no throttling will occur and Windows will hold that frequency. Of course double-check that with HWinfo.
This file when run will add the needed registry key:
https://www.dropbox.com/scl/fi/bigm...ey=jnyrsz7g9bq7nhlb5elx7jwcu&st=kub4nlh6&dl=0
This is where you set a max frequency.
Results
The benchmark results were compiled into a table for comparison. To enable cross-architecture evaluation, all benchmark scores were
normalized relative to the Intel Skylake core, which was used as the baseline reference. Each benchmark score was first converted to a
throughput-per-cycle value and then expressed as a ratio relative to the Skylake baseline.
The normalized results were combined into a single composite score using a weighted geometric mean. The weighting applied to each benchmark was as follows:
- Geekbench 6.5 – 40%
- Cinebench 2026 or/and R23 – 30%
- 7-Zip – 20%
- CPU-Z – 10%
These weightings were selected to balance general-purpose application behavior, sustained floating-point performance, and integer-focused execution characteristics while preventing any single benchmark from disproportionately influencing the final composite score.