Discussion Investigation of Single Thread CPU "Thoughput/cycle"

Hulk · Wednesday at 6:27 PM

Here is my take on determining "throughput/cycle" for the purpose of comparing CPUs and their supporting memory systems.

The baseline CPU here is my desktop Skylake in blue font. More performant processors are above it and less performant ones below it. Please forgive the use of "IPC." It's shorter to fit in the chart than "Throughput/Cycle." For this thread Throughput/Cycle = IPC.

Also I understand that these benchmarks don't simply test compute. New processors have the advantage not only of better architectures, but also better memory subsystems to feed the cores, as well as new ISA instructions that may be used in the benchmarks. It's all folded in.

Igor Kavinsky has done a lot of good work here. He is very clever in how he's isolated the P's and E's in Alder Lake and Arrow Lake. Thanks to Thunder 57 for submitting his Zen 3 scores. It's takes quite a bit of effort to gets these recorded accurately and I really appreciate the work helping me with this project.

While no benchmark or benchmarks are a perfect representation of real world performance, if you look at these scores they line up quite well with what we "knew" about these CPU's but couldn't quite put firm numbers on.

The percentages are important but even informative because they are all computed the same way using the same benchmarks.

If anyone has a core not in the table and would like to help out just reach out to me and I can help get you going if you need some help with this.

Skymont is about 37-37% better than Gracemont and about 6 or 7% below Golden Cove.
Lion Cove is about 1% better than Zen 5.
Gracemont is about 5 or 6% below Skylake.

Update 2/13 - Added Thunder 57's Zen 3 results.
Update 2/14 - Igor Kavinsky added Golden Cove and Gracemont.

Hulk · Wednesday at 6:27 PM

Investigation of “Throughput/Cycle” for Client CPUs

Introduction

In the early days of computing, the computational capability of a microprocessor was often described using IPC (instructions per cycle), which measures how many instructions a processor can complete in a single clock cycle. Early client processors were single-issue, meaning they could issue at most one instruction per cycle. As a result, IPC was capped at 1, and in practice was frequently lower due to pipeline stalls, branch penalties, and memory latency.

This limitation changed with the introduction of superscalar processors, such as the Intel Pentium, which were capable of issuing multiple instructions per cycle under favorable conditions.

An intuitive way to understand this is to imagine an assembly line that, at best, can produce one widget per second. As long as every stage of the line operates perfectly—without errors or interruptions—the throughput remains one widget per second. Any disruption at any stage, however, reduces the effective throughput below this theoretical maximum.

The Pentium represented a major breakthrough because, under certain conditions—specifically when two instructions could be executed simultaneously while remaining in program order—it could achieve more than one instruction per cycle. Extending the analogy, this is similar to adding a second assembly line: even if one of the lines experiences frequent errors, the system can still produce more than one widget per second overall.

The P6 architecture (Pentium Pro) extended this concept further by increasing the number of execution “assembly lines” (making the core wider) and introducing out-of-order execution. This allowed steps in the assembly process to be performed in a different order than originally specified, as long as the final result remained correct. By dynamically reordering work, all assembly lines could be kept more fully occupied, increasing overall throughput.

As processor cores became wider over time and out-of-order execution grew more sophisticated, the concept of IPC alone became increasingly inadequate as a measure of performance. A more useful way to describe compute efficiency is throughput per cycle—that is, how much useful work the CPU, along with its memory subsystem and supporting components, can perform per clock cycle when running real applications.

Purpose

The purpose of this investigation is to compare the single-threaded (ST) throughput per cycle of modern client processors running 64-bit Windows, using standardized benchmark workloads as a proxy for real-world application performance.

Benchmarks

The following four benchmarks were selected to determine a throughput-per-cycle metric for the tested processors. Each benchmark was chosen because it emphasizes different aspects of single-threaded CPU behavior, providing a more representative measure of overall compute efficiency than any single test alone. Of course, no benchmark is perfect but I have found these benchmarks to be precise (return ~same scores run-after-run) and ubiquitous, which is nice because we have a “gut” feeling about how they translate to our use cases. The most important aspect of this investigation is getting consistent results.

The benchmarks used in this investigation are:

Geekbench 6.5 (Single-Thread)
Geekbench 6 was selected because it represents a broad mix of real-world workloads, including integer, floating-point, and memory-related tasks. Its short, diverse subtests make it a useful proxy for general application responsiveness and front-end efficiency.

Cinebench 2026 (Single-Core and Single-Thread tests) and/or Cinebench R23
Cinebench 2026 measures sustained single-threaded performance using a ray-tracing workload. It places significant emphasis on floating-point and vector execution and provides insight into a processor’s ability to sustain throughput under continuous computational load. If your processor will only run R23 then just run that one. If it will run both then please submit both scores.

7-Zip 25.01 (x64, Single Thread, run as close to 4.7GHz as possible, explanation for this request below)
The 7-Zip benchmark was included due to its heavy reliance on integer operations, branching behavior, and cache efficiency. When run in 64-bit mode, it minimizes legacy architectural constraints and highlights core execution efficiency in integer-dominated workloads.

7-Zip using a 32MB dictionary size as in this test will "blow through" the L3 cache for most systems and rely heavily on main memory access. As the CPU and main memory frequency discrepency grow larger and larger the CPU will be starved for data and MIPS/GHz will decrease. To keep the CPUs on a relatively level playing field let's try to keep the CPU frequency for this one around 4.7GHz is possible to keep this test more compute and less main memory stressor.

CPU-Z
CPU-Z’s benchmark is a FP32 math test using SSE instructions. It does not leverage SSE’s vector math capability with the exception of some 128-bit memory accesses. Most SSE instructions are scalar FP32 adds, multiplies, conversions, or compares. The long average instruction length could mean frontend throughput gets restricted by 16 byte per cycle L1 instruction cache bandwidth on older Intel CPUs. However, that limitation can be mitigated by op caches, and is only an issue if the execution engine can reach high enough IPC for frontend throughput to matter. CPU-Z benchmark has a typical mix of memory accesses. Branches are less common in CPU-Z than in games, compression, and Cinebench 2024. (From Chips and Cheese).

Procedure

For each processor, all benchmarks were run in a 64-bit Windows environment using their respective single-threaded test modes. The operating frequency observed during each benchmark run was recorded, and benchmark scores were normalized by frequency to produce a throughput-per-cycle value. The individual normalized results were then combined to form a composite measure of single-threaded compute efficiency.

Both the Single-Thread (ST) and Single-Core (SC) tests in Cinebench were executed to evaluate the effect of simultaneous multithreading (SMT) or hyper-threading (HT), when supported by the processor under test.

Since we are only running single thread benchmarks, the CPU should run very close to the top turbo or boost rated frequency for your CPU, but of them may run one or two hundred MHZ below that rating. Please have HWinfo open so you can watch the core frequency as the benchmark runs. You are looking for the highest current frequency and it might bounce around from core-to-core. For example, my HX370 has a boost frequency of 5.1GHz but in these benches average 4.9 or 5GHz. On the other hand, my 9950X holds a consistent 5.7GHz on all benchmarks.

Geekbench 6.5, Cinebench 2026, Cinebench R23, and HWinfo can be downloaded from their respective sites.

I have zipped up the correct versions of 7-Zip (25.01 x64) and CPUz (2.18.0 64 bit version) and you can download them at the link below. Both are portable and will not install anything on your computer. https://www.dropbox.com/scl/fi/mc5v...ey=o1aef8g5w9ycgnh1sh63pc5sf&st=5pz1npfg&dl=0

Run 7-Zip by double-clicking 7zFM.exe. The benchmark is located under Tools>Benchmark and it will start to run as soon as you select it. Change the “Number of CPU Threads” to 1 and make sure the Dictionary size is 32MB and Passes is set to 10. The bench will run 10 times and you will report the “Total Rating” in GIPS this is shown near the bottom right of the dialog.

Run CPUz by double-clicking “CPUz_x64.exe” and then selecting “Bench>Bench CPU.” Report the Single Thread result. I didn’t include the 32 bit version in this zip file so there is no chance of confusion as to which one to run.

A note on “cornering” E cores and benching hybrid CPU’s in general

Hybrid cores can be tricky to benchmark. The P cores are relatively straightforward because Windows will generally put a single thread on the most performant core. All you have to do is have a look in HWinfo at the frequency it is running the test.

For the E cores there are a few techniques you can use to “corner” them. On most desktops you can turn off all of the P cores except one, and then set that one to 900MHz (lowest generally allowed). This way the Windows will go for a more performant core, which now will be an E core.

Igor Kavinsky has also been very helpful and clever in how he obtained what I believe are very accurate scores of his Lion Cove and especially Skymont cores, which are very slippery. Here is what he told me he did and it helped me to benchmark the Zen 5c cores in my laptop.

“Use Statuscore. When you tick any core, it boosts to max speed. By clicking each core, you will know which one is the P-core and which one is the E-core from their max boost speeds.

So let's suppose Core 17 is the first Zen5c core. GB6 is the easiest. It obeys process affinity like a good dog. Set affinity to Core 17. In Status Core, turn on data recording and immediately click Run in GB6. When you see the Raytracing test, get ready. As soon as the motion test is done, go to the Statuscore menu and stop the data recording. Now just let the multicore part of GB6 finish so you can get the score. It's still going to test only the affinity core in MT because like I said, GB6 is a good dog.

In the recorded CSV file, go to the E-core section and you will see all the cores have clocks around 100-200 MHz while only the E-core clock will be continuously high. I took the values closest to the boost clock. So if most values were 4800 or so, I disregarded all values below 4800 and took an average of all values 4800 or above.

7-Zip is also easy just like GB6 and you shouldn't have any issue.

CPU-Z is a bit problematic and needs quick reflexes. Choose E-core from the drop down. Set affinity to one E-core only. Then start the test and immediately open the data recording dialog and click it but don't start. Wait for the single core test to start and start recording as soon as you see the ST test begin and stop when it's done. In the CSV, look for when the selected E-core's clocks start at its expected boost clock and take an average from that point till the last point where the boost clock is high and average those.

CB23 is stupid so you have to be quick. Start data recording, start the ST test, immediately set the affinity to the E-core and keep recording till the score appears. Same as before, take an average of the constantly high clocks only. CB26 is a bit better as you get some time while it prepares to run so you can do the above without much pressure.

Setting Max Frequency for non-hybrid CPUs using Window Power Plan
If you have a non-hybrid CPU then a simple registry tweak will allow you to set a max frequency. Set it below a frequency no throttling will occur and Windows will hold that frequency. Of course double-check that with HWinfo.

This file when run will add the needed registry key: https://www.dropbox.com/scl/fi/bigm...ey=jnyrsz7g9bq7nhlb5elx7jwcu&st=kub4nlh6&dl=0

This is where you set a max frequency.

Results

The benchmark results were compiled into a table for comparison. To enable cross-architecture evaluation, all benchmark scores were normalized relative to the Intel Skylake core, which was used as the baseline reference. Each benchmark score was first converted to a throughput-per-cycle value and then expressed as a ratio relative to the Skylake baseline.

The normalized results were combined into a single composite score using a weighted geometric mean. The weighting applied to each benchmark was as follows:

Geekbench 6.5 – 40%
Cinebench 2026 or/and R23 – 30%
7-Zip – 20%
CPU-Z – 10%

These weightings were selected to balance general-purpose application behavior, sustained floating-point performance, and integer-focused execution characteristics while preventing any single benchmark from disproportionately influencing the final composite score.

Hulk · Wednesday at 6:27 PM

(reserved)

Markfw · Wednesday at 7:02 PM

well, I know that was a lot of work, but multicore is almost more important, and power usage as well. Think about adding those. and maybe the 285 also ?

Hulk · Wednesday at 7:55 PM

Markfw said:
well, I know that was a lot of work, but multicore is almost more important, and power usage as well. Think about adding those. and maybe the 285 also ?

I know you are big in DC so that is important to you. But for most people the first 8 or 12 cores are the ones that really matter. 285K has the same type cores as 245K. I'm keeping the scope of this limited to ST workrate.

Abwx · Wednesday at 8:03 PM

I would discard CPUz, i mean, a so called bench that show only 1% IPC difference between Zen 3 and Zen 4 is just a prove that it s a fishy bench, that looks like a look up table more than anything else, indeed it took Zen 3 to barely match Skylake while in any other bench the difference is huge.

Markfw · Wednesday at 8:11 PM

Hulk said:
I know you are big in DC so that is important to you. But for most people the first 8 or 12 cores are the ones that really matter. 285K has the same type cores as 245K. I'm keeping the scope of this limited to ST workrate.

But that bench is for ONE core, even 8-12 core bench's included would be great !

zir_blazer · Wednesday at 11:58 PM

One of the reasons why I'm not that fond of this kind of testing is because it can't account for changes like memory subsystem. Zen 3 to Zen 4 includes the DDR4 to DDR5 jump, which for me, made Zen 4 a much smaller jump than it appears on this kind of test because you're supposedly testing with faster memory, so Core architecture IPC increase is LOWER than advertised. Zen 4 vs Zen 5 are far easier to directly compare because, well, you can.

LGA 1700 is perhaps the most interesing recent platform because you can isolate how certain changes may affect IPC, like:
Cache L2 size (Alder Lake vs Raptor Lake)
DDR4 vs DDR5 (With different Motherboard)
...which perhaps can help to extrapolate IPC on other platforms.

Otherwise you have AM4, which covers a whole range of CPUs/APUs Zen, Zen+, Zen 2, Zen 3 (The latter two due to the offdie memory controller again, makes it easier to compare between them but harder to the earlier two. Albeit for Zen vs Zen+ vs Zen 2 vs Zen 3 it may make more sense to compare Raven Ridge, Picasso, Renoir and Cezanne than the CPUs-only Ryzens, precisely, because all 4 are monolithic, so perhaps you can kind of extrapolate offdie memory controller impact).

Hulk · Thursday at 12:19 AM

zir_blazer said:
One of the reasons why I'm not that fond of this kind of testing is because it can't account for changes like memory subsystem. Zen 3 to Zen 4 includes the DDR4 to DDR5 jump, which for me, made Zen 4 a much smaller jump than it appears on this kind of test because you're supposedly testing with faster memory, so Core architecture IPC increase is LOWER than advertised. Zen 4 vs Zen 5 are far easier to directly compare because, well, you can.

LGA 1700 is perhaps the most interesing recent platform because you can isolate how certain changes may affect IPC, like:
Cache L2 size (Alder Lake vs Raptor Lake)
DDR4 vs DDR5 (With different Motherboard)
...which perhaps can help to extrapolate IPC on other platforms.

Otherwise you have AM4, which covers a whole range of CPUs/APUs Zen, Zen+, Zen 2, Zen 3 (The latter two due to the offdie memory controller again, makes it easier to compare between them but harder to the earlier two. Albeit for Zen vs Zen+ vs Zen 2 vs Zen 3 it may make more sense to compare Raven Ridge, Picasso, Renoir and Cezanne than the CPUs-only Ryzens, precisely, because all 4 are monolithic, so perhaps you can kind of extrapolate offdie memory controller impact).

Yes, as I wrote in the first post.
"Also I understand that these benchmarks don't simply test compute. New processors have the advantage not only of better architectures, but also better memory subsystems to feed the cores, as well as new ISA instructions that may be used in the benchmarks. It's all folded in."

I am making no attempt to isolate the cores from cache, main memory, new ISA, etc.. Also cpus are generally designed not to have a giant bottleneck somewhere because the engineers know putting a super great cpu in a system where it is starved for bandwidth is not economical. Notice I wrote "generally."

This investigation is simply total throughput rate for some common benchmarks very accurately tested and recorded. That's it.

Lots of good ideas here and I suggest someone start another thread and incorporate the ideas suggested here! We only have about 5 active threads in here so more content would be great.

Another issue is when people have a dog in the race and this benchmark or that benchmark doesn't represent their dog fairly. That's a valid complaint but these benches are ubiquitious and often "leaked" so it's nice to really know how they compare among systems.

MS_AT · Thursday at 3:49 AM

Hulk said:
CPU-Z was used as a lightweight, integer-focused benchmark that is particularly sensitive to instruction scheduling, front-end throughput, and branch prediction.

Not to nitpick, but let's link the C&C article again:

https://chipsandcheese.com/p/cpu-zs-inadequate-benchmark?utm_source=publication-search

Quotes:

CPU-Z’s benchmark is a FP32 math test using SSE instructions. It does not leverage SSE’s vector math capability with the exception of some 128-bit memory accesses. Most SSE instructions are scalar FP32 adds, multiplies, conversions, or compares.

So no, it's not integer focused.

Compared to Cinebench and gaming workloads, CPU-Z has fewer branches, which are also easier to predict. (...) CPU-Z has fewer unique branches than Cinebench 2024 or games. Even Goldmont Plus has no problem tracking CPU-Z’s branches, even though its BTB is the smallest of the bunch.

It's also not branch heavy.

A CPU’s frontend is responsible for bringing instructions into the core. It’s not seriously challenged by CPU-Z’s benchmark, but it’s nice to understand why

The bit about frontend does not seem to hold either.

Bonus:

I can’t think of anything that fits within the L1 cache and barely challenges the branch predictor. CPU-Z’s benchmark is an exception. The factors that limit performance in CPU-Z are very different from those in typical real-life workloads. (...) Thus, CPU-Z’s benchmark ends up being useless to both CPU designers and end users.

Hulk · Thursday at 8:57 AM

MS_AT said:
Not to nitpick, but let's link the C&C article again:

https://chipsandcheese.com/p/cpu-zs-inadequate-benchmark?utm_source=publication-search

Quotes:

So no, it's not integer focused.

It's also not branch heavy.

The bit about frontend does not seem to hold either.

Bonus:

Thank you. I updated the description with that from C+C.

Doug S · Thursday at 2:02 PM

So a week ago I said this to @Hulk in response to his complaints about what he saw as shortcomings of benchmarking to produce information he wants to see:

Doug S said:
So why you are complaining about the lack of benchmarks that meet your criteria? If you have 30 years of experience in this and know exactly what you want and believe you know how to get it? Why not create this great benchmark yourself?

And well, he did. Not saying I had anything to do with it, but still I'm impressed because too often when people complain about what they see as shortcomings in the way things are currently done, they don't want to step up and do something to address it because that's a lot more work than complaining.

Kudos!

Search

Discussion Investigation of Single Thread CPU "Thoughput/cycle"

Hulk

Diamond Member

Hulk

Diamond Member

Hulk

Diamond Member

Markfw

Moderator Emeritus, Elite Member

Hulk

Diamond Member

Abwx

Lifer

Markfw

Moderator Emeritus, Elite Member

zir_blazer

Golden Member

Hulk

Diamond Member

MS_AT

Senior member

Hulk

Diamond Member

Doug S

Diamond Member

TRENDING THREADS