Discussion [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems

coercitiv · Aug 5, 2019

Lately some of our forum members have expressed divergent opinions on how IPC should be measured and interpreted in the context of systems with different power and clock targets.

This thread is intended to be the right place for opinions this matter, whether derived from experience or based on academic sources. Discuss!

NTMBK · Aug 5, 2019

Instructions Per Cycle is a meaningless metric unless you specify the precise section of code it is being run on, and the specific data that it is running on. Some code is just naturally low IPC, with lots of memory dependencies with lots of cache misses, or lots of high latency instructions. Or at the other end of the spectrum you could just be running an endless loop of NOPs, in which case you can theoretically hit near infinite IPC.

It has a very, very specific meaning, and it is sadly misused by most tech enthusiasts. They really mean "single threaded performance", which is a much looser term that can encompass many different things.

Consider this- heavily vectorized code that makes use of lots of AVX-512 can easily reduce IPC, due to higher per-instruction latency, and increased pressure on the memory subsystem. But that can still result in much higher performance, as each instruction is doing more work. IPC != per-thread performance.

Edrick · Aug 5, 2019

Exactly what NTMBK said. I do not understand how IPC can be viewed as anything other than Instruction Per Cycle.

Markfw · Aug 5, 2019

MY take on this. Run several different benchmarks on processor A, and the same benchmarks on processor B at the same mhz or ghz. Whichever does better has more IPC.
Due to the various technical changes in design, thats the only way to tell.

IMO

Schmide · Aug 5, 2019

NTMBK said:
(snip) AVX-512 can easily reduce IPC, due to higher per-instruction latency, and increased pressure on the memory subsystem. But that can still result in much higher performance, as each instruction is doing more work. IPC != per-thread performance.

Technically all SIMD reduces instructions per clock because they are single instruction with multiple datum. Hehe

SarahKerrigan · Aug 5, 2019

NTMBK said:
Instructions Per Cycle is a meaningless metric unless you specify the precise section of code it is being run on, and the specific data that it is running on. Some code is just naturally low IPC, with lots of memory dependencies with lots of cache misses, or lots of high latency instructions. Or at the other end of the spectrum you could just be running an endless loop of NOPs, in which case you can theoretically hit near infinite IPC.

It has a very, very specific meaning, and it is sadly misused by most tech enthusiasts. They really mean "single threaded performance", which is a much looser term that can encompass many different things.

Consider this- heavily vectorized code that makes use of lots of AVX-512 can easily reduce IPC, due to higher per-instruction latency, and increased pressure on the memory subsystem. But that can still result in much higher performance, as each instruction is doing more work. IPC != per-thread performance.

Indeed. Somewhere down the line "IPC" became an incorrect shorthand for "clock-normalized ST" (and frequently even more specifically "clock-normalized ST integer.")

Thala · Aug 5, 2019

coercitiv said:
Lately some of our forum members have expressed divergent opinions on how IPC should be measured and interpreted in the context of systems with different power and clock targets.

This thread is intended to be the right place for opinions this matter, whether derived from experience or based on academic sources. Discuss!

Regarding measurements, you typically count the number of instructions retired and count the number of clock cycles in the same time window. The quotient of both measures give you IPC. Most architectures have counters and generate according events in HW.

If you are keeping the workload to run constant at binary level, IPC is linear to single core performance per clock for the devices under test.

HutchinsonJC · Aug 5, 2019

In relation to power or clock/frequency targets specifically, I don't know.

I think there's importance in comparing the same clock speeds comparing one brand to another while doing a certain task, but obviously that importance is largely academic if one CPU is capable of significantly higher clock speeds or if one CPU has significantly more cache. It's more just a fun curiosity to see the clocks be artificially flat lined to see how or what the architectures are doing.

I don't think there's any particularly meaningful way for a single benchmark (even comprised of multiple subsets of benchmarks) to utilize ALL transistors of a CPU in such a way to accurately measure something we'll call or term as "IPC" so as to be useful to everyone's primary use case.

HurleyBird · Aug 5, 2019

SarahKerrigan said:
Indeed. Somewhere down the line "IPC" became an incorrect shorthand for "clock-normalized ST" (and frequently even more specifically "clock-normalized ST integer.")

Or "clock-normalized MT performance" depending. But the thing here is that measuring actual instructions is pretty meaningless. Not all instructions are created equally and the only metric that matters is real performance. But good luck getting people to shift to a term like PPC (performance per clock) instead of IPC. Because the slight misuse of the IPC term hides a completely unimportant metric, and because that misuse has so much momentum behind it as to be effectively immovable, I don't really have a problem with it. Everyone knows that when someone says "IPC" they really mean performance-per-clock, and railing against that comes across as slightly autistic -- what's the actual point of opposition outside of sheer technical correctness?

moinmoin · Aug 5, 2019

Due to the complexity of today's core designs IPC would actually be very complex to calculate correctly. At its core IPC depends on the latency an instruction has to start up and how many cycles it takes for it to complete. For such info Agner Fog's Microarchitecture guide is a very excellent resource. The latency and cycles for instructions for each microarchitecture can be found unter "execution units" with a full list of instructions in his Instruction Tables. These gets more complex by including the different cache and memory bandwidths and latencies ("cache and memory access"), the efficiency and penalty on miss latency of branch predictions ("branch prediction"), whether an instruction needs one, two or even more µops or possibly even combines some instructions ("instruction decoding"), and so on and so forth.

So I'd say IPC is theoretically and technically well defined but due to its intertwined complexity practically impossible to use correctly. So we essentially end up with STPPC if correctly applied in spirit (like referring to SPEC benchmarks as the manufacturers already do).

To give a current example, Ice Lake improves the latency and cycles for several instructions. At the same time it increases L1$ latency by 25%.

Tup3x · Aug 5, 2019

Personally I'd say that the CPU that gives better performance while consuming less power has better "IPC". Often the less power hungry chip would have better performance at the same clock speed too.

The thing is... It doesn't make sense to compare CPUs at same clock speed if the other CPU can clock way higher and then give better performance. Personally the I think that the only thing that matters is power consumption and performance. Rest are irrelevant.

moinmoin · Aug 5, 2019

Tup3x said:
Personally I'd say that the CPU that gives better performance while consuming less power has better "IPC". Often the less power hungry chip would have better performance at the same clock speed too.

The thing is... It doesn't make sense to compare CPUs at same clock speed if the other CPU can clock way higher and then give better performance. Personally the I think that the only thing that matters is power consumption and performance. Rest are irrelevant.

Both power efficiency and frequency scalability are attributes of process nodes more than IPC (though the core design can obviously help).

NTMBK · Aug 5, 2019

Tup3x said:
Personally I'd say that the CPU that gives better performance while consuming less power has better "IPC". Often the less power hungry chip would have better performance at the same clock speed too.

But that's not what IPC is! IPC is Instructions Per Cycle. As in, take number of instructions executed, and divide by the number of cycles that took. That's the only thing it means.

thigobr · Aug 5, 2019

As many pointed on this thread IPC is just a measure of how many instruction per cycle a given CPU core can execute for a given fixed workload. It has nothing to do with Power or any other metrics besides number of instructions and time.

What I have been seeing more and more is a misappropriation of the term to indicate general performance (including there even power/efficiency and multi-thread throughput).

My take on it... Let people misuse the term because it's pretty boring to come at each thread trying to correct it.

HurleyBird · Aug 5, 2019

thigobr said:
My take on it... Let people misuse the term because it's pretty boring to come at each thread trying to correct it.

And when even the likes of AMD, Intel, and others are using the term to mean performance per clock in their marketing, it's an uphill battle you can't win.

Thala · Aug 5, 2019

HurleyBird said:
And when even the likes of AMD, Intel, and others are using the term to mean performance per clock in their marketing, it's an uphill battle you can't win.

And still an increase in IPC precisely means a performance increase for a certain program or a set of programs. So the term is correctly used. Which brings me to the point that the term IPC can of course be correctly used and can be precisely measured using performance monitoring features - which makes it attractive for certain metrics and models.

Andrei. · Aug 5, 2019

If you want to measure actual IPC, you go use perf, uProf, or vTune or whatever other profiler to actually poll the performance counters and get the actual cycles and instruction numbers that were run:

And so on. This will satisfy the anal people about the term IPC and you'll be the best kind of correct - technically correct.
Colloquially, because we're benchmarking the same standard benchmarks, this means they're the same binaries. This means the instruction count are the same across all machines.

This means that metrics such as derived PPC / Performance Per Clock metrics are identical in their relative values to the actual IPC values. Yes people still call it IPC - and technically it's wrong - but practically it's also not wrong.
Finally, you do not measure IPC across different platforms at some arbitrary equal frequency. I'll quote myself here with my example:

Andrei. said:

SPEC 429.mcf:

4325MHz: 50.68 score, 11.71 score per GHz
3500MHz: 45.49 score, 12.99 score per GHz +10.9% IPC
3000MHz: 39.43 score, 13.14 score per GHz +12.1% IPC

And this is why measuring IPC at some arbitrary equal frequency between systems and especially between different micro-architectures is a load of crap.

Click to expand...

Another famous example of memory intensive workloads are, you guessed it, 3D games.

People measuring at some locked frequency between different microarchitecture families will be reporting misleading and wrong PPC/IPC numbers because they are altering the microarchitectural balance and measuring at some random point in the non-linear performance vs clock curve. The only point at which microarchitectural performance characteristics such as IPC should matter is at peak performance, because that's where you'll be spending >90% of your computing time.

Thank you for coming to my TED talk.

HurleyBird · Aug 5, 2019

Thala said:
And still an increase in IPC precisely means a performance increase for a certain program or a set of programs. So the term is correctly used. Which brings me to the point that the term IPC can of course be correctly used and can be precisely measured using performance monitoring features - which makes it attractive for certain metrics and models.

No. Intel and AMD marketing count raw performance, not instructions, when they make IPC claims.

lopri · Aug 5, 2019

True enough but when there are several SKUs spanning multiple frequency points among vendors, is it really a crime to pick an arbitrary number in order to give an idea to people who do not have time or means of testing themselves? A generation of CPUs from the same vendor are mostly cut from the same cloth anyway.

naukkis · Aug 6, 2019

Andrei. said:
If you want to measure actual IPC, you go use perf, uProf, or vTune or whatever other profiler to actually poll the performance counters and get the actual cycles and instruction numbers that were run:

But that actual IPC has nothing to do with performance per clock. People and manufacturers are interested in PPC, actual IPC isn't important at all. Different CPU's execute different instructions at different speeds, and different instructions sets don't even have same instructions to execute. Therefore we have benchmarks like spec, source code which can translated to binary with publicly available compilers and so make comparison between cpu's, not between binaries which prefer different cpu archs.

And there's nothing wrong to measure that spec-delivered performance as IPC as there is fixed amount of instructions to run in source code, how many architectural instructions cpu uses to to accomplish that isn't relevant for performance comparisons.

Thala · Aug 6, 2019

HurleyBird said:
No. Intel and AMD marketing count raw performance, not instructions, when they make IPC claims.

You still do not get it - maybe you also read what Andrei wrote above to get some insights.
In more generalized form this means: If value A has a linear relation to value B and you measure an relative change of A you can correctly claim the same relative change of B.

I quote myself here:

Thala said:
If you are keeping the workload to run constant at binary level, IPC is linear to single core performance per clock for the devices under test.

Thats precisely what Andrei used here:

Andrei said:
4325MHz: 50.68 score, 11.71 score per GHz
3500MHz: 45.49 score, 12.99 score per GHz +10.9% IPC
3000MHz: 39.43 score, 13.14 score per GHz +12.1% IPC

He measured a performance score and concluded/claimed an IPC gain - technically totally correct.

HurleyBird · Aug 6, 2019

Thala said:
You still do not get it - maybe you also read what Andrei wrote above to get some insights.

I guess you're referring to:

Colloquially, because we're benchmarking the same standard benchmarks, this means they're the same binaries. This means the instruction count are the same across all machines.

Except that's not exactly true. It might be, depending on the processors and binaries, but certainly not in all situations. Processors that are not compatible will obviously not be running the same binary, making the above statement seem a bit odd as nothing suggests limiting to x86. Even then, some binaries can have multiple code paths depending on processor features or strengths, or a program may have multiple binaries to accomplish the same. Doing the same task with AVX or AVX 512 will change the number of instructions involved. And that's just the pre-compiled stuff. JITed code could end up quite different given an intelligent JIT compiler that plays to the strengths of whatever processor it's running on.

Andrei. · Aug 6, 2019

HurleyBird said:
I guess you're referring to:

Except that's not exactly true. It might be, depending on the processors and binaries, but certainly not in all situations. Processors that are not compatible will obviously not be running the same binary, making the above statement seem a bit odd as nothing suggests limiting to x86. Even then, some binaries can have multiple code paths depending on processor features or strengths, or a program may have multiple binaries to accomplish the same. Doing the same task with AVX or AVX 512 will change the number of instructions involved. And that's just the pre-compiled stuff. JITed code could end up quite different given an intelligent JIT compiler that plays to the strengths of whatever processor it's running on.

When I use the word colloquially, I intent its actual meaning. You're of course technically correct, but in practical terms you're talking about corner case benchmarks. For example in AT's benchmark suite there's one single benchmark which uses AVX512 - everything else is following the same code path with the exact same instructions, making your point a moot point for the overall discussion.

Thala · Aug 6, 2019

HurleyBird said:
I guess you're referring to:

Except that's not exactly true. It might be, depending on the processors and binaries, but certainly not in all situations. Processors that are not compatible will obviously not be running the same binary, making the above statement seem a bit odd as nothing suggests limiting to x86.
Even then, some binaries can have multiple code paths depending on processor features or strengths, or a program may have multiple binaries to accomplish the same. Doing the same task with AVX or AVX 512 will change the number of instructions involved. And that's just the pre-compiled stuff. JITed code could end up quite different given an intelligent JIT compiler that plays to the strengths of whatever processor it's running on.

Using the same binary (same code path) is a pre-condition for the statements to be true.
And nowhere i ever mentioned that the statements are limited to x86. It also holds when comparing within ARMv8A architecture implementations.

Abwx · Aug 6, 2019

Andrei. said:
If you want to measure actual IPC, you go use perf, uProf, or vTune or whatever other profiler to actually poll the performance counters and get the actual cycles and instruction numbers that were run:

View attachment 9254

View attachment 9255

And so on. This will satisfy the anal people about the term IPC and you'll be the best kind of correct - technically correct.

Colloquially, because we're benchmarking the same standard benchmarks, this means they're the same binaries. This means the instruction count are the same across all machines.

This means that metrics such as derived PPC / Performance Per Clock metrics are identical in their relative values to the actual IPC values. Yes people still call it IPC - and technically it's wrong - but practically it's also not wrong.

Finally, you do not measure IPC across different platforms at some arbitrary equal frequency. I'll quote myself here with my example:

Another famous example of memory intensive workloads are, you guessed it, 3D games.

People measuring at some locked frequency between different microarchitecture families will be reporting misleading and wrong PPC/IPC numbers because they are altering the microarchitectural balance and measuring at some random point in the non-linear performance vs clock curve. The only point at which microarchitectural performance characteristics such as IPC should matter is at peak performance, because that's where you'll be spending >90% of your computing time.

Thank you for coming to my TED talk.

Some good remarks, but then when comparing Zen 2 IPC to the one of Coffee Lake you are overclocking the latter s RAM speed, official spec from Intel is 2600MHz but you clock it at 3200MHz while only AMD has this frequency specified as official and guaranted as being bug less.

in short you are displaying for Coffee Lake an IPC that is in no way guaranted by the manufacturer since you run the chip out of specs.

Discussion [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems

Diamond Member

Lifer

Golden Member

Moderator Emeritus, Elite Member

Diamond Member

Senior member

Golden Member

Senior member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Lifer

Senior member

Platinum Member

Golden Member

Senior member

Platinum Member

Elite Member

Golden Member

Golden Member

Platinum Member

Senior member

Golden Member

Lifer