Discussion [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems

NTMBK · Aug 6, 2019

Andrei. said:
When I use the word colloquially, I intent its actual meaning. You're of course technically correct, but in practical terms you're talking about corner case benchmarks. For example in AT's benchmark suite there's one single benchmark which uses AVX512 - everything else is following the same code path with the exact same instructions, making your point a moot point for the overall discussion.

What subset of the benchmark suite are you talking about? When I look in Bench, I see comparisons including a bunch of things like video games- where you absolutely won't get consistent number of instructions run. Even if you use canned benchmarks, or canned user inputs, pretty much every modern video game has framerate dependent logic- the longer your frametime, the further you need to increment your position due to velocity, etc. In short, the more frames per second you achieve, the more instructions you will be running in order to simulate the same time period (at a higher fidelity).

Zucker2k · Aug 6, 2019

Oh, where were you guys when the IPC debate broke out here. It's kinda funny because I was arguing the same thing Andrei is arguing now.

NTMBK · Aug 6, 2019

Zucker2k said:
Oh, where were you guys when the IPC debate broke out here. It's kinda funny because I was arguing the same thing Andrei is arguing now.

It's a 73 page thread, ain't nobody got time for that!

Schmide · Aug 6, 2019

HurleyBird said:
I guess you're referring to:

Except that's not exactly true. It might be, depending on the processors and binaries, but certainly not in all situations. Processors that are not compatible will obviously not be running the same binary, making the above statement seem a bit odd as nothing suggests limiting to x86. Even then, some binaries can have multiple code paths depending on processor features or strengths, or a program may have multiple binaries to accomplish the same. Doing the same task with AVX or AVX 512 will change the number of instructions involved. And that's just the pre-compiled stuff. JITed code could end up quite different given an intelligent JIT compiler that plays to the strengths of whatever processor it's running on.

A few caveats:

If you're running special codepaths for different levels of AVX you're going to dynamically load and for the most part recompile everything. The same way you can't compile a 32 and 64 bit executable into the same program, you can't set a certain compiler flag to do both AVX and even AVX2 in the same dll. (AVX and AVX2 are very similar) (Sandy Bridge to Haswell)

Doing the same task with AVX or AVX512 will have similar instructions but AVX512 will issue less instructions to get the same work done. An exception to this is operations that require permutations. A 2 lane AVX path is one pass. For a 4 lane AVX512 path is 4 passes. Plus additional in lane element manipulation for both AVX and AVX512. (1 to 5 passes)

There is an additional penalty for mixing modes. Running SSE3 mixed with AVX for example will require register flushes in the upper lane.

JIT will probably never replace the standard compiler and library system. It has its place in many housekeeping tasks, but traditional code is still the norm.

For the most part AVX2 isn't going anywhere and AVX512 will probably find its way into a few libraries. Its niche is still high performance computing and even there it is far from the norm in code.

HurleyBird · Aug 6, 2019

Schmide said:
The same way you can't compile a 32 and 64 bit executable into the same program, you can't set a certain compiler flag to do both AVX and even AVX2 in the same dll. (AVX and AVX2 are very similar) (Sandy Bridge to Haswell)

I believe ICC can generate multiple SIMD paths in the same binary using -ax<FEATURE> flags. For the most part these aren't as well optimised though.

Schmide · Aug 6, 2019

It's more of a why would you? You end up with a bunch of dead code in memory rather than on disk.

NTMBK · Aug 7, 2019

Schmide said:
It's more of a why would you? You end up with a bunch of dead code in memory rather than on disk.

Because slapping on a few compiler flags takes 10 minutes, unlike maintaining your own custom DLL loading code!

lamedude · Sep 8, 2019

Internet: ICC's dispatcher gives Intel an unfair advantage rabble rabble
ATF: Why would you use it?

Glibc is crippling AMD too so you can't use any Linux benchs to compare IPC.

Hans de Vries · Sep 9, 2019

lamedude said:
Internet: ICC's dispatcher gives Intel an unfair advantage rabble rabble
ATF: Why would you use it?

Glibc is crippling AMD too so you can't use any Linux benchs to compare IPC.

Well, it seems indeed that some optimized libraries are loaded for Haswell but not for Zen 2.
Question is when are these parameters used for which libraries? Source URL

The author wrote this code in May 2017, long before Zen 2. I would trust the original author even though he works for Intel. He produced the very first Linux distibution (version 0.12 in early 1992 on two 5.25" floppies).

Nothingness · Sep 9, 2019

As someone asked on RWT: has it been proven that no AVX2 optimised glibc functions are being used on AMD? Because I'm not sure there's anything wrong with this code: ISA-specific code should be dispatched using the CPU features, and the highlighted code doesn't set these features, it only sets the platform name to "haswell" and AVX512 feature (which is not available on AMD CPUs).

Hans de Vries · Sep 9, 2019

Nothingness said:
As someone asked on RWT: has it been proven that no AVX2 optimised glibc functions are being used on AMD? Because I'm not sure there's anything wrong with this code: ISA-specific code should be dispatched using the CPU features, and the highlighted code doesn't set these features, it only sets the platform name to "haswell" and AVX512 feature (which is not available on AMD CPUs).

Yes. I would like to know that as well. Are these flags indeed used to select libraries? and if so which libraries? I wouldn't know. That's why I posted it on RWT...

yeshua · Sep 9, 2019

Probably I've got a brilliant idea.

Why don't we give up on IPC entirely and instead use "work per second (per watt)", i.e. WPS? This will be workload dependent but then we may use dozens of different tasks and calculate a geometric mean. In this case there's very little to argue about because the same task may require more instructions say on ARM than on x86-64 and vice versa while the time/power it takes to complete the task may differ.

Schmide · Sep 9, 2019

Hans de Vries said:
Well, it seems indeed that some optimized libraries are loaded for Haswell but not for Zen 2.
Question is when are these parameters used for which libraries? Source URL

View attachment 10535

The author wrote this code in May 2017, long before Zen 2. I would trust the original author even though he works for Intel. He produced the very first Linux distibution (version 0.12 in early 1992 on two 5.25" floppies).

Why is this here and in another thread?

Hans de Vries · Sep 9, 2019

Schmide said:
Why is this here and in another thread?

It is Off Topic here.

SAAA · Sep 9, 2019

yeshua said:
Probably I've got a brilliant idea.

Why don't we give up on IPC entirely and instead use "work per second (per watt)", i.e. WPS? This will be workload dependent but then we may use dozens of different tasks and calculate a geometric mean. In this case there's very little to argue about because the same task may require more instructions say on ARM than on x86-64 and vice versa while the time/power it takes to complete the task may differ.

I'm with Yeshua here, we should just cut this trend of IPC word being misused and start defining a simpler word for single threaded performance over clockspeed. Over most workloads.

Also we could really start to use a term to define how well a program is threaded, think this app is 4TH optimized, this one is 1TH heavy etc.
So you just put them together and see over a range of programs what's best. "OK that cpu is good at nTH, what about 5 and below?"

Userbench may have funny parameters now but I like the way it scales from 1 to 4 to many threads: some programs don't use ALL the cores you have or just one, so a mid range evaluation is useful to determine mixed workloads performance.

FlameTail · Jan 30, 2024

Let's revive this thread.

So what's the correct way to measure IPC?

Thibsie · Jan 30, 2024

I think it has been answered pretty much in the first posts. Why bring this back ?

FlameTail · Jan 30, 2024

Thibsie said:
I think it has been answered pretty much in the first posts. Why bring this back ?

Coz the issue of IPC was raised in the Zen 5 thread.

coercitiv · Jan 30, 2024

FlameTail said:
Coz the issue of IPC was raised in the Zen 5 thread.

Something you should consider in relation to what was raised in the Zen 5 thread: performance per clock of a certain architecture is affected by the different memory layers it has access to. You can see this in memory sensitive applications, where much faster DRAM or a much bigger cache can massively increase performance. One easy to pick example is gaming, where 3D v-cache AMD CPUs manage to outperform the vanilla SKUs while running lower clocks too. This is important because the cores themselves are the same, but they are fed with increased efficiency and work closer to their theoretical potential.

Another scenario where we can witness this type of memory PPC scaling is when we normalize for ISO clocks. If we take a core that is built to work at ~6Ghz and downclock it to ~3Ghz, the relative speed of the memory subsystem increases. CPU wastes less clock cycles waiting for memory to deliver the goods. In memory sensitive workloads a 50% drop in CPU clocks will lead to less than 50% drop in performance per clock. This can happen because of DRAM memory alone, but can also be impacted by bus/L3 clocks (depending on architecture).

In the Zen 5 thread you took the GB6 score as a performance per clock indicator and attempted to normalize for ISO clocks. AFAIK GB is memory sensitive, so this means that normalization is not as easy as applying the rule of the thirds.

You'll notice I used PPC instead of IPC. Performance per clock is much closer to describe what we really mean when we colloquially use "IPC" in this forum and in massmedia. This meaning is not really compatible with the pure engineering term (read this thread if you want ot know why). As consumers we talk about IPC as an average obtained from a battery of tests (or obtained from something like SPEC but then also reflected in a battery of workloads to see distribution of gains across the spectrum). The more different the architecures are in terms of clocks and scope, the less relevant these IPC comparisons are, to the point where it is much saner to just compare performance, power and cost as the trifecta that really matters.

DavidC1 · Jan 30, 2024

Finally someone with a sense.

I hate the term "IPC" with a passion. Intel/AMD has relegated to using it. It confuses people. They claim nonsense like it's an instruction, so it's different between CPUs of different ISA's, even in cases where it performs identically per clock. But most people are little evolved compared to literal NPCs in World of Warcraft, so what can you do?

NTMBK · Feb 1, 2024

Huh, I got a lot of likes from this thread all of a sudden. Y'all are still using IPC wrong

JustViewing · Feb 1, 2024

IPC is not a constant. It can very depend on the type application. Theoretical IPC is never achieved in any of the software. Zen4 for example has theoretical IPC of 6. But it would never achieve that due to various constraints, mainly not enough architectural registers and memory bottleneck. It is very hard to get IPC over 2 unless you do assembly language programming.

DavidC1 · Feb 3, 2024

JustViewing said:
IPC is not a constant. It can very depend on the type application. Theoretical IPC is never achieved in any of the software.

Dude.

Did you literally choose to ignore everything said previously? ILP(the REAL IPC) only matters to engineers. We really don't care how much ILP exists, or how much instructions it takes to execute a program. We only care how the chip performs in applications.

When people in general say "IPC" they are talking about a computer chip's performance at the same frequency.

I posted just one above you and said I hate the term IPC because it confuses people - such as yourself.

JustViewing · Feb 3, 2024

DavidC1 said:
I posted just one above you and said I hate the term IPC because it confuses people - such as yourself.

I am certainly not confused by the term and I am very much aware what exactly it means. I also really hate the misuse of the term IPC. But the true IPC is very valuable to compare different processor architectures and find out its strengths and weaknesses. The true IPC can be found in sites like http://instlatx64.atw.hu . BTW, personally I do check actual IPC using profiler for finding ways to optimize.

Mopetar · Feb 3, 2024

IPC is meaningless unless you know what applications you're running. You can even get different numbers for the same application just based on compiler settings or which compiler you're using.

Clock speed doesn't matter either outside of cases where there's a major bottleneck due to memory to the point load/store buffers are completely full and the CPU stalls completely to wait for memory to catch up. Even then dropping the clock speed doesn't really gain you any actual performance even if the IPC technically goes up.

The instructions are tied to cycles, regardless of how fast or slow the clock runs. Clock speed is only relevant if what you really want is overall performance.

Discussion [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems

Lifer

Golden Member

Lifer

Diamond Member

Platinum Member

Diamond Member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Member

Diamond Member

Senior member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Senior member

Platinum Member

Senior member

Diamond Member