Discussion [IPC] Instructions per cycle - How we measure, interpret and apply this metric for modern computing systems

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Diamond Member
Nov 14, 2011
9,691
3,514
136
When I use the word colloquially, I intent its actual meaning. You're of course technically correct, but in practical terms you're talking about corner case benchmarks. For example in AT's benchmark suite there's one single benchmark which uses AVX512 - everything else is following the same code path with the exact same instructions, making your point a moot point for the overall discussion.
What subset of the benchmark suite are you talking about? When I look in Bench, I see comparisons including a bunch of things like video games- where you absolutely won't get consistent number of instructions run. Even if you use canned benchmarks, or canned user inputs, pretty much every modern video game has framerate dependent logic- the longer your frametime, the further you need to increment your position due to velocity, etc. In short, the more frames per second you achieve, the more instructions you will be running in order to simulate the same time period (at a higher fidelity).
 

Zucker2k

Golden Member
Feb 15, 2006
1,717
1,035
136
Oh, where were you guys when the IPC debate broke out here. It's kinda funny because I was arguing the same thing Andrei is arguing now.
 

Schmide

Diamond Member
Mar 7, 2002
5,426
380
126
I guess you're referring to:



Except that's not exactly true. It might be, depending on the processors and binaries, but certainly not in all situations. Processors that are not compatible will obviously not be running the same binary, making the above statement seem a bit odd as nothing suggests limiting to x86. Even then, some binaries can have multiple code paths depending on processor features or strengths, or a program may have multiple binaries to accomplish the same. Doing the same task with AVX or AVX 512 will change the number of instructions involved. And that's just the pre-compiled stuff. JITed code could end up quite different given an intelligent JIT compiler that plays to the strengths of whatever processor it's running on.
A few caveats:

If you're running special codepaths for different levels of AVX you're going to dynamically load and for the most part recompile everything. The same way you can't compile a 32 and 64 bit executable into the same program, you can't set a certain compiler flag to do both AVX and even AVX2 in the same dll. (AVX and AVX2 are very similar) (Sandy Bridge to Haswell)

Doing the same task with AVX or AVX512 will have similar instructions but AVX512 will issue less instructions to get the same work done. An exception to this is operations that require permutations. A 2 lane AVX path is one pass. For a 4 lane AVX512 path is 4 passes. Plus additional in lane element manipulation for both AVX and AVX512. (1 to 5 passes)

There is an additional penalty for mixing modes. Running SSE3 mixed with AVX for example will require register flushes in the upper lane.

JIT will probably never replace the standard compiler and library system. It has its place in many housekeeping tasks, but traditional code is still the norm.

For the most part AVX2 isn't going anywhere and AVX512 will probably find its way into a few libraries. Its niche is still high performance computing and even there it is far from the norm in code.
 
Last edited:

HurleyBird

Platinum Member
Apr 22, 2003
2,427
849
136
The same way you can't compile a 32 and 64 bit executable into the same program, you can't set a certain compiler flag to do both AVX and even AVX2 in the same dll. (AVX and AVX2 are very similar) (Sandy Bridge to Haswell)
I believe ICC can generate multiple SIMD paths in the same binary using -ax<FEATURE> flags. For the most part these aren't as well optimised though.
 

Schmide

Diamond Member
Mar 7, 2002
5,426
380
126
It's more of a why would you? You end up with a bunch of dead code in memory rather than on disk.
 

NTMBK

Diamond Member
Nov 14, 2011
9,691
3,514
136
It's more of a why would you? You end up with a bunch of dead code in memory rather than on disk.
Because slapping on a few compiler flags takes 10 minutes, unlike maintaining your own custom DLL loading code!
 

Hans de Vries

Senior member
May 2, 2008
299
811
136
www.chip-architect.com
Internet: ICC's dispatcher gives Intel an unfair advantage rabble rabble
ATF: Why would you use it?

Glibc is crippling AMD too so you can't use any Linux benchs to compare IPC.
Well, it seems indeed that some optimized libraries are loaded for Haswell but not for Zen 2.
Question is when are these parameters used for which libraries? Source URL

10535

The author wrote this code in May 2017, long before Zen 2. I would trust the original author even though he works for Intel. He produced the very first Linux distibution (version 0.12 in early 1992 on two 5.25" floppies).
 

Nothingness

Platinum Member
Jul 3, 2013
2,185
437
136
As someone asked on RWT: has it been proven that no AVX2 optimised glibc functions are being used on AMD? Because I'm not sure there's anything wrong with this code: ISA-specific code should be dispatched using the CPU features, and the highlighted code doesn't set these features, it only sets the platform name to "haswell" and AVX512 feature (which is not available on AMD CPUs).
 

Hans de Vries

Senior member
May 2, 2008
299
811
136
www.chip-architect.com
As someone asked on RWT: has it been proven that no AVX2 optimised glibc functions are being used on AMD? Because I'm not sure there's anything wrong with this code: ISA-specific code should be dispatched using the CPU features, and the highlighted code doesn't set these features, it only sets the platform name to "haswell" and AVX512 feature (which is not available on AMD CPUs).
Yes. I would like to know that as well. Are these flags indeed used to select libraries? and if so which libraries? I wouldn't know. That's why I posted it on RWT...
 

yeshua

Member
Aug 7, 2019
166
134
76
Probably I've got a brilliant idea.

Why don't we give up on IPC entirely and instead use "work per second (per watt)", i.e. WPS? This will be workload dependent but then we may use dozens of different tasks and calculate a geometric mean. In this case there's very little to argue about because the same task may require more instructions say on ARM than on x86-64 and vice versa while the time/power it takes to complete the task may differ.
 

Schmide

Diamond Member
Mar 7, 2002
5,426
380
126
Well, it seems indeed that some optimized libraries are loaded for Haswell but not for Zen 2.
Question is when are these parameters used for which libraries? Source URL

View attachment 10535

The author wrote this code in May 2017, long before Zen 2. I would trust the original author even though he works for Intel. He produced the very first Linux distibution (version 0.12 in early 1992 on two 5.25" floppies).
Why is this here and in another thread?
 
Last edited:

SAAA

Senior member
May 14, 2014
541
126
116
Probably I've got a brilliant idea.

Why don't we give up on IPC entirely and instead use "work per second (per watt)", i.e. WPS? This will be workload dependent but then we may use dozens of different tasks and calculate a geometric mean. In this case there's very little to argue about because the same task may require more instructions say on ARM than on x86-64 and vice versa while the time/power it takes to complete the task may differ.
I'm with Yeshua here, we should just cut this trend of IPC word being misused and start defining a simpler word for single threaded performance over clockspeed. Over most workloads.

Also we could really start to use a term to define how well a program is threaded, think this app is 4TH optimized, this one is 1TH heavy etc.
So you just put them together and see over a range of programs what's best. "OK that cpu is good at nTH, what about 5 and below?"

Userbench may have funny parameters now but I like the way it scales from 1 to 4 to many threads: some programs don't use ALL the cores you have or just one, so a mid range evaluation is useful to determine mixed workloads performance.
 

ASK THE COMMUNITY