You should tell that Intel then, because that's they also use "IPC" in that way:
"broad workload mixture"
"FPS per clock." does not give you information about the architecture.
"FPS per clock per core" would result in an IPC-like result, but because FPS are integer values, it would not be accurate enough!
CPUs are not GPUs. It is not a 1 to 1 comparison.
FPS per clock doesn't give you information about the architecture by design... You're not getting it. You are drawing conclusions you cannot draw. You DONT know the IPC of a shader core or of any other individual compute unit in the GPU. You CANT know it from the data you are claiming shows IPC. GPUs are BY THEIR NATURE highly parallel. Occupancy is as important or more important than per-shader speed. FPS per clock doesn't pretend to know things it doesn't know.
When you measure "IPC" how do you know its due to a faster shader core or if its due to better scheduling and occupancy? Answer: You don't.
Notice Intel is talking about their CPU core (a single core). Not greater multithreading across all present cores (= higher occupancy). Not increased bandwidth. Not any other bottleneck. Your measurement that you think is IPC includes occupancy bottlenecks, fixed function logic bottlenecks, memory bottlenecks and more. You are NOT measuring IPC, period. Intel doesn't compute their IPC looking at how well they can fill every thread on every core in their architecture, that's obviously not IPC.
Repeat after me: Higher Occupancy is Not Higher IPC. Higher Occupancy is Higher Throughput. Higher Concurrency is Higher Throughput.
Lets put this another way to make it extra clear:
Imagine the GPU is a server farm made up of 128 individual CPU cores. If you test that server farm with a program that measures how fast 1 core in 1 CPU can go that doesn't require hitting memory or storage, you have measured the IPC of that 1 core in 1 CPU. If you put a highly scalable load that scales to 128 cores against that server, you are not measuring IPC. You are measuring total throughput. Read the recent Xeon D vs Cavium ThunderX review to understand the idea. IPC factors in. But equally or more so does: thread scheduling, how parallel the workload is, how much bandwidth you need, etc. A higher throughput machine can beat a higher IPC machine, which is obvious.
If you used an old 4P server with very poor inter-socket communication you would NOT say that each core has lower IPC versus a single socket of the same CPU just because it has lower total throughput per core (total performance divided by # of cores). The CPUs are the same. They have the same IPC. Dividing performance by core count did not reveal the IPC. It revealed total throughput (specifically, it would help measure scaling inefficiency as socket count goes up). When you use a Xeon Phi which has much better intercore communication performance as compared to a similar number of Xeon CPUs, you don't claim the Xeon Phi has better IPC. There is more to throughput and performance than IPC. And in a workload that is not dependent on low latency and intercore communication, the non-phi Xeon array would likely go faster due to its higher IPC. But it also could be slower in the right workload. The modified Silvermont core's IPC didnt change between workloads. Other bottlenecks arose and reduced TOTAL THROUGHPUT.
Notice in that review that certain CPUs even those having MORE IPC actually do worse because they have other bottlenecks like insufficient RAM or bandwidth. This is like a GPU, where consistently filling each queue with work (CONCURRENCY) without running into bottlenecks is as important as ensuring each single queue can execute the work quickly (IPC).
To bring this to the topic of the thread: We don't know Pascals IPC despite how fervently people try to act like they do. Each core likely does exactly the same amount of work. The structural changes from Maxwell to Pascal likely result in increased occupancy so that each unit has work more often. Im just speculating, and so is everyone else. The data only shows how many FPS can be produced with every clock. Or if you divide FPS by shader count, then by clock you have FPS per shader per cycle. Without better testing or better data we will never know if it is because each individual computational unit is faster or if it is because they've increased occupancy, or if its that they've decreased latencies in the chip fabric or any of the other intricacies involved in getting thousands of compute units to work together. If you want to call FPS per shader per cycle "IPC" no one can stop you but it's not going to be correct.
By your definition: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Oh my god, My IPC goes down in SLI.
By correct definitions: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Wow, looks like my total throughput per card and per shader decreased in SLI, since I know IPC doesn't change based on scaling inefficiencies, I know my bottleneck is from SLI.
There is no avoiding that work division and scheduling are major determinants of performance in a parallel computation, and that IPC is separate from that.