Two misconceptions about IPC and GPUs

Piroko · Jun 13, 2016

renderstate said:
The IPC curve is not going to be inversely proportional to frequency at low frequencies where the effect of memories is irrelevant. That's pretty straightforward. It's also a completely uninteresting case.

This case even applies to some GPUs at their default frequency though. GM204 (Hawaii is probably a better example) has plenty of memory bandwith to spare, while GP204 probably is slightly starved at default frequencies. That is something you would hide if you clocked the chips to an arbitrary lower frequency, but it is inherent to the product if you actually use it (GP204 at <1.5 GHz, GP204 at >>1.6 GHz).
Both Memory bandwith/clock speed and chip clock speed are equal parts of the design choice made by Nvidia and together they make up actual performance. Any discussion about IPC should reflect this since we can't buy the GPU with a different memory arrangement.

The Willamette vs. Tualatin vs. Thunderbird flashback is strong in this thread.

Headfoot · Jun 13, 2016

NeoLuxembourg said:
Am I missing something? Why should the IPC change with the frequency? It's called "Instructions per clock" for a reason.

Yes, you have bottlenecks and underutilisation problems but that's why IPC is normally calculated over a range of frequencies.

People regularly use IPC to mean both theoretical maximum IPC (instructions per cycle) aka maximum throughput but also simultaneously other people use it to mean measured / actual throughput.

The way its being used on this board is extremely sloppy and doesn't lead to productive discussion.

I personally avoid the term altogether and use descriptive language like "average performance per clock per shader" which leaves little room for interpretation

renderstate · Jun 13, 2016

HW architects never talk about max theoretical IPC (they might talk about a core width instead) and IPC is always about measured IPC.

dacostafilipe · Jun 13, 2016

Headfoot said:
People regularly use IPC to mean both theoretical maximum IPC (instructions per cycle) aka maximum throughput but also simultaneously other people use it to mean measured / actual throughput.

Yes, I know. I'm not talking about that.

Normally IPC is calculated like this: Take some benchmark (ex: Cinebench) and run it on your hardware (GPU,CPU, whatever ...). You divide the resulting score by the number of cores and your frequency.

If you don't have a bottleneck or you are suffering from underutilisation, the IPC will stay more or less the same across frequencies.

dark zero · Jun 13, 2016

Borealis7 said:
so a good, objective, game-independent, measure of a GPU is purely flops? or flops/W?
maybe the old pixel fill rate or triangles per second?

we can't be reduced to comparing GPUs solely on performance in games because of proprietary software that some games run on which is better/worse on certain GPUs of a certain company.

Mining is one of the best ways to see performance. Anoter is seeing the Double Precision.

renderstate · Jun 13, 2016

dark zero said:
Mining is one of the best ways to see performance. Anoter is seeing the Double Precision.

If one is interested in gaming performance the tests you mentioned are completely irrelevant.

Deders · Jun 13, 2016

From my understanding, Instructions Per Clock of a CPU is measured using different CPU's at the same clock speed.

Of course this goes out of the window when you use programs that can take advantage of new data paths like SIMD's, and likewise with a GPU if the workload is unbalanced for the architecture and/or Vram limitations, but still the above definition is what I presumed people have been talking about all these years.

TBH I've not seen any mention of IPC relating to GPU's on the forum yet. Maybe just not read the threads.

Borealis7 · Jun 14, 2016

if i remember my Intro to EE course from college correctly, without knowing the specific microcode that runs on each core for each operation, all we can do is guesstimate the average Speed-Up between one architecture and the next. (i'm not an EE major...just CompSci)
the only way to get IPC numbers is from the engineers themselves and no one is going to tell you that.

Headfoot · Jun 14, 2016

NeoLuxembourg said:
Yes, I know. I'm not talking about that.

Normally IPC is calculated like this: Take some benchmark (ex: Cinebench) and run it on your hardware (GPU,CPU, whatever ...). You divide the resulting score by the number of cores and your frequency.

If you don't have a bottleneck or you are suffering from underutilisation, the IPC will stay more or less the same across frequencies.

Except that's not measuring IPC. That's measuring average throughput. Measured IPC in a truer sense isolates out the memory bottlenecks as much as possible and focusing on computational core throughput. Typically you see folks using Dhrystone (e.g. divide Dhrystone MIPS by number of computational units divided by clock) or the entire linpack suite for this. But a real IPC measurement is pretty useless because actual real world performance is affected by all the bottlenecks - software, hardware, PCIe bus, CPU speed, memory on card, cache in the GPU, GPU computational width and instructions, etc. etc.

Thus why talking about IPC is just poor terminology for GPUs since everybody seems to think it means something slightly different.

I don't know why people are opposed to just using accurate terminology like "FPS per clock." Maybe using the term IPC makes people feel really smart

Deders · Jun 14, 2016

Aren't modern gpu's measured in teraflops, just as a guideline before actual benchmarks are made?

dacostafilipe · Jun 16, 2016

Headfoot said:
Except that's not measuring IPC. That's measuring average throughput.

You should tell that Intel then, because that's they also use "IPC" in that way:

"broad workload mixture"

Headfoot said:
I don't know why people are opposed to just using accurate terminology like "FPS per clock." Maybe using the term IPC makes people feel really smart

"FPS per clock." does not give you information about the architecture.

"FPS per clock per core" would result in an IPC-like result, but because FPS are integer values, it would not be accurate enough!

William Gaatjes · Jun 16, 2016

BFG10K said:
Actually it was terribly slow and extremely power-hungry. That's why they completely scrapped it with Conroe and have never looked back.

It was not all bad, some very good ideas that Intel came up with to make the netburst architecture faster (to counter the effects of a long pipeline with for example trace cache) were also used in later x86 cpu's from Intel(but with a much shorter pipeline). Even the available cpu's of today have it in them. One of many reasons why the Intel cpu's are so fast.

Headfoot · Jun 16, 2016

NeoLuxembourg said:
You should tell that Intel then, because that's they also use "IPC" in that way:

"broad workload mixture"

"FPS per clock." does not give you information about the architecture.

"FPS per clock per core" would result in an IPC-like result, but because FPS are integer values, it would not be accurate enough!

CPUs are not GPUs. It is not a 1 to 1 comparison.

FPS per clock doesn't give you information about the architecture by design... You're not getting it. You are drawing conclusions you cannot draw. You DONT know the IPC of a shader core or of any other individual compute unit in the GPU. You CANT know it from the data you are claiming shows IPC. GPUs are BY THEIR NATURE highly parallel. Occupancy is as important or more important than per-shader speed. FPS per clock doesn't pretend to know things it doesn't know.

When you measure "IPC" how do you know its due to a faster shader core or if its due to better scheduling and occupancy? Answer: You don't.

Notice Intel is talking about their CPU core (a single core). Not greater multithreading across all present cores (= higher occupancy). Not increased bandwidth. Not any other bottleneck. Your measurement that you think is IPC includes occupancy bottlenecks, fixed function logic bottlenecks, memory bottlenecks and more. You are NOT measuring IPC, period. Intel doesn't compute their IPC looking at how well they can fill every thread on every core in their architecture, that's obviously not IPC.

Repeat after me: Higher Occupancy is Not Higher IPC. Higher Occupancy is Higher Throughput. Higher Concurrency is Higher Throughput.

Lets put this another way to make it extra clear:

Imagine the GPU is a server farm made up of 128 individual CPU cores. If you test that server farm with a program that measures how fast 1 core in 1 CPU can go that doesn't require hitting memory or storage, you have measured the IPC of that 1 core in 1 CPU. If you put a highly scalable load that scales to 128 cores against that server, you are not measuring IPC. You are measuring total throughput. Read the recent Xeon D vs Cavium ThunderX review to understand the idea. IPC factors in. But equally or more so does: thread scheduling, how parallel the workload is, how much bandwidth you need, etc. A higher throughput machine can beat a higher IPC machine, which is obvious.

If you used an old 4P server with very poor inter-socket communication you would NOT say that each core has lower IPC versus a single socket of the same CPU just because it has lower total throughput per core (total performance divided by # of cores). The CPUs are the same. They have the same IPC. Dividing performance by core count did not reveal the IPC. It revealed total throughput (specifically, it would help measure scaling inefficiency as socket count goes up). When you use a Xeon Phi which has much better intercore communication performance as compared to a similar number of Xeon CPUs, you don't claim the Xeon Phi has better IPC. There is more to throughput and performance than IPC. And in a workload that is not dependent on low latency and intercore communication, the non-phi Xeon array would likely go faster due to its higher IPC. But it also could be slower in the right workload. The modified Silvermont core's IPC didnt change between workloads. Other bottlenecks arose and reduced TOTAL THROUGHPUT.

Notice in that review that certain CPUs even those having MORE IPC actually do worse because they have other bottlenecks like insufficient RAM or bandwidth. This is like a GPU, where consistently filling each queue with work (CONCURRENCY) without running into bottlenecks is as important as ensuring each single queue can execute the work quickly (IPC).

To bring this to the topic of the thread: We don't know Pascals IPC despite how fervently people try to act like they do. Each core likely does exactly the same amount of work. The structural changes from Maxwell to Pascal likely result in increased occupancy so that each unit has work more often. Im just speculating, and so is everyone else. The data only shows how many FPS can be produced with every clock. Or if you divide FPS by shader count, then by clock you have FPS per shader per cycle. Without better testing or better data we will never know if it is because each individual computational unit is faster or if it is because they've increased occupancy, or if its that they've decreased latencies in the chip fabric or any of the other intricacies involved in getting thousands of compute units to work together. If you want to call FPS per shader per cycle "IPC" no one can stop you but it's not going to be correct.

By your definition: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Oh my god, My IPC goes down in SLI.
By correct definitions: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Wow, looks like my total throughput per card and per shader decreased in SLI, since I know IPC doesn't change based on scaling inefficiencies, I know my bottleneck is from SLI.

There is no avoiding that work division and scheduling are major determinants of performance in a parallel computation, and that IPC is separate from that.

renderstate · Jun 16, 2016

NeoLuxembourg said:
You should tell that Intel then, because that's they also use "IPC" in that way:

"broad workload mixture"

"FPS per clock." does not give you information about the architecture.

"FPS per clock per core" would result in an IPC-like result, but because FPS are integer values, it would not be accurate enough!

Using FPS to compare perf is bad as it's not a linear scale. Replace it with milliseconds.

renderstate · Jun 16, 2016

Headfoot said:
CPUs are not GPUs. It is not a 1 to 1 comparison.

FPS per clock doesn't give you information about the architecture by design... You're not getting it. You are drawing conclusions you cannot draw. You DONT know the IPC of a shader core or of any other individual compute unit in the GPU. You CANT know it from the data you are claiming shows IPC. GPUs are BY THEIR NATURE highly parallel. Occupancy is as important or more important than per-shader speed. FPS per clock doesn't pretend to know things it doesn't know.

When you measure "IPC" how do you know its due to a faster shader core or if its due to better scheduling and occupancy? Answer: You don't.

Notice Intel is talking about their CPU core (a single core). Not greater multithreading across all present cores (= higher occupancy). Not increased bandwidth. Not any other bottleneck. Your measurement that you think is IPC includes occupancy bottlenecks, fixed function logic bottlenecks, memory bottlenecks and more. You are NOT measuring IPC, period. Intel doesn't compute their IPC looking at how well they can fill every thread on every core in their architecture, that's obviously not IPC.

Repeat after me: Higher Occupancy is Not Higher IPC. Higher Occupancy is Higher Throughput. Higher Concurrency is Higher Throughput.

Lets put this another way to make it extra clear:

Imagine the GPU is a server farm made up of 128 individual CPU cores. If you test that server farm with a program that measures how fast 1 core in 1 CPU can go that doesn't require hitting memory or storage, you have measured the IPC of that 1 core in 1 CPU. If you put a highly scalable load that scales to 128 cores against that server, you are not measuring IPC. You are measuring total throughput. Read the recent Xeon D vs Cavium ThunderX review to understand the idea. IPC factors in. But equally or more so does: thread scheduling, how parallel the workload is, how much bandwidth you need, etc. A higher throughput machine can beat a higher IPC machine, which is obvious.

If you used an old 4P server with very poor inter-socket communication you would NOT say that each core has lower IPC versus a single socket of the same CPU just because it has lower total throughput per core (total performance divided by # of cores). The CPUs are the same. They have the same IPC. Dividing performance by core count did not reveal the IPC. It revealed total throughput (specifically, it would help measure scaling inefficiency as socket count goes up). When you use a Xeon Phi which has much better intercore communication performance as compared to a similar number of Xeon CPUs, you don't claim the Xeon Phi has better IPC. There is more to throughput and performance than IPC. And in a workload that is not dependent on low latency and intercore communication, the non-phi Xeon array would likely go faster due to its higher IPC. But it also could be slower in the right workload. The modified Silvermont core's IPC didnt change between workloads. Other bottlenecks arose and reduced TOTAL THROUGHPUT.

Notice in that review that certain CPUs even those having MORE IPC actually do worse because they have other bottlenecks like insufficient RAM or bandwidth. This is like a GPU, where consistently filling each queue with work (CONCURRENCY) without running into bottlenecks is as important as ensuring each single queue can execute the work quickly (IPC).

To bring this to the topic of the thread: We don't know Pascals IPC despite how fervently people try to act like they do. Each core likely does exactly the same amount of work. The structural changes from Maxwell to Pascal likely result in increased occupancy so that each unit has work more often. Im just speculating, and so is everyone else. The data only shows how many FPS can be produced with every clock. Or if you divide FPS by shader count, then by clock you have FPS per shader per cycle. Without better testing or better data we will never know if it is because each individual computational unit is faster or if it is because they've increased occupancy, or if its that they've decreased latencies in the chip fabric or any of the other intricacies involved in getting thousands of compute units to work together. If you want to call FPS per shader per cycle "IPC" no one can stop you but it's not going to be correct.

By your definition: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Oh my god, My IPC goes down in SLI.
By correct definitions: Run 2 cards in SLI with 80% scaling. Divide total FPS by core count by shader count. Wow, looks like my total throughput per card and per shader decreased in SLI, since I know IPC doesn't change based on scaling inefficiencies, I know my bottleneck is from SLI.

There is no avoiding that work division and scheduling are major determinants of performance in a parallel computation, and that IPC is separate from that.

Great post!

dacostafilipe · Jun 16, 2016

Headfoot said:
...

I will stick to how the Intel and the others use IPC, thanks. 😉

As for "GPU IPC", I still thinks that applying the same technique to GPUs is possible. I agree with you that it's more complicated as we can't isolate the cores but it still produces useful information that can be used to approximate possible IPC differences.

Feel free to disagree.

PS: How many times did you edit your post?

kraatus77 · Jun 16, 2016

just use perf/tflops. it tells you about aggregate architectural improvements.

i know its not a 100% correct way of measuring it but it's the best and easy way.

and for performance, just test 15-20 games of different engines.

Headfoot · Jun 16, 2016

NeoLuxembourg said:
I will stick to how the Intel and the others use IPC, thanks. 😉

As for "GPU IPC", I still thinks that applying the same technique to GPUs is possible. I agree with you that it's more complicated as we can't isolate the cores but it still produces useful information that can be used to approximate possible IPC differences.

Feel free to disagree.

PS: How many times did you edit your post?

LOL IM GONNA USE INTELS DEFINITION LOLOLOL. I literally already showed how it is inapplicable. You're just being contrarian and stubborn now. Your disagreement is based in zero logic and zero fact.

You're totally right bro, because you said so.

PS: nobody is disagreeing the measurement is useful. The issue is what people call the measurement.

dacostafilipe · Jun 16, 2016

Headfoot said:
LOL IM GONNA USE INTELS DEFINITION LOLOLOL.

How old are you? :thumbsdown:

Headfoot said:
I literally already showed how it is inapplicable. You're just being contrarian and stubborn now. Your disagreement is based in zero logic and zero fact.
.

It's complete nonsense what you wrote.

Theres no "one IPC" in a CPU (or GPU).

Because of how CPUs/GPUs are build you have multiple IPC values depending on what part of the Hardware you use. The only way to extract a "global/average/whatever IPC" to compare with other CPUs/GPUs (I'll remind you that this thread is about comparing Maxwell with Pascal) is to use software to measure it. You should use multiple types of software to test multiple parts of the hardware (decoder, microop cache, branching, alu, int, ...) and create and average.

If you want to test a CPU IPC you will certainly not try is on a multi rack farm. Same goes with your SLI nonsense. It's just silly taking extremes to prove your point!

Headfoot · Jun 16, 2016

NeoLuxembourg said:
How old are you? :thumbsdown:

It's complete nonsense what you wrote.

Theres no "one IPC" in a CPU (or GPU).

Because of how CPUs/GPUs are build you have multiple IPC values depending on what part of the Hardware you use. The only way to extract a "global/average/whatever IPC" to compare with other CPUs/GPUs (I'll remind you that this thread is about comparing Maxwell with Pascal) is to use software to measure it. You should use multiple types of software to test multiple parts of the hardware (decoder, microop cache, branching, alu, int, ...) and create and average.

If you want to test a CPU IPC you will certainly not try is on a multi rack farm. Same goes with your SLI nonsense. It's just silly taking extremes to prove your point!

No.

Your definition is wrong and my "silly extremes" prove it. IPC is distinct from throughput, end of story. If you can't understand the difference you have no business defining the word IPC.

dark zero · Jul 21, 2016

I need to revive this thread but...

Silverforce11 said:
Here's leaked benches for the Nitro+ 480.

Some in the press said reviews are coming in a few days.

Source: http://forums.anandtech.com/showthread.php?t=2478626&page=52

So... if higher clockspeeds are supposed to determine the performance... what is in fact the factor that AMD card with lower clocks are tieing with a higher clocked nVIDIA card?

I really doubt that are drivers.

jhu · Jul 21, 2016

NeoLuxembourg said:
Because of how CPUs/GPUs are build you have multiple IPC values depending on what part of the Hardware you use. The only way to extract a "global/average/whatever IPC" to compare with other CPUs/GPUs (I'll remind you that this thread is about comparing Maxwell with Pascal)

Strictly speaking, I don't think it's really possible to directly compare IPC between Maxwell and Pascal because they may not be executing the same instructions (maybe they do, I can't tell). As far as I know, unlike AMD with GCN or Intel with HD Graphics, NVidia doesn't publish their GPU's ISA or instruction encoding.

sxr7171 · Jul 21, 2016

Remember the P4 that hit a brick wall and could never be released at 4GHz? That will always end up happening if you rely too much on clock speed.

sirmo · Jul 21, 2016

dark zero said:
I need to revive this thread but...

Source: http://forums.anandtech.com/showthread.php?t=2478626&page=52

So... if higher clockspeeds are supposed to determine the performance... what is in fact the factor that AMD card with lower clocks are tieing with a higher clocked nVIDIA card?

I really doubt that are drivers.

More stream processors.. more densely packed too. AMD uses density for more stream processors and space for other features like command processor and ACEs.

Nvidia seems to be happy trading space for higher clocks since their architecture is more streamlined to graphic workloads and has less features/compute.

tajoh111 · Jul 21, 2016

sirmo said:
More stream processors.. more densely packed too. AMD uses density for more stream processors and space for other features like command processor and ACEs.

Nvidia seems to be happy trading space for higher clocks since their architecture is more streamlined to graphic workloads and has less features/compute.

That's mostly a myth. The transistor density between the rx480 and 1080 are almost the same. Clock speed isn't really based on transistor density once heat is controlled.

The 1080's cores are just designed to go faster because they are larger and have a longer pipeline. The fact that the 1080, has the potential to reach the same overall clocks as the 1060 although the latter has a lower transistor density is proof of this.

Nvidia's transistor count is very similar to AMD at this point, so the gap should be closing but it is getting larger and larger.

What we have is Nvidia has been focusing on improving the cores of their architecture, while AMD has been adding supplementary components while keeping the cores the same. When AMD changes the cores, they will have a new architecture. Until then it will be GCN.

Two misconceptions about IPC and GPUs

Senior member

Diamond Member

Senior member

Senior member

Platinum Member

Senior member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Lifer

Diamond Member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Lifer

Diamond Member

Golden Member

Senior member