Just how long are we going to keep playing the IPC game ?

ThatBuzzkiller · Nov 17, 2014

NTMBK said:
It is a fundamentally different concept. Improving IPC improves performance on existing code, which new ISA extensions do not.

Higher IPC does not necessarily improve performance on existing code either ...

Increasing core counts also mean that your raising IPC since the said processor can now execute more instructions per cycle but does that mean we should start designing our CPUs to be more like GPUs ?

You can argue the technicalities all you want but are you going to address the underlying issue of excessively increasing the amount of "operations per cycle" will eventually yield no gains on a lot of conventional applications ?

Idontcare · Nov 17, 2014

NostaSeronx said:
Efficient large IPS(instructions per second) is what consumers want.

80,000 at arb. 50 watt.

Is better than
100,000 at arb. 100 watt
or
60,000 at arb. 40 watt.

This. IPS/Watt.

http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf

III-V · Nov 17, 2014

TuxDave said:
That begs the question... should we just let it go or keep educating people...

...

Do you have any commentary you could give for this thread? I think your insight would be appreciated, seeing as you've worked on these things, from what I've been told.

I suppose I'll go hold down the fort, while I await your response.

serpretetsky · Nov 17, 2014

IPC is a very generic term that depends on the sample code or particular instruction you are executing. Intel might give you a general IPC for their entire processor, is this what you are referring to? I would say it's worthless to attempt to analyze this number because the performance of a processor is far more complex than a single undefined IPC can give you.

A blanket statement like " increasing the IPC after that will no longer provide any meaningful performance gains." really cannot just be applied without clearly defining the IPC of what and how was it measured.

If I improve the IPC of every individual instruction on a processor by 30% I can guarantee you that your code will run 30% faster. However, if you tell me the IPC of a processor has increased by 30%, but you fail to tell me that you are measuring with an assumption that coders/compilers will be using a particular instruction 80% of the time and that particular instruction was the one that obtained all of the IPC improvement, well now, those are two completely different things. It's always relative to a particular test condition.

ThatBuzzkiller · Nov 17, 2014

III-V said:
Well, that's basically what the term IPC has devolved into. It's now a term for per-clock performance. I don't think there's any going back, given its widespread misuse.

I think everyone's set you straight on the IPC thing, so I'll point out that 14nm desktop processors should see a return in clock scaling.

I want to believe but at the same time I don't want to come back crashing into reality ...

sefsefsefsef · Nov 18, 2014

Like others have said, Instructions Per Cycle has a specific technical meaning. Please do not misuse it.

It is not possible to have an "IPC myth." IPC directly relates to performance, as long as clock speeds don't drop. The problem with the "MHz myth" is that IPC actually goes down as clock speed increases for most interesting applications. This is in large part because a cache miss becomes relatively more punishing at higher frequencies. Also, reaching very high clock speeds requires using longer pipelines, which are punished more by branch mis-predicts.

Also, there is no single "IPC" for a processor. It's not a universal thing. Running SPEC CPU 2006 on a processor will yield IPCs in the range of ~0.15 to ~2.2+ (off the top of my head) on a modern x86 CPU. IPC is a function of how a program interacts with a given CPU design. This means the same design will work better for some applications than for others. Some applications require very large caches, some don't. Some benefit from prefetching, some don't.

A higher IPC translating to higher performance is starting to sound more and more like a charade as time goes on.

I'm still really struggling to understand what you might mean by this. Are you really just talking about the execution width and instruction window size of a CPU? If so, you are absolutely right. Those things have already reached extremely harsh diminishing returns. I can't imagine what else you could be talking about.

ThatBuzzkiller · Nov 18, 2014

serpretetsky said:
IPC is a very generic term that depends on the sample code or particular instruction you are executing. Intel might give you a general IPC for their entire processor, is this what you are referring to? I would say it's worthless to attempt to analyze this number because the performance of a processor is far more complex than a single undefined IPC can give you.

A blanket statement like " increasing the IPC after that will no longer provide any meaningful performance gains." really cannot just be applied without clearly defining the IPC of what and how was it measured.

If I improve the IPC of every individual instruction on a processor by 30% I can guarantee you that your code will run 30% faster. However, if you tell me the IPC of a processor has increased by 30%, but you fail to tell me that you are measuring with an assumption that coders/compilers will be using a particular instruction 80% of the time and that particular instruction was the one that obtained all of the IPC improvement, well now, those are two completely different things. It's always relative to a particular test condition.

I should probably clarify but I do mean general IPC for the most part ...

There will come a time where increasing IPC will do nothing for most applications and those that do benefit massively are likely to find better performance out of GPUs since the said program likely has a high amount of DLP.

Increasing IPC does NOT defeat Amdahl's Law! A program that is 100% sequential does does not benefit from more cores, wider execution units, and even pipelining! What dictates execution times in that case are the write result latencies. The only two aspects that can solve the problem is a shorter pipeline and higher clocks. A shorter pipeline means that it requires less cycles to traverse the end of a pipeline which translates to smaller write result latencies. Higher clocks will result in smaller cycle times which also decreases the write result latencies.

Often, an application that performs best on a CPU is likely to be limited by the write result latencies.

Can you imagine a world where 1 cycle pipelines and 10 GHz CPUs were possible ?

Single threaded workloads would perform astronomically faster than they would today!

ThatBuzzkiller · Nov 18, 2014

sefsefsefsef said:
I'm still really struggling to understand what you might mean by this. Are you really just talking about the execution width and instruction window size of a CPU? If so, you are absolutely right. Those things have already reached extremely harsh diminishing returns. I can't imagine what else you could be talking about.

Yes and go check the post below yours to see what I'm elaborating on ...

jhu · Nov 18, 2014

ThatBuzzkiller said:
Can you imagine a world where 1 cycle pipelines and 10 GHz CPUs were possible ?

Single threaded workloads would perform astronomically faster than they would today!

Actually, no. Still limited by how fast memory is. Try turning off the CPU caches: computer responsiveness turns to molasses.

ThatBuzzkiller · Nov 18, 2014

jhu said:
Actually, no. Still limited by how fast memory is. Try turning off the CPU caches: computer responsiveness turns to molasses.

I'm aware of the memory wall but if you so want to discuss that specifically then make a separate thread about the subject ...

sefsefsefsef · Nov 18, 2014

What is DLP?

"Write result latency" isn't really a thing, because of the register bypass network. Wait, are you talking about memory writes, or register writes? Memory writes have store queue->load bypass. Both are already solved problems.

You really gotta stop using the term "IPC" incorrectly. It's really super hard to tell what you are trying to talk about, but it's clear you aren't talking about IPC.

SAAA · Nov 18, 2014

ThatBuzzkiller said:
I should probably clarify but I do mean general IPC for the most part ...

There will come a time where increasing IPC will do nothing for most applications and those that do benefit massively are likely to find better performance out of GPUs since the said program likely has a high amount of DLP.

Increasing IPC does NOT defeat Amdahl's Law! A program that is 100% sequential does does not benefit from more cores, wider execution units, and even pipelining! What dictates execution times in that case are the write result latencies. The only two aspects that can solve the problem is a shorter pipeline and higher clocks. A shorter pipeline means that it requires less cycles to traverse the end of a pipeline which translates to smaller write result latencies. Higher clocks will result in smaller cycle times which also decreases the write result latencies.

Often, an application that performs best on a CPU is likely to be limited by the write result latencies.

Can you imagine a world where 1 cycle pipelines and 10 GHz CPUs were possible ?

Single threaded workloads would perform astronomically faster than they would today!

This is absolutely not Always true, by definition improving IPC (in a certain operation/algorithm) increases performance at the same clocks.

So while it's true that some operations are limited by how fast you can write back the results it's not sure that you can avoid this and improve much more.

Example: in a code that is 100% single threaded, say because each step requires the result of the previous iteration, you can obtain a speed up of x times just by using more transistors and placing an ALU after another for x times, like in series instead of parallel (similar to GPUs).

Now at the same clocks your algorithm is x times faster, because you don't even need to write back to memory every time! You just compute x times in a single clock the same operation, then if you really need to you can write back in the next clock all or the final result.

Yeah I know this is a stupid example but for many particular applications it can work very well: imagine doing this thing for computing pi, a decisively single threaded task.

Nothingness · Nov 18, 2014

SAAA said:
Yeah I know this is a stupid example but for many particular applications it can work very well: imagine doing this thing for computing pi, a decisively single threaded task.

Computing pi digits isn't really single threaded. The multiplication routine used for this computation can be made parallel.

See this for instance: http://www.numberworld.org/y-cruncher/

SAAA · Nov 18, 2014

Nothingness said:
Computing pi digits isn't really single threaded. The multiplication routine used for this computation can be made parallel.

See this for instance: http://www.numberworld.org/y-cruncher/

Oops nevermind, let's pick any other single threaded task then.
BTW 13.3 trillion digits!!?? Wow that's what you call a stress test, it took more than 200 days!

witeken · Nov 18, 2014

Idontcare said:
This. IPS/Watt.

http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf

Also known as performance per watt.

Thala · Nov 18, 2014

If I improve the IPC of every individual instruction on a processor by 30% I can guarantee you that your code will run 30% faster.

This is technical nonsense. IPC is not the property of one instruction, but rather the property of the pipeline, where the instruction getting issued.
That's like saying in order to improve speed (km/h) you are going to improve the km/h for each kilometer.

Cerb · Nov 18, 2014

ThatBuzzkiller said:
There will come a time where increasing IPC will do nothing for most applications and those that do benefit massively are likely to find better performance out of GPUs since the said program likely has a high amount of DLP.

When and how?

Increasing IPC means increasing performance per clock cycle, while running the same instructions. For a program not to benefit, the clock speed must have to reduce by the amount of the IPC increase, in the final CPU.

Also, DLP and ILP are not necessarily interchangeable. For any program for which CPU performance is at all interesting, they will not be. Most programs that benefit massively from work to improve IPC are not those programs. For example, compilation has improved at a rate greater than average, across newer Intel generations, and is primarily statistics, decision trees, and lookup tables, none of which are traditionally capable of high-IPC, usually have low ILP, but latent high DLP. But, to exploit that potential DLP would require far faster RAM than we have, today, and many CPUs each with far more performance than GPUs. GPU-like systems would be a poor fit. Same goes for much database work, where the storage and memory bandwidth is a major limiting factor. MySQL and Postgres generically try to keep chunks worked on reasonable, while I know in the past both Oracle and MS have optimized data structures to just fit into common Intel CPU L2 caches (that said, lots of basic querying over large data sets could make good use of GPU-like processors, though AVX2 might end up being a good enough in-between feature set).

Increasing IPC does NOT defeat Amdahl's Law!

Increasing IPC does not change anything with regards to Amdahl's Law whatsoever.

A program that is 100% sequential does does not benefit from A more cores, B wider execution units, and even C pipelining!

A true, B false, outside of possibly synthetic programs with loads that depend on normal ALU results, and C will vary (again, pipelining is a supporting feature, allowing other features to do their thing well, not a direct way to improve performance).

What dictates execution times in that case are the write result latencies.

Ah, but in a purely sequential program, those latencies are basically zero, in a high-performance CPU. Writes hit the register, and then the program continues. Read latency and execution time then dominate. Data does not need to be flushed back to memory before a read instruction from the same thread can load it again, usually, too.

Often, an application that performs best on a CPU is likely to be limited by the write result latencies.

If that is the case, you are dealing with simple programs, like array arithmetic, but that have large working sets. These are not programs for which scalar IPC has ever really mattered, outside of small HPC niches. That, or database-like programs with large working sets that must share resources between CPUs.

Can you imagine a world where 1 cycle pipelines and 10 GHz CPUs were possible ?

Single threaded workloads would perform astronomically faster than they would today!

Only if we also had 0.1ns DRAM latency.

Cerb · Nov 18, 2014

ThatBuzzkiller said:
I'm aware of the memory wall but if you so want to discuss that specifically then make a separate thread about the subject ...

The memory performance, from DDRx down to L1, is an important part of what makes up IPC; why should it get a separate thread?

AtenRa · Nov 18, 2014

Well most people here and on the web consider IPC as single thread performance.

But if you want to see the actual IPC of processors its like this,

L = Latency, T= Throughput

Edit:

And this is Core 2 Duo E6400 IPC running SPEC2000 and 2006

Nothingness · Nov 18, 2014

AtenRa, it's considered good manner to cite your sources

The first one is from Torbjörn Granlund: Instruction latencies and throughput for AMD and Intel x86 processors.

The second is from Tribuvan Kumar Prakash and Lu Peng: Performance Characterization of SPEC CPU2006 Benchmarks on Intel Core 2 Duo Processor

AtenRa · Nov 18, 2014

Nothingness said:
AtenRa, it's considered good manner to cite your sources

The first one is from Torbjörn Granlund: Instruction latencies and throughput for AMD and Intel x86 processors.

The second is from Tribuvan Kumar Prakash and Lu Peng: Performance Characterization of SPEC CPU2006 Benchmarks on Intel Core 2 Duo Processor

Ahh nice thx, i had the pics but i couldnt find the urls

serpretetsky · Nov 18, 2014

Thala said:
This is technical nonsense. IPC is not the property of one instruction, but rather the property of the pipeline, where the instruction getting issued.
That's like saying in order to improve speed (km/h) you are going to improve the km/h for each kilometer.

Perhaps my comment was misleading in that you might try to measure the IPC of a single lone instruction all by itself without the context of what's happening in the rest of the pipeline, and you are right, such an attempt is pretty meaningless.

But otherwise IPC can measure ANY WORKLOAD YOU GIVE THE PROCESSOR. If you want to measure the IPC of an instruction, feed the processor NOTHING but that instruction over and over again. You will now have an IPC value of that single instruction. Is that useful? I don't know, Tell me the context and I can tell you if that number is useful.

A shorter pipeline means that it requires less cycles to traverse the end of a pipeline which translates to smaller write result latencies. Higher clocks will result in smaller cycle times which also decreases the write result latencies.

I'm gonna get pretty basic here.

Computer architecture usually teaches that a basic processor has 5 stages (this a simplified model):

Code:

1) Instruction Fetch
2) Instruction Decode
3) Execution
4) Memory access
5) Register Write-back.

Let's assume you built a processor that somehow completed all of these steps in 2 cycles. I'm not fully sure how you would do it, but that's ok.

Here's the hidden peace of information that might not be immediately obvious: Every one of those stages has certain latency. This latency comes about because the electric charge is not able to instantly rush in and fill the wires and transistors. There is a delay because of capacitance, inductance, and plain old resistance.

So if you were somehow able to compress those 5 stages into 2 stages, you still haven't gotten rid of the latencies. So, let's pretend here were the latencies with the 5 stage

Code:

1) Inst Fet   1ns
2) Inst Dec  1ns
3) Exec       1ns
4) Mem       1ns
5 WriteBack 1ns

You still have the same latencies to deal with, but now it's 2 cycles. Here's your new architecuter

Code:

1) Cycle1  2.5ns
2) Cycle2 2.5ns

You can't simply get rid of these latencies without creating some radical new design or using some new technology.

However, let's also assume you saved some latency because you don't need those pesky registers in between stages:

Code:

1) Cycle1 2ns
2) Cycle2 2ns

Alright. Now you think to yourself "Great! I can complete a single instruction in 4ns instead of 5!". I will also point out, at this stage, that your processor actually has to run at a frequency that is 2 TIMES SLOWER than the original architecture to accomodate the new longest delays (1ns -> 2ns). Your dream of reducing pipelining while increasing frequency seems sort of ridiculous.

You have gained the ability to complete a single instruction in 4ns, instead of 5ns. What have you lost? Something very important: you cannot pipeline your 2-cycle machine as deeply.

YOU ARE WASTING RESOURCES. Every sub-instruction has to wait for the next sub-instruction to complete one giant stage, even though, electrically, there are parts of that stage that next sub instruction is not even using anymore.

In both of these pipelined designs, we theoretically should be able to complete 1 instruction every cycle. However, the 5 stage pipeline will suffer more, as you point out, due to pipeline flushing and other shenanigans. So, let's assume the 5-stage pipeline actually has a IPC of 0.8 for some particular workload with various instructions while the 2 stage has an IPC of 1.0 for the same workload

Which do you think is going to be faster?
1) 5-stage processor running at 1Ghz (1ns clock period) with IPC of 0.8
2) 2-stage processor running at 500Mhz (2ns clock period) with IPC of 1.0.

1Ghz * 0.8 inst/cycle = 0.8 Giga instructions/second = 800 Mega instructions/sec
500Mhz * 1.0 inst/cycle = 500 Mega instructoins/second

All i'm saying is that you can't just claim to reduce pipeline depth and get free performance. If that were true every processor manufacturer would simply reduce there pipeline depth. Obviously it's a balancing act, and you have to choose the write pipeline depth.

Another thing i've completely ignored is power constrains, which as everyone remembers from pentium4/prescott days, can make the above analysis even more complex. Typically the power constraints favor lower frequencies, but it still does not allow us to get a simple answer.

ThatBuzzkiller · Nov 19, 2014

sefsefsefsef said:
What is DLP?

"Write result latency" isn't really a thing, because of the register bypass network. Wait, are you talking about memory writes, or register writes? Memory writes have store queue->load bypass. Both are already solved problems.

You really gotta stop using the term "IPC" incorrectly. It's really super hard to tell what you are trying to talk about, but it's clear you aren't talking about IPC.

It stands for data level parallelism ...

There's a limit to operand forwarding ... Just because you can avoid some stalling due to a register write back instruction doesn't mean that you won't stall at all. This idea benefits most when the instruction pipeline is shorter or the execution phase is shortened.

Then there are other times where your forced to convert the floating point elements into integers and this type of action pays full price since the x86's integer as well as floating point pipelines aren't unified along with their set of registers. There are probably some other situations like these that I'm not aware of ...

Like I said before, just don't focus on the semantics ...

ThatBuzzkiller · Nov 19, 2014

SAAA said:
Example: in a code that is 100% single threaded, say because each step requires the result of the previous iteration, you can obtain a speed up of x times just by using more transistors and placing an ALU after another for x times, like in series instead of parallel (similar to GPUs).

Now at the same clocks your algorithm is x times faster, because you don't even need to write back to memory every time! You just compute x times in a single clock the same operation, then if you really need to you can write back in the next clock all or the final result.

I wonder how well that worked out for Itanium ...

ThatBuzzkiller · Nov 19, 2014

Cerb said:
When and how?

Increasing IPC means increasing performance per clock cycle, while running the same instructions. For a program not to benefit, the clock speed must have to reduce by the amount of the IPC increase, in the final CPU.

There's only so much ILP you can extract from a regular and even a lot of the high performance applications that run on a CPU. At times a section of a program may allow the processor to execute 100 operations in parallel where there's not a whole lot of data dependencies to resolve but there are a lot more sections of the program that that do not allow for an execution of more than 10 operations.

If there were massive gains to be had by ballooning up the amount of execution units then Intel would have done it already a looong time ago ...

Cerb said:
Also, DLP and ILP are not necessarily interchangeable. For any program for which CPU performance is at all interesting, they will not be. Most programs that benefit massively from work to improve IPC are not those programs. For example, compilation has improved at a rate greater than average, across newer Intel generations, and is primarily statistics, decision trees, and lookup tables, none of which are traditionally capable of high-IPC, usually have low ILP, but latent high DLP. But, to exploit that potential DLP would require far faster RAM than we have, today, and many CPUs each with far more performance than GPUs. GPU-like systems would be a poor fit. Same goes for much database work, where the storage and memory bandwidth is a major limiting factor. MySQL and Postgres generically try to keep chunks worked on reasonable, while I know in the past both Oracle and MS have optimized data structures to just fit into common Intel CPU L2 caches (that said, lots of basic querying over large data sets could make good use of GPU-like processors, though AVX2 might end up being a good enough in-between feature set).

A high amount of ILP implies a high amount of DLP ...

A program that has high DLP can probably exploit vectorization very easily and split up the tasks individually among the SIMT units on a GPU.

Just how long are we going to keep playing the IPC game ?

Golden Member

Elite Member

Senior member

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Lifer

Golden Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Golden Member

Elite Member

Elite Member

Lifer

Diamond Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member