Just how long are we going to keep playing the IPC game ?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
The memory performance, from DDRx down to L1, is an important part of what makes up IPC; why should it get a separate thread?

Memory performance and ILP are two ENTIRELY different things ...

Although there is a mutual relationship between the two subjects, this thread deals with the latter and it's potential issues of reaching a wall or a plateau therefore I feel it is best that the discussion of the former be held in it's own separate thread ...
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
It stands for data level parallelism ...

There's a limit to operand forwarding ... Just because you can avoid some stalling due to a register write back instruction doesn't mean that you won't stall at all. This idea benefits most when the instruction pipeline is shorter or the execution phase is shortened.

Then there are other times where your forced to convert the floating point elements into integers and this type of action pays full price since the x86's integer as well as floating point pipelines aren't unified along with their set of registers. There are probably some other situations like these that I'm not aware of ...

Like I said before, just don't focus on the semantics ...

I figured you meant "data level parallelism" when you said "DLP," but the problem is that "data level parllelism" isn't a real thing that computer architects talk about. Are you just talking about data parallelism? If there is data parallelism in an algorithm then that implies it can have instruction level parallelism (if your window is large enough ...), but the reverse is not necessarily true. In practice, with realistic instruction window sizes, data parallelism and instruction level parallelism are only kind of related.

There is no "limit" to operand forwarding. It's a very straightforward technique, and it's basically a solved problem. Absolute worst case is an L1 write and read, but there are many opportunities to avoid even this. Unless you're talking about situations with no temporal locality between data producers and consumers? That's a whole other story, and nothing you've talked about mentions this scenario, so I assume you're ignoring it.

I'm not sure how float<-->int conversion works in x86 CPUs, but I highly doubt the register access is the hard part of that operation. What do you mean by "pay the the full price?"
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
Memory performance and ILP are two ENTIRELY different things ...

Although there is a mutual relationship between the two subjects, this thread deals with the latter and it's potential issues of reaching a wall or a plateau therefore I feel it is best that the discussion of the former be held in it's own separate thread ...

It's true that ILP is just an inherent attribute of a program, and is separate from memory performance ... at least in theory. In practice, however, the two are inseparably intertwined.

I recently wrote an out-of-order CPU simulator where you can configure things like execution width and memory latency. If I set the machine to be very wide, say 32-wide, and set its memory latency to be 0 (basically a magical L1 cache), then I'm seeing IPCs of nearly 32 for some applications. The ILP that exists there is HUGE in some programs. BUT, as soon as I introduce even remotely realistic memory latencies (even a magical L1 that always hits, but has a realistic latency of 4, let's say), then the IPC just plummets. Adding in multiple levels of cache and a realistic-ish DRAM simulation, and IPC numbers drop down to the low single digits (<1 for many applications). Then what happens to the IPC if I reduce the width of the machine to a more realistic 4 or 6? Pretty much nothing. The memory subsystem is a far more interesting problem than ILP these days.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I'm gonna get pretty basic here.

Computer architecture usually teaches that a basic processor has 5 stages (this a simplified model):
Code:
1) Instruction Fetch
2) Instruction Decode
3) Execution
4) Memory access
5) Register Write-back.

Let's assume you built a processor that somehow completed all of these steps in 2 cycles. I'm not fully sure how you would do it, but that's ok.

Here's the hidden peace of information that might not be immediately obvious: Every one of those stages has certain latency. This latency comes about because the electric charge is not able to instantly rush in and fill the wires and transistors. There is a delay because of capacitance, inductance, and plain old resistance.

So if you were somehow able to compress those 5 stages into 2 stages, you still haven't gotten rid of the latencies. So, let's pretend here were the latencies with the 5 stage

Code:
1) Inst Fet   1ns
2) Inst Dec  1ns
3) Exec       1ns
4) Mem       1ns
5 WriteBack 1ns

You still have the same latencies to deal with, but now it's 2 cycles. Here's your new architecuter

Code:
1) Cycle1  2.5ns
2) Cycle2 2.5ns
You can't simply get rid of these latencies without creating some radical new design or using some new technology.

However, let's also assume you saved some latency because you don't need those pesky registers in between stages:

Code:
1) Cycle1 2ns
2) Cycle2 2ns

Alright. Now you think to yourself "Great! I can complete a single instruction in 4ns instead of 5!". I will also point out, at this stage, that your processor actually has to run at a frequency that is 2 TIMES SLOWER than the original architecture to accomodate the new longest delays (1ns -> 2ns). Your dream of reducing pipelining while increasing frequency seems sort of ridiculous.

You have gained the ability to complete a single instruction in 4ns, instead of 5ns. What have you lost? Something very important: you cannot pipeline your 2-cycle machine as deeply.

YOU ARE WASTING RESOURCES. Every sub-instruction has to wait for the next sub-instruction to complete one giant stage, even though, electrically, there are parts of that stage that next sub instruction is not even using anymore.

In both of these pipelined designs, we theoretically should be able to complete 1 instruction every cycle. However, the 5 stage pipeline will suffer more, as you point out, due to pipeline flushing and other shenanigans. So, let's assume the 5-stage pipeline actually has a IPC of 0.8 for some particular workload with various instructions while the 2 stage has an IPC of 1.0 for the same workload

Which do you think is going to be faster?
1) 5-stage processor running at 1Ghz (1ns clock period) with IPC of 0.8
2) 2-stage processor running at 500Mhz (2ns clock period) with IPC of 1.0.

1Ghz * 0.8 inst/cycle = 0.8 Giga instructions/second = 800 Mega instructions/sec
500Mhz * 1.0 inst/cycle = 500 Mega instructoins/second

All i'm saying is that you can't just claim to reduce pipeline depth and get free performance. If that were true every processor manufacturer would simply reduce there pipeline depth. Obviously it's a balancing act, and you have to choose the write pipeline depth.

Another thing i've completely ignored is power constrains, which as everyone remembers from pentium4/prescott days, can make the above analysis even more complex. Typically the power constraints favor lower frequencies, but it still does not allow us to get a simple answer.

Actually, assuming that clock speeds still stayed the same you would have solved the register write back latencies with a shorter staged pipeline.

A stage in a pipeline represents represents 1 cycle of work applied, not 2 and a half otherwise the said instruction pipeline isn't truly 2 stage ...

It's no surprise that the first option will likely perform better for today's and likely future workloads since it's pretty easy to get an IPC rate of 2 and if some programs are lucky maybe we'll even reach 3.5 but as time keeps moving sooner or later your going to have to eventually decrease register write back times to overcome the bottleneck in data dependencies ...

It makes no difference if your IPC is 100, 1000, or even infinity when a routine is limited by how fast you can output a result.

This whole IPC thing is headed to a dead end ...
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I figured you meant "data level parallelism" when you said "DLP," but the problem is that "data level parllelism" isn't a real thing that computer architects talk about. Are you just talking about data parallelism? If there is data parallelism in an algorithm then that implies it can have instruction level parallelism (if your window is large enough ...), but the reverse is not necessarily true. In practice, with realistic instruction window sizes, data parallelism and instruction level parallelism are only kind of related.

There is no "limit" to operand forwarding. It's a very straightforward technique, and it's basically a solved problem. Absolute worst case is an L1 write and read, but there are many opportunities to avoid even this. Unless you're talking about situations with no temporal locality between data producers and consumers? That's a whole other story, and nothing you've talked about mentions this scenario, so I assume you're ignoring it.

I'm not sure how float<-->int conversion works in x86 CPUs, but I highly doubt the register access is the hard part of that operation. What do you mean by "pay the the full price?"

How do I go about unwinding myself here ...

Data level parallelism can help in that as long as there is an arbitrarily large sets of data, then the computations can be efficiently vectorized. Check up on Gustafon's Law and see how this remains as a boon for GPUs. In other words as long as there is always more sets of data to operate on, applying more ILP or TLP doesn't hurt to that type of workload. Gustafon's Law is arguably the biggest reason why GPUs were able to get away with ballooning their core counts compared to CPUs.

There IS a limit to operand forwarding. As long as you have an execution phase of more than 2 cycles there will still be gains to be had and sometimes results can only be accessed after the memory access phase in the case of a load instruction. Believe me, there's more than one stage for the execution phase ...

The hardest part is getting the element to transition into the different pipeline. The first thing that an FPU does is a truncating conversion, the next thing that happens is that the element gets evicted to the L1 cache and eventually finds itself in the GPRs. What I mean by "pay the the full price" is cases like these that cannot benefit from operand forwarding like things other than in pipeline data dependencies cannot fix.
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
There's only so much ILP you can extract from a regular and even a lot of the high performance applications that run on a CPU. At times a section of a program may allow the processor to execute 100 operations in parallel where there's not a whole lot of data dependencies to resolve but there are a lot more sections of the program that that do not allow for an execution of more than 10 operations.
That's not in question. But, there's no magic pixie dust that can be thrown at every problem to make up for lack of ILP, even if MPMD isn't theoretically necessary. Even problems that have high DLP can't all use GPU-like simplicity, at least not and go faster.

If there were massive gains to be had by ballooning up the amount of execution units then Intel would have done it already a looong time ago ...
No, they wouldn't have. Those that did didn't get much performance from doing so. Why? Because those execution units need to stay fed.

A high amount of ILP implies a high amount of DLP ...
High DLP generally means thousands or more. High ILP usually doesn't even mean 10, with impossible window sizes. If the same value ranges, then yes, but usually with more than single-digit ILP comes opportunities for TLP, if not vectorizing. OTOH, low ILP can still have very high DLP.

A program that has high DLP can probably exploit vectorization very easily and split up the tasks individually among the SIMT units on a GPU.
Only a program that can be made to use simple large data structures, and very few branches, can do that, at least in such a way that the result will not be much slower than just doing it normally on the CPU cores. But simple slow CPU cores, like in Phis, which are one option, that can handle branches well, add a lot of latency, that slows down the overall computation. There's just no free lunch to be had (that said, I know of no theoretical reason that branches, and filter->result->next_loop couldn't be handled much better on GPU-like processors, only technical and economic ones, with the economic reasons dominating).
 
Last edited:

serpretetsky

Senior member
Jan 7, 2012
642
26
101
I want to make sure I have your opinion correct:

you want to :
1) Decrease pipeline depth
2) Stop focusing on improving IPC values (for which benchmarks? just all of them? including data dependancy ones?)
3) Start focusing on increasing clockspeed

Is this correct?

This whole IPC thing is headed to a dead end ...

There is no "whole IPC thing". There is a balancing act. You make it sound like processor companies have some conspiracy against everyone or are on some huge advertisement campaign about IPC. Basically it sounds like you think there is a Mega hertz myth but now with IPC instead. I really don't know anyone who thinks IPC is the the wholy grail measurement of all of performance.

Customers don't buy processors because they have high IPC, they buy processors because they perform well. Most end-clients don't even know what IPC is.

edit:
A stage in a pipeline represents represents 1 cycle of work applied, not 2 and a half otherwise the said instruction pipeline isn't truly 2 stage ...
ns stands for nanoseconds, not cycles.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
That's not in question. But, there's no magic pixie dust that can be thrown at every problem to make up for lack of ILP, even if MPMD isn't theoretically necessary. Even problems that have high DLP can't all use GPU-like simplicity, at least not and go faster.

Just why are GPU's simple ? Their memory subsystem maybe not as capable at handling irregular memory access patterns but with a lot of DLP, it won't matter since the said workload is often more limited by execution resources.

No, they wouldn't have. Those that did didn't get much performance from doing so. Why? Because those execution units need to stay fed.

Sure bandwidth can become a problem but a lot of these programs that run on CPUs are not memory intensive.

High DLP generally means thousands or more. High ILP usually doesn't even mean 10, with impossible window sizes. If the same value ranges, then yes, but usually with more than single-digit ILP comes opportunities for TLP, if not vectorizing. OTOH, low ILP can still have very high DLP.

That's seldom or not even the case at all ...

Just because high DLP can allow for a more coarse SIMD unit, it does not mean that a microprocessor designer can get away with making the execution unit 16,384-bits wide because that is very inefficient since not a whole lot of data sets will follow the same execution/program paths. There's a very good reason why both Nvidia and AMD designed a warp size of 32 along with GCN's vector unit being 512-bits wide respectively. The same goes for Intel adding another AVX 512 unit in knights landing instead of making it fatter/wider to 1024-bits.

Only a program that can be made to use simple large data structures, and very few branches, can do that, at least in such a way that the result will not be much slower than just doing it normally on the CPU cores. But simple slow CPU cores, like in Phis, which are one option, that can handle branches well, add a lot of latency, that slows down the overall computation. There's just no free lunch to be had (that said, I know of no theoretical reason that branches, and filter->result->next_loop couldn't be handled much better on GPU-like processors, only technical and economic ones, with the economic reasons dominating).

What you are spouting is outdated information ...

GPU's of today can very well handle branches. It's just the divergence that's the issue and both Nvidia and AMD are doing a good job of keeping that under wraps.
 
Last edited:

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I want to make sure I have your opinion correct:

you want to :
1) Decrease pipeline depth
2) Stop focusing on improving IPC values (for which benchmarks? just all of them? including data dependancy ones?)
3) Start focusing on increasing clockspeed

Is this correct?



There is no "whole IPC thing". There is a balancing act. You make it sound like processor companies have some conspiracy against everyone or are on some huge advertisement campaign about IPC. Basically it sounds like you think there is a Mega hertz myth but now with IPC instead. I really don't know anyone who thinks IPC is the the wholy grail measurement of all of performance.

Customers don't buy processors because they have high IPC, they buy processors because they perform well. Most end-clients don't even know what IPC is.

edit:

ns stands for nanoseconds, not cycles.

I don't want the focus on IPC to be completely diminished but what I want most is more on increasing single threaded performance in other ways such as decreasing register write back times and increasing on the clock speeds would be very nice.

A lot of Intel advocates have that mindset and to a lesser extent that goes for the industry as well ...

Intel and AMD can't keep staying on a sinking ship such as IPC, they need a strategy that sustains gain to them forever such as increasing clock speeds ...
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
And? It was about IPC, not ILP. IPC and memory performance are rather intimately related, on most processors, for most workloads.

Doesn't matter ...

Both of those terms are not mutually exclusive. If anything they mean the same thing for a different acronym since they represent the same idea.

This thread deals with the ILP wall, not the memory wall ...
 

NTMBK

Lifer
Nov 14, 2011
10,461
5,845
136
Doesn't matter ...

Both of those terms are not mutually exclusive. If anything they mean the same thing for a different acronym since they represent the same idea.

This thread deals with the ILP wall, not the memory wall ...

This thread has "IPC" in the title. People kind of presume that IPC is the topic of the thread. And memory system performance is a massive factor in actual real world achievable IPC.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Doesn't matter ...

Both of those terms are not mutually exclusive. If anything they mean the same thing for a different acronym since they represent the same idea.

This thread deals with the ILP wall, not the memory wall ...
BS. They do not represent the same thing at all. If you run your memory at 800MHz, say, and get 10% less performance than at 1600MHz, while not changing any other speeds (and, let's say, to cover corner cases, have the CPU speed fixed), then IPC was lowered by 10% due to the memory speed change. ILP will not have been changed at all, being a property of the code itself, regardless of the machine it runs on.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Just why are GPU's simple ? Their memory subsystem maybe not as capable at handling irregular memory access patterns but with a lot of DLP, it won't matter since the said workload is often more limited by execution resources.
They lack those very memory-related execution resources, as well. Branching on GPUs means a fixed-length noop, due to having to operate the same way across all datas. Branching that involved addressing gives that plus more latency.

Sure bandwidth can become a problem but a lot of these programs that run on CPUs are not memory intensive.
Were that the case, why wouldn't we be able to get near 1 IPC in server loads, on high-end chips? They aren't usually bandwidth intensive, but that's not all there is to memory.

That's seldom or not even the case at all ...
Yet, every single mainstream processor designer, for decades, has worked on SIMD (even if half-assed, like Intel until recently), and many embedded processors have VLIW and VLIW-like processors on them. Likewise, it's so uncommon that we have 4-8 core CPUs for our desktops, GPGPUs with hundreds or more, "cores," and Intel's even making ~50 core x86 cards. If the work does not [practically] require a PC per data set, vector should be usable, in some form or another (whether some given implementation works well in the real world is another matter, of course). If not, another independent thread is needed. Gustafson's Law, regarding embarrassingly parallel workloads, is what allows system like GPGPU, with low ILP, to perform well, because they can apply that 1 instruction to so much data, since a distinct PC is not really needed for each data set, to perform well.

Just because high DLP can allow for a more coarse SIMD unit, it does not mean that a microprocessor designer can get away with making the execution unit 16,384-bits wide because that is very inefficient since not a whole lot of data sets will follow the same execution/program paths. There's a very good reason why both Nvidia and AMD designed a warp size of 32 along with GCN's vector unit being 512-bits wide respectively. The same goes for Intel adding another AVX 512 unit in knights landing instead of making it fatter/wider to 1024-bits.
So in one breathe it's seldom, then in another it's a reason for wide data paths to narrow execution units? Of course 16Kb would be too wide, mainly due to making caches work (even the mighty Intel had their work cut out for them going with 256b in Haswell!). 512-bit AVX should be a minimum of 1:8 ILP:DLP, nV went back to 1:32, and I honestly can't recall right now where AMD is at or going, w/ the new GCN.

Regular scalar execution is 1:1 min, maxing out at 1:2, and 1:2 is only if being generous, with the idea at A op B -> C isn't really just 1. All those are getting a good bit higher in DLP than ILP.

What you are spouting is outdated information ...
Outdated since when? If you have a tree with mere hundreds of bytes per data-containing node, which GPGPU system is going to handle that better than a fast speculative CPU? It breaks down. With the world moving to SoCs, JIT languages for real work other than Java (or non-standard Java implementations), and x86 having AVX2 (finally, a decent SIMD ISA for x86), we may start to see changes, but of course not every problem will be amenable, as some will spend more time managing better-packed data structures than working on them.

GPU's of today can very well handle branches. It's just the divergence that's the issue and both Nvidia and AMD are doing a good job of keeping that under wraps.
Under wraps so well it hasn't been seen. How are speeding up branch resolution per thread set, and are they splitting up workloads in the compiler for potentially sparse results? So far, all I've seen are more SMT features for hope that the GPU stays busy while it waits.
 
Last edited:

lopri

Elite Member
Jul 27, 2002
13,314
690
126
Here is a graph that shows the "tick-tock" cadence from IPC performance straight from the horse's mouth. (IPC as in Instruction Per Clock without consideration of power) Not as pretty as some might have thought in the past when some reviewers gushed over every new CPU.

IPC_Improvements.png


Biggest IPC gain came from Netburst -> Dothan/Merom on Intel CPUs. Since then there have been "refinements." To be fair, though, increasing number of cores did change the computing landscape quite a bit.

Still, it is quite pathetic if you compare it to some other tech industries. (e.g. Display, GPUs, HDD, memory, wireless network, etc.)
 

SAAA

Senior member
May 14, 2014
541
126
116
Where's that image from?

Also, odd that Conroe isn't included...

It's from Intel, some server presentation slide that was shown with new Haswell parts.

Why not Conroe? Because it would ridicule the rest of Ipc increases, the bar would be 4-5 times higher than the first (remember the doubling of performance/clock from Netbutst?).

Then look also at the cumulative increase and you'll see that both Haswell and Ivy weren't that bad in absolute, not % terms: someone made an example using a skyscraper height and in this case it's like they added 100m again to a 1000m high one.
Broadwell may be only 5% more but that is +55m, like a 10% when the tower was 500m.
It's really becoming harder to do more, let's see how Skylake turns out.
 

NTMBK

Lifer
Nov 14, 2011
10,461
5,845
136
Seriously, guys? :\ "Conroe" is on the chart. It's Merom.

Merom is the mobile version of Conroe. It's a mobile IPC chart.

EDIT: Ninja'd by 20 minutes... I need to leave old tabs open less often :D