Is there a IPC limit or a point of diminishing return?

positivedoppler

Golden Member
Apr 30, 2012
1,140
236
116
In the case of a single core, is there a theoretical limit on how much ipc can be achieved before we hit a wall? If there is no limit and IPC can continuously increase provided we keep throwing transistors at the cpu, is there a point of diminishing return in which adding transistors becomes just too expensive? Example, increasing cpu transistor by 100% increase IPC by 1%.

This is not a thread about which CPU company is best. I am not interested in how amazing a certain company's process and research technology is and I am not interested to known when you think a company will go out of business.

I am just interested in the above question because I am curious to learn at what point does the iGPU becomes more important than the iCPU. Does the iGPU scale better with the added transistor than the iCPU?
 

Roland00Address

Platinum Member
Dec 17, 2008
2,196
260
126
Yes for singled threaded situations there is a point of diminishing returns on both the ipc as well as the maximum frequency you can have for a processor in a reasonable thermal limit.

IPC has a lot to do with how the pipeline of the cpu is designed, for you want the cpu to be at full utilization as much as possible. Yet the processor must be "stable" and finish the previous instruction, reset, and be read for the next instruction for if there is not a stable enough timing with enough clearance your processor is not stable and the data is not reliable.

You can "flush" the information faster by using higher voltages but there is diminishing returns for higher voltages means more heat and also several other problems.


Another very important thing is to keep the cpu fed as fast as possible with as fast as cache as possible and a cache structure that knows what to put where in the cache and what is not needed. Cache is 2 to 3 orders of magnitude faster than ram, ram is 2 orders of magnitude faster than ssds in latency, and ssds are 2 orders of magnitude faster than hard drive.

There is a limit on how much you can improve cache speed and cache sizes, you can't just throw transistors at this for there is a limit of how much space you can place of the fastest cache due to it having to be at certain locations on the processor die, though slower cache has less constraints for they can be farther away.
 

Nothingness

Diamond Member
Jul 3, 2013
3,142
2,158
136
Difficult question :)

If you want to have some theoretic data, there's an old article by David Wall, Limits of ILP (google it). It shows how some programs can have IPC >> 100.

But that just says there's a lot of ILP to extract, the price to pay has already become quite high; in particular branch prediction accuracy and hardware prefetchers (to reduce the impact of memory latency) are what is limiting IPC, so you'd better improve them rather than trying to extract more ILP by widening the instuction window and adding functional units.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I think we're already at the point of diminishing returns ...

If your only seeing gains on workloads that have a lot of data level parallelism and nothing else then at that point microprocessor designers should shelve making wider cores and invest some of that die space to the GPU ...
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
There is also other factors in terms of where to put transistors. And that's integration and accelerators.

GPUs always scales better due to the parallel nature than CPUs. But the amount of work you can do on them is extremely limited. But then accelerators beats GPUs on that as well, tho with the same case as GPU vs CPU, just with accelerators vs GPUs.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Yes. Consider a given program A running on a single core.

Program A can be sped up (perf/frequency) in two ways.
1) Increase the CPU's ability to process data in parallel (extract ILP).
2) Run each individual operation faster.

Looking at 2) this reaches a maximum when each operation executes in a single cycle. I suppose fused operations are possible (ie FMA) in some cases but logistics will prevent many operations from operating in tandem (at least under reasonable operating conditions - a real world core will make certain trade-offs). Double pumping a unit is possible but there is still the hard cap of one operation per operational unit cycle. The usage of these fused-operations also depends on data dependencies.

For 1) a given program will have a maximum amount of ILP assuming the program is real world and of mixed operations that are not independent of each other (a program whose sole purpose is to increment every element of a 100,000,000 by 100,000,000 array could theoretically be executed with the latency of one operation in a wide enough core - ie like AVX operations but on a much larger scale). However while a machine with any amount of ILP could be built, real-world program A contains a certain amount of ILP after which core resources will go unused.

With this in mind, program A will eventually reach a peak perf/frequency limit on some 'perfect core'.

The same will be true for program B, though these limits will be different. Thus for any program with a finite amount of ILP there will be a theoretical performance limit.
 

maddie

Diamond Member
Jul 18, 2010
4,993
5,163
136
Yes. Consider a given program A running on a single core.

Program A can be sped up (perf/frequency) in two ways.
1) Increase the CPU's ability to process data in parallel (extract ILP).
2) Run each individual operation faster.

Looking at 2) this reaches a maximum when each operation executes in a single cycle. I suppose fused operations are possible (ie FMA) in some cases but logistics will prevent many operations from operating in tandem (at least under reasonable operating conditions - a real world core will make certain trade-offs). Double pumping a unit is possible but there is still the hard cap of one operation per operational unit cycle. The usage of these fused-operations also depends on data dependencies.

For 1) a given program will have a maximum amount of ILP assuming the program is real world and of mixed operations that are not independent of each other (a program whose sole purpose is to increment every element of a 100,000,000 by 100,000,000 array could theoretically be executed with the latency of one operation in a wide enough core - ie like AVX operations but on a much larger scale). However while a machine with any amount of ILP could be built, real-world program A contains a certain amount of ILP after which core resources will go unused.

With this in mind, program A will eventually reach a peak perf/frequency limit on some 'perfect core'.

The same will be true for program B, though these limits will be different. Thus for any program with a finite amount of ILP there will be a theoretical performance limit.
This is an excellent explanation.
 

sxr7171

Diamond Member
Jun 21, 2002
5,079
40
91
I think we're already at the point of diminishing returns ...

If your only seeing gains on workloads that have a lot of data level parallelism and nothing else then at that point microprocessor designers should shelve making wider cores and invest some of that die space to the GPU ...

How about instead of GPUs they start putting large amounts of eDRAM on instead?
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
How about instead of GPUs they start putting large amounts of eDRAM on instead?

That doesn't do much if your serially computation bound ...

The amount of L3 cache in current Xeon's are enough to guarantee that you'll almost never get a miss so adding eDRAM is a futile attempt ...

The only other option that I can think of to gain appreciable returns on single threaded performance is lowering the latencies overall ...

Having all caches been pseudo-associative is ideal for lowering memory access times. Making the pipeline shorter and designing the cores for higher frequencies can make some serious strides in single threaded performance too ...
 

positivedoppler

Golden Member
Apr 30, 2012
1,140
236
116
Yes. Consider a given program A running on a single core.

Program A can be sped up (perf/frequency) in two ways.
1) Increase the CPU's ability to process data in parallel (extract ILP).
2) Run each individual operation faster.

Looking at 2) this reaches a maximum when each operation executes in a single cycle. I suppose fused operations are possible (ie FMA) in some cases but logistics will prevent many operations from operating in tandem (at least under reasonable operating conditions - a real world core will make certain trade-offs). Double pumping a unit is possible but there is still the hard cap of one operation per operational unit cycle. The usage of these fused-operations also depends on data dependencies.

For 1) a given program will have a maximum amount of ILP assuming the program is real world and of mixed operations that are not independent of each other (a program whose sole purpose is to increment every element of a 100,000,000 by 100,000,000 array could theoretically be executed with the latency of one operation in a wide enough core - ie like AVX operations but on a much larger scale). However while a machine with any amount of ILP could be built, real-world program A contains a certain amount of ILP after which core resources will go unused.

With this in mind, program A will eventually reach a peak perf/frequency limit on some 'perfect core'.

The same will be true for program B, though these limits will be different. Thus for any program with a finite amount of ILP there will be a theoretical performance limit.

Thanks for the explanation The 10^6 array example seems to be a tasks best suited for the GPU. So from reading all this, I'm guessing the takeaway is we might be quickly closing in at a point in which IPC can no longer be feasibly increased.
 

mahoshojo

Junior Member
Jul 24, 2015
18
0
36
Let's say you are thinking about to improve your cpu performance by 5%.

The first and easiest way to do so is increase clock frequency by 5%.
And in order to reach that 5% higher clock rate, the supply voltage has to go up by 10%, for example.
Then your overall power consumption goes up by ~27%(C*V^2*freq rule).

To improve the IPC is sort of like adding more computing resource, or transistor into your cpu core. And it can be equivalent to adding more C to the C*V^2*freq. If C (= computing resource) is 27% more but you only get 5% higher IPC, it's better to just overclock the cpu.
 

sxr7171

Diamond Member
Jun 21, 2002
5,079
40
91
That doesn't do much if your serially computation bound ...

The amount of L3 cache in current Xeon's are enough to guarantee that you'll almost never get a miss so adding eDRAM is a futile attempt ...

The only other option that I can think of to gain appreciable returns on single threaded performance is lowering the latencies overall ...

Having all caches been pseudo-associative is ideal for lowering memory access times. Making the pipeline shorter and designing the cores for higher frequencies can make some serious strides in single threaded performance too ...

Is there a reason they are not working towards shorter pipelines?
 

Nothingness

Diamond Member
Jul 3, 2013
3,142
2,158
136
Is there a reason they are not working towards shorter pipelines?
If you have an operation to do it takes a certain amount of computations. You split this computation into (pipeline) stages (1 stage per cycle), but the amount of computations remains constant (not exactly, but let's admit it). So if you reduce the number of stages you have more work to do in one cycle, which means your stage is deeper and hence takes longer to traverse (electrons have finite speed), so you have to increase time between stages, hence a reduced clock.

To sum up: shorter pipelines mean lower clock :)
 

Shehriazad

Senior member
Nov 3, 2014
555
2
46
I kind of hope that conventional CPUs hit some sort of stopping point in a few years. I mean the gains aren't really impressive the past few years, anyway.

If IPC can no longer be increased by a reasonable amount and frequency also hits a dead-end... (I really don't think 6+ ghz CPUs will ever be a REAL thing -> Now don't spam me with your ln2 overclocks please)...then decreasing cache latency and other small things is all that's left.

But after that? Conventional CPUs no-mans-land. I feel like it could be quite a turning point in the CPU hardware business. Other companies could actually catch up to Intel (no, I don't mean AMD) and diversify the market.

But then again...with mobile device market being so huge and small cores totally being a thing...I almost feel like progress is going to be set back a few years (raw performance) in favor of mobility/interconnectivity.
 

sxr7171

Diamond Member
Jun 21, 2002
5,079
40
91
If you have an operation to do it takes a certain amount of computations. You split this computation into (pipeline) stages (1 stage per cycle), but the amount of computations remains constant (not exactly, but let's admit it). So if you reduce the number of stages you have more work to do in one cycle, which means your stage is deeper and hence takes longer to traverse (electrons have finite speed), so you have to increase time between stages, hence a reduced clock.

To sum up: shorter pipelines mean lower clock :)

Thanks for that.
 

maddie

Diamond Member
Jul 18, 2010
4,993
5,163
136
One possible way to get past the IPC barrier. If this works, we have a new paradigm.

This might allow the CPU to reconfigure itself to suit the actual program being processed. As ENIGMOID said, each program sequence would have an optimum IPC associated with it and a virtual CPU could be reconfigured as needed.

Yes, I know its Semiaccurate, but look at the Website and the companies involved in the development.


Soft Machines talks VISC architecture details

http://www.semiaccurate.com/2015/10/08/soft-machines-talks-visc-architecture-details/
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
In the case of a single core, is there a theoretical limit on how much ipc can be achieved before we hit a wall?
With conventional CPUs, the program's ILP makes for an absolute limit.

If there is no limit and IPC can continuously increase provided we keep throwing transistors at the cpu, is there a point of diminishing return in which adding transistors becomes just too expensive? Example, increasing cpu transistor by 100% increase IPC by 1%.
While not to that point, we probably are to the point of +100% providing +10% or less (on average), just because so much is about memory, now (and going by what Intel and IBM have been doing the last 5-10 years). Getting the right things into L1 or L2 at the right times won't take much in the way of extra transistors, but takes a lot of R&D, and cleverness. We hit diminishing returns by '04 or so. By '10, we were at severely diminishing returns.

I am just interested in the above question because I am curious to learn at what point does the iGPU becomes more important than the iCPU. Does the iGPU scale better with the added transistor than the iCPU?
For normal people, we've already crossed that line. Skylake's relatively huge IGP, even for 2C models, is not coincidence. It's exceptionally powerful, given what it is, and more important to have improved generation over generation than more cores, or cores that are too much faster.

How about instead of GPUs they start putting large amounts of eDRAM on instead?
Intel is doing both. How they have a big L4 that functions not just for the GPU, but still generally performs well is beyond my tiny brain's capability to understand, but they've done it. But, they can find OEMs that can find end users that will pay more for such premium features, so it's not going to be mainstream for awhile, if ever. Still, while having a big eDRAM cache might improve those hard to benchmark cases like background VM work, multitasking, etc., by not evicting back out to RAM so much, it won't speed up what's using CPU time right this instant much, because will usually need to look into the future, more than the past (assuming L3 is big enough, which is usually the case).
 

boozzer

Golden Member
Jan 12, 2012
1,549
18
81
my question is this, is the current ipc reaching THE limit? the last 5 years has been really bad in that department.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
my question is this, is the current ipc reaching THE limit? the last 5 years has been really bad in that department.
There have been no breakthroughs in memory. While we'll get bigger and faster SRAM and eDRAM, the coherent CAM parts will make up for the fast memory cells themselves (this is already part of why L1 and L2 caches are fairly small). That's the practical limit, today.

In real system DRAM latency, I don't think we've improved by more than 2x, if that, since '03-'05 (when we got IMCs in mainstream CPUs), and a lot of that has been tricks with the new memory types and buses, more than the memory itself running faster. The memory itself is getting faster very minimally, and it's that RAM bandwidth and latency that's really hurting improvements in single-core performance. I'm sure once HMC gets to its 2nd or 3rd generation, we'll see improvements from that, but still only marginal ones.

Let's say you're running a CPU around 3.5GHz, and random memory access averages 75ns (not far off from what Haswell w/ typical DDR3 seems to get in some synthetic tests, IIRC). The CPU effectively needs be correct often enough to stay around 250 cycles ahead of what the memory can offer, to not stall too much and just waste time. That's why minor prefetching improvements, and a 0.1% gain in average branch prediction correctness, can matter so much. One screwup that has to go to L2 might burn up to maybe 20 cycles, L3 even more (since it changes speed relative to the CPUs, it's going to vary quite a bit), and then missing in L3 could be anywhere from under 100, if the page is open or right next to an open one, to over 200, if it's a totally different address from what's been worked on recently.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
EDRAM/Large caches as seen with Broadwell-C and the HEDT line is a good example of the memory issue. Those CPUs can easily perform 20-30% more than they should in raw core performance.
 

boozzer

Golden Member
Jan 12, 2012
1,549
18
81
There have been no breakthroughs in memory. While we'll get bigger and faster SRAM and eDRAM, the coherent CAM parts will make up for the fast memory cells themselves (this is already part of why L1 and L2 caches are fairly small). That's the practical limit, today.

In real system DRAM latency, I don't think we've improved by more than 2x, if that, since '03-'05 (when we got IMCs in mainstream CPUs), and a lot of that has been tricks with the new memory types and buses, more than the memory itself running faster. The memory itself is getting faster very minimally, and it's that RAM bandwidth and latency that's really hurting improvements in single-core performance. I'm sure once HMC gets to its 2nd or 3rd generation, we'll see improvements from that, but still only marginal ones.

Let's say you're running a CPU around 3.5GHz, and random memory access averages 75ns (not far off from what Haswell w/ typical DDR3 seems to get in some synthetic tests, IIRC). The CPU effectively needs be correct often enough to stay around 250 cycles ahead of what the memory can offer, to not stall too much and just waste time. That's why minor prefetching improvements, and a 0.1% gain in average branch prediction correctness, can matter so much. One screwup that has to go to L2 might burn up to maybe 20 cycles, L3 even more (since it changes speed relative to the CPUs, it's going to vary quite a bit), and then missing in L3 could be anywhere from under 100, if the page is open or right next to an open one, to over 200, if it's a totally different address from what's been worked on recently.
thank you for the thorough post.
 

Nothingness

Diamond Member
Jul 3, 2013
3,142
2,158
136
EDRAM/Large caches as seen with Broadwell-C and the HEDT line is a good example of the memory issue. Those CPUs can easily perform 20-30% more than they should in raw core performance.
As always it depends a lot on the characteristics of your program. As an example Anandtech showed only a modest increase of 3.3% of IPC when going from 4770K to 5775C.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
While there is a hard IPC bound decided by properties of the code, that does not mean we are near the limits. Previous posters mentioned about "problems" when your operations are down to 1 cycle: you can double pump units of rise clock. But it is not that limited, CPUs are out of order machines, that can execute instructions far in the future as long as they don't have dependencies (and even then branch prediction or outright crazy stuff like Itanium can save the day).

Even if real world program has average IPC of 1.7, it's average and it means that machine built for 1.7 IPC is 'economical', but will be slower than machine built for 2 IPC average, even if by several percent.

So those are execution improvements Intel has been adding lately, increasing OoE "window" and also working on improving core to serve those ops.

It's hard and "not rewarding" (compared to department who doubles throughput by going from AVX to AVX512 or enables FMA), but those improvements still happen. Haswell and SKylake both added quite some additional hw.

One can only wonder what progress we would see if AMD was not so grossly incompetent... And we should hope that Zen will turn the tide and force Intel to pump transistors into cores.
 

gdansk

Diamond Member
Feb 8, 2011
3,311
5,243
136
Depending on the workload, certain operations must be conducted in-order and thus limit the number of instructions that can realistically be executed in a single clock. Furthermore, widening execution units is difficult because you must widen/speed up everything else in order to feed them (or add more SMT). There isn't a hard limit but there is probably a soft barrier depending on the number of transistors in your budget.