Is there a IPC limit or a point of diminishing return?

Discussion in 'CPUs and Overclocking' started by positivedoppler, Oct 19, 2015.

  1. positivedoppler

    positivedoppler Senior member

    Joined:
    Apr 30, 2012
    Messages:
    872
    Likes Received:
    8
    In the case of a single core, is there a theoretical limit on how much ipc can be achieved before we hit a wall? If there is no limit and IPC can continuously increase provided we keep throwing transistors at the cpu, is there a point of diminishing return in which adding transistors becomes just too expensive? Example, increasing cpu transistor by 100% increase IPC by 1%.

    This is not a thread about which CPU company is best. I am not interested in how amazing a certain company's process and research technology is and I am not interested to known when you think a company will go out of business.

    I am just interested in the above question because I am curious to learn at what point does the iGPU becomes more important than the iCPU. Does the iGPU scale better with the added transistor than the iCPU?
     
  2. Loading...

    Similar Threads - limit point diminishing Forum Date
    Ryzen UEFI ram limitation and brand choices? CPUs and Overclocking Mar 9, 2017
    Corsair has new Ryzen Page : provides free AM4 Bracket for existing customers/limited supply CPUs and Overclocking Feb 28, 2017
    [semiaccurate] Coffee Lake points to issues with Intel’s 10nm process CPUs and Overclocking Dec 29, 2016
    When a 2600k OC isnt fast enough...list some CPU limited games CPUs and Overclocking Nov 21, 2016
    Broadwell architecture seems fundamentally clock limited CPUs and Overclocking Jun 1, 2016

  3. Roland00Address

    Roland00Address Golden Member

    Joined:
    Dec 17, 2008
    Messages:
    1,795
    Likes Received:
    8
    Yes for singled threaded situations there is a point of diminishing returns on both the ipc as well as the maximum frequency you can have for a processor in a reasonable thermal limit.

    IPC has a lot to do with how the pipeline of the cpu is designed, for you want the cpu to be at full utilization as much as possible. Yet the processor must be "stable" and finish the previous instruction, reset, and be read for the next instruction for if there is not a stable enough timing with enough clearance your processor is not stable and the data is not reliable.

    You can "flush" the information faster by using higher voltages but there is diminishing returns for higher voltages means more heat and also several other problems.


    Another very important thing is to keep the cpu fed as fast as possible with as fast as cache as possible and a cache structure that knows what to put where in the cache and what is not needed. Cache is 2 to 3 orders of magnitude faster than ram, ram is 2 orders of magnitude faster than ssds in latency, and ssds are 2 orders of magnitude faster than hard drive.

    There is a limit on how much you can improve cache speed and cache sizes, you can't just throw transistors at this for there is a limit of how much space you can place of the fastest cache due to it having to be at certain locations on the processor die, though slower cache has less constraints for they can be farther away.
     
  4. Nothingness

    Nothingness Golden Member

    Joined:
    Jul 3, 2013
    Messages:
    1,612
    Likes Received:
    79
    Difficult question :)

    If you want to have some theoretic data, there's an old article by David Wall, Limits of ILP (google it). It shows how some programs can have IPC >> 100.

    But that just says there's a lot of ILP to extract, the price to pay has already become quite high; in particular branch prediction accuracy and hardware prefetchers (to reduce the impact of memory latency) are what is limiting IPC, so you'd better improve them rather than trying to extract more ILP by widening the instuction window and adding functional units.
     
  5. ThatBuzzkiller

    ThatBuzzkiller Senior member

    Joined:
    Nov 14, 2014
    Messages:
    687
    Likes Received:
    14
    I think we're already at the point of diminishing returns ...

    If your only seeing gains on workloads that have a lot of data level parallelism and nothing else then at that point microprocessor designers should shelve making wider cores and invest some of that die space to the GPU ...
     
  6. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,392
    Likes Received:
    120
    There is also other factors in terms of where to put transistors. And that's integration and accelerators.

    GPUs always scales better due to the parallel nature than CPUs. But the amount of work you can do on them is extremely limited. But then accelerators beats GPUs on that as well, tho with the same case as GPU vs CPU, just with accelerators vs GPUs.
     
  7. Enigmoid

    Enigmoid Platinum Member

    Joined:
    Sep 27, 2012
    Messages:
    2,907
    Likes Received:
    22
    Yes. Consider a given program A running on a single core.

    Program A can be sped up (perf/frequency) in two ways.
    1) Increase the CPU's ability to process data in parallel (extract ILP).
    2) Run each individual operation faster.

    Looking at 2) this reaches a maximum when each operation executes in a single cycle. I suppose fused operations are possible (ie FMA) in some cases but logistics will prevent many operations from operating in tandem (at least under reasonable operating conditions - a real world core will make certain trade-offs). Double pumping a unit is possible but there is still the hard cap of one operation per operational unit cycle. The usage of these fused-operations also depends on data dependencies.

    For 1) a given program will have a maximum amount of ILP assuming the program is real world and of mixed operations that are not independent of each other (a program whose sole purpose is to increment every element of a 100,000,000 by 100,000,000 array could theoretically be executed with the latency of one operation in a wide enough core - ie like AVX operations but on a much larger scale). However while a machine with any amount of ILP could be built, real-world program A contains a certain amount of ILP after which core resources will go unused.

    With this in mind, program A will eventually reach a peak perf/frequency limit on some 'perfect core'.

    The same will be true for program B, though these limits will be different. Thus for any program with a finite amount of ILP there will be a theoretical performance limit.
     
  8. maddie

    maddie Golden Member

    Joined:
    Jul 18, 2010
    Messages:
    1,389
    Likes Received:
    26
    This is an excellent explanation.
     
  9. sxr7171

    sxr7171 Diamond Member

    Joined:
    Jun 21, 2002
    Messages:
    5,012
    Likes Received:
    20
    How about instead of GPUs they start putting large amounts of eDRAM on instead?
     
  10. ThatBuzzkiller

    ThatBuzzkiller Senior member

    Joined:
    Nov 14, 2014
    Messages:
    687
    Likes Received:
    14
    That doesn't do much if your serially computation bound ...

    The amount of L3 cache in current Xeon's are enough to guarantee that you'll almost never get a miss so adding eDRAM is a futile attempt ...

    The only other option that I can think of to gain appreciable returns on single threaded performance is lowering the latencies overall ...

    Having all caches been pseudo-associative is ideal for lowering memory access times. Making the pipeline shorter and designing the cores for higher frequencies can make some serious strides in single threaded performance too ...
     
  11. positivedoppler

    positivedoppler Senior member

    Joined:
    Apr 30, 2012
    Messages:
    872
    Likes Received:
    8
    Thanks for the explanation The 10^6 array example seems to be a tasks best suited for the GPU. So from reading all this, I'm guessing the takeaway is we might be quickly closing in at a point in which IPC can no longer be feasibly increased.
     
  12. mahoshojo

    mahoshojo Junior Member

    Joined:
    Jul 24, 2015
    Messages:
    16
    Likes Received:
    0
    Let's say you are thinking about to improve your cpu performance by 5%.

    The first and easiest way to do so is increase clock frequency by 5%.
    And in order to reach that 5% higher clock rate, the supply voltage has to go up by 10%, for example.
    Then your overall power consumption goes up by ~27%(C*V^2*freq rule).

    To improve the IPC is sort of like adding more computing resource, or transistor into your cpu core. And it can be equivalent to adding more C to the C*V^2*freq. If C (= computing resource) is 27% more but you only get 5% higher IPC, it's better to just overclock the cpu.
     
  13. sxr7171

    sxr7171 Diamond Member

    Joined:
    Jun 21, 2002
    Messages:
    5,012
    Likes Received:
    20
    Is there a reason they are not working towards shorter pipelines?
     
  14. Nothingness

    Nothingness Golden Member

    Joined:
    Jul 3, 2013
    Messages:
    1,612
    Likes Received:
    79
    If you have an operation to do it takes a certain amount of computations. You split this computation into (pipeline) stages (1 stage per cycle), but the amount of computations remains constant (not exactly, but let's admit it). So if you reduce the number of stages you have more work to do in one cycle, which means your stage is deeper and hence takes longer to traverse (electrons have finite speed), so you have to increase time between stages, hence a reduced clock.

    To sum up: shorter pipelines mean lower clock :)
     
  15. Shehriazad

    Shehriazad Senior member

    Joined:
    Nov 3, 2014
    Messages:
    554
    Likes Received:
    1
    I kind of hope that conventional CPUs hit some sort of stopping point in a few years. I mean the gains aren't really impressive the past few years, anyway.

    If IPC can no longer be increased by a reasonable amount and frequency also hits a dead-end... (I really don't think 6+ ghz CPUs will ever be a REAL thing -> Now don't spam me with your ln2 overclocks please)...then decreasing cache latency and other small things is all that's left.

    But after that? Conventional CPUs no-mans-land. I feel like it could be quite a turning point in the CPU hardware business. Other companies could actually catch up to Intel (no, I don't mean AMD) and diversify the market.

    But then again...with mobile device market being so huge and small cores totally being a thing...I almost feel like progress is going to be set back a few years (raw performance) in favor of mobility/interconnectivity.
     
  16. sxr7171

    sxr7171 Diamond Member

    Joined:
    Jun 21, 2002
    Messages:
    5,012
    Likes Received:
    20
    Thanks for that.
     
  17. maddie

    maddie Golden Member

    Joined:
    Jul 18, 2010
    Messages:
    1,389
    Likes Received:
    26
    One possible way to get past the IPC barrier. If this works, we have a new paradigm.

    This might allow the CPU to reconfigure itself to suit the actual program being processed. As ENIGMOID said, each program sequence would have an optimum IPC associated with it and a virtual CPU could be reconfigured as needed.

    Yes, I know its Semiaccurate, but look at the Website and the companies involved in the development.


    Soft Machines talks VISC architecture details

    http://www.semiaccurate.com/2015/10/08/soft-machines-talks-visc-architecture-details/
     
  18. Cerb

    Cerb Elite Member

    Joined:
    Aug 26, 2000
    Messages:
    17,409
    Likes Received:
    0
    With conventional CPUs, the program's ILP makes for an absolute limit.

    While not to that point, we probably are to the point of +100% providing +10% or less (on average), just because so much is about memory, now (and going by what Intel and IBM have been doing the last 5-10 years). Getting the right things into L1 or L2 at the right times won't take much in the way of extra transistors, but takes a lot of R&D, and cleverness. We hit diminishing returns by '04 or so. By '10, we were at severely diminishing returns.

    For normal people, we've already crossed that line. Skylake's relatively huge IGP, even for 2C models, is not coincidence. It's exceptionally powerful, given what it is, and more important to have improved generation over generation than more cores, or cores that are too much faster.

    Intel is doing both. How they have a big L4 that functions not just for the GPU, but still generally performs well is beyond my tiny brain's capability to understand, but they've done it. But, they can find OEMs that can find end users that will pay more for such premium features, so it's not going to be mainstream for awhile, if ever. Still, while having a big eDRAM cache might improve those hard to benchmark cases like background VM work, multitasking, etc., by not evicting back out to RAM so much, it won't speed up what's using CPU time right this instant much, because will usually need to look into the future, more than the past (assuming L3 is big enough, which is usually the case).
     
  19. boozzer

    boozzer Golden Member

    Joined:
    Jan 12, 2012
    Messages:
    1,549
    Likes Received:
    17
    my question is this, is the current ipc reaching THE limit? the last 5 years has been really bad in that department.
     
  20. Cerb

    Cerb Elite Member

    Joined:
    Aug 26, 2000
    Messages:
    17,409
    Likes Received:
    0
    There have been no breakthroughs in memory. While we'll get bigger and faster SRAM and eDRAM, the coherent CAM parts will make up for the fast memory cells themselves (this is already part of why L1 and L2 caches are fairly small). That's the practical limit, today.

    In real system DRAM latency, I don't think we've improved by more than 2x, if that, since '03-'05 (when we got IMCs in mainstream CPUs), and a lot of that has been tricks with the new memory types and buses, more than the memory itself running faster. The memory itself is getting faster very minimally, and it's that RAM bandwidth and latency that's really hurting improvements in single-core performance. I'm sure once HMC gets to its 2nd or 3rd generation, we'll see improvements from that, but still only marginal ones.

    Let's say you're running a CPU around 3.5GHz, and random memory access averages 75ns (not far off from what Haswell w/ typical DDR3 seems to get in some synthetic tests, IIRC). The CPU effectively needs be correct often enough to stay around 250 cycles ahead of what the memory can offer, to not stall too much and just waste time. That's why minor prefetching improvements, and a 0.1% gain in average branch prediction correctness, can matter so much. One screwup that has to go to L2 might burn up to maybe 20 cycles, L3 even more (since it changes speed relative to the CPUs, it's going to vary quite a bit), and then missing in L3 could be anywhere from under 100, if the page is open or right next to an open one, to over 200, if it's a totally different address from what's been worked on recently.
     
  21. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,392
    Likes Received:
    120
    EDRAM/Large caches as seen with Broadwell-C and the HEDT line is a good example of the memory issue. Those CPUs can easily perform 20-30% more than they should in raw core performance.
     
  22. boozzer

    boozzer Golden Member

    Joined:
    Jan 12, 2012
    Messages:
    1,549
    Likes Received:
    17
    thank you for the thorough post.
     
  23. Nothingness

    Nothingness Golden Member

    Joined:
    Jul 3, 2013
    Messages:
    1,612
    Likes Received:
    79
    As always it depends a lot on the characteristics of your program. As an example Anandtech showed only a modest increase of 3.3% of IPC when going from 4770K to 5775C.
     
  24. ShintaiDK

    ShintaiDK Lifer

    Joined:
    Apr 22, 2012
    Messages:
    20,392
    Likes Received:
    120
    Oh I agree. But we can also see how many places its the limitation rather than IPC. Or raw computational power if you wish.
     
  25. JoeRambo

    JoeRambo Senior member

    Joined:
    Jun 13, 2013
    Messages:
    332
    Likes Received:
    31
    While there is a hard IPC bound decided by properties of the code, that does not mean we are near the limits. Previous posters mentioned about "problems" when your operations are down to 1 cycle: you can double pump units of rise clock. But it is not that limited, CPUs are out of order machines, that can execute instructions far in the future as long as they don't have dependencies (and even then branch prediction or outright crazy stuff like Itanium can save the day).

    Even if real world program has average IPC of 1.7, it's average and it means that machine built for 1.7 IPC is 'economical', but will be slower than machine built for 2 IPC average, even if by several percent.

    So those are execution improvements Intel has been adding lately, increasing OoE "window" and also working on improving core to serve those ops.

    It's hard and "not rewarding" (compared to department who doubles throughput by going from AVX to AVX512 or enables FMA), but those improvements still happen. Haswell and SKylake both added quite some additional hw.

    One can only wonder what progress we would see if AMD was not so grossly incompetent... And we should hope that Zen will turn the tide and force Intel to pump transistors into cores.
     
  26. gdansk

    gdansk Senior member

    Joined:
    Feb 8, 2011
    Messages:
    275
    Likes Received:
    0
    Depending on the workload, certain operations must be conducted in-order and thus limit the number of instructions that can realistically be executed in a single clock. Furthermore, widening execution units is difficult because you must widen/speed up everything else in order to feed them (or add more SMT). There isn't a hard limit but there is probably a soft barrier depending on the number of transistors in your budget.