Question Is there an absolute limit for ipc in CPU design?

MasterofZen · Jul 12, 2021

With the end of Dennard scaling, single thread performance is more and more reliant on achieving higher IPC (instructions per second). WIth new techniques we have been able to pushing ipc higher, but it is getting more and more difficult(apple A14 and Arm Cortex has seen diminishing generational improvement). I'm wondering is there an absolute limit for IPC. If there is, how close are we to this limit (for CISC/RISC respectively)?

IntelUser2000 · Jul 12, 2021

MasterofZen said:
With the end of Dennard scaling, single thread performance is more and more reliant on achieving higher IPC (instructions per second). WIth new techniques we have been able to pushing ipc higher, but it is getting more and more difficult(apple A14 and Arm Cortex has seen diminishing generational improvement). I'm wondering is there an absolute limit for IPC. If there is, how close are we to this limit (for CISC/RISC respectively)?

There will be limit for some applications but similar concerns were had back 10 years ago.

A relatively recent study by Google has said much greater instruction parallelism is possible. Also, the workload size continues to grow, meaning wider processors will be faster in current and future applications.

maddie · Jul 12, 2021

MasterofZen said:
With the end of Dennard scaling, single thread performance is more and more reliant on achieving higher IPC (instructions per second). WIth new techniques we have been able to pushing ipc higher, but it is getting more and more difficult(apple A14 and Arm Cortex has seen diminishing generational improvement). I'm wondering is there an absolute limit for IPC. If there is, how close are we to this limit (for CISC/RISC respectively)?

A few years ago, most thought yes, and it was close. Now, no one knows, as the last few years have shown the impossible. At least, I have never seen a theory claiming such a limit. Maybe, it's beyond us and AI will have to unlock it.

dullard · Jul 12, 2021

This is application dependent.

Suppose one car/driver can average running one errand per hour (clock). That would be an IPC of 1 errand per hour. In order to run 5 errands in that same hour, you can just have 5 cars. In order to run 1 million errands in an hour you can have 1 million cars, giving an IPC of 1 million per hour. So, in that situation, there is no limit as long as you can keep adding cars and drivers.

But that starts to fall apart when you need a specific task done with a specific resource. Suppose your car has 2 flat tires, needs to be washed, and needs to have its emissions inspected across town. Those 3 tasks will take an average of 3 clock ticks. To have your specific car do specific tasks, your IPC is only as good as your average number of errands you can run per hour. Sure with throwing money away you might be able to slightly increase that (such as Nascar-like pit stops with teams of people at your beck and call), but still at some point you can't physically do anything more with your specific car at one time since only so many people can work on a car at a given time. You can't have 1 million people working on your car during the exact same clock tick.

The cliché version is that 100 women can have 100 children in 9 months, but you can't speed up your wife's pregnancy no matter what.

Borealis7 · Jul 12, 2021

this is what they taught us at university.
IPC is a product of architecture. if you increase register sizes and add higher-order "macro-ops" you can achieve higher IPC at the same switching speed and transistor budget (more or less, because registers are also transistors...). we know that there is a theoretical limit to absolute best complexity of some operations (for instance, shifting bits, XORing values, any "order of n" operations) and we also know that some operations do not benefit from parallelism. so it looks like there is a theoretical limit dictated by mathematics and logic.

Nothingness · Jul 12, 2021

IntelUser2000 said:
A relatively recent study by Google has said much greater instruction parallelism is possible.

Do you have a link?

Also, the workload size continues to grow, meaning wider processors will be faster in current and future applications.

I’d say for large workloads, large caches and better perfetchers matter much more than being wider.

aigomorla · Jul 12, 2021

I thought IPC and power draw/heat grew exponential the higher you went on current tech.
This is why only insane stuff is only capable on super conductors, or using very low temperature benchmarking scenarios like pouring liquid helium over said silicon.

scineram · Jul 12, 2021

Duh of course there is!

semiman · Jul 12, 2021

It's also function of compiler and workload evolution so it's really hard to tell. Also, vector extensions are eating # of instructions which can run in parallel.

In any case, I kinda see that it's approaching to the limit. In Conroe days, reorder buffer size was 96(2007), In Skylake(2015) 224 then Sunny Cove has a whooping 352 entries.
This indicates that in order to extract instruction level parallelism, now CPUs need to scan through almost 1.2KBs of instruction sets(rough estimation), which is much larger than a size of single function already. Also, it's also approaching to the size of L1 set(32kB with 8-way --> 4kB per set) where we can somehow expect related instructions are stored together.

Mopetar · Jul 12, 2021

I don't know if you're necessarily interpreting that right. Was the buffer size in Conroe chosen because it was the ideal amount or because it couldn't be made larger without considerable cost or resulting in no additional performance gain due to a bottleneck elsewhere?

Buffers are larger if they can provide additional performance gain. Suppose we had some possible breakthrough where we could increase memory bandwidth by something like two orders of magnitude. Outside of a few applications or domains that are largely bound by memory bandwidth the overall performance wouldn't increase by much, maybe not even more than a few dozen percent in a lot of cases. The new abundance of memory bandwidth merely shifts any bottlenecks to another part of the chip and until that's improved the massive bandwidth gains can't be fully utilized.

I'm sure there's a chip designer out there wishing for an even bigger bigger because of all of the wonderful things they could do, but there's a transistor budget and some other designer wants a bigger cache and the next one wants to add another ALU. They can't all get what they want.

gdansk · Jul 12, 2021

Achievable IPC varies widely with problem. If you know the problem exactly, you could design an ASIC for it. You could design a ASIC just for some SPEC benchmarks if you wanted to. It'd be totally useless but it could probably achieve dozens of times higher IPC than any general purpose CPU. That's why you see more and more die space dedicated to accelerators of various sorts, especially ML.

But in general the realistic IPC for many common programs is quite low and increasing it is very difficult. I have no idea how close modern CPUs are to the realistic limit but multiple attempts at using VLIW and smarter compilers to achieve higher IPC have failed (Itanium, Denver) to even match them.

MasterofZen · Jul 12, 2021

IntelUser2000 said:
There will be limit for some applications but similar concerns were had back 10 years ago.

A relatively recent study by Google has said much greater instruction parallelism is possible. Also, the workload size continues to grow, meaning wider processors will be faster in current and future applications.

Can you give us the link to this study?

MasterofZen · Jul 12, 2021

maddie said:
A few years ago, most thought yes, and it was close. Now, no one knows, as the last few years have shown the impossible. At least, I have never seen a theory claiming such a limit. Maybe, it's beyond us and AI will have to unlock it.

Even if there is no limit to ipc. How dose IPC scale with processor width n (this width could be decoder width, buffer size, or ALU number or whatever) is also very interesting. If IPC scale with log(n), even though there is no upper limit for ipc, it will be still a grim future for IPC improvement.

MasterofZen · Jul 12, 2021

Borealis7 said:
this is what they taught us at university.
IPC is a product of architecture. if you increase register sizes and add higher-order "macro-ops" you can achieve higher IPC at the same switching speed and transistor budget (more or less, because registers are also transistors...). we know that there is a theoretical limit to absolute best complexity of some operations (for instance, shifting bits, XORing values, any "order of n" operations) and we also know that some operations do not benefit from parallelism. so it looks like there is a theoretical limit dictated by mathematics and logic.

Yeah, that's what I was asking, can you elaborate on this limit, or give us some references.
What's also interesting is how dose IPC scale with processor width n (this width could be decoder width, buffer size, or ALU number or whatever). Do they scale differently for RISC and CISC. For example, If CISC IPC scale with log(n), and RISC IPC scale with sqrt(n), then RISC is the future.

IntelUser2000 · Jul 12, 2021

I don't remember. It's one of those things when you are looking for something else you see and think "oh that's interesting."

Mr. Keller said the same thing, that there's lot more room to grow.

It doesn't matter whether it's CISC or RISC. It's x86 vs non x86.

The argument revolves around x86 instructions having variable width and it results in increased decoder complexity. So when you want to add more decoders the transistor requirements don't increase linearly but much more.

Beyond that they are pretty similar. At one point some doubted whether it was possible to have an out of order x86 core.

Doug S · Jul 12, 2021

MasterofZen said:
Yeah, that's what I was asking, can you elaborate on this limit, or give us some references.
What's also interesting is how dose IPC scale with processor width n (this width could be decoder width, buffer size, or ALU number or whatever). Do they scale differently for RISC and CISC. For example, If CISC IPC scale with log(n), and RISC IPC scale with sqrt(n), then RISC is the future.

Its not simply a matter of width, if someone designed an ARM CPU that's 16 wide it isn't necessarily going to have higher IPC than Apple M1 just because it is wider. There are all sorts of potential bottlenecks like number of physical registers, size and accuracy of branch prediction, latency and ways of L1 cache both I and D, and so forth. Any one of those could constrain its performance enough that it performs worse than the M1 despite being twice as wide.

IPC is dependent on what code is being run. Some code is more amenable to concurrent execution, but other code will have too many dependencies and won't even come close to an IPC of 1.0. Something that has to load values from main memory in a fashion too random for prefetch to work - Oracle database being the classic example - will always have terrible IPC and so long as main memory has many CPU cycles of latency no CPU no matter how wide or magical can ever change that.

There won't be a formula it scales by because of that, and because the returns of going wider will always be diminishing because there will be less and less code segments where extracting that much parallelism is possible. For any particular block of code there's an ultimate maximum IPC that's possible determined by dependencies. If you talk about IPC you have to talk about something like "IPC while running SPEC benchmark x", there's no such thing as an IPC value specific to a CPU. It is always specific to a CPU running some particular segment of code. Its even worse if you try to compare across CPUs, because it is easier to have a higher IPC clocking at 3.2 GHz than clocking at 5.0 GHz because a given main memory latency in ns is over 50% higher measured in CPU cycles at 5.0 GHz.

At the other extreme of IPC from Oracle, if you have some loop that's for example multiplying an array of 1000 numbers by another array of 1000 numbers, theoretically it could have an IPC of 1000 if you had a wide enough machine to execute all those multiplications in a single cycle. If your application was mostly about doing big multiplications, the sky's the limit for how high its IPC could be if you had a crazy wide enough CPU. But that magic 1000 wide CPU isn't going to help your IPC running Oracle much, it will still be less than 1.0.

Cardyak · Jul 13, 2021

There have been lots of studies within academic circles regarding IPC and ILP limits for a long time, one of my favourite papers for presenting IPC limits with clarity in a way that the layman can understand is the following paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9196&rep=rep1&type=pdf

If you build a perfect machine with infinite registers, infinite width/depth, perfect branch predictor, limitless caches, etc: Then you are looking at an IPC of approximately 40 in most integer workloads. However it should be noted that this is adhering to data dependencies and therefore does not include data speculation.

If you start to break data dependencies (I.E: Value Prediction) Then there is technically infinite IPC because you could *in theory* predict every instruction ahead of time and execute them all in a single cycle. However in real terms, a value predictor will never be 100% accurate, and of course we will never have infinitely wide/deep machines. As people have mentioned, diminishing returns may put a limit on scaling and extending certain elements of a CPU which will certainly present new challenges moving forward. But even with an ambitious (but somewhat realistic scenario) you can end up with IPC well into the hundreds.

My 2 cents for what it's worth? - Using value prediction and making incredibly wide and deep OoO machines (I'm talking reorder buffers in the thousands of entries, enormous caches and branch predictors, and a 64 wide decode engine in the front end) I'm pretty confident we can get IPC up to at least 50, and there's no real reason to think you couldn't push past even that.

With aggressive enough value prediction and hundreds of execution ports I can't see a good reason why IPC couldn't break the 100 barrier (Yes it would be incredibly complex, and it is decades and decades away, but all the theory is there)

I'd really recommend reading through the paper I linked, particularly pages 12 through to 15.

TheELF · Jul 13, 2021

There is no logical reason to increase IPC beyond what the average software uses.
If you run an embarrassingly parallel workload you can just use additional cores to provide more IPC and they don't even have to be on the same CPU.
If you run a normal workload and it can only use x IPC then anything more than that is being lost unless you forcibly run more things on that core just so you don't waste the additional IPC.
And while we do run a lot of stuff all at once on modern systems they are all so small individually that they don't need hugely wide cores, most of the times peoples CPUs are sitting at below 10% usage and that is already with them downclocking to like 800Mhz.

Cardyak · Jul 13, 2021

TheELF said:
There is no logical reason to increase IPC beyond what the average software uses.
If you run an embarrassingly parallel workload you can just use additional cores to provide more IPC and they don't even have to be on the same CPU.

Well yes, but there are two issues with this:

1. The majority of software is created by human beings, who can't (or won't) capture all of the work that can be parallelised and separate it into different threads. The benefit of IPC gains is they increase the performance on a set of programs without any work on the programmers part. This isn't even necessarily about laziness but more related economics and resource management. (Programmers are expensive - Silicon is relatively cheap and dependable)

2. Even the most single-threaded data dependent tasks still contain an inherent amount of parallelism that can be exploited. Multi-threaded programming doesn't really have much granularity at lower levels. Yes, each large scale task can be broken down into many threads, but even on a heavily multi-threaded workload there will still be performance left on the table without more investment in OoO Execution and Branch Prediction, etc.

In essence, splitting large abstract parts of a system into different threads is easy and attainable, splitting apart the small granular parts of code to run into separate threads is really hard, for example:

1. a+b = c
2. k - 5 = w
3. Store w -> Address 0x80
4. d = c * 8

Instructions 2 & 3 can be executed independent of 1 & 4. You can't reasonably expect a programmer to create an entirely new thread just to extract that small piece of performance. Especially when you a consider code such as this will appear thousands of times continuously all throughout the program.

I repeat my earlier statement, there is a level of parallelism in code that is almost impossible to extract for human beings due to it being too obscure to identify in high level programming languages (or low level ones for that matter), and even with identification the overhead of creating extra threads and managing them at the OS level will still render this slower then simply getting the CPU to process it Out of Order.

Given this problem, there is an enormous amount of performance being left on the table by these small parcels of code that *can* be parallelised but will always slip by human eyes.

I think IPC gains are going to be incredibly important moving forward. Encourage human beings to multi-thread the big tasks by all means, but let the CPU extract the low-level ILP.

Your statement is still correct of course, there is no logical reason to increase IPC capability beyond what most software offers - but we are an incredibly long distance away from that. Most top end architectures today execute between 4-5 instruction per clock (I'm referring to designs such as Apple's Firestorm here, certainly NOT Intel), we don't know what the ceiling is yet, but what we do know is that it is orders of magnitudes above this (As I referenced earlier, numerous studies have proven IPC can be improved into the high tens or even hundreds of instructions per cycle).

We still have a long, long road of improvement ahead of us.

TheELF · Jul 13, 2021

Cardyak said:
1. a+b = c
2. k - 5 = w
3. Store w -> Address 0x80
4. d = c * 8

Instructions 2 & 3 can be executed independent of 1 & 4. You can't reasonably expect a programmer to create an entirely new thread just to extract that small piece of performance. Especially when you a consider code such as this will appear thousands of times continuously all throughout the program.

Yes, in fact they are so independent that they can run on different cores entirely without any performance loss...
I know you are just trying to make a point with that example, but so am I, just because you can extract ILP doesn't mean it has to run on the same core, only if there are many dependencies on results would they lose performance.

Cardyak said:
Your statement is still correct of course, there is no logical reason to increase IPC capability beyond what most software offers - but we are an incredibly long distance away from that. Most top end architectures today execute between 4-5 instruction per clock

Are we though? For normal desktop class software that isn't embarrassingly parallel?!
How would one even go about judging that?!
I guess one could run performance analysis on a bunch of software and see how much IPC they do execute to see if they even reach those 4-5 IPC current CPUs have, intel has PCM-core that you can have running in real time showing you how much IPC your system runs at any given moment give it a go with a heavy game or whatever you think qualifies.

maddie · Jul 13, 2021

TheELF said:
Yes, in fact they are so independent that they can run on different cores entirely without any performance loss...
I know you are just trying to make a point with that example, but so am I, just because you can extract ILP doesn't mean it has to run on the same core, only if there are many dependencies on results would they lose performance.

Are we though? For normal desktop class software that isn't embarrassingly parallel?!
How would one even go about judging that?!
I guess one could run performance analysis on a bunch of software and see how much IPC they do execute to see if they even reach those 4-5 IPC current CPUs have, intel has PCM-core that you can have running in real time showing you how much IPC your system runs at any given moment give it a go with a heavy game or whatever you think qualifies.

Are you suggesting that CPU architects return to simpler cores that work on part of a thread? Sounds like reverse hyperthreading to me, where multiple cores can concurrently work on the same thread.

semiman · Jul 13, 2021

Cardyak said:
There have been lots of studies within academic circles regarding IPC and ILP limits for a long time, one of my favourite papers for presenting IPC limits with clarity in a way that the layman can understand is the following paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9196&rep=rep1&type=pdf

If you build a perfect machine with infinite registers, infinite width/depth, perfect branch predictor, limitless caches, etc: Then you are looking at an IPC of approximately 40 in most integer workloads. However it should be noted that this is adhering to data dependencies and therefore does not include data speculation.

If you start to break data dependencies (I.E: Value Prediction) Then there is technically infinite IPC because you could *in theory* predict every instruction ahead of time and execute them all in a single cycle. However in real terms, a value predictor will never be 100% accurate, and of course we will never have infinitely wide/deep machines. As people have mentioned, diminishing returns may put a limit on scaling and extending certain elements of a CPU which will certainly present new challenges moving forward. But even with an ambitious (but somewhat realistic scenario) you can end up with IPC well into the hundreds.

My 2 cents for what it's worth? - Using value prediction and making incredibly wide and deep OoO machines (I'm talking reorder buffers in the thousands of entries, enormous caches and branch predictors, and a 64 wide decode engine in the front end) I'm pretty confident we can get IPC up to at least 50, and there's no real reason to think you couldn't push past even that.

With aggressive enough value prediction and hundreds of execution ports I can't see a good reason why IPC couldn't break the 100 barrier (Yes it would be incredibly complex, and it is decades and decades away, but all the theory is there)

I'd really recommend reading through the paper I linked, particularly pages 12 through to 15.

This is quite an interesting paper. Thank you. But it seems that assumptions in the paper are too strong like you said. If you can predict values(wow?!) 100%, then who needs a CPU? We don't even need ALUs.

We are already in ages in which more than 200 ROB entries are required to fuel 1~2 additional ALUs in CPU backend ports. (Conroe(3 ALUs) to Skylake(4 ALUs?)) This indicates that we need to scan through more than 1KB(assuming that each instruction's 4B) s of the region to extract 1~2 independent ALU operations. Then if we're to add 1~2 more ALUs then hmm..maybe 2000 ROB entries? Even then there will be some workloads in which no extractable ILPs exist at all.

Intel Skylake showed 2.1~3.4 instructions per cycle, depending on workloads. Assuming that we put 8 ALUs + memory...etc to CPUs then we'll achieve 4.2 ~ 6.8 if everything's working correctly. But if we put assumptions like "each additional ALU requires x2 ROB to be fully utilized" then there should be 3200~ reorder buffer entries, which requires 4 times dense + 4 times power efficient + 4 times complex metal wiring. Can't really imagine 50 instructions per clock.

MasterofZen · Jul 13, 2021

Cardyak said:
There have been lots of studies within academic circles regarding IPC and ILP limits for a long time, one of my favourite papers for presenting IPC limits with clarity in a way that the layman can understand is the following paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9196&rep=rep1&type=pdf

If you build a perfect machine with infinite registers, infinite width/depth, perfect branch predictor, limitless caches, etc: Then you are looking at an IPC of approximately 40 in most integer workloads. However it should be noted that this is adhering to data dependencies and therefore does not include data speculation.

If you start to break data dependencies (I.E: Value Prediction) Then there is technically infinite IPC because you could *in theory* predict every instruction ahead of time and execute them all in a single cycle. However in real terms, a value predictor will never be 100% accurate, and of course we will never have infinitely wide/deep machines. As people have mentioned, diminishing returns may put a limit on scaling and extending certain elements of a CPU which will certainly present new challenges moving forward. But even with an ambitious (but somewhat realistic scenario) you can end up with IPC well into the hundreds.

My 2 cents for what it's worth? - Using value prediction and making incredibly wide and deep OoO machines (I'm talking reorder buffers in the thousands of entries, enormous caches and branch predictors, and a 64 wide decode engine in the front end) I'm pretty confident we can get IPC up to at least 50, and there's no real reason to think you couldn't push past even that.

With aggressive enough value prediction and hundreds of execution ports I can't see a good reason why IPC couldn't break the 100 barrier (Yes it would be incredibly complex, and it is decades and decades away, but all the theory is there)

I'd really recommend reading through the paper I linked, particularly pages 12 through to 15.

Thanks for the explanation. It's very insightful.

JoeRambo · Jul 14, 2021

diediealldie said:
We are already in ages in which more than 200 ROB entries are required to fuel 1~2 additional ALUs in CPU backend ports. (Conroe(3 ALUs) to Skylake(4 ALUs?)) This indicates that we need to scan through more than 1KB(assuming that each instruction's 4B) s of the region to extract 1~2 independent ALU operations.

It does not work that way, it is not just 1-2 values extracted. Even if first 100 instructions in the ROB are stalled waiting for operands ( be it from memory or other instructions ), CPU can still find work ahead if ROB size allows. Exaggerated example - your next 500 operations could be a simple loop incrementing registers and writing out values in the end to memory.
So CPU will happily process them, and once the initial stalled instructions complete, loop calculations could be well under way already.

That is the beauty of Out of Order execution. Obviously it lives and die with state of the art branch prediction and various other tricks, but i think Apple has already shown that there is at least 50-100% more IPC to extract with todays technology versus what we have in x86 now.

TheELF · Jul 14, 2021

maddie said:
Are you suggesting that CPU architects return to simpler cores that work on part of a thread? Sounds like reverse hyperthreading to me, where multiple cores can concurrently work on the same thread.

What!? No! Of course not!
I'm saying that the two lines can be a separate thread that doesn't need to run on the same core.

Question Is there an absolute limit for ipc in CPU design?

Junior Member

Elite Member

Diamond Member

Elite Member

Platinum Member

Diamond Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Senior member

Member

Diamond Member

Diamond Member

Junior Member

Junior Member

Junior Member

Elite Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Junior Member

Golden Member

Diamond Member