Will we ever see a step change in IPC from intel

sefsefsefsef · Jun 27, 2014

Outside of totally disruptive technologies which we all know about (most notably 3D stacking), there is still room for improving IPC of an individual, traditional CPU core. We're really running out of room for extracting ILP from a program (this is a fundamental mathematical problem that has to do with how we write programs), but there are some other sources of slow-down which we might be able to overcome. Three main sources of slowdown are: cache miss on something that used to be cached but got evicted, not prefetching something that could have been prefetched, and missing a branch prediction. Modern CPUs still leave a lot on the table when it comes to intelligent caching and prefetching, IMO.

Cerb · Jun 27, 2014

liahos1 said:
maybe i should try rephrasing what i'm fundamentally trying to understand -

Why is Intel's IPC increases sort of iterative at this point? Is there a technological reason it cant improve faster? if there is a ceiling why does that ceiling exist - fundamentally.

All the low hanging fruit is gone. Totally gone. Increases in speed require more power. Bigger data structures require superlinear space and thus cost and power. Part of the reason we even had the P4 and IA64 was that some very smart people genuinely believed that these CPUs of today, with their mildly-increasing IPC, were going to be impossible to design.

Ultimately, today, it comes down to memory, for IPC. Maybe 4 cycles for L1 (I think that's what Haswell has...), ~10-20 for L2 (can't recall), 30+ for L3 (varies by current speeds), maybe 100+ for RAM (RAM varies a lot, and that's only increased with power saving in modern CPUs)...those are cycles, where on a miss, nothing gets done by that thread. It doesn't take too many cache misses to ruin the effective IPC of what looks like some fine fast code. New, disruptive, RAM technology, that can be commercially viable, is a requirement for any major IPC increases in the future.

Apple could do so much because (a) they bought an all-star design team, (b) Apple knows R&D spending matters, so they try not to be too stingy with it, (c) ARM's own CPU designs suffer from design by committee inefficiencies, and (d) Apple didn't get all stingy with die space, giving the whole thing plenty of cache. Qualcomm, FI, has kept rolling their own to optimize the hardware for their target markets, and was doing so since before Apple was. Broadcom used to do this, too, but I haven't heard much from them lately involving custom cores.

Cerb · Jun 27, 2014

Revolution 11 said:
The first thing that comes to my mind is that 6-issue is better than 4-issue but obviously nothing is ever that simple. What is the benefit of going "wider"? What is the drawback? Are there specific workloads where one is better than the other?

Look at the diagrams above. Each port isn't doing the same thing as every other port. Adding one is going to be done based on expecting a certain mix of instructions with data ready to go at any given time. Wider is better because even low-IPC work has tendency to be bursty. One cycle it might be able to issue 5 instructions, and then the next 8 none (exaggeration, but I'm making a point, OK?). But, if it's 2-issue, then the best case is not 9 cycles for that small batch of work (.55 IPC), but 11 (.45 IPC). If 4-issue, 10 (0.5 IPC). Small gains, but it can translate into a few % total performance if the width can be made good use of by hot code (IE, the 1% that runs the most). But, what specifically, each of those ports might have to execute matters, since they're basically never symmetrical.

In the case of Intel's, they also have to worry about what two threads are doing at the same time.

Theoretically, single-issue at an ever-faster clock speed is the best, as it would not depend on the code having ILP. But, well, Netburst. With speeds limited by power, and customers caring about power use, that's just not feasible.

Exophase · Jun 27, 2014

sefsefsefsef said:
Modern CPUs still leave a lot on the table when it comes to intelligent caching and prefetching, IMO.

How would you make caching more intelligent? You can make it larger, add more levels, add more associativity, and so on, but that falls more under the category of working harder than working smarter. I think it's telling how Intel has basically left the L1 and L2 cache designs alone since Nehalem, not counting the uop cache.

I guess you could do things like skewed associativity, or having separate caches for the stack, maybe a separate cache for FP/vector.. Not sure how much any of those things buy you.

As for prefetching, I'd love to comment, only I don't think Intel really divulges it in great detail (but they do seem to be doing a lot here). I think the situation is a lot like branch prediction - you can try to improve it to cover slightly more cases but there always be this set of cases that are impossible to predict, and you'll hit ever diminishing returns approaching it.

Homeles · Jun 27, 2014

Lepton87 said:
Is it really confirmed 100% that Cyclone is really 6-issue wide? I read somewhere that the tests that Anand did to confirm if it's indeed 6-issue wide are somehow flawed but I lack the expertise in programming to really investigate it myself. Even HW front end is 4-wide so cyclone is 50% wider then HW and on par with an old Itanium? The new is 12 wide but it's not really comparable due to very different approach between those uArch.

A potential issue is that just because something can issue 6 operations per cycle, doesn't mean it's capable of sustaining that. If what Exophase said is true, this wouldn't be a problem that the A7 suffers from.

Lepton87 · Jun 28, 2014

Homeles said:
A potential issue is that just because something can issue 6 operations per cycle, doesn't mean it's capable of sustaining that. If what Exophase said is true, this wouldn't be a problem that the A7 suffers from.

If it can issue 6 ops per cycle, if the instruction mix stays the same, is it even possible that it's not able to sustain that? However rare that instruction mix might be.

sefsefsefsef · Jun 28, 2014

Exophase said:
How would you make caching more intelligent? You can make it larger, add more levels, add more associativity, and so on, but that falls more under the category of working harder than working smarter. I think it's telling how Intel has basically left the L1 and L2 cache designs alone since Nehalem, not counting the uop cache.

I guess you could do things like skewed associativity, or having separate caches for the stack, maybe a separate cache for FP/vector.. Not sure how much any of those things buy you.

As for prefetching, I'd love to comment, only I don't think Intel really divulges it in great detail (but they do seem to be doing a lot here). I think the situation is a lot like branch prediction - you can try to improve it to cover slightly more cases but there always be this set of cases that are impossible to predict, and you'll hit ever diminishing returns approaching it.

Caching has many opportunities for intelligence. For starters, you can choose whether or not data should be cached in the first place (separate from no-chance instructions). Next, you can choose how much eviction resistance you want a cache block to have (cache insertion policies). You can also decide when to evict something (dead block prediction, LRU, or RRIP strategies). Here's one of my favorite innovative caching policies:

http://users.ece.cmu.edu/~omutlu/pub/eaf-cache_pact12.pdf

Prefetching is another rich area of research, even though it's been "well studied" for decades. AMPM prefetching was published 5 years ago, and even though it's probably un-implementable in hardware, it achieves very high performance in simulation, and there have been other prefetchers since then that are implementable and match or beat its performance. Also, this is an irregular data prefetcher I like that I saw at a conference last year that I like a lot:

https://www.cs.utexas.edu/~lin/papers/micro13.pdf

The point is that caching and prefetching are very active areas of research, and I believe (although I have no actual knowledge of this) that real world CPUs are not using anything significantly better than what exists in the literature, which is constantly improving. My reason for thinking real world CPUs don't have too powerful of secret magic is that a lot of the research that is published is actually coming out of the research arms of these CPU companies, or is done in collaboration with them.

Homeles · Jun 28, 2014

Lepton87 said:
If it can issue 6 ops per cycle, if the instruction mix stays the same, is it even possible that it's not able to sustain that? However rare that instruction mix might be.

Exophase might be able to answer better, as could Cerb and sef. I'm not a software guy, so this subject isn't exactly my strength...

witeken · Jun 28, 2014

liahos1 said:
maybe i should try rephrasing what i'm fundamentally trying to understand -

Why is Intel's IPC increases sort of iterative at this point?

Because there's no point in making a new architecture from scratch every 2 years. This is how it's been done since a long time.

Is there a technological reason it cant improve faster?

Transistor (budget) and power budget. For instance Intel doesn't want to get a lower performance/watt, so a new feature mustn't increase power by more than half of the increase in performance.

if there is a ceiling why does that ceiling exist - fundamentally.

Single threaded performance is bottlenecked by clock speed because there's only so much ILP and other low-hanging fruit to be extracted.

Or is the company not really under any pressure to innovate it at a more rapid clip given they own the pc space and we really just have to buy what they sell?

They are under pressure. But there are also other things than pure performance, like ISA and battery life that improved in Haswell.

I ask because we saw some pretty spectacular gains at apple between A6 and a7. Could apple get a massive step up in a8 vs a7 also or are they also going to be running into upper limit for given tdp?

Smartphones were a new market, so there was a lot of room for improvements. At the same time, power usage skyrocketed.

ShintaiDK · Jun 28, 2014

Revolution 11 said:
The first thing that comes to my mind is that 6-issue is better than 4-issue but obviously nothing is ever that simple. What is the benefit of going "wider"? What is the drawback? Are there specific workloads where one is better than the other?

Utlization is the biggest problem. At the Conroe era Intel said that the 4th issue port added around 5% to the total performance. So unless the rest improve, including software. Then its idle transistors most of the time.

Search

Will we ever see a step change in IPC from intel

sefsefsefsef

Senior member

Cerb

Elite Member

Cerb

Elite Member

Exophase

Diamond Member

Homeles

Platinum Member

Lepton87

Platinum Member

sefsefsefsef

Senior member

Homeles

Platinum Member

witeken

Diamond Member

ShintaiDK

Lifer

TRENDING THREADS