Try to predict branch or load both branches?

niinja2 · Jan 22, 2013

Im certain that someone already came up with this but what the heck.

There have been alot of talk about misspredict branch penalty being one of the main problem for bulldozer performance.

CPU's branch predictor tries to predict which branch is going to be executed so it loads the most probable one in one pipeline/core.

Why doesn't it load both possible brahches in 2 different pipelines/cores, and then use the data from the pipeline/core that ended to be the right one, and flush the other?

Imagine you work in a veichle factory and you get an order that you need to make a car or a truck but you are not sure which one. You could load 1 assembly line with parts from the car or truck, whichever you find most probable. But you could also load 2 assembly lines , one with car parts and other with truck parts and when a final decision comes, on which veichle is to be made, you just finish the product on the line you need and you remove the parts from the other assembly line.

This solution would need more power since it would load 2 or more cores with only one thread but it would negate branch misspredict penalty that any CPU has , since there would be no missed branches, and would thus improve single thread performance. It would improve execution of any thread that has branch instructions as long as there are enough resources. Imagine runinng 2-3 thread aplication on 8 core CPU loaded 100%.

It would increase performance but it would use more power for the same task. Maybe user could chose if he wants to use branch predict mod which would be power optimized or load 2 cores mod which would be performance optimized.

Lord knows AMD could use more singlethreaded performance and their 8 core cpus have more then enough resources to execute the task. Those core's are not even 50% used most of the time.

My question is, why isn't this used in modern CPU arhititectures? Or is it? Can it be software implemented?

TuxDave · Jan 22, 2013

Disclaimer: Relative to my coworkers, I'm terrible at computer architecture.

Possible problems:
1) If you choose to create the 2nd possibility on another thread or core, you would have to stall that core, retire and copy all the existing registers/cache from one core/thread to the other before you get to start again.
2) To mitigate that, you could execute the 2nd possibility on the same core/thread but then you definitely hit resource constraints.

Branch prediction is pretty good these days. Maybe like 90% (random number). So 10% of the time you'll be wasting your time and hurting performance. 90% of the time you're doing the right thing. Doing the "do both" you're 100% guaranteed to be wasting your time and resources.

Anyways, I have heard of this method of handling branches, I just haven't read up on the real pros and cons.

ChronoReverse · Jan 22, 2013

The branch misprediction rate of Sandy Bridge is like 2%. You're probably better off filling the extra execution units with actual work, AKA SMT (Hyperthreading).

Exophase · Jan 22, 2013

Without in-depth analysis, including comparison with other uarchs, one can't really say that branch misprediction penalty is one of the biggest weaknesses on Bulldozer.

Some CPUs offer a mechanism called predication, which lets you conditionally execute instructions. With predication you can, in software, convert a conditional to something branch-less. There are also ways to do this without hardware predication, with varying amounts of extra overhead.

The problem is, it's an oversimplification to think of branches as going from one path to two paths. What happens when the two branch paths then themselves contain branches? The number of paths increases exponentially. Eventually these paths tend to merge, but the path count can get really wide before this happens. There are also indirect branches, which are multi-way - these will contribute to branch misprediction too, but you generally don't want to try to convert these to executing all paths in parallel.

The hardware would need to know when it's appropriate to do this, but its ability to do so would be limited.

Looking deeper into what you're suggesting, it sounds like it'd have problems really working on Bulldozer's CMT design. The two cores in the module have completely separate register files. So to split one thread between the two of them you must copy the registers. This is a fairly big deal. Also, the two cores have separate L1 dcaches. While they're coherent with each other they're not going to be optimized to share the same data. Writes (both to registers and memory) would be nullified so you wouldn't have to worry about those, but you'd effectively still have to copy over L1 reads that are stale. In the best case you'd be negating the benefit of having two separate caches. In the worst case it'd mean invalidating and going to L2 cache instead.

You have to balance these costs with the average branch misprediction cost. I'm assuming that the hardware wouldn't even try such a thing unless the branch is flagged as having a history of being difficult to predict, and is a direct branch. Still, a branch that mispredicts half the time (which tends to be around the worst case) would result in an average penalty of maybe 10 cycles.

So the way I see it, if you have CMT like in Bulldozer you have to worry about coherency overhead. If you do it with conventional SMT you're increasing pressure on execution resources that could have went to the single thread. In all cases you're burning more energy. Definitely some major compromises here.

It does look like the latest POWER CPUs are now performing dynamic prediction for small branches that merely skip the next instruction, if they're marked poorly predictable. This is one of the best use cases for prediction: one-armed if statements over very small bodies.

Ventanni · Jan 22, 2013

Why doesn't it load both possible branches in 2 different pipelines/cores, and then use the data from the pipeline/core that ended to be the right one, and flush the other?

It's an interesting and valid idea, but you're literally talking about doubling the total die area and power consumption for, at best, <1-10% performance gain. I'm sure this has crossed Intel and AMD microarchitecture engineer's minds at one point during an after-work event involving heavy drinking, but it's probably the least efficient way to improve performance. I'm sure there's also a high memory overhead here just to check the and compare the data. That's going to take extra clock cycles too having to constantly compare each core's results. And two, what if both CPUs were right? You've effectively just doubled the power consumption for nothing.

Also, from my understanding, the front end portion of the CPU responsible for decoding, branch predicting, etc is one of the highest power consuming portions of the overall architecture. Designing branch predictors to be more efficient and accurate is likely far more power efficient and effective in the long run than simply doubling them. Part of the reason ARM cores are typically more power friendly than x86 (something that's starting to change) is due to having a significantly less capable front end comparatively.

I could be wrong on that last part.

With that said, isn't AMD planning on doubling up part of the front end for Steamroller or something?

Turbonium · Jan 22, 2013

Guys, guys...

1 + 1 = 2

I is CPU?

I wish I knew what you guys were talking about, lol. I've read descriptions of how a CPU works before (including on AT), but with little success (I guess that's what I get for being a biology major). But that was years ago. Perhaps I'll try it again now and see how far I get.

Carry on. The nerd is strong with this one.

pm · Jan 22, 2013

This is a form of speculative execution - http://en.wikipedia.org/wiki/Speculative_execution

I worked on Itanium for a long while and that had a couple of forms of speculation - control, and data, and predication (which isn't the same thing but is a close relative of it).

See:
http://www.siliconintelligence.com/people/binu/pubs/vliw/node14.html

For example, in IA64, you can do a speculative load, which is a load executed before a branch is taken which is then tossed if that branch isn't taken or an exception occurs.

Phynaz · Jan 22, 2013

http://www.anandtech.com/show/1766/7

A5 · Jan 22, 2013

Turbonium said:
Guys, guys...

1 + 1 = 2

I is CPU?

I wish I knew what you guys were talking about, lol. I've read descriptions of how a CPU works before (including on AT), but with little success (I guess that's what I get for being a biology major). But that was years ago. Perhaps I'll try it again now and see how far I get.

Carry on. The nerd is strong with this one.

Branch prediction in general isn't too complicated as a concept (though understanding what an execution pipeline is helps):

A lot of computer code is based on checking the current state of something and then tailoring the next bit of execution based on that result. The problem is that modern architectures have that check occur before the previous few instructions are finished executing.

Luckily, the CPU can use a variety of methods to try to make an informed decision about which direction it will have to go before those previous instructions finish. A correct guess means that the CPU can just keep working as-is, while a wrong guess means the CPU has to flush all the data after the guess (since it'll be a few more instructions ahead by the time it can actually verify its guess) and start over.

Idontcare · Jan 22, 2013

Welcome to the forums niinja2 :thumbsup:

niinja2 said:
Why doesn't it load both possible brahches in 2 different pipelines/cores, and then use the data from the pipeline/core that ended to be the right one, and flush the other?

The gist of your proposal is just one variant of a generalized class of computing models that are referred to as "speculative processing" or "speculative execution".

Speculative execution is a performance optimization. The main idea is to do work before it is known whether that work will be needed at all, so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed. If it turns out the work wasn't needed after all, the results are simply ignored.

The specific aspect of speculative execution that you are speaking to is what is called "eager execution" or "oracle execution" (oracle because it is "predicting" the future so to speak):

Eager execution is a form of speculative execution where both sides of the conditional branch are executed, however the results are committed only if the predicate is true. With unlimited resources, eager execution (also known as oracle execution) would in theory provide the same performance as perfect branch prediction. With limited resources eager execution should be employed carefully since the number of resources needed grows exponentially with each level of branches executed eagerly.

Speculative processing does increase performance, but at the cost of increasing power-consumption and die-size.

The challenge is striking the right balance between the increase in power, die-size, and performance.

Doubling the resources (unfettered dual branch execution) for maybe a 2-3% performance gain is probably not a good use of one's transistor budget, nor would it be good for one's power budget and performance/watt metrics.

And since clockspeeds tend to be TDP or temperature limited, that extra power consumed by speculative processing might actually harm performance if it means clocking both branches at a lower clockspeeds (think turbo bins and the reason why they are only used in low-threaded situations).

Speculative processing is a great way to improve performance and IPC if that is the ultimate goal to be optimized in unconstrained fashion. It is the sort of stuff you'd expect to see pop up in an IBM Power processor

It may end up being put to work down the road (5nm or 7nm) when xtors are so abundant that people quite literally have no idea what to do with them beside adding another 128MB of on-die cache with them.

For now, the cost-adder (die size) and power footprint make prohibit such approaches.

edit: LOL, spent too long typing this I see, pm deftly beat me to it!

niinja2 · Jan 24, 2013

Thank you for the responses guys they are informative.

Secondly i just found what i was looking for from posts from a few of you , but before that i wrote a repply for 2 hours and i don't want it to go to waste so il just post it. (must remember to read ALL of the replies

before i reply back)

Exophase said:
The problem is, it's an oversimplification to think of branches as going from one path to two paths. What happens when the two branch paths then themselves contain branches? The number of paths increases exponentially. Eventually these paths tend to merge, but the path count can get really wide before this happens. There are also indirect branches, which are multi-way - these will contribute to branch misprediction too, but you generally don't want to try to convert these to executing all paths in parallel.

Yes , things go exponentially if there are branches that contain branches. One thread could theoretically load 100% of 8 core cpu. And if you have idle cores and you want to run your program faster, and if you value performance versus power consumption this is a better option. In cases where performance is priority it's a good tradeoff.
In cases like cinebench where you are using almost 100% of cpu without it, this would not be desirable. Maybe a user could be given a choice to select which kind of mod CPU could go with: performance or power optimized?

Exophase said:
Also, the two cores have separate L1 dcaches. While they're coherent with each other they're not going to be optimized to share the same data. Writes (both to registers and memory) would be nullified so you wouldn't have to worry about those, but you'd effectively still have to copy over L1 reads that are stale. In the best case you'd be negating the benefit of having two separate caches. In the worst case it'd mean invalidating and going to L2 cache instead.

Im not quite farmiliarized with every term you are using like "memory overhead" but i guess that working one thread on two different hardware would cause slowing down due to syncronization or whatever.

Bulldozer already has 1 instruction cache for 2 cores, if they made 1 Datacache for 2 cores this would negate stalling? This way you duplicate data and instructions only within internal cpu registers, and i think things go there alot faster.
Its common that single thread aplications execute themelfs on all cores of a cpu and not only one. If its so taxing to spread one thread on virtually all cores why does the program do it?

Exophase said:
Still, a branch that mispredicts half the time (which tends to be around the worst case) would result in an average penalty of maybe 10 cycles.

Ventanni said:
doubling the total die area and power consumption for, at best, <1-10% performance gain.

Idontcare said:
Doubling the resources (unfettered dual branch execution) for maybe a 2-3% performance gain is probably not a good use of one's transistor budget, nor would it be good for one's power budget and performance/watt metrics.

If a cpu misses 2% of instructions due to branching and penalty for each miss is 10 cycles it means that you execute 98 out a 100 instructions in 98 cycles, and 2 instructions in 2*10=20 cycles since every time you miss a brach you have to wait 10 cycles for instructions to start popping out of the pipeline again. 98+20= 118. So 2% missed branches means 118 cycles, and 0% missed branches means 100 cycles. Thats 18% faster. This is a rough estimate and i might be wrong but i doubt it. Bulldozer has a penalty of 20 cycles... and that would mean 128 cycles or 28% speedup.

Opteron 6276 (bulldozer) should perform 33% better the Opteron 6176 (k10) since they are kept on the same clocks.

These are the expectations.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/7
"and we also kept the clock speed the same to focus on the architecture. Where a 30-35% performance increase is good, anything over 35% indicates that the Bulldozer architecture handles that particular sort of software better than Magny-Cours."

These are the results in hard to predict branching aplications.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/8
"Closer inspection shows that four benchmarks regress. The regression appears to be small in most benchmarks (7 to 14%), but remember that we have 33% more cores. Even a small regression of 7% means that we are losing up to 30% of the previous architecture's single-threaded performance!"

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/12
"plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches,"

Even though branch penalty might not be the only cause of low performance in these 4 benchmarks it is the most probable since they have hard to predict branches. Bulldozer is expected to perform at 133% of k10 and it is performing at 85% of k10 performance. 133/85= 1.56. Expected performance is 56% higher then the real one, so bulldozer is missing 56% of the potential here.

"The green bars show the performance improvement going from single to dual core, the red bars show the performance improvement from having the benchmark and its data stored entirely within L1 cache (no cache misses) and finally the yellow bars show the performance improvement due to the use of the Mitosis compiler/hardware modifications with a dual core CPU. As you can see, offering in many cases a 2 - 3x performance improvement is nothing short of impressive."

Im no expert and i know things are alot complicated then my calculations but it seems to me the real penalty in performance is way more then 10%, depending on the density and the type of the branches ofcourse.

As for doubling the die size, that is not neccesary. This could be used only for lightly threaded aplications. We already have 8 core cpus that are most of the time not used over 50%.

Exophase said:
So the way I see it, if you have CMT like in Bulldozer you have to worry about coherency overhead. If you do it with conventional SMT you're increasing pressure on execution resources that could have went to the single thread. In all cases you're burning more energy. Definitely some major compromises here.

Hm,so loading one cache with data from 2 threads would reduce single thread performance. Well Bulldozer needs bigger cahche anyways

.

Other compromise is burning more energy but i think people who buy AMD's products dont really care about power consumption

. And i think average gamer would not mind if his 8 core cpu is using 50 watts more power so his game can run 10-20 or 30% faster.

Could this be done to mitigate cache misses? Run one thread on lets say 4 cores. If you get a cache hit on one core and miss on 3 you continue the thread from the "hit" core and copy new data from it in the rest of the cores. This would effectively increase cache size 4 times. It would also use more power and would not be preferable where aplication would use 100% of the cpu without this.

BrightCandle · Jan 24, 2013

The cost of going from one core to another is going to rival the branch misprediction cost unfortunately. Copying data back and forth between cores isn't really practical when its unlikely to be faster than L3 access (30+ cycles). Using resources in the core in parallel is certainly possible but actually they are in use with the predicted branch already speculatively executing the code using the cores instruction level parallelism.

The operating system would really need to know what was going on as well because it is happily trying to schedule tasks on these CPUs and thinks its in full control.

I doubt there are any practical performance benefits to doing it on offer. The number of tricks we can play is rapidly reaching zero.

Exophase · Jan 24, 2013

niinja2 said:
Bulldozer already has 1 instruction cache for 2 cores, if they made 1 Datacache for 2 cores this would negate stalling? This way you duplicate data and instructions only within internal cpu registers, and i think things go there alot faster.
Its common that single thread aplications execute themelfs on all cores of a cpu and not only one. If its so taxing to spread one thread on virtually all cores why does the program do it?

Having separate L1 data caches is central the entire Bulldozer CMT design. It lets you make the caches smaller and faster. There's a reason that Intel even uses exclusive L2 caches. Instruction cache access was deemed enough outside of the critical path to be okay to share.

If you have two independent integer cores, even if you share L1 data cache, you still need to transfer the entire internal state. That means the physical register file, which is huge, all of the mappings, buffers, etc - or you get a stall when the code starts running on the other core. That stall is similar in nature to what you get with a branch misprediction, but probably worse. The cost of copying this state is probably similar.

You could have two cores running the same code redundantly before the branch happens, then take different paths at a branch. When the predicted outcome is known the core which made the right path becomes the master and the other one tries to catch up. Maybe if the master core stalls enough the slave will get the chance to catch up faster. Then you only merge again after it has caught up. Of course, until you're resynched again you can't do more branches this way, so it'd really only help with first generation branches until it catches up again. And I don't know if you even really can make it gradually sync up without getting the other core to stop and let it. At least some information would have to be communicated between the two, so the catching up core knows which registers are clean.

I'm really pretty skeptical that this approach would ever work out well in practice..

niinja2 said:
If a cpu misses 2% of instructions due to branching and penalty for each miss is 10 cycles it means that you execute 98 out a 100 instructions in 98 cycles, and 2 instructions in 2*10=20 cycles since every time you miss a brach you have to wait 10 cycles for instructions to start popping out of the pipeline again. 98+20= 118. So 2% missed branches means 118 cycles, and 0% missed branches means 100 cycles. Thats 18% faster. This is a rough estimate and i might be wrong but i doubt it. Bulldozer has a penalty of 20 cycles... and that would mean 128 cycles or 28% speedup.

Opteron 6276 (bulldozer) should perform 33% better the Opteron 6176 (k10) since they are kept on the same clocks.

Start over.

Around 10-20% of instructions are branches. Branch prediction rates tend to be higher than 95% for high end predictors like on BD. That's closer to 0.5 to 1% of instructions. So cut your penalty from 1/2 to 1/4.

The branch mispredict penalty is basically shared between the two cores in a module if both are running. What that means is that if the cores were bottlenecked later in the pipeline they might not notice all of the stall.

And you can't just use a 1 instruction per cycle estimate for everything else, it doesn't work like that. K10 and Bulldozer have a lot of differences that impact performance, it goes far beyond just branch misprediction. Anand never makes the claim you are, that a heavier misprediction penalty is to blame. Consider that the misprediction penalty on K10 is something like 10-12 cycles and the prediction rate on BD is higher than K10, so it's not like the difference you'd get from no penalty at all.

The performance claims given for BD early on were bogus. Don't listen to marketing numbers.

Mitosis is also doing much more than what you're saying, so it's not applicable, and I think Intel must have used very favorable situations for their slides.

niinja2 said:
Could this be done to mitigate cache misses? Run one thread on lets say 4 cores. If you get a cache hit on one core and miss on 3 you continue the thread from the "hit" core and copy new data from it in the rest of the cores. This would effectively increase cache size 4 times. It would also use more power and would not be preferable where aplication would use 100% of the cpu without this.

You sound like you're basically saying to use another core to provide the OoO execution capability the core already has...

Getting at another core's exclusive cache is generally even slower than going to a cache they all share. So not worth it.

pm · Jan 24, 2013

One thing that I'll add is that it's one thing to mispredict a cached branch (ie 10 cycles), but it's a whole different thing if you mispredict on a load and then, worst case, have to go to main memory or a different CPU via snoop. So speculative execution - particularly for loads - can make some sense in some large memory (ie. server) workloads in systems with multiple CPU's... hence it's inclusion in Itanium. On the other hand, I'm a circuit guy not a RTL or microarchitect, but I spent a fair bit doing debug and I will say that having worked on it a bit, speculation can be difficult to validate.

GillyBillyDilly · Jan 24, 2013

Another reason why this is not being done may be that this could cause terrible bugs. suppose the loop must branch at certain iteration at a certain point otherwise you would be accessing parts of memory which are not available for your process. With a multi caching your code could now try to load form those addresses.

Hope someone can explain this better.

tynopik · Jan 24, 2013

(nevermind)

Homeles · Jan 24, 2013

Exophase said:
Without in-depth analysis, including comparison with other uarchs, one can't really say that branch misprediction penalty is one of the biggest weaknesses on Bulldozer.

That in-depth analysis has already been done, on this very site by Johan de Gelas. The article is titled "The Bulldozer Aftermath: Delving Even Deeper." Relative to Sandy and Ivy Bridge, it's one of its biggest weaknesses. Not only is SNB's pipeline shorter, but the μop cache minimizes penalty even further and has an 80-something percent hit rate, if I remember correctly.

Interestingly enough, Steamroller will be implementing a μop cache.

Now Bulldozer's branch predictor was stronger than AMD's previous μarchs, but it wasn't quite strong enough. Both Piledriver and Steamroller significantly improve on Bulldozer's BPU.

pm · Jan 24, 2013

GillyBillyDilly said:
Another reason why this is not being done may be that this could cause terrible bugs. suppose the loop must branch at certain iteration at a certain point otherwise you would be accessing parts of memory which are not available for your process. With a multi caching your code could now try to load form those addresses.

Hope someone can explain this better.

Yeah, this is what I was trying to say with my terse wording of "difficult to validate".

IA64 handles this with a concept called "Not a Thing" (aka "NaT bits") which tells whether a register entry is speculative or real and then an Advanced Load Address Table (ALAT) ( http://en.wikipedia.org/wiki/Advanced_load_address_table ). It works on Itanium, so it's not theoretical. But, yeah, I agree with you. It all gets pretty confusing... at least from my perspective.

Exophase · Jan 24, 2013

pm said:
One thing that I'll add is that it's one thing to mispredict a cached branch (ie 10 cycles), but it's a whole different thing if you mispredict on a load and then, worst case, have to go to main memory or a different CPU via snoop. So speculative execution - particularly for loads - can make some sense in some large memory (ie. server) workloads in systems with multiple CPU's... hence it's inclusion in Itanium. On the other hand, I'm a circuit guy not a RTL or microarchitect, but I spent a fair bit doing debug and I will say that having worked on it a bit, speculation can be difficult to validate.

Then again, if you execute multiple paths you'll be guaranteed to need both the instructions and memory locations touched by all paths. Not a performance problem if these only hit exclusive resources, some percentage of cache access will inevitably miss either to a shared cache or main memory.

On Bulldozer-style CMT loading the second module will always increase shared L1 icache pressure, fetch bandwidth, and (until Steamroller) decode bandwidth.

Another reason why this is not being done may be that this could cause terrible bugs. suppose the loop must branch at certain iteration at a certain point otherwise you would be accessing parts of memory which are not available for your process. With a multi caching your code could now try to load form those addresses.

Hope someone can explain this better.

GillyBillyDilly said:
Another reason why this is not being done may be that this could cause terrible bugs. suppose the loop must branch at certain iteration at a certain point otherwise you would be accessing parts of memory which are not available for your process. With a multi caching your code could now try to load form those addresses.

Hope someone can explain this better.

This is no different from what happens when the branch is mispredicted. The core needs to be able to recover from things like exceptions, so that speculative execution doesn't actually cause them. By the time the pipeline is ready to commit the exception it'll already know if it needed to flush the pipeline from a mispredict (or in this case, being on the wrong arm of a branch). Same is true for stores, they don't actually get committed until after branches are resolved.

If you have stuff like device memory that could have side effects and therefore can't withstand superfluous loads the CPU needs to take measures to make sure it doesn't happen speculatively. These sorts of loads are marked this way by the TLB so it knows to take precaution when accessing them, by doing something like serializing the pipeline first.

GillyBillyDilly · Jan 24, 2013

Exophase said:
Then again, if you execute multiple paths you'll be guaranteed to need both the instructions and memory locations touched by all paths. Not a performance problem if these only hit exclusive resources, some percentage of cache access will inevitably miss either to a shared cache or main memory.

On Bulldozer-style CMT loading the second module will always increase shared L1 icache pressure, fetch bandwidth, and (until Steamroller) decode bandwidth.

Another reason why this is not being done may be that this could cause terrible bugs. suppose the loop must branch at certain iteration at a certain point otherwise you would be accessing parts of memory which are not available for your process. With a multi caching your code could now try to load form those addresses.

Hope someone can explain this better.

This is no different from what happens when the branch is mispredicted. The core needs to be able to recover from things like exceptions, so that speculative execution doesn't actually cause them. By the time the pipeline is ready to commit the exception it'll already know if it needed to flush the pipeline from a mispredict (or in this case, being on the wrong arm of a branch). Same is true for stores, they don't actually get committed until after branches are resolved.

If you have stuff like device memory that could have side effects and therefore can't withstand superfluous loads the CPU needs to take measures to make sure it doesn't happen speculatively. These sorts of loads are marked this way by the TLB so it knows to take precaution when accessing them, by doing something like serializing the pipeline first.

Good to have you around. ()

Idontcare · Jan 24, 2013

Homeles said:
That in-depth analysis has already been done, on this very site by Johan de Gelas. The article is titled "The Bulldozer Aftermath: Delving Even Deeper." Relative to Sandy and Ivy Bridge, it's one of its biggest weaknesses. Not only is SNB's pipeline shorter, but the μop cache minimizes penalty even further and has an 80-something percent hit rate, if I remember correctly.

Interestingly enough, Steamroller will be implementing a μop cache.

Now Bulldozer's branch predictor was stronger than AMD's previous μarchs, but it wasn't quite strong enough. Both Piledriver and Steamroller significantly improve on Bulldozer's BPU.

If one takes a sort of superficial look at Intel post the Rambus debacle, we can see where they were originally thinking they would have oodles of bandwidth available to their CPU's and they weren't as concerned with the issue as they ended up being once rambus fell out of favor.

Once Rambus was clearly not going to be the future, we saw Intel prioritize the development of unparalleled caches (latency and size) as well as the development of unparalleled branch predictors and prefetchers.

They basically threw resources and prioritization towards insulating their microarchitecture from whatever the ram guys were going to be selling. They did not want their processor's performance to be beholden to the roadmaps of the ram guys.

And we see the fruits of their labor when we read ram reviews and note that it hardly matters whether our 4GHz 8-thread 3770k's are being fed by DDR3-1333 or DDR3-2666 dimms.

The prefetchers, on-die caches, and branch predictors have all been massively tuned and improved such that it doesn't matter how good or crappy the system memory is, the CPU will get along just fine despite it.

And this is exactly where AMD lags. We see it with their APUs where performance is critically dependent on what the memory guys are selling speed-wise and price-wise. And we see it in their FX line where branch misprediction and cache misses are met with severe performance penalties.

That is why the theoretical IPC of bulldozer is so much higher than the realized IPC, whereas the realized IPC of a 3770k is much closer to its theoretical IPC.

Now I don't know if the narrative is correct, regarding Intel being motivated by the Rambus experience to prioritize the advancement of pre-fetchers, cache, and branch prediction to insulate themselves from the unreliable roadmaps of the memory producers, but that is how I have come to view it.

lamedude · Aug 7, 2015

In depth blog post on this. After reading these I do not envy anyone who had to write compilers for IA64.

Try to predict branch or load both branches?

Junior Member

Lifer

Platinum Member

Diamond Member

Golden Member

Platinum Member

Elite Member Mobile Devices

Lifer

Diamond Member

Elite Member

Junior Member

Diamond Member

Diamond Member

Elite Member Mobile Devices

Member

Diamond Member

Platinum Member

Elite Member Mobile Devices

Diamond Member

Member

Elite Member

Golden Member