• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Anyone care to explain branch prediction to me :\

smp

Diamond Member
?
I have a hard enough time following CPU articles and what not and it would help if I had an understanding of the jist behind branch prediction, or rather, how it works. I understand that when an instruction comes from the cache it's designated for a certain instruction or something and that the branch prediction unit sort of guestimates where it's supposed to go and just shoots it off. If you get a miss, then you get wasted cycles. Anyways, I'm sure you can see that my knowledge is limited here, any clarification is much appreciated. Feel free to go off topic and just show off your knowledge of CPU design all you want.
 
TCE

(thread crap elsewhere)

BTW, I'm not in school. There is such a thing as wanting to converse about this type of thing, simply reading texts sometimes isn't enough, dynamic information is much nicer IMO and that's what forums are for IMO.
So STFU and RTFM or something.
 
100 input C$
200 if C$ = 1 then goto 400
300 exit
400 goto 100

<btw this isn't real code 😛>

If compiled and run, line 200 would be a branch ... the result would be evaluated and the program will have to jump to 1 of 2 possibilities. Since modern CPUs execute things out of order and try to avoid stalling, they want to keep running while line 200 is being evaluated. So it tries to predict which way the evaluation will turn out and run instructions in that direction ahead of time. If it's right it's saved time, if not it's wasted time executing the wrong instructions ahead of time.

Viola! That's branch prediction. It's usually done by keeping a table of previous branches and using that to determine which the most likely branch will be.


 
Originally posted by: JKing76
DYOH.


(Do your own homework)

rolleye.gif


Branch prediction uses cpu cycles to go ahead an process whichever branch it thinks the program is going to take..... if the guess was correct, it's already processed some way into that branch, so you get a speedup...

if the prediction was wrong, the cpu has to be "flushed".... therefore on cpu's with very long pipelines, e.g. P4..... a bad prediction slows down a program.......

there are some pretty advanced prediction algorithms in use on these modern cpu's.............. they are right approx 80% of the time i would say...... but remember, it isnt possible to create an algorithm that is perfect, 100%(well, without quantum computing anyway, but if you had a quantum computer, you wouldn't need the prediction lol).......
 
I don't have a great deal of knowledge on specific internal design, and what I have is mainly from a few years ago.

Obviously the job of a CPU is to execute instructions. In hardware jargon these would be called op-codes. "op" is short for operation. "code" is in effect a number which designates a particular instruction. All CPU types have there own specific encoding, although generally speaking the instructions are similar. The Motorola PowerPC does not use the same opcodes as the Intel Pentium. The current Intel CPU instruction set the one we need to be concerned with, and it is usually called x86, because most of it originated in CPUs that had 86 at the end of their model names. (8086, 80286, 80386, 80486, Pentium?)

In order to do an instruction, there are really sub-tasks that are performed to get to the end result. To speed execution, these tasks are separated and "pipelined." In each stage of the so-called pipeline, a certain part of the instruction is executed and then the instruction is passed to the next stage. As soon as it is passed to the next stage, another instruction can come into the stage just left unoccupied. The reason this is faster is that you don't have to wait for the whole instuction to get done before starting the next. And the simpler a task is, the easier it is for hardware to do it fast. The net result is that perhaps 10, or more, instructions may be in the midst of being done at the same time. (To complicate, and speed up, things further, the CPU nowadays has two or more pipelines that can be used simultaneously.)

The first thing that must be done to execute an instruction is to get the next instruction. Seems simple enough, but this turns out to be a major problem. Ordinarily CPU instructions can be divided into two sorts. One does a simple arithmatic or logical operation, or even more simply, just reads or writes data. The other type determines what the next instruction will be. That is a "branch" instruction. In the absence of a branch instruction, the CPU gets the next instruction from the memory location just after the previous instruction. The essense of a branch instruction is that it determines where the next instruction the CPU will execute will come from. It could continue on, or it could "branch" to a new location. If it could be determined beforehand exactly whether the branch would be taken or not, you could eliminate the instruction.

Typically branch instructions occur very frequently (or they did before the P4 became Intel's focus.) Let's guess every 5 or ten instructions in a row would have one. In a conceptual way, branches are the intelligence in a program, and that is why they are so common. If the end result were really so easy to predict, why run the program at all? But branch instructions are a problem for the pipeline. Ideally the pipeline should always be full, but what should you put in the pipeline? Which series of instructions; the ones immediately following, or the ones at the branch location? If you choose the wrong ones, you have just wasted 10 or more clock cycles. That brings us to "branch prediction".

How can you predict a branch when it is unpredictable? One way is to predict it will not be taken. As matter of fact, this is the way programs are most often designed, so that is pretty good prediction. A programmers tool, an optimizing compiler, will arrange this automatically. OTOH, another common branch type is to repeat a section some number of times (a "loop"). Mispredicting that the branch is not taken 10,000 times in row would be a bad idea. So you can keep track of which way it went last time, and predict it will be same next time, which is also a pretty good guess, and in the case of a loop you would only be wrong twice. Branch prediction can get more elaborate, but the explanation of why the more elaborate systems would do much good I have never seen. Branch predictors keep track of a limited number of the most recent branch instructions and how they went.

A long pipeline makes the penalty for mispredicted branches worse just because more cycles are wasted,

The other thing you mention is actually a cache miss. In order to execute instructions at a maximum rate, the required instruction and data should be in the cache rather than in the main memory. If the CPU operates at 15 times the memory speed, the penalty for a cache miss is very big. AFAIK there is no prediction of what should be put into the cache. When an instruction references a new location, that location along with the block of locations near it are loaded into the cache (a "cache line"), and what is currenty there is kicked out. If the stuff that is being kicked out has been changed while it was there, then it has to be written back to memory. People sometimes overlook the fact that the cache has only a few locations that a particular memory block can occupy. If it is an "8-way" cache, there will be 8 possiblilities, and 8 is actually a large number for a cache. If you increase the size of a cache without increasing the number of "ways", you increase the probability or kicking something out.

Well, I tried!
 
Originally posted by: KF
I don't have a great deal of knowledge on specific internal design, and what I have is mainly from a few years ago. Obviously the job of a CPU is to execute instructions. In hardware jargon these would be called op-codes. "op" is short for operation. "code" is in effect a number which designates a particular instruction. All CPU types have there own specific encoding, although generally speaking the instructions are similar. The Motorola PowerPC does not use the same opcodes as the Intel Pentium. The current Intel CPU instruction set the one we need to be concerned with, and it is usually called x86, because most of it originated in CPUs that had 86 at the end of their model names. (8086, 80286, 80386, 80486, Pentium?) In order to do an instruction, there are really sub-tasks that are performed to get to the end result. To speed execution, these tasks are separated and "pipelined." In each stage of the so-called pipeline, a certain part of the instruction is executed and then the instruction is passed to the next stage. As soon as it is passed to the next stage, another instruction can come into the stage just left unoccupied. The reason this is faster is that you don't have to wait for the whole instuction to get done before starting the next. And the simpler a task is, the easier it is for hardware to do it fast. The net result is that perhaps 10, or more, instructions may be in the midst of being done at the same time. (To complicate, and speed up, things further, the CPU nowadays has two or more pipelines that can be used simultaneously.) The first thing that must be done to execute an instruction is to get the next instruction. Seems simple enough, but this turns out to be a major problem. Ordinarily CPU instructions can be divided into two sorts. One does a simple arithmatic or logical operation, or even more simply, just reads or writes data. The other type determines what the next instruction will be. That is a "branch" instruction. In the absence of a branch instruction, the CPU gets the next instruction from the memory location just after the previous instruction. The essense of a branch instruction is that it determines where the next instruction the CPU will execute will come from. It could continue on, or it could "branch" to a new location. If it could be determined beforehand exactly whether the branch would be taken or not, you could eliminate the instruction. Typically branch instructions occur very frequently (or they did before the P4 became Intel's focus.) Let's guess every 5 or ten instructions in a row would have one. In a conceptual way, branches are the intelligence in a program, and that is why they are so common. If the end result were really so easy to predict, why run the program at all? But branch instructions are a problem for the pipeline. Ideally the pipeline should always be full, but what should you put in the pipeline? Which series of instructions; the ones immediately following, or the ones at the branch location? If you choose the wrong ones, you have just wasted 10 or more clock cycles. That brings us to "branch prediction". How can you predict a branch when it is unpredictable? One way is to predict it will not be taken. As matter of fact, this is the way programs are most often designed, so that is pretty good prediction. A programmers tool, an optimizing compiler, will arrange this automatically. OTOH, another common branch type is to repeat a section some number of times (a "loop"). Mispredicting that the branch is not taken 10,000 times in row would be a bad idea. So you can keep track of which way it went last time, and predict it will be same next time, which is also a pretty good guess, and in the case of a loop you would only be wrong twice. Branch prediction can get more elaborate, but the explanation of why the more elaborate systems would do much good I have never seen. Branch predictors keep track of a limited number of the most recent branch instructions and how they went. A long pipeline makes the penalty for mispredicted branches worse just because more cycles are wasted, The other thing you mention is actually a cache miss. In order to execute instructions at a maximum rate, the required instruction and data should be in the cache rather than in the main memory. If the CPU operates at 15 times the memory speed, the penalty for a cache miss is very big. AFAIK there is no prediction of what should be put into the cache. When an instruction references a new location, that location along with the block of locations near it are loaded into the cache (a "cache line"), and what is currenty there is kicked out. If the stuff that is being kicked out has been changed while it was there, then it has to be written back to memory. People sometimes overlook the fact that the cache has only a few locations that a particular memory block can occupy. If it is an "8-way" cache, there will be 8 possiblilities, and 8 is actually a large number for a cache. If you increase the size of a cache without increasing the number of "ways", you increase the probability or kicking something out. Well, I tried!

Nice, thank you. That sort of helped 🙂
Now i just have to get some programming under my belt, I'm sure this would make better sense to me if I read the intel P4 developers manuals that I ordered for free from intel 🙂

 
Originally posted by: zsouthboy
Originally posted by: Sunner
I thought modern branchpredictors had accuracies around 95+ percent?

Hmmm last i remember seeing was 80..................... maybe... i dunno..
It's well over 95% for both the Athlon and the Pentium 4.
 
Originally posted by: AndyHui

It's well over 95% for both the Athlon and the Pentium 4.

It's worth noting though that 95% isn't too great. The 5% its wrong is absolutely killer on performance; it forces the CPU to spend around 20% of its time recovering from the bad branch.
 
I can go into the specifics of branch predictor implementations since KF gave a good overview of the concept of branches.....

Branch prediction has a long history...performance losses due to conditional branches were recognized in the 1960s, and given initial study in Tomasulo et al.'s IBM System/360 model 91 paper. The problem is that conditional branches pose a control dependence....the instructions following a branch in the control flow of a program are conditionally dependent on the outcome. Should that control dependence cause instruction issue or execution to stall, branches may present themselves as control hazards.

Even in the short microprocessor pipelines of the 80s with 4-5 stages, branches posed a problem. These MPUs only issued one instruction per cycle, and often only stalled for one cycle when encountering a branch in order to determine the outcome of the branch....despite these meager penalties compared to today's MPUs, branch hazards were important enough for designers to implement branch delay slots in the ISA of many RISC architectures, which, through compiler assistence, attempted to insert an instruction in the potential stall cycle before a branch.

But branch delay slots are a hack that attempted to solve "the problem of the day" via modifying the ISA; later implementations with longer pipelines found branch delay slots an annoyance. An ISA-agnostic microarchitectural approach is to attempt to predict the outcome of the branch, which had already been evaluated in larger systems.

The first technique is to predict the outcome of a branch and fetch instructions along the predicted path, but not allow those predicted instructions to start execution. This guarantees the precise program state, since in the case of a miss-prediction, no instructions that should not have executed can modify the program state. But the effectiveness diminishes as pipelines become longer, since this prediction can only hide a couple stall cycles. Something else is needed for out-of-order execution microprocessors, which add at least 2-3 stages over a simple in-order pipeline.

The other option is to not only predict the outcome of the branch, but also speculatively execute instructions down the predicted path. The problem is that the program state must be guaranteed to be precise when a miss-speculation occurs, so there needs to be some method of "backing-up." Fortunately there is already a framework in place for out-of-order execution microprocessors. These dynamically scheduled processors fetch, decode, and issue instructions in program order, and buffer those instructions into reservation stations where they wait until their operands are ready. Through register renaming (I can go into this later if you like, this is a rather brief explanation of dynamic scheduling) data dependencies are avoided, and instructions can be executed out-of-order from the reservation stations. The IBM 360/91 did dynamic scheduling in 1967, but dynamic scheduling lost favor for a period because it does not guarantee precise program exceptions, since the instructions are executed out-of-order. This made debugging and handling program interrupts very difficult. It wasn't until 20 years later that methods of maintaining precise interrupts with dynamic scheduling were developed. One of them, the reorder buffer, maintains a list of instructions currently waiting for operands in the reservation stations or being executed. A spot in the reorder buffer is allocated in program order when an instruction is issued, and the results of the instruction after execution are kept in the reorder buffer. The reorder buffer now makes sure that the program state is updated in-order after out-of-order execution.

The reorder buffer can now be trivially updated to support speculative execution around branches. A speculative bit can be set in an entry that contains a speculative instruction. If the branch prediction was found to be correct, the speculative bits are cleared. If the prediction was incorrect, the speculative instructions are thrown out, and the old mapping of the renaming registers is restored.

So now you might wonder how branch prediction is performed. The simplest effective way, Smith's 2-bit state machine, has a small memory, in which each entry has two bits. If an entry is 00, it's predicted strongly not-taken; if it's 01, it's predicted weakly not-taken; if it's 10, it's predicted weakly taken; if it's 11, it's predicted strongly not taken. If, for example, an entry at prediction is 10 (weakly taken), and the branch is actually taken, then the value at the entry is shifted to 11 (strongly taken). Likewise, if that branch was actually not taken, the entry is shifted to 01 (weakly not-taken). The memory is addressed through the lower-order bits of the program counter (PC, the address that determines the instruction that is to be executed) of the branch instruction. If the lower 15-bits of the PC are used, then the memory has 2^15 = 32K entries.

The Smith predictor only tracks local information: only the state of the current branch is used to perform a prediction. There are numerous other schemes, too many to describe, that use global branch history to make predictions: Yeh & Patt, Gshare, Filter, YAGS, tournament, correlating, etc. For example, the Yeh & Patt predictor, used in the Pentium Pro, keeps a table of the branch history per address (although two different branches may map to the same branch history entry, since the predictor can't possibly have an entry for every branch). If the branch history entry has 01001... for example, the recent history is not taken, taken, not taken, not taken, taken, etc. The entry thus determines some history pattern, which is used to index into another table, which might use the Smith 2-bit state machine described earlier.

Besides the potential problems that branch hazards may pose that was described earlier, branches also pose a problem with fetching instructions. In the simplest case where you're fetching one instruction per cycle, in the common case you'll be fetching instruction N, then N+1, then N+2, then N+3, etc. When a branch is encountered, it's easy to just stall the fetch until the address of the next instruction is determined, but this kills performance since on average one in every four to seven instructions is a branch. Modern microprocessors fetching a peak of 3-6 instructions per cycle, on average one branch may be encountered per fetch. Let's say a microprocessor fetches four instructions/cycle, it likely performs this by fetching four instructions from one block in the cache. How do we then know where to fetch instructions during the next cycle? Mere branch prediction isn't enough; branch prediction is often performed during instruction issue, which may be 2-5 or more cycles later. While we're fetching instructions in cycle N, we must also be determining where to fetch instructions from in cycle N+1.

The answer is the branch target buffer (BTB), which is a buffer indexed with the branch PC and produces the predicted PC from which fetching should occur during the next cycle. Thus when you index into the buffer with a PC, you are not only predicting that the particular PC is or is not a branch (since you have not fetched or even decoded the instruction for that PC yet), but also if the predicted branch is predicted taken or not taken. Often to BTB is indexed with the PC at the start of the cache block, so you're guessing, given that PC, whether or not there is a branch in that group of X instructions and, if there is branch, whether or not it is taken. On a BTB miss or miss-prediction, you may have to stall one or two cycles before fetching can begin from the correctly (predicted) PC.

It's worth noting though that 95% isn't too great. The 5% its wrong is absolutely killer on performance; it forces the CPU to spend around 20% of its time recovering from the bad branch.
Yep, and the bad part is that the relative effect of average stall cycles from mispredictions gets worse as IPC increases....just as a rough estimate, say that 15% of instructions are branches, there is a 95% prediction rate, and that there is an average 12 cycle misprediction penalty. The number of stall cycles is (roughly) .15 * .05 * 12 = 0.09 CPI (cycles/instruction, the inverse of IPC). If you have an average "normal" (under correct speculative circumstances) CPI of 0.5, the CPI with miss-speculation stalls is 0.59, which hurts performance by 18%.
 
Back
Top