I can go into the specifics of branch predictor implementations since KF gave a good overview of the concept of branches.....
Branch prediction has a long history...performance losses due to conditional branches were recognized in the 1960s, and given initial study in Tomasulo et al.'s IBM System/360 model 91 paper. The problem is that conditional branches pose a control dependence....the instructions following a branch in the control flow of a program are conditionally dependent on the outcome. Should that control dependence cause instruction issue or execution to stall, branches may present themselves as control hazards.
Even in the short microprocessor pipelines of the 80s with 4-5 stages, branches posed a problem. These MPUs only issued one instruction per cycle, and often only stalled for one cycle when encountering a branch in order to determine the outcome of the branch....despite these meager penalties compared to today's MPUs, branch hazards were important enough for designers to implement branch delay slots in the ISA of many RISC architectures, which, through compiler assistence, attempted to insert an instruction in the potential stall cycle before a branch.
But branch delay slots are a hack that attempted to solve "the problem of the day" via modifying the ISA; later implementations with longer pipelines found branch delay slots an annoyance. An ISA-agnostic microarchitectural approach is to attempt to predict the outcome of the branch, which had already been evaluated in larger systems.
The first technique is to predict the outcome of a branch and fetch instructions along the predicted path, but not allow those predicted instructions to start execution. This guarantees the precise program state, since in the case of a miss-prediction, no instructions that should not have executed can modify the program state. But the effectiveness diminishes as pipelines become longer, since this prediction can only hide a couple stall cycles. Something else is needed for out-of-order execution microprocessors, which add at least 2-3 stages over a simple in-order pipeline.
The other option is to not only predict the outcome of the branch, but also speculatively execute instructions down the predicted path. The problem is that the program state must be guaranteed to be precise when a miss-speculation occurs, so there needs to be some method of "backing-up." Fortunately there is already a framework in place for out-of-order execution microprocessors. These dynamically scheduled processors fetch, decode, and issue instructions in program order, and buffer those instructions into reservation stations where they wait until their operands are ready. Through register renaming (I can go into this later if you like, this is a rather brief explanation of dynamic scheduling) data dependencies are avoided, and instructions can be executed out-of-order from the reservation stations. The IBM 360/91 did dynamic scheduling in 1967, but dynamic scheduling lost favor for a period because it does not guarantee precise program exceptions, since the instructions are executed out-of-order. This made debugging and handling program interrupts very difficult. It wasn't until 20 years later that methods of maintaining precise interrupts with dynamic scheduling were developed. One of them, the reorder buffer, maintains a list of instructions currently waiting for operands in the reservation stations or being executed. A spot in the reorder buffer is allocated in program order when an instruction is issued, and the results of the instruction after execution are kept in the reorder buffer. The reorder buffer now makes sure that the program state is updated in-order after out-of-order execution.
The reorder buffer can now be trivially updated to support speculative execution around branches. A speculative bit can be set in an entry that contains a speculative instruction. If the branch prediction was found to be correct, the speculative bits are cleared. If the prediction was incorrect, the speculative instructions are thrown out, and the old mapping of the renaming registers is restored.
So now you might wonder how branch prediction is performed. The simplest effective way, Smith's 2-bit state machine, has a small memory, in which each entry has two bits. If an entry is 00, it's predicted strongly not-taken; if it's 01, it's predicted weakly not-taken; if it's 10, it's predicted weakly taken; if it's 11, it's predicted strongly not taken. If, for example, an entry at prediction is 10 (weakly taken), and the branch is actually taken, then the value at the entry is shifted to 11 (strongly taken). Likewise, if that branch was actually not taken, the entry is shifted to 01 (weakly not-taken). The memory is addressed through the lower-order bits of the program counter (PC, the address that determines the instruction that is to be executed) of the branch instruction. If the lower 15-bits of the PC are used, then the memory has 2^15 = 32K entries.
The Smith predictor only tracks local information: only the state of the current branch is used to perform a prediction. There are numerous other schemes, too many to describe, that use global branch history to make predictions: Yeh & Patt, Gshare, Filter, YAGS, tournament, correlating, etc. For example, the Yeh & Patt predictor, used in the Pentium Pro, keeps a table of the branch history per address (although two different branches may map to the same branch history entry, since the predictor can't possibly have an entry for every branch). If the branch history entry has 01001... for example, the recent history is not taken, taken, not taken, not taken, taken, etc. The entry thus determines some history pattern, which is used to index into another table, which might use the Smith 2-bit state machine described earlier.
Besides the potential problems that branch hazards may pose that was described earlier, branches also pose a problem with fetching instructions. In the simplest case where you're fetching one instruction per cycle, in the common case you'll be fetching instruction N, then N+1, then N+2, then N+3, etc. When a branch is encountered, it's easy to just stall the fetch until the address of the next instruction is determined, but this kills performance since on average one in every four to seven instructions is a branch. Modern microprocessors fetching a peak of 3-6 instructions per cycle, on average one branch may be encountered per fetch. Let's say a microprocessor fetches four instructions/cycle, it likely performs this by fetching four instructions from one block in the cache. How do we then know where to fetch instructions during the next cycle? Mere branch prediction isn't enough; branch prediction is often performed during instruction issue, which may be 2-5 or more cycles later. While we're fetching instructions in cycle N, we must also be determining where to fetch instructions from in cycle N+1.
The answer is the branch target buffer (BTB), which is a buffer indexed with the branch PC and produces the predicted PC from which fetching should occur during the next cycle. Thus when you index into the buffer with a PC, you are not only predicting that the particular PC is or is not a branch (since you have not fetched or even decoded the instruction for that PC yet), but also if the predicted branch is predicted taken or not taken. Often to BTB is indexed with the PC at the start of the cache block, so you're guessing, given that PC, whether or not there is a branch in that group of X instructions and, if there is branch, whether or not it is taken. On a BTB miss or miss-prediction, you may have to stall one or two cycles before fetching can begin from the correctly (predicted) PC.
It's worth noting though that 95% isn't too great. The 5% its wrong is absolutely killer on performance; it forces the CPU to spend around 20% of its time recovering from the bad branch.
Yep, and the bad part is that the relative effect of average stall cycles from mispredictions gets worse as IPC increases....just as a rough estimate, say that 15% of instructions are branches, there is a 95% prediction rate, and that there is an average 12 cycle misprediction penalty. The number of stall cycles is (roughly) .15 * .05 * 12 = 0.09 CPI (cycles/instruction, the inverse of IPC). If you have an average "normal" (under correct speculative circumstances) CPI of 0.5, the CPI with miss-speculation stalls is 0.59, which hurts performance by 18%.