Originally posted by: CTho9305
Chips nowadays are often (usually?) power-limited, meaning the design could run faster and reliably enough at a higher voltage and frequency, but nobody* wants to buy 200-watt processors. Given that branch prediction accuracies are well over 90%, it seems insane to me to quadruple the power (if you handle 2 branches) for a few percent performance. Also, in real-world integer code**, about 20% of the instructions are branches. With a conservative estimate of 32 instructions in-flight***, you're already dealing with ~6 branches (so your chip that burns 4X the power still has to use prediction for the majority of branches). There's an additional complexity too - if you decide to take a branch, you have to figure out where the "branch taken" path actually is. In some cases, it's encoded into the instruction, but sometimes it's a calculated value. If that calculation isn't finished yet, you have to guess where to jump to, which is 1 of ~2^32 options (anywhere in the address space), not just 1 or 2 options (taken or not). Even when there are only 2 branches in flight, you may not be able to guess where they branch to.
*Nobody = not enough people who are willing to pay enough extra
**Floating point code tends to have fewer branches, but prediction accuracy for floating point code tends to be in the 98%+ range, so you almost never mispredict the branches. Few branches + low mispredict rate = minimal possible performance benefit.
***P4 apparently had
~128, I can't find Phenom / i7 numbers.