Performance boost of Speculative Execution: what's the speedup? (Spectre-related discussion)

dondon44 · Mar 4, 2018

This thread is currently discussing the topic:

Performance boost of Speculative Execution: what's the speedup? (Spectre-related discussion)

-- additional details are in the post #16 --

Original Post:

Title: Speculative Execution and Branch Prediction

I'm starting this thread to discuss whether branch prediction must include speculative execution, and whether it is a form of speculative execution.

Post your opinions backed up by facts and explanations.

I'll start out with my opinion:
Branch prediction does not necessarily involve Speculative execution, as is demonstrated by the fact that Pentium and Pentium MMX series of CPUs have branch prediction, but they don't have speculative execution.
Furthermore, branch prediction is not a form of speculative execution. The word "execution" in speculative execution refers to the actual execution stage of the CPU. Branch prediction can be simply used to fill the pipeline UP TO the execution stage, and in that case the branch prediction is not a form of speculative execution. In such a case, the branch prediction is just like an enhanced instruction prefetcher.

Otherwise, we would need to also say that instruction prefetching is a form of speculative execution, because, well, the instruction prefetcher can also make a miss. But we don't say that instruction prefetching is a form of specualtive execution, therefore, the branch prediction is also not a form of speculative execution.

I'll add more facts and explanations as necessary.

What is your opinion?

whm1974 · Mar 4, 2018

Well Speculative Execution(SE) has been part of CPU design for 20+ years now and I can imagine that modern software expects this feature to be available.

yoram · Mar 5, 2018

There are many kinds of speculative execution in modern systems:

- out of order speculation, as in using Tomasulo's algorithm or similar
- conditional instructions at the compiler/ISA level
- conditional execution at a thread level (common in GPUs)
- hardware transactional execution

Even for the first one could imagine implementations that are suspectible to spectre after a branch misprediction, but are not true out of order execution. But yes commonly you're probably right for the first. But not for the others.

In the end it's arguing about definitions of words which is usually not very fruitful.

TempAcc99 · Mar 5, 2018

It depends on how you define things.

Wiki says:

Speculative execution is an optimization technique where a computer system performs some task that may not be needed.

and

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.

By that definition fetching data from memory which might not be needed is speculative execution.

If we use your definition that execution = execution stage of CPU pipeline then yes you can do branch prediction without speculative execution. So it's a matter of how you define certain terms and from googling it seems this is in general not well defined.

dondon44 · Mar 5, 2018

whm1974 said:
Well Speculative Execution(SE) has been part of CPU design for 20+ years now and I can imagine that modern software expects this feature to be available.

I strongly disagree with this statement. Speculative execution is invisible to the software (or, at least, it should be invisible in the absence of design errors such as Spectre). The modern software does not depend on SE or expect SE. For example, many modern processors don't even have SE, but the software works identically whether the processor has SE or not.

yoram said:
There are many kinds of speculative execution in modern systems:

- out of order speculation, as in using Tomasulo's algorithm or similar
- conditional instructions at the compiler/ISA level
- conditional execution at a thread level (common in GPUs)
- hardware transactional execution

Even for the first one could imagine implementations that are suspectible to spectre after a branch misprediction, but are not true out of order execution. But yes commonly you're probably right for the first. But not for the others.

Unfortunately, and maybe you have misunderstood my intention, but I'm not talking about the bottom three.
I'm talking about processor issuing instructions to the execution units before branch condition is resolved
(or before indirect address is computed).
I think that is the most common and most widely accepted meaning of the word "speculative execution".
Since I'm mostly talking in the context of Spectre, than perhaps the fourth one
could also come into play, but I wasn't referring to that.

yoram said:
In the end it's arguing about definitions of words which is usually not very fruitful.

Well, yes and no. We need to have at least somewhat strict definitions in order to be able to communicate precisely. Otherwise, a discussion becomes meaningless, with noone knowing the exact meaning of particular words.

TempAcc99 said:
By that definition fetching data from memory which might not be needed is speculative execution.

If we use your definition that execution = execution stage of CPU pipeline then yes you can do branch prediction without speculative execution. So it's a matter of how you define certain terms and from googling it seems this is in general not well defined.

I agree, fetching data (not instructions!) which might not be needed from memory
is a form of speculative execution.

OK, I agree that the definition of "speculative execution" might be somewhat fuzzy.
Let's set the definition as said, that speculative execution refers to the execution stage of the CPU.
In that case, I can see that you agree with me.

Zodiark1593 · Mar 5, 2018

whm1974 said:
Well Speculative Execution(SE) has been part of CPU design for 20+ years now and I can imagine that modern software expects this feature to be available.

Not necessarily. Speculation, and for that matter, Out-Of-Order excecution, is considered to be "undefined" behavior according to the x86 ISA. Meaning they are not part of the spec, but otherwise does not deviate from the spec.

X86 code should produce the same result regardless as to whether speculation was used.

Phynaz · Mar 5, 2018

dondon44 said:
For example, many modern processors don't even have SE

Please provide a citation to the above.

Zodiark1593 · Mar 5, 2018

Phynaz said:
Please provide a citation to the above.

https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulnerable-to-spectre-or-meltdown/

Would this prove sufficient?

Phynaz · Mar 5, 2018

Zodiark1593 said:
https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulnerable-to-spectre-or-meltdown/

Would this prove sufficient?

That's 2, one of which isn't modern. That's not "many".

Zodiark1593 · Mar 5, 2018

Phynaz said:
That's 2, one of which isn't modern. That's not "many".

Perhaps as far as number of cpu designs that lack Speculation, though the Cortex A53 design is used today in very many devices, either alone or in a big.Little configuration.

dondon44 · Mar 7, 2018

Recap.

Apparently, the term "Speculative Execution" isn't precisely defined. In the stricter sense, it refers to the ability of a processor to execute instructions that were not requested to be executed. If the processor later determines that the results of some speculatively executed instructions are not required, it will trow away the results and rollback all the effects of speculative execution.

In the looser sense, the term might include various other task that a CPU performs, where the results of execution might eventually be thrown away or rolled back.

yoram posted the hardware transactional execution as one example of the 'grey' area. In this case, a programmer specifically requests a CPU to execute a sequence of instructions with a rollback option. Whether the word 'speculative' actually fits the description is probably each person's opinion.

Another interesting example is the recent "Spectre" security problem. In the context of "Spectre", the term "speculative execution" does not refer to the strict sense only, but to anything the processor does that might eventually need to be undone.

From the standpoint of a programmer and the software, the "speculative execution" feature is functionally invisible. The software executes exactly the same, whether speculative execution is used or not. The software never asks for speculative execution to be enabled or disabled, because the results of execution are guaranteed to be the same in both cases. The processor uses speculative execution to speed up the processing without any impact on the software.

TempAcc99 · Mar 7, 2018

dondon44 said:
The processor uses speculative execution to speed up the processing without any impact on the software.

I'm gonna argue that performance has a huge impact on the software and as we are starting to see now, this hit can be quiet big.

dondon44 · Mar 7, 2018

TempAcc99 said:
I'm gonna argue that performance has a huge impact on the software and as we are starting to see now, this hit can be quiet big.

It is indisputable that performance has a huge impact on software.

But, there is a difference between saying that software"doesn't work", "works incorrectly" and "works too slowly". The third case is, in some ways, substantially different than the first two.

So, in the sense of the first two mentioned cases, speculative execution has no effect.

In the third mentioned case, a CPU without SE might run too slowly. It's the same as a CPU with a clock set too low. This might be detrimental in some situations, for example, real-time applications. Or, the loss of performance might cause only some minor annoyances (like lower frame rates).

Of course, the severity of problems is related to the amount of slowdown. A small slowdown brings only slight problems, large slowdowns might completely brake most software. On the other hand, most software is made to function on CPUs of widely different performance. In that sense, a relatively "small" slowdown (of less that 50%) is unlikely to break almost any software.

TempAcc99 · Mar 8, 2018

dondon44 said:
But, there is a difference between saying that software"doesn't work", "works incorrectly" and "works too slowly". The third case is, in some ways, substantially different than the first two.

I disagree. Working too slowly can mean it's unusable like a video game that only outputs 1 fps. It works 100% correct but it is too slow and hence fundamentally broken even if all the calculation done are completely correct. And fixing a bug eg. coding error can often be much easier than fixing performance issues.

dondon44 · Mar 9, 2018

If you had a game which runs at 30 fps, and then you disable speculative execution and get 1 fps, then I'll agree with you.

The question is then: what's the amount of slowdown that would be caused by disabling speculative execution.

~~But that is a separate question for a separate thread.~~

dondon44 · Mar 9, 2018

Here is a new topic for this thread:

What would be the average speed decrease of a modern CPU when the speculative execution feature is disabled?

Post your opinions backed up by reasons. That is, don't just say "it would be about XX% decrease", you need to explain why are you estimating approx XX% speed decrease.

Apparently, the term "speculative execution" is not precisely defined. In the context of this question, we are talking about the strict meaning of speculative execution. Therefore, it is speculative execution following a conditional branch instruction, an indirect jump instruction or a return instruction. It does not include hardware transactional memory.

Also, don't worry about whether speculative execution can actually be disabled or not. That is a separate question.

Markfw · Mar 9, 2018

First, all CPU's are not created equal, so this could have quite a range of performance hits as far the %. It could be 1% on one, and 30% on another. The only way to tell is to TEST all current CPU's and provide numbers for each design. Anything more than that is just wild speculation. It also depends on the application, and the type of data being used.

Just one thing to back this up more.... cache size, L1,L2 and L3. A cpu with a one meg L3, and one with a 40 meg L3 certainly will differ on the performance loss %.

dondon44 · Mar 9, 2018

Markfw said:
The only way to tell is to TEST all current CPU's and provide numbers for each design. Anything more than that is just wild speculation. It also depends on the application, and the type of data being used.

In my opinion, you are only partially right. Much of what you say is true.

As you know, this would be practically impossible to test without a special CPU microcode provided by the CPU manufacturer (which was never provided). Therefore, your "only way" is, in fact, a no-way.

But there is a big difference between a 20% drop and a 90% drop. That is an order-of-magnitude difference.

I'm not asking to estimate the drop to 1% precision. I'm asking about a rough estimate. 10%, 20%, 30%,... 90%, that's a rough estimate, and it can be done with a good level of accuracy even WITHOUT testing.

For different applications, there would be substantial differences, but I'm asking ON AVERAGE, like, an average of various benchmark scores.

For different CPUs, the results will be different, but this difference should not be very pronounced because most modern architectures implement Speculative Execution in similar ways (and they all have Spectre).

dondon44 · Mar 10, 2018

As there were no replies in the past 20 hours, I'll start with my estimation.

The performance decrease would be low, about 15% on average. In some specific applications, it might go up to 30%, but that would be uncommon. In pathological cases, it might be up to 50%, but that would be extremely unlikely.

Reasons:
This paper shows that the expected amount of branch instructions in integer code is in the range 7 – 30%, with an average of less than 20%. For floating point code, it is in the range 0-20%, with an average of about 10%.

https://www.spec.org/workshops/2007...ance_Characterization_SPEC_CPU_Benchmarks.pdf

For each branch instruction encountered, a processor wastes cycles until it has resolved the branch condition. The branch condition might be immediately available, or it might be still resolving. While it is hard to estimate how much a CPU has to wait on average, we can put a rough estimate of 0.8 cycles per branch instruction.

A modern CPU executes about 1.4 instructions per cycle. If every seventh instruction is a branch instruction (14.2%), than for each 7 instructions, one would be a branch instruction. 7 instructions take about 5 cycles to execute, but without speculative execution it would be 0.8 cycles more, resulting in a 14% slowdown.

Disabling speculative execution does not mean disabling out-of-order execution. Out-of-order makes a big difference. In a CPU that has o-o-o execution, speculative execution is extremely easy to implement. That's why all CPUs with o-o-o execution also have speculative execution. Speculative execution is like a nice, free bonus of o-o-o execution, but it is not an essential feature of a high-performance CPUs.

Zodiark1593 · Mar 10, 2018

dondon44 said:
As there were no replies in the past 20 hours, I'll start with my estimation.

The performance decrease would be low, about 15% on average. In some specific applications, it might go up to 30%, but that would be uncommon. In pathological cases, it might be up to 50%, but that would be extremely unlikely.

Reasons:
This paper shows that the expected amount of branch instructions in integer code is in the range 7 – 30%, with an average of less than 20%. For floating point code, it is in the range 0-20%, with an average of about 10%.

https://www.spec.org/workshops/2007...ance_Characterization_SPEC_CPU_Benchmarks.pdf

For each branch instruction encountered, a processor wastes cycles until it has resolved the branch condition. The branch condition might be immediately available, or it might be still resolving. While it is hard to estimate how much a CPU has to wait on average, we can put a rough estimate of 0.8 cycles per branch instruction.

A modern CPU executes about 1.4 instructions per cycle. If every seventh instruction is a branch instruction (14.2%), than for each 7 instructions, one would be a branch instruction. 7 instructions take about 5 cycles to execute, but without speculative execution it would be 0.8 cycles more, resulting in a 14% slowdown.

Disabling speculative execution does not mean disabling out-of-order execution. Out-of-order makes a big difference. In a CPU that has o-o-o execution, speculative execution is extremely easy to implement. That's why all CPUs with o-o-o execution also have speculative execution. Speculative execution is like a nice, free bonus of o-o-o execution, but it is not an essential feature of a high-performance CPUs.

These are, of course, synthetic benchmarks, and fairly old ones at that (though somewhat relevant as Conroe is included). I do have to wonder if workloads such as gaming are going to be more branch-heavy than what an average would indicate. The average wouldn't be particularly relevant if it so happens that a common task performed on modern PCs faces relatively steep performance hits.

Unfortunately, I don't think we could test what sort of performance hit would incur if Speculation ceased to exist in current use cases, considering the last x86 cpu to lack the capability was the original Atom core, and as far as I know, Intel hasn't released any means of disabling the functionality. For what it's worth, I don't know of many architectures that are wider than 2-issue while still omitting speculation.

Anandtech released an article featuring a Q&A with the lead architect of the Cortex A53. In the Q&A, Peter (the architect) mentioned that using speculation reduces performance/watt, but also mentions there is no real way around this tradeoff if higher performance is necessary. There's good information in that article (not necessarily on the topic, but in cpu architecture in general), so it may be worth a read.

https://www.anandtech.com/show/7591...-cortex-a53-lead-architect-peter-greenhalgh/2

dondon44 · Mar 12, 2018

Zodiark1593 said:
These are, of course, synthetic benchmarks, and fairly old ones at that (though somewhat relevant as Conroe is included). I do have to wonder if workloads such as gaming are going to be more branch-heavy than what an average would indicate. The average wouldn't be particularly relevant if it so happens that a common task performed on modern PCs faces relatively steep performance hits.

Unfortunately, I don't think we could test what sort of performance hit would incur if Speculation ceased to exist in current use cases,

Gaming is mostly bottlenecked by the GPU. Therefore, games should not be substantially affected.

The paper I referenced is apparently from year 2007. I would be surprised if the density of branch instructions changed significantly since then.

I agree that we cannot test the effects of disabling speculative execution. We can only estimate, roughly.

I still stand by my estimation that disabling speculative execution would cause a performance hit in the range of 0 - 30% in most cases, and 15% on average. If someone has different numbers, or better and more accurate data then I do, please post it here. But noone posted any relevant data so far. The only thing I can see are statements of dubious validity, for example: "anything but testing is a wild speculation".

In my opinion, saying that disabling speculative execution would cause "massive" performance hits is wrong. Also, saying that "performance hit would be much, much worse than 15%" is also wrong and unsubstantiated. Another misleading statement from Mr. Jan Olšan is: "people with some insight into this tend to expect performance drops in absence of branch prediction to be crippling" (Mr.Olšan said in another thread that branch prediction is speculative execution).

Tuna-Fish · Mar 12, 2018

dondon44 said:
As there were no replies in the past 20 hours, I'll start with my estimation.

The performance decrease would be low, about 15% on average. In some specific applications, it might go up to 30%, but that would be uncommon. In pathological cases, it might be up to 50%, but that would be extremely unlikely.

For each branch instruction encountered, a processor wastes cycles until it has resolved the branch condition. The branch condition might be immediately available, or it might be still resolving. While it is hard to estimate how much a CPU has to wait on average, we can put a rough estimate of 0.8 cycles per branch instruction.

This analysis is entirely wrong because you are ignoring pipeline length. Modern CPUs do not execute instructions one at a time, one per cycle. Instead, executing a single instruction takes many cycles (~15-25 on modern cpus), but it can start executing the next one on the next cycle after starting the previous one. For a more involved explanation, look here.

The only part where this does not work is branches, because they need to complete before the address of the next instruction is known. Without any branch prediction, you don't wait until 0.8 cycles, you wait your entire pipeline to drain. The cost of this for modern cpus can be found here (pdf warning, look for misprediction penalty under any cpu you are interested in). For example, he measures 15-20 cycles for skylake and 18 for Ryzen.

The real loss of losing branch prediction is somewhere in the neighborhood of 80%-50%, depending on workload.

dondon44 · Mar 13, 2018

Tuna-Fish said:
This analysis is entirely wrong because you are ignoring pipeline length. Modern CPUs do not execute instructions one at a time, one per cycle. Instead, executing a single instruction takes many cycles (~15-25 on modern cpus), but it can start executing the next one on the next cycle after starting the previous one. For a more involved explanation, look here.

I have been programming for 20 years and I'm very well familiarized with the concept of pipeline and it's effects.

Tuna-Fish said:
Without any branch prediction, you don't wait until 0.8 cycles, you wait your entire pipeline to drain. The cost of this for modern cpus can be found here (pdf warning, look for misprediction penalty under any cpu you are interested in). For example, he measures 15-20 cycles for skylake and 18 for Ryzen.

The real loss of losing branch prediction is somewhere in the neighborhood of 80%-50%, depending on workload.

You are completely mistaken. Where did I talk about disabling branch prediction? I said: what if speculative execution is disabled. Speculative execution is not branch prediction, at least not in the context of this thread (see the original posts here #1 and here #16).

Please, don't confuse the terms again. In this discussion, the term "speculative execution" is defined as not including branch prediction. And the question is: what if speculative execution, as defined, is disabled.

Performance boost of Speculative Execution: what's the speedup? (Spectre-related discussion)

Member

Diamond Member

Junior Member

Member

Member

Platinum Member

Lifer

Platinum Member

Lifer

Platinum Member

Member

Member

Member

Member

Member

Member

Moderator Emeritus, Elite Member

Member

Member

Platinum Member

Member

Golden Member

Member