They have a pretty interesting feature called, and I might be referring to the wrong name here, 'conditional execution' [google] that permits execution of instructions past a 'flow-control point' [if-statement] with a 'I'll tell you in the future whether to commit this data' as in 'I'll tell you whether we actually should have executed this' / 'I'll tell you the result of this if statement'. So when you reach a branch and don't know the result ahead of time, this permits the processor to continue on executing as if it had perfectly predicted the result of the branch, but 'back up' [rather, toss] the results of that code. Between this and compiler instruction re-ordering the compiler is able to fairly efficiently manage conditionals that would otherwise precipitate a stall leading to the wasted execution resources that HT alleviates.
I just realized I have not fully made the connection between this and 'being multi-threaded': I seem to recall this is one thing that's done in lieu of HT to negate the need for it. If you can immediately jump to the end of the flow control statement and continue executing if you learn you didn't need to be executing it in the first place, you--
ok this is the connecting point: compiler orders (shifts forward) 'some instructions that follow the function block within the flow-control and to be executed after the flow-control/if-statement that we will conditionally execute on' to be executed during the computation of the result of the flow control-- the CPU then executes those instructions 'that it'll need to do no matter what' while waiting on the result of the if-statement. Sometimes it doesn't have enough of those 'no matter what' computations, and that's when the compiler says 'start executing these anyways and if we didn't enter the flow control if-statement, then I'll tell you in the future [conditionally commit] before commit to dump those results'. Combined with likely() and unlikely() C macros it's gotten to a point where there are limited gains available from HT-- because a sufficiently equivalently efficient alternative has already been implemented. This is, for example, one reason why Intel's x86 Atoms are only marginally competitive with ARM for SFF devices-- and that's -with- the process node advantage.
Also, I think that re-ordering helps with memory access/cache miss stalls, too. This is the main connecting point I'm missing/that I don't recall/that I can't comment on, AKA 'all that that I just typed/described doesn't quite address memory access stalls'
I think you're confusing some different technologies here..
The ability to speculatively execute instructions past a branch before the branch's direction is known is called branch prediction and has been used in processors for a long time. That's pretty much what you seem to be describing. This feature doesn't need to be expressed by the instruction set at all, the CPU can and does do it completely transparently. And it doesn't use SMT to accomplish it.
ARM has had conditional execution for most instructions, also known as predication. It makes it so individual instructions are nullified if a condition isn't true. Short branches can be converted into branchless code this way. Most predication was removed from AArch64 in ARMv8, because it wasn't worth keeping it. It was only really helpful for very poorly predictable branches, but even then not always because it creates new dependencies which can inhibit ability to reorder code, and it adds more complexity in the critical path of the processor. Maybe worst of all is that when you include the ability to predicate most instructions it wastes too many bits in the instruction that could have been used for other things (like more registers). Even before AArch64 ARM started recommending predication only be used in very limited scenarios in their newer 32-bit processors.
Branch prediction or predication really negate the benefits of SMT. While SMT
could be used to hide the latency of branches where prediction isn't available (AFAIK some GPUs work this way) most CPUs with SMT also have branch prediction.
OoOE on the other hand can negate some benefits of SMT, by allowing better exploitation of parallelism within the same thread instead of parallelism between multiple threads. So in some cases SMT is more effective on in-order processors, especially wider ones. But for a lot of code it's beneficial on out of order processors too, because not enough parallelism can be found in single threads. It can also help reduce the branch misprediction penalty by splitting the amount of instructions that would have been queued past the branch for one thread to a smaller amount queued for both threads (assuming the other thread doesn't end up discarding those instructions too)
Something you may be thinking of is speculative multi-threading or SpMT. Here, a separate thread is used to speculatively execute multiple pieces of code in parallel. For example, following both sides of a branch. As far as I know, SpMT hasn't been implemented in any commercial processors, although it was part of Andy Glew's visions for CMT long before Steamroller came out.