Why doesn't ARM implement SMT or CMT module?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

A5

Diamond Member
Jun 9, 2000
4,902
5
81
Smartphones don't need a huge amount of ridiculously slow threads but where MIPS is used there's a need for that.

Yep. MIPS is good if you need a bunch of slow cores in a small amount of space, which isn't really a factor in mobile.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
SMT on the other hand does not help much with branch mispredicts, because the moment you are able to detect the mispredict you already have wasted time.

I think I /have/ been confusing misprediction penalties with cache miss penalties.

SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

It doesn't help much if you hit another branch misprediction in the other thread shortly thereafter, but the same thing goes for cache misses. Fortunately, branch mispredictions and cache misses are often fairly spaced apart and don't correlate much across threads.
 

beginner99

Diamond Member
Jun 2, 2009
5,315
1,760
136
Not true, even small cores can handle multiple threads, MIPS CPUs handle 4 threads per core. They are used in networking where it is very useful. There's even an article about it on the home page.


For their use case it makes sense, it makes less sense for ARM as they aren't the preferred solution in networking hardware just as MIPS isn't the preferred solution for android smartphones.

Initial Atoms used SMT because they where in-order cores. Then came silvermont which is an out-of order core and Intel removed it again. Thats what the poster probably forgot. These MIPS cores might be in-order too and it's obvious why SMT is beneficial in in-order cores, even if they are small.
 
Dec 30, 2004
12,553
2
76
SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

It doesn't help much if you hit another branch misprediction in the other thread shortly thereafter, but the same thing goes for cache misses. Fortunately, branch mispredictions and cache misses are often fairly spaced apart and don't correlate much across threads.
so, kill the branch prediction, and use SMT instead
 
Dec 30, 2004
12,553
2
76
Initial Atoms used SMT because they where in-order cores. Then came silvermont which is an out-of order core and Intel removed it again. Thats what the poster probably forgot. These MIPS cores might be in-order too and it's obvious why SMT is beneficial in in-order cores, even if they are small.
IIRC they can implement HT for 10% of the die space needed to implement the OoOE
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.

And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

I don't know where you got those numbers but Cortex-A7 has an 8 stage pipeline so I doubt it has such a low branch misprediction penalty. It also doesn't represent all ARM designs. Branch misprediction penalty on Cortex-A15, A57, and A72 is about 15 cycles. It's similar on Apple's Cyclone core.

In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.

Well yeah, if you're switching to threads that aren't even active (because they yielded or because they're stalled on a cache miss) that's not going to be as effective. But if you have two non-blocked threads available alternating the very first part of the front end roughly 50/50 isn't too bad of an approach.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.
It's not just the IW, which might fill up if there are some long latency things going on (cache misses, some FP ops..). But in an superscalar OoO pipeline there will be many instructions on the way to their EX/AGU/FP units including the branch instruction, like dec0, dec1, dec2, pack, dispatch, schedule, issue/rf, ALU/branch unit, writeback. 9 stages in a 4 wide design -> 35 other instructions wasted when the branch gets resolved. If there are some dependencies leading to a later branch resolution -> even more wasted (those instructions which got pumped into our big buffers, while we sorted out what to do - out of order ;)).
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.

Understood.

I don't know where you got those numbers but Cortex-A7 has an 8 stage pipeline so I doubt it has such a low branch misprediction penalty. It also doesn't represent all ARM designs. Branch misprediction penalty on Cortex-A15, A57, and A72 is about 15 cycles. It's similar on Apple's Cyclone core.

Disregard the quoted A7 number, it is most likely higher (lower than A8/A9/A15/A57 though).

Well yeah, if you're switching to threads that aren't even active (because they yielded or because they're stalled on a cache miss) that's not going to be as effective. But if you have two non-blocked threads available alternating the very first part of the front end roughly 50/50 isn't too bad of an approach.

Well, in our particular design we had no caches (TCM/SRAM only) and no address translation. So every possible delay (load delay, branch miss delay, operand-use delay) was deterministic and less than 8 cycles. So with 8 threads we got rid of the issues completely :)
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Not sure if it was mentioned already, but implementing SMT/CMP effectively comes at a cost in terms of chip design and validation time/expense.

There was a good reason Intel took so long to implement HT, the investment expense had to be worth it. For these ARM chip designs I would be surprised if they've reached the cost point where it makes sense to invest all the requisite resources for SMT design and validation versus just copy/paste another core onto the die and connect with ringbus or other.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
It's not just the IW, which might fill up if there are some long latency things going on (cache misses, some FP ops..). But in an superscalar OoO pipeline there will be many instructions on the way to their EX/AGU/FP units including the branch instruction, like dec0, dec1, dec2, pack, dispatch, schedule, issue/rf, ALU/branch unit, writeback. 9 stages in a 4 wide design -> 35 other instructions wasted when the branch gets resolved. If there are some dependencies leading to a later branch resolution -> even more wasted (those instructions which got pumped into our big buffers, while we sorted out what to do - out of order ;)).

All of those instructions are in the instruction window (ie a ROB or similar) until retirement. I'm not talking about it filling up from long latency instructions.

What I'm trying to say is that you don't have to flush the entire instruction window on a branch misprediction. Only the instructions that are "dependent" on the branch prediction, meaning the ones that were entered after it in the same thread. You don't have to remove ones that entered it afterwards from another thread. That, on average, roughly halves the number of instructions that must be flushed by the branch misprediction.
 
Dec 30, 2004
12,553
2
76
You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.
wow, that's cool. what architecture?
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
All of those instructions are in the instruction window (ie a ROB or similar) until retirement. I'm not talking about it filling up from long latency instructions.

What I'm trying to say is that you don't have to flush the entire instruction window on a branch misprediction. Only the instructions that are "dependent" on the branch prediction, meaning the ones that were entered after it in the same thread. You don't have to remove ones that entered it afterwards from another thread. That, on average, roughly halves the number of instructions that must be flushed by the branch misprediction.
Of course, the instructions fill the IW, n per cycle. But they don't stay there and wait for the single branch to go through the n wide pipeline alone. Ready instructions also enter the pipeline, so that there are up to n x involved stages - 1 (with stages from dispatch to ex) instructions besides the branch, which are distributed over the stages beginning with dispatch down to ex.

The IW also hasn't to be flushed. It simply can be invalidated. Instruction streams are only read and executed (self modifying code aside), not written somewhere.

Did my explaination help? Maybe I'm just bad at it. ;)
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Of course, the instructions fill the IW, n per cycle. But they don't stay there and wait for the single branch to go through the n wide pipeline alone. Ready instructions also enter the pipeline, so that there are up to n x involved stages - 1 (with stages from dispatch to ex) instructions besides the branch, which are distributed over the stages beginning with dispatch down to ex.

The IW also hasn't to be flushed. It simply can be invalidated. Instruction streams are only read and executed (self modifying code aside), not written somewhere.

Did my explaination help? Maybe I'm just bad at it. ;)

I keep saying that it has to flush the instructions entered past the branch, I really don't see what you're trying to correct me with here. The point is that entered the instruction window and at varying stages in the rest of the pipeline after the branch can still continue making progress if they're from a different thread. Seriously, I understand how branch prediction and CPU pipelining works, what exact problem do you have with the statements I'm making?

And yes, I understand that the instruction window doesn't have to be written back somewhere, you shouldn't get caught up on the fact that I used "flushed" because this isn't a cache and the same terminology doesn't apply. If you have to take the meaning of the words to literally fit what they'd mean in a cache context then "invalidated" doesn't sit that well with me either because for a cache it means that it won't cause a hit in the future, but the instruction window isn't "looked up" in the same way either. If it's confusing maybe we should just say that the instructions that came after the mispredicted branch in the same thread must be discarded?
 
Apr 30, 2015
131
10
81
Broadcom's upcoming Vulcan ARMv8 CPU has four hardware threads per core. http://www.broadcom.com/press/release.php?id=s797235

ARM have a paper on multi - threading :
http://community.arm.com/docs/DOC-2823
The basic message is that multi-threading can be efficient and effective with simple core designs, of which the Broadcom design, mentioned above, is presumably an example, but that for complex chips, such as Intel's X86 Xeons presumably, validation becomes a nightmare, and efficiency falls, as more and more silicon lies idle; also power gating becomes increasingly complex with such cores; if this is true, then when ARM or their partners, produce more powerful cores, they will be more efficient than corresponding Intel Xeon cores; time will tell.
 

Yuriman

Diamond Member
Jun 25, 2004
5,530
141
106
ARM have a paper on multi - threading :
http://community.arm.com/docs/DOC-2823
The basic message is that multi-threading can be efficient and effective with simple core designs, of which the Broadcom design, mentioned above, is presumably an example, but that for complex chips, such as Intel's X86 Xeons presumably, validation becomes a nightmare, and efficiency falls, as more and more silicon lies idle; also power gating becomes increasingly complex with such cores; if this is true, then when ARM or their partners, produce more powerful cores, they will be more efficient than corresponding Intel Xeon cores; time will tell.

Or perhaps, as ARM CPUs become more complex (approaching the complexity of Intel's), multithreading will become less efficient and more complex.