Why doesn't ARM implement SMT or CMT module?

Exophase · Jul 23, 2015

DrMrLordX said:
You know, with the much-ballyhooed push for ARM to enter server space, you'd think SMT would be highly-desirable. I don't think I've seen it on any of the server ARM chips put forth by the ARMy on paper or in silicon.

Broadcom's upcoming Vulcan ARMv8 CPU has four hardware threads per core. http://www.broadcom.com/press/release.php?id=s797235

A5 · Jul 23, 2015

Lepton87 said:
Smartphones don't need a huge amount of ridiculously slow threads but where MIPS is used there's a need for that.

Yep. MIPS is good if you need a bunch of slow cores in a small amount of space, which isn't really a factor in mobile.

Exophase · Jul 23, 2015

Thala said:
SMT on the other hand does not help much with branch mispredicts, because the moment you are able to detect the mispredict you already have wasted time.

soccerballtux said:
I think I /have/ been confusing misprediction penalties with cache miss penalties.

SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

It doesn't help much if you hit another branch misprediction in the other thread shortly thereafter, but the same thing goes for cache misses. Fortunately, branch mispredictions and cache misses are often fairly spaced apart and don't correlate much across threads.

beginner99 · Jul 24, 2015

Lepton87 said:
Not true, even small cores can handle multiple threads, MIPS CPUs handle 4 threads per core. They are used in networking where it is very useful. There's even an article about it on the home page.

For their use case it makes sense, it makes less sense for ARM as they aren't the preferred solution in networking hardware just as MIPS isn't the preferred solution for android smartphones.

Initial Atoms used SMT because they where in-order cores. Then came silvermont which is an out-of order core and Intel removed it again. Thats what the poster probably forgot. These MIPS cores might be in-order too and it's obvious why SMT is beneficial in in-order cores, even if they are small.

DrMrLordX · Jul 24, 2015

Exophase said:
Broadcom's upcoming Vulcan ARMv8 CPU has four hardware threads per core. http://www.broadcom.com/press/release.php?id=s797235

Interesting find there. Thanks.

soccerballtux · Jul 24, 2015

Exophase said:
SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

It doesn't help much if you hit another branch misprediction in the other thread shortly thereafter, but the same thing goes for cache misses. Fortunately, branch mispredictions and cache misses are often fairly spaced apart and don't correlate much across threads.

so, kill the branch prediction, and use SMT instead

soccerballtux · Jul 24, 2015

beginner99 said:
Initial Atoms used SMT because they where in-order cores. Then came silvermont which is an out-of order core and Intel removed it again. Thats what the poster probably forgot. These MIPS cores might be in-order too and it's obvious why SMT is beneficial in in-order cores, even if they are small.

IIRC they can implement HT for 10% of the die space needed to implement the OoOE

Thala · Jul 24, 2015

Exophase said:
SMT should be able to help reduce the penalty of branch mispredictions too.

Normally, when you hit a misprediction everything ahead of the branch in the instruction window (inserted more recently) has to be flushed. But if the instructions come from another thread they don't have to be. This should average out to a roughly 50% reduction in thrown out work.

You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.

Exophase · Jul 24, 2015

Thala said:
You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.

Thala said:
And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

I don't know where you got those numbers but Cortex-A7 has an 8 stage pipeline so I doubt it has such a low branch misprediction penalty. It also doesn't represent all ARM designs. Branch misprediction penalty on Cortex-A15, A57, and A72 is about 15 cycles. It's similar on Apple's Cyclone core.

Thala said:
In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.

Well yeah, if you're switching to threads that aren't even active (because they yielded or because they're stalled on a cache miss) that's not going to be as effective. But if you have two non-blocked threads available alternating the very first part of the front end roughly 50/50 isn't too bad of an approach.

Dresdenboy · Jul 24, 2015

Exophase said:
I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.

It's not just the IW, which might fill up if there are some long latency things going on (cache misses, some FP ops..). But in an superscalar OoO pipeline there will be many instructions on the way to their EX/AGU/FP units including the branch instruction, like dec0, dec1, dec2, pack, dispatch, schedule, issue/rf, ALU/branch unit, writeback. 9 stages in a 4 wide design -> 35 other instructions wasted when the branch gets resolved. If there are some dependencies leading to a later branch resolution -> even more wasted (those instructions which got pumped into our big buffers, while we sorted out what to do - out of order

).

Thala · Jul 24, 2015

Exophase said:
I'm not talking about switching threads on branches here. The scenario I described above applies with a simple round-robing SMT that alternates fetches between two threads into separate prefetch queues (which is what Bulldozer does for instance).

If 50% of the instructions ahead of the branch in the instruction window are from another thread you don't have to flush them on a branch misprediction.

Understood.

I don't know where you got those numbers but Cortex-A7 has an 8 stage pipeline so I doubt it has such a low branch misprediction penalty. It also doesn't represent all ARM designs. Branch misprediction penalty on Cortex-A15, A57, and A72 is about 15 cycles. It's similar on Apple's Cyclone core.

Disregard the quoted A7 number, it is most likely higher (lower than A8/A9/A15/A57 though).

Well yeah, if you're switching to threads that aren't even active (because they yielded or because they're stalled on a cache miss) that's not going to be as effective. But if you have two non-blocked threads available alternating the very first part of the front end roughly 50/50 isn't too bad of an approach.

Well, in our particular design we had no caches (TCM/SRAM only) and no address translation. So every possible delay (load delay, branch miss delay, operand-use delay) was deterministic and less than 8 cycles. So with 8 threads we got rid of the issues completely

Idontcare · Jul 24, 2015

Not sure if it was mentioned already, but implementing SMT/CMP effectively comes at a cost in terms of chip design and validation time/expense.

There was a good reason Intel took so long to implement HT, the investment expense had to be worth it. For these ARM chip designs I would be surprised if they've reached the cost point where it makes sense to invest all the requisite resources for SMT design and validation versus just copy/paste another core onto the die and connect with ringbus or other.

Exophase · Jul 24, 2015

Dresdenboy said:
It's not just the IW, which might fill up if there are some long latency things going on (cache misses, some FP ops..). But in an superscalar OoO pipeline there will be many instructions on the way to their EX/AGU/FP units including the branch instruction, like dec0, dec1, dec2, pack, dispatch, schedule, issue/rf, ALU/branch unit, writeback. 9 stages in a 4 wide design -> 35 other instructions wasted when the branch gets resolved. If there are some dependencies leading to a later branch resolution -> even more wasted (those instructions which got pumped into our big buffers, while we sorted out what to do - out of order ).

All of those instructions are in the instruction window (ie a ROB or similar) until retirement. I'm not talking about it filling up from long latency instructions.

What I'm trying to say is that you don't have to flush the entire instruction window on a branch misprediction. Only the instructions that are "dependent" on the branch prediction, meaning the ones that were entered after it in the same thread. You don't have to remove ones that entered it afterwards from another thread. That, on average, roughly halves the number of instructions that must be flushed by the branch misprediction.

soccerballtux · Jul 24, 2015

Thala said:
You first would have to partially decode before you can take the decision to switch threads. And even then you would not want to switch threads on each branch speculating that the decoded branch target is a miss.

And then, since we are talking ARM, branch mispredicts are much less expensive than say x86. On Silvermont we are talking 10 cycles while on Cortex A-7 it is about 3.

In one of our early processor designs (15 years ago) we were switching the threads each clock cycle. With 8 threads we were completely hiding all branch and load delays. But this concept is not very typical for SMT and does not help much for cache misses.

wow, that's cool. what architecture?

Dresdenboy · Jul 25, 2015

Exophase said:
All of those instructions are in the instruction window (ie a ROB or similar) until retirement. I'm not talking about it filling up from long latency instructions.

What I'm trying to say is that you don't have to flush the entire instruction window on a branch misprediction. Only the instructions that are "dependent" on the branch prediction, meaning the ones that were entered after it in the same thread. You don't have to remove ones that entered it afterwards from another thread. That, on average, roughly halves the number of instructions that must be flushed by the branch misprediction.

Of course, the instructions fill the IW, n per cycle. But they don't stay there and wait for the single branch to go through the n wide pipeline alone. Ready instructions also enter the pipeline, so that there are up to n x involved stages - 1 (with stages from dispatch to ex) instructions besides the branch, which are distributed over the stages beginning with dispatch down to ex.

The IW also hasn't to be flushed. It simply can be invalidated. Instruction streams are only read and executed (self modifying code aside), not written somewhere.

Did my explaination help? Maybe I'm just bad at it.

Exophase · Jul 25, 2015

Dresdenboy said:
Of course, the instructions fill the IW, n per cycle. But they don't stay there and wait for the single branch to go through the n wide pipeline alone. Ready instructions also enter the pipeline, so that there are up to n x involved stages - 1 (with stages from dispatch to ex) instructions besides the branch, which are distributed over the stages beginning with dispatch down to ex.

The IW also hasn't to be flushed. It simply can be invalidated. Instruction streams are only read and executed (self modifying code aside), not written somewhere.

Did my explaination help? Maybe I'm just bad at it.

I keep saying that it has to flush the instructions entered past the branch, I really don't see what you're trying to correct me with here. The point is that entered the instruction window and at varying stages in the rest of the pipeline after the branch can still continue making progress if they're from a different thread. Seriously, I understand how branch prediction and CPU pipelining works, what exact problem do you have with the statements I'm making?

And yes, I understand that the instruction window doesn't have to be written back somewhere, you shouldn't get caught up on the fact that I used "flushed" because this isn't a cache and the same terminology doesn't apply. If you have to take the meaning of the words to literally fit what they'd mean in a cache context then "invalidated" doesn't sit that well with me either because for a cache it means that it won't cause a hit in the future, but the instruction window isn't "looked up" in the same way either. If it's confusing maybe we should just say that the instructions that came after the mispredicted branch in the same thread must be discarded?

Systems analyst · Jul 27, 2015

Exophase said:
Broadcom's upcoming Vulcan ARMv8 CPU has four hardware threads per core. http://www.broadcom.com/press/release.php?id=s797235

ARM have a paper on multi - threading :
http://community.arm.com/docs/DOC-2823
The basic message is that multi-threading can be efficient and effective with simple core designs, of which the Broadcom design, mentioned above, is presumably an example, but that for complex chips, such as Intel's X86 Xeons presumably, validation becomes a nightmare, and efficiency falls, as more and more silicon lies idle; also power gating becomes increasingly complex with such cores; if this is true, then when ARM or their partners, produce more powerful cores, they will be more efficient than corresponding Intel Xeon cores; time will tell.

Exophase · Jul 27, 2015

Systems analyst said:
The basic message is that multi-threading can be efficient and effective with simple core designs, of which the Broadcom design, mentioned above, is presumably an example

Broadcom's Vulcan more closely resembles Intel's recent "big" x86 uarchs than a simple core:

http://investorshub.advfn.com/boards/read_msg.aspx?message_id=95215022

Dresdenboy · Jul 31, 2015

Exophase said:
If it's confusing maybe we should just say that the instructions that came after the mispredicted branch in the same thread must be discarded?

Agreed.

I think some wording just triggered some inconsistencies with the OoO model I had in mind.

Yuriman · Jul 31, 2015

Systems analyst said:
ARM have a paper on multi - threading :
http://community.arm.com/docs/DOC-2823
The basic message is that multi-threading can be efficient and effective with simple core designs, of which the Broadcom design, mentioned above, is presumably an example, but that for complex chips, such as Intel's X86 Xeons presumably, validation becomes a nightmare, and efficiency falls, as more and more silicon lies idle; also power gating becomes increasingly complex with such cores; if this is true, then when ARM or their partners, produce more powerful cores, they will be more efficient than corresponding Intel Xeon cores; time will tell.

Or perhaps, as ARM CPUs become more complex (approaching the complexity of Intel's), multithreading will become less efficient and more complex.

Why doesn't ARM implement SMT or CMT module?

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Golden Member

Diamond Member

Golden Member

Golden Member

Elite Member

Diamond Member

Lifer

Golden Member

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member