Why doesn't ARM implement SMT or CMT module?

CluelessOne

Member
Jun 19, 2015
76
49
91
So far as I know ARM cpus don't implement something like hyperthreading or CMT module? Is there something in their design philosophy that prevents doing it?

Instead they go for bigLittle clusters.

In my own opinion it's better if they do something like big-medium-little module approach. Something like a module with 1 core optimized for 100 MHz clock, 1 for around 400 MHz clock, and another for 1 GHz with a NEON attached in the module. So Tri-core and SIMD in a module.

For a phone the slowest core should be enough to manage radio/wireless monitoring, human UI input, button press etc. The other 2 for OS and programs. But what do I know, that is just me.

So any reason why ARM don't go SMT approach?
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
Their cores are just too small to make it worth it. Even intel moved away from it on their small cores. Apple is more likely to do SMT, since they are actually being smart about it and making their cores wider. The wider the core, with the more executions units, the more it makes sense to implement SMT.

I think big.LITTLE is completely retarded the way they have it implemented. There is no reason to have more than one LITTLE core.
 

dark zero

Platinum Member
Jun 2, 2015
2,655
140
106
Only Apple and nVIDIA can go to SMT due their bigger cores.

Meanwhile, Mediatek is trying to transform the Big Little into a CMT, but much better than AMD one.
 

Lepton87

Platinum Member
Jul 28, 2009
2,544
9
81
Only Apple and nVIDIA can go to SMT due their bigger cores.

Meanwhile, Mediatek is trying to transform the Big Little into a CMT, but much better than AMD one.

Not true, even small cores can handle multiple threads, MIPS CPUs handle 4 threads per core. They are used in networking where it is very useful. There's even an article about it on the home page.

http://www.anandtech.com/show/8457/mips-strikes-back-64bit-warrior-i6400-architecture-arrives/3

SMT_575px.JPG


For their use case it makes sense, it makes less sense for ARM as they aren't the preferred solution in networking hardware just as MIPS isn't the preferred solution for android smartphones.
 

CluelessOne

Member
Jun 19, 2015
76
49
91
Not true, even small cores can handle multiple threads, MIPS CPUs handle 4 threads per core. They are used in networking where it is very useful.



Snip




For their use case it makes sense, it makes less sense for ARM as they aren't the preferred solution in networking hardware just as MIPS isn't the preferred solution for android smartphones.







So why is it less sense for ARM in smartphone? Or is it android limitation?
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Most ARM SoCs have at least 4 cores now. It makes too little sense to add more hardware threads on top of this. Too little phone and tablet software benefits.

I expect that was also the biggest driver for Silvermont dropping SMT. It was a huge benefit for Bonnell and Saltwell cores, because although they were narrow they were in-order so SMT benefited a lot from filling in stalls from cache missing and scheduling conflicts. But those were never deployed in more than dual core configurations.

CMT-style resource sharing like a shared NEON unit is undesirable because it adds latency and complexity to a design, especially when they'd want this to be configurable for different customers. It also adds more limitations in power management, where you end up having to have multiple cores in the cluster powered at the same time, or adding more power domains to try to split apart the shared and unshared resources. The individual NEON units per CPU are not really big enough to be worth trying to share.

Even more, they are on 4 way SMT.

Yes but for extremely parallel tasks. Most of the time it's running a completely different type of software. Same reason why it works well having dozens of cores and very wide SIMD.

And MIPS is a very underrated arch who has great potential there

I don't think it's underrated, if anything I think people overrate how underrated it is, every time I hear people talk about it's wondering why no one uses these amazing CPUs when they don't seem especially amazing to me. It's not enough that they're competitive, they're going to have to offer a lot of advantage to get SoC makers and OEMs to switch because there's a lot of risk and overhead in doing so - if Intel has struggled even with their design expertise, big process advantage, and practically giving the SoCs away then it's not hard to see why it'd be very challenging for MIPS to carve out any market share. Software compatibility does actually matter. MIPS entered the game on Android too late, too many important apps and middleware don't support it on Android.
 
Last edited:
Dec 30, 2004
12,553
2
76
They have a pretty interesting feature called, and I might be referring to the wrong name here, 'conditional execution' [google] that permits execution of instructions past a 'flow-control point' [if-statement] with a 'I'll tell you in the future whether to commit this data' as in 'I'll tell you whether we actually should have executed this' / 'I'll tell you the result of this if statement'. So when you reach a branch and don't know the result ahead of time, this permits the processor to continue on executing as if it had perfectly predicted the result of the branch, but 'back up' [rather, toss] the results of that code. Between this and compiler instruction re-ordering the compiler is able to fairly efficiently manage conditionals that would otherwise precipitate a stall leading to the wasted execution resources that HT alleviates.

I just realized I have not fully made the connection between this and 'being multi-threaded': I seem to recall this is one thing that's done in lieu of HT to negate the need for it. If you can immediately jump to the end of the flow control statement and continue executing if you learn you didn't need to be executing it in the first place, you--

ok this is the connecting point: compiler orders (shifts forward) 'some instructions that follow the function block within the flow-control and to be executed after the flow-control/if-statement that we will conditionally execute on' to be executed during the computation of the result of the flow control-- the CPU then executes those instructions 'that it'll need to do no matter what' while waiting on the result of the if-statement. Sometimes it doesn't have enough of those 'no matter what' computations, and that's when the compiler says 'start executing these anyways and if we didn't enter the flow control if-statement, then I'll tell you in the future [conditionally commit] before commit to dump those results'. Combined with likely() and unlikely() C macros it's gotten to a point where there are limited gains available from HT-- because a sufficiently equivalently efficient alternative has already been implemented. This is, for example, one reason why Intel's x86 Atoms are only marginally competitive with ARM for SFF devices-- and that's -with- the process node advantage.

Also, I think that re-ordering helps with memory access/cache miss stalls, too. This is the main connecting point I'm missing/that I don't recall/that I can't comment on, AKA 'all that that I just typed/described doesn't quite address memory access stalls'
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
They have a pretty interesting feature called, and I might be referring to the wrong name here, 'conditional execution' [google] that permits execution of instructions past a 'flow-control point' [if-statement] with a 'I'll tell you in the future whether to commit this data' as in 'I'll tell you whether we actually should have executed this' / 'I'll tell you the result of this if statement'. So when you reach a branch and don't know the result ahead of time, this permits the processor to continue on executing as if it had perfectly predicted the result of the branch, but 'back up' [rather, toss] the results of that code. Between this and compiler instruction re-ordering the compiler is able to fairly efficiently manage conditionals that would otherwise precipitate a stall leading to the wasted execution resources that HT alleviates.

I just realized I have not fully made the connection between this and 'being multi-threaded': I seem to recall this is one thing that's done in lieu of HT to negate the need for it. If you can immediately jump to the end of the flow control statement and continue executing if you learn you didn't need to be executing it in the first place, you--

ok this is the connecting point: compiler orders (shifts forward) 'some instructions that follow the function block within the flow-control and to be executed after the flow-control/if-statement that we will conditionally execute on' to be executed during the computation of the result of the flow control-- the CPU then executes those instructions 'that it'll need to do no matter what' while waiting on the result of the if-statement. Sometimes it doesn't have enough of those 'no matter what' computations, and that's when the compiler says 'start executing these anyways and if we didn't enter the flow control if-statement, then I'll tell you in the future [conditionally commit] before commit to dump those results'. Combined with likely() and unlikely() C macros it's gotten to a point where there are limited gains available from HT-- because a sufficiently equivalently efficient alternative has already been implemented. This is, for example, one reason why Intel's x86 Atoms are only marginally competitive with ARM for SFF devices-- and that's -with- the process node advantage.

Also, I think that re-ordering helps with memory access/cache miss stalls, too. This is the main connecting point I'm missing/that I don't recall/that I can't comment on, AKA 'all that that I just typed/described doesn't quite address memory access stalls'

I think you're confusing some different technologies here..

The ability to speculatively execute instructions past a branch before the branch's direction is known is called branch prediction and has been used in processors for a long time. That's pretty much what you seem to be describing. This feature doesn't need to be expressed by the instruction set at all, the CPU can and does do it completely transparently. And it doesn't use SMT to accomplish it.

ARM has had conditional execution for most instructions, also known as predication. It makes it so individual instructions are nullified if a condition isn't true. Short branches can be converted into branchless code this way. Most predication was removed from AArch64 in ARMv8, because it wasn't worth keeping it. It was only really helpful for very poorly predictable branches, but even then not always because it creates new dependencies which can inhibit ability to reorder code, and it adds more complexity in the critical path of the processor. Maybe worst of all is that when you include the ability to predicate most instructions it wastes too many bits in the instruction that could have been used for other things (like more registers). Even before AArch64 ARM started recommending predication only be used in very limited scenarios in their newer 32-bit processors.

Branch prediction or predication really negate the benefits of SMT. While SMT could be used to hide the latency of branches where prediction isn't available (AFAIK some GPUs work this way) most CPUs with SMT also have branch prediction.

OoOE on the other hand can negate some benefits of SMT, by allowing better exploitation of parallelism within the same thread instead of parallelism between multiple threads. So in some cases SMT is more effective on in-order processors, especially wider ones. But for a lot of code it's beneficial on out of order processors too, because not enough parallelism can be found in single threads. It can also help reduce the branch misprediction penalty by splitting the amount of instructions that would have been queued past the branch for one thread to a smaller amount queued for both threads (assuming the other thread doesn't end up discarding those instructions too)

Something you may be thinking of is speculative multi-threading or SpMT. Here, a separate thread is used to speculatively execute multiple pieces of code in parallel. For example, following both sides of a branch. As far as I know, SpMT hasn't been implemented in any commercial processors, although it was part of Andy Glew's visions for CMT long before Steamroller came out.
 

lamedude

Golden Member
Jan 14, 2011
1,214
19
81
I don't think it's underrated, if anything I think people overrate how underrated it is, every time I hear people talk about it's wondering why no one uses these amazing CPUs when they don't seem especially amazing to me.
Mario64 and FF7 are great so therefore MIPS is great. And don't you tell me those are games are overrated.:) And if a new 6502 (6581664?) would appear tomorrow I would say its underrated too.;)
 

Zodiark1593

Platinum Member
Oct 21, 2012
2,230
4
81
Mario64 and FF7 are great so therefore MIPS is great. And don't you tell me those are games are overrated.:) And if a new 6502 (6581664?) would appear tomorrow I would say its underrated too.;)

And the PS2, and PSP are both MIPS based. Doesn't mean they have the resources to go toe to toe with ARM now.
 
Dec 30, 2004
12,553
2
76
I think you're confusing some different technologies here..

The ability to speculatively execute instructions past a branch before the branch's direction is known is called branch prediction and has been used in processors for a long time. That's pretty much what you seem to be describing. This feature doesn't need to be expressed by the instruction set at all, the CPU can and does do it completely transparently. And it doesn't use SMT to accomplish it.

ARM has had conditional execution for most instructions, also known as predication. It makes it so individual instructions are nullified if a condition isn't true. Short branches can be converted into branchless code this way. Most predication was removed from AArch64 in ARMv8, because it wasn't worth keeping it. It was only really helpful for very poorly predictable branches, but even then not always because it creates new dependencies which can inhibit ability to reorder code, and it adds more complexity in the critical path of the processor. Maybe worst of all is that when you include the ability to predicate most instructions it wastes too many bits in the instruction that could have been used for other things (like more registers). Even before AArch64 ARM started recommending predication only be used in very limited scenarios in their newer 32-bit processors.

Branch prediction or predication really negate the benefits of SMT. While SMT could be used to hide the latency of branches where prediction isn't available (AFAIK some GPUs work this way) most CPUs with SMT also have branch prediction.

OoOE on the other hand can negate some benefits of SMT, by allowing better exploitation of parallelism within the same thread instead of parallelism between multiple threads. So in some cases SMT is more effective on in-order processors, especially wider ones. But for a lot of code it's beneficial on out of order processors too, because not enough parallelism can be found in single threads. It can also help reduce the branch misprediction penalty by splitting the amount of instructions that would have been queued past the branch for one thread to a smaller amount queued for both threads (assuming the other thread doesn't end up discarding those instructions too)

Something you may be thinking of is speculative multi-threading or SpMT. Here, a separate thread is used to speculatively execute multiple pieces of code in parallel. For example, following both sides of a branch. As far as I know, SpMT hasn't been implemented in any commercial processors, although it was part of Andy Glew's visions for CMT long before Steamroller came out.
branch prediction is something very different from what I'm talking about. The branch predication is what I was thinking of.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
branch prediction is something very different from what I'm talking about. The branch predication is what I was thinking of.

Predication and conditional execution is essentially the same thing. Aarch64 adds some powerful conditional select operations while removing other conditional operations in order to safe instruction set encoding space. A converted if-else statment using predication has among things already mentioned here the advantage, that is does not pollute the BTB.
Note, that it has nothing to do with re-ordering of basic-blocks if compiler hints as likely() are used.
In addition predication is orthogonal to SMT.

Finally having SMT for low power processors is not the best idea because many expensive resources (flip-flops) have to be doubled up, increasing leakage/power and 1st level cache pressure. Essentially perf/watt decreases.
However if the plan is to develop a desktop class ARM CPU, where absolute performance is more important than perf/watt, SMT might be a good option.
 
Last edited:

Borealis7

Platinum Member
Oct 19, 2006
2,901
205
106
imagine CMT on GPUs where the increase in performace is almost linear with the increase in the number of "cores". AMD has already done shared Integer units in it's Bulldozer architecture, and GPU SPs are essentially 3-dimensional "vector spinners" which require 3 (or 4 or 5) integer operations each step.

i'm no chip designer but that sounds to me like a good idea.
 
Dec 30, 2004
12,553
2
76
Predication and conditional execution is essentially the same thing. Aarch64 adds some powerful conditional select operations while removing other conditional operations in order to safe instruction set encoding space. A converted if-else statment using predication has among things already mentioned here the advantage, that is does not pollute the BTB.
Note, that it has nothing to do with re-ordering of basic-blocks if compiler hints as likely() are used.
In addition predication is orthogonal to SMT.

Finally having SMT for low power processors is not the best idea because many expensive resources (flip-flops) have to be doubled up, increasing leakage/power and 1st level cache pressure. Essentially perf/watt decreases.
However if the plan is to develop a desktop class ARM CPU, where absolute performance is more important than perf/watt, SMT might be a good option.

yes

can you describe what you mean by 'orthogonal to SMT'?

I'm saying that, with comiler optimization, it's good enough at keeping the CPU busy that adding SMT wouldn't gain much performance.

the italics part-- I'm going off either a professor's or ARM presenter's sidenote in a university course-- this is what I recall them saying. There was also something about 'because we are and have always been RISC, we can more effectively utilize predication in compiling for the architecture'
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,355
653
136
can you describe what you mean by 'orthogonal to SMT'?

With orthogonal i mean, one thing does not impact the other. With predication you get less branch mispredicts penalties, less polution in BTB and possibly better cache utilization and code density.
SMT on the other hand does not help much with branch mispredicts, because the moment you are able to detect the mispredict you already have wasted time. However a cache miss for instance is immediatly detected and gives SMT the opportunity to switch to other HW threads on the same core while the memory subsystem serves the cache miss.

I'm saying that, with comiler optimization, it's good enough at keeping the CPU busy that adding SMT wouldn't gain much performance.


Not really. The compiler can only optimize based on static knowledge, that is informance known at compile time. The purpose of SMT in contrary is to keep care of dynamic effects like chache misses. The compiler has not the slightes clue when a cache miss will happen nor has it means to hide the miss penalty. It can be argued though, that the compiler can to some extent reduce average cache miss rate for instance by "clever" basic block re-ordering and data placement.
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
branch prediction is something very different from what I'm talking about. The branch predication is what I was thinking of.

I figured as much, but your description here:

They have a pretty interesting feature called, and I might be referring to the wrong name here, 'conditional execution' [google] that permits execution of instructions past a 'flow-control point' [if-statement] with a 'I'll tell you in the future whether to commit this data' as in 'I'll tell you whether we actually should have executed this' / 'I'll tell you the result of this if statement'. So when you reach a branch and don't know the result ahead of time, this permits the processor to continue on executing as if it had perfectly predicted the result of the branch, but 'back up' [rather, toss] the results of that code.

Pretty much perfectly applies to branch prediction.
 
Dec 30, 2004
12,553
2
76
I always considered them different because prediction is 'we think this is the result based on our branch prediction history' whereas predication is 'it doesn't matter, we'll calculate anyways and only commit if we're told'. I might be getting rusty, or I might have swallowed marketing material where ARM expressed differentiation from HT for no good reason
 
Dec 30, 2004
12,553
2
76
With orthogonal i mean, one thing does not impact the other. With predication you get less branch mispredicts penalties, less polution in BTB and possibly better cache utilization and code density.
SMT on the other hand does not help much with branch mispredicts, because the moment you are able to detect the mispredict you already have wasted time. However a cache miss for instance is immediatly detected and gives SMT the opportunity to switch to other HW threads on the same core while the memory subsystem serves the cache miss.

[/I]

Not really. The compiler can only optimize based on static knowledge, that is informance known at compile time. The purpose of SMT in contrary is to keep care of dynamic effects like chache misses. The compiler has not the slightes clue when a cache miss will happen nor has it means to hide the miss penalty. It can be argued though, that the compiler can to some extent reduce average cache miss rate for instance by "clever" basic block re-ordering and data placement.

I believe, if I'm remembering right, that I disagree with the first, and the second with a caveat: the static optimization is more effective when using predication. What I'm saying is that the 'toolbox of predication' permits 'a different form of compiler optimization' that 'favors use of branch predication', in a way that negates many of the benefits of HT. Doesn't negate for cache misses, but since misses only happen 3-7% of the time...

edit: I think I /have/ been confusing misprediction penalties with cache miss penalties.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,526
12,398
136
You know, with the much-ballyhooed push for ARM to enter server space, you'd think SMT would be highly-desirable. I don't think I've seen it on any of the server ARM chips put forth by the ARMy on paper or in silicon.