ARM instruction set, predication and OoO execution

May 11, 2008
20,139
1,149
126
I was reading about the predication of ARM instructions and was wondering how the out of order execution is implemented.
The thing is, to do out of order execution, the instruction following one and other cannot have dependencies.
I read that the 64 bit ARM instruction set no longer has predication bits. I am wondering if it has anything todo with the idea that predication might be a limiting factor for OoO execution.
With most instruction sets, only the branches are conditional, but with the ARM instruction set, any instruction can be made conditional. And it is often used.

For example this snippet, Registers r1 and r2 are tested for zero and the Z flag will be updated.
I wonder how this would work in an OoO cpu.

Code:
1 SUB    	r14, r14,#4		 			
2 STMFD		r13!, {r0-r7,r14}
3 LDR		r0,=timer_address
4 LDMIA		r0!,{r1-r10}
5 CMP		r1,     #0										
6 SUBNE		r1, r1, #1  
7 CMP		r2,	#0
8 SUBNE		r2, r2, #1

How would the schedular logic know that instruction 5and 6 are paired and instruction 7 and 8 are paired and that both pairs can be executed in parallel ?

Any thoughts ?

http://en.wikipedia.org/wiki/ARM_architecture#Conditional_execution
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,595
730
126
I would assume the flags register is considered a dependency, because instructions 5-8 use it, they would have to execute in order.
 

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,970
136
Yes, flags are considered as a dependency, and can be renamed too, so nothing prevents (5 6) from running in parallel with (7 8).
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,964
136
Grossly simplifying, the algorithm for modern OoO is:

Frontend maintains rename table. On arrival, the result(s) of an instruction is renamed to be a new register, this mapping is updated in the rename table, and the registers are marked "not ready", while the sources of the instruction are renamed to whatever hardware register currently matches their architectural names.

Then instructions are passed onto the scheduler. Each cycle, it looks up which instructions have their operands set "ready", and sends the oldest ones that have all their operands ready to the execution units.

When instruction completes and writes it's result, the register is marked ready.


In your example, both the CMPs depend on the LDMIA. As the load takes a relatively long time even from L1, all the instructions will be waiting in schedulers when it finishes. The instant the r0 and r1 become ready, the CMPs can execute in parallel. As they finish, their renamed results flags1 and flags2 become available on the same cycle, and the subne:s can execute in parallel.
 
May 11, 2008
20,139
1,149
126
Thanks all for the reply.

I still wonder though, if the predication bits together with any instruction having the ability to be conditional prevents maximum efficient use of a branch predictor and OoO logic. I mean in comparision with cpu architectures that only have conditional branches.
I wonder if that is the reason ARM stopped making every instruction conditional with the 64 bit implementation of ARM.
ARM has also abandoned it with the thumb2 instruction set. With the exception of the IF THEN instruction to allow up to four following instructions to be conditional, only the branches are conditional.

EDIT:

Maybe it is patents related, i do not know.
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
AArch64 still has a modest amount of conditional execution via its conditional select instruction. It has modifiers that can increment or invert (take the one's complement) of one of the inputs. With the zero register this allows for conditionally setting a register to 0, 1, or -1, and conditional increment can be useful for things like conditionally copying parts of one array to another. In general, conditional select can be used to merge the outcomes of two branches that are executed unconditionally (assuming that there are no other side-effects), and the instruction itself is more general/powerful than conditional moves that are common in other ISAs. The extra registers in AArch64 help to make select merging more palatable.

Like you say, because of increased dependencies predication is less useful than it used to be. But it's not totally useless on OoO CPUs. Especially with short one-armed bodies that have a poor prediction rate. ARM has given some guidelines for when and when not to use it on Cortex-A9, A15, A57, and A72.

I wouldn't say they dropped predication from Thumb-2 either. You can think of IT as more of a prefix than an instruction. Some CPUs incorporate thew IT early in the front end so it doesn't require execution resources. The way Thumb-2 works they wouldn't have been able to have per-instruction predication if they wanted without making the instruction set much more limited compared to ARM. They had to pay an encoding cost upfront to keep it compatible with original Thumb and to allow variable length encoding at all.

The biggest issue with ARM instructions having predication in every instruction is the wasted opcode bits. Combined with the decreased utility, it just wasn't worth keeping it around in that form any longer, but ARM still distilled the best functionality of it into a modest section of the opcode space. The only thing I wish they still had was conditional store (a masked store for SIMD would be even better), but this can be expensive and problematic to implement efficiently.
 
May 11, 2008
20,139
1,149
126
Interesting. That makes sense. I would like to know more about AArch64, i am going to read up on information at the ARM info center ? :thumbsup:

I always do like the auto increment and autodecrement features of some of the instructions of the ARM instruction set.

There is one thing i have difficulty remembering and finding.
I have not been coding in ARM assembly for some time now.
And although i can read it, i forgot about this :

Code:
STMFD r13!,{r0-r12,r14}	// Save registers to SVC_stack.

The exclamation mark, do you know what it is for again ?
EDIT: I found it again, if i am correct it is to update the register with the new value (writeback flag).
After the store , r13 is updated with the incremented value. r13 = r13 + (14 * 4), (14 registers and 4 bytes wide).

End EDIT

It was somewhere in my pdf document of 900 pages thick. :(
As is the "^" , i forgot what it was for.

Code:
LDMFD r13!,{pc}^	// Restore lr value to pc and return.

I thought it was about restoring the
Processor Status Register ?
I am not sure.
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
The ^ means one of two things:

- If it's an ldm and the PC is in the register list, it's an exception return. So in addition to the block memory load it restores the appropriate spsr to cpsr which handles the appropriate mode change.
- If it's an stm or PC is not in the register list, it means to use the user/system mode registers instead of the banked special mode registers. This applies to lr and sp and if you're in FIQ mode also some of the other registers.

Both forms are undefined if you're in user/system mode.
 
May 11, 2008
20,139
1,149
126
Thank you.

I will look through the little assembly code that i have to check if i used it correctly. To be honest, everything works, and i have not run into any issues but better to be save than sorry.