Fun Strlen retrieving 4 bytes at once.

William Gaatjes · Jun 16, 2011

After doing this, i realized one issue :
This kind of reverse engineering to optimize code only works with my version GNU GCC 4.4.2. It is very likely that if i use a newer version or a compiler from another vendor, that this optimization may not produce the self code.

The last version i made (have not posted it yet, but it checks for a NULL pointer, a pointer to a NULL string and for word aligned strings. Easy to implement, it is just checking the pointer it self.

Thus i will start to experiment with optimizing the code in assembly while respecting the calling convention C. But even this will only work if the calling convention does not change. But that is a matter of placing clear comment about what calling convention is used and what kind of compiler if problems would arise. It is not that necessary, it is more a practice, a skill that i would like to learn just in case i might need it one day for work related projects. As for hobby material, it is always handy to know these kind of background information.

William Gaatjes · Jun 16, 2011

Of course inline assembly is a possibility too.

I could add a function to W-ARM to convert assembly code to the inline gcc format. Then the compiler takes care of any calling convention issues.
I think i will take this path. As it also will allow more readable code.

Schmide · Jun 16, 2011

William Gaatjes said:
After doing this, i realized one issue :
This kind of reverse engineering to optimize code only works with my version GNU GCC 4.4.2. It is very likely that if i use a newer version or a compiler from another vendor, that this optimization may not produce the self code.

These days you can trust the compiler a fair amount.

As you can see

This code

Code:

DWORD TestStrLen32V2(DWORD *ui_pntr)
{
    if(((DWORD)ui_pntr & 0x00000003) != 0)
    {
      return 0;
    }
  
    int ui_cntr = 0;
  
    do
    {
      if((*ui_pntr & 0x000000FF) == 0)
      {
        goto L1;
      }
      if((*ui_pntr & 0x0000FF00) == 0)
      {
        goto L2;
      }
      if((*ui_pntr & 0x00FF0000) == 0)
      {
        goto L3;
      }
      if((*ui_pntr & 0xFF000000) == 0)
      {
        goto L4;
      }
      ui_pntr++;
      ui_cntr+=4;
    } while (ui_cntr < 1024);
    goto L1;

L4: ui_cntr++;
L3: ui_cntr++;
L2: ui_cntr++;
L1: return (ui_cntr);
}

Produces this x86 asm in release. Almost perfect. (I did rename the labels of the disassembly)

Code:

     xor         eax,eax 
     test        dl,3 
     jne         L1 
TLP: mov         ecx,dword ptr [edx] 
     test        cl,cl 
     je          L1 
     test        ecx,0FF00h 
     je          L2 
     test        ecx,0FF0000h 
     je          L3 
     test        ecx,0FF000000h 
     je          L4 
     add         eax,4 
     add         edx,4 
     cmp         eax,400h 
     jl          TLP
     ret
              
L4:  inc         eax  
L3:  inc         eax  
L2:  inc         eax  
L1:  ret

Note: You should be using a do while loop as the first check is unnecessary. Although the compiler often figures this out it may not always be the case and you should give it the best code for the task.

William Gaatjes · Jun 16, 2011

I messed around a bit but i was not satisfied. On my work in break time i tried it some more and found an almost perfect version with one exception. By placing a label differently i could dump 2 instructions (8 bytes in flash).

I settled for this hand optimized version which is based on the version from work and where i removed 2 not useful instructions. If i am not mistaken it should work.
I added an extra check for a null pointer as well.

It should be an almost copy of your original code.

Hand optimized :

Code:

uint32_t StrLen32(uint32_t *ui_pntr)
{
      subs  r2, r0, #0  
      beq .LB5  
      ands  r0, r2, #3  
      bne .LB5  

.LP1: ldr r3, [r2, r0]  
      tst r3, #-16777216
      beq .LB1
      tst r3, #16711680
      beq .LB2
      tst r3, #65280
      beq .LB3        
      tst r3, #255  
      beq .LB4  
      
      add r0, r0, #4  
      cmp r0, #1024 
      bne .LP1
      bx  lr

.LB4: add r0, r0, #1  
.LB3: add r0, r0, #1
.LB2: add r0, r0, #1  
.LB1: bx  lr  
 
.LB5: mov r0, #0  
.LB6: bx  lr
}

Produced by the compiler after trying. -O2

Code:

      subs  r2, r0, #0  
      beq .L11  
      ands  r0, r2, #3  
      beq .L16  
      b .L11  
.L20: tst r3, #65280
      beq .L14  
      tst r3, #255  
      beq .L15  
      add r0, r0, #4  
      cmp r0, #1024 
      beq .L19  
.L16: ldr r3, [r2, r0]  
      tst r3, #-16777216
      bxeq  lr  
      tst r3, #16711680 
      bne .L20  
.L13: add r0, r0, #1
      bx  lr  
      
.L11: mov r0, #0  
      bx  lr
        
.L15: add r0, r0, #1  
.L14: add r0, r0, #1
      add r0, r0, #1  
      bx  lr  
.L19: bx  lr

William Gaatjes · Jun 16, 2011

I am not 100% assured it will always work but this is the idea :

Inline assembly in GNU GCC.

The last use of the register tricks the compiler not to issue a warning of having no return value.
I think it is better to suppress the warning for this function and just use r0 instead of %[reg0]. But i have to figure out how first.

Code:

uint32_t StrLen32V3(uint32_t *ui_pntr)
{

register uint32_t    ui_cntr asm("r0");

asm volatile(
    "       subs  r2, r0, #0        \r\n"
    "       beq .LB5                \r\n"
    "       ands  r0, r2, #3        \r\n"  
    "       bne .LB5                \r\n"
    ".LP1:  ldr r3, [r2, r0]        \r\n" 
    "       tst r3, #-16777216      \r\n"
    "       beq .LB1                \r\n"
    "       tst r3, #16711680       \r\n"
    "       beq .LB2                \r\n"
    "       tst r3, #65280          \r\n"
    "       beq .LB3                \r\n"
    "       tst r3, #255            \r\n"
    "       beq .LB4                \r\n"
    "       add r0, r0, #4          \r\n"
    "       cmp r0, #1024           \r\n"
    "       bne .LP1                \r\n"
    "       bx  lr                  \r\n"     
    ".LB4:  add r0, r0, #1          \r\n"
    ".LB3:  add r0, r0, #1          \r\n"
    ".LB2:  add r0, r0, #1          \r\n"
    ".LB1:  bx  lr                  \r\n"  
    
    ".LB5:  mov %[reg0], #0         \r\n" : [reg0]"=r" (ui_cntr));
return(ui_cntr);

}

EDIT:
I found a solution to force the use of register r0 for ui_cntr.

Instead of declaring :

uint32_t ui_cntr, do register uint32_t ui_cntr asm("r0");

William Gaatjes · Jun 17, 2011

Sigh... I have tested it and i thought it works but it does not.

I have 4 test strings of 9, 10, 12 and 48 characters long before the NULL termination appears.
Normal StrLen result :
9.
10.
12.
48.

That is good.

StrLen32 result :
8.
8.
12.
48.

It looks as if the last word is not even retrieved in the first 2 tests with StrLen32.
I looked around in the code but i cannot find anything wrong with it.
I have my data abort handler also working but that is not popping up either.
(Which is normal because everything is word aligned.)

I do not understand why the code refuses to check and count the last word.
There is no code in the listing that would cause such behavior. I have turned all optimization off. It makes no sense at the moment.
I have tried different versions that are posted here, all have the same issue.

Most interesting but i am puzzled...

My Screendump :

StrLen test.
Amount of characters in sz_test is : 9.
Amount of characters in sz_test2 is : 10.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.
StrLen32 test.
Amount of characters in sz_test is : 8.
Amount of characters in sz_test2 is : 8.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.

William Gaatjes · Jun 18, 2011

I can still not explain it. I thought i misinterpreted the endian format of the mcu but i did not interpret it wrong. The MCU is Big Endian.

The first character of the string is placed at the lowest memory address and then continues with increasing memory addresses until the end of the string is reached. Strange.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0234b/i45903.html

William Gaatjes · Jun 18, 2011

This is most surprising :
As stubborn as i am and not having an accepting mentality, i assumed the processor was in fact a little endian model, even though the web site claims(that is what i understand of it) it is big endian. I figure it like this : There cannot be a flaw the processor, i would have experienced it much earlier. I can imagine when the string is placed in memory, that NULL padding is done until a word boundary is reached. This would explain for the results and the behavior.

Thus i took the last inline assembler version again and modified to look at the least significant byte first. Now it works.

But i still have to retrace what is going wrong. Or my information is wrong, or i misinterpreted it...

Code:

uint32_t StrLen32(uint32_t *ui_pntr)
{

register uint32_t ui_cntr asm("r0"); 

asm volatile(
    "         subs  r2, r0, #0          \r\n" // Do a NULL pointer check.
    "         beq .ENDLB5               \r\n" // Jump if ui_pntr = 0.
    "         ands  r0, r2, #0x00000003 \r\n" // Check if BIT0 and BIT are both 0. 
    "         bne .ENDLB5               \r\n" // If not, jump.
    
    ".LPLBL1: ldr r3, [r2, r0]          \r\n" // r3 = memory address pointed at by r2 + r0.
    "         tst r3, #0x000000FF       \r\n" // tst r3 & 0xFF000000 (original).
    "         beq .JLABL1               \r\n" // jump if Z is set.
    "         tst r3, #0x0000FF00       \r\n" // tst r3 & 0x00FF0000 (original).
    "         beq .JLABL2               \r\n" // jump if Z is set.
    "         tst r3, #0x00FF0000       \r\n" // tst r3 & 0x0000FF00 (original).
    "         beq .JLABL3               \r\n" // jump if Z is set.
    "         tst r3, #0xFF000000       \r\n" // tst r3 & 0x000000FF (original).
    "         beq .JLABL4               \r\n" // jump if Z is set.
    "         add r0, r0, #4            \r\n" // r0 = r0 + 4.
    "         cmp r0, #0x0000FF00       \r\n" // r0 - 0x0000FF00. Max string length = 65280 characters.
    "         bne .LPLBL1               \r\n" // jump if Z is cleared.
    "         bx  lr                    \r\n" // return.    

    ".JLABL4: add r0, r0, #1            \r\n" // r0++.
    ".JLABL3: add r0, r0, #1            \r\n" // r0++.
    ".JLABL2: add r0, r0, #1            \r\n" // r0++.
    ".JLABL1: bx  lr                    \r\n" // return.  
    
    ".ENDLB5: mov &#37;[reg0], #0           \r\n" : [reg0]"=r" (ui_cntr));
return(ui_cntr);
}

StrLen test.
Amount of characters in sz_test is : 9.
Amount of characters in sz_test2 is : 10.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.
StrLen32 test.
Amount of characters in sz_test is : 9.
Amount of characters in sz_test2 is : 10.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.

Sleep really helps...

William Gaatjes · Jun 18, 2011

I finally can explain it : ARM7TDMI processors are bi -endian. Meaning both formats can work. The reason why this can work is because it is a full risc core and as a programmer, you have to take care of everything yourself. The linker file generator that i have made, sets the endian mode of the output format of the code to little endian, thus the core works in little endian mode. Mystery solved. I think i will have to change the linker tab on W-ARM to clearly mention the endian format.

EDIT:
I even found out more when looking in the datasheet of my SAM7S.

18.3 Functional Description
The Memory Controller handles the internal ASB bus and arbitrates the accesses of both
masters.
It is made up of:
• A bus arbiter
• An address decoder
• An abort status
• A misalignment detector
• An Embedded Flash Controller
The MC handles only little-endian mode accesses. The masters work in little-endian mode only.

That is an enforcement that the ARM7TDMI core can only function in little endian format in the SAM7S. Because otherwise the memory controller would have to be augmented to support both endian modes. The memory controller is very simple but IMHO better then what the LPC2x(ARM7TDMI) range from nxp offer.

William Gaatjes · Aug 17, 2011

My latest version :

This is a modified version (GCC 4.4.2).
I just tested it, i seems to work, but i am tired. I can have missed something.
StrLen32 returns 0xFFFFFFF0 with a NULL pointer.
StrLen32 returns 0xFFFFFFFF with a non word aligned address value (Lowest digit of address does not equal : 0x0,0x4,0x8 or 0xC.
StrLen32 returns 0x0000FF00 with a stringsize greater then 65280 characters.

Screendump of terminal program.

StrLen test.
Amount of characters in sz_test is : 9.
Amount of characters in sz_test2 is : 10.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.
StrLen32 test.
Amount of characters in sz_test is : 9.
Amount of characters in sz_test2 is : 10.
Amount of characters in sz_test3 is : 12.
Amount of characters in sz_bl_startup is : 48.
StrLen32 test 2, argument = 0.
returnvalue is : FF FF FF F0.
StrLen32 test 3, argument = 3.
returnvalue is : FF FF FF FF.

Here is the code :
I am sorry if it is a bit messed up. I use TAB widths of 2 spaces. The forum software is confused of my use of tabs.

Code:

// *************************************************************************************************
// Counts the amount of characters with a maximum of 65280 in a "NULL" terminated string.
// Returns amount of characters in string.
// This is an higly optimized version that retrieves 4 bytes at once by doing a word access.
// Thus the string must be word aligned. 
// W-ARM generates output files in the little endian format by default.
// This function expects the little endian format to work. 
// When the argument is a NULL pointer, return 0xFFFFFFF0.
// When the value of argument is a not aligned memory address, StrLen32 will return 0xFFFFFFFF.
//
// This function issues a warning because the compiler is not aware of the assembler code.
// I do not know yet how to solve this.
//
uint32_t StrLen32( uint32_t *ui_pntr)
{
register uint32_t	ui_cntr asm("r0"); 

asm volatile(
		"	        subs  r1, r0, #0  		\r\n"	// Do a NULL pointer check.
		"		beq .ENDLB6					\r\n"	// Jump if ui_pntr = 0.
		"		ands  r0, r1, #0x00000003	\r\n" // Check if BIT0 and BIT1 are both 0. 
		"		bne .ENDLB5 			\r\n" // If not, jump.
		
		".LPLBL1:	ldr r2, [r1, r0]					\r\n" // r3 = memory address pointed at by r2 + r0.
		"		tst r2, #0x000000FF				\r\n"	// tst.
		"		beq .JLABL1								\r\n" // jump if Z is set.
		"		tst r2, #0x0000FF00				\r\n" // tst.
		"		beq .JLABL2								\r\n"	// jump if Z is set.
		"		tst r2, #0x00FF0000				\r\n"	// tst.
		"		beq .JLABL3      					\r\n"	// jump if Z is set.
		"		tst r2, #0xFF000000				\r\n"	// tst.
		"		beq .JLABL4								\r\n"	// jump if Z is set.
		"		add r0, r0, #4  					\r\n"	// r0 = r0 + 4.
		"		cmp r0, #0x0000FF00				\r\n"	// r0 - 0x0000FF00. Max string length = 65280 characters.
		"		bne .LPLBL1								\r\n" // jump if Z is cleared.
		"		bx  lr										\r\n"	// return.		

		".JLABL4:	add r0, r0, #1  					\r\n" // r0++.
		".JLABL3:	add r0, r0, #1						\r\n"	// r0++.
		".JLABL2:	add r0, r0, #1  					\r\n"	// r0++.
		".JLABL1:	bx  lr  									\r\n" // return.  
		
		".ENDLB5:	mvn r0, #0								\r\n"	// create the value 0xFFFFFFFF
		"					bx lr											\r\n"	// return.
		".ENDLB6:	mvn r1, #0								\r\n"	// create the value 0xFFFFFFFF
		"					sub &#37;[reg0],r1,#15				\r\n" : [reg0]"=r" (ui_cntr));
																							// If the Z flag is not set, subtract 15. Thus return 0XFFFFFFF0
return(ui_cntr);
}

Darkstone · Aug 17, 2011

So i was thinking... If you have a dword, woudn't it be faster to check if there are 'null' characters on all 4 chars at once?

Code:

DWORD *ptr = whatever string pointer;
DWORD temp = ~ptr;
DWORD temp &= ( temp & 0xAAAAAAAA ) >> 1;  //result = 0x0x0x0x times 4 (binary)
DWORD temp &= ( temp & 0x88888888 ) >> 2;  //result = 000x000x times 4
DWORD temp &= ( temp & 0x10101010 ) >> 4;  //result = 0000000x times 4
// now if temp is nonzero, at least one 'null' character is present at *ptr
// possible value of temp: 0x00000100 for *ptr = 'a' 'b' '\0' 'd'

i realize this is a hell lot of more code, it works O(2log(n)) instand of O(n) for every 4-byte you check. (http://en.wikipedia.org/wiki/Big_O_notation) So it works faster when you add more bytes. It is untested however...

:whiste:

William Gaatjes · Aug 18, 2011

Darkstone said:
So i was thinking... If you have a dword, woudn't it be faster to check if there are 'null' characters on all 4 chars at once?

Code:

DWORD *ptr = whatever string pointer; DWORD temp = ~ptr; DWORD temp &= ( temp & 0xAAAAAAAA ) >> 1; //result = 0x0x0x0x times 4 (binary) DWORD temp &= ( temp & 0x88888888 ) >> 2; //result = 000x000x times 4 DWORD temp &= ( temp & 0x10101010 ) >> 4; //result = 0000000x times 4 // now if temp is nonzero, at least one 'null' character is present at *ptr // possible value of temp: 0x00000100 for *ptr = 'a' 'b' '\0' 'd'

i realize this is a hell lot of more code, it works O(2log(n)) instand of O(n) for every 4-byte you check. (http://en.wikipedia.org/wiki/Big_O_notation) So it works faster when you add more bytes. It is untested however...

:whiste:

It is true that a word wide simultaneous check would be better.
But i have difficulty grasping when it is translated to assembler it would win in execution time. I mean, at the assembler level, you still have to do checking for the result. I think what Schmide created as an example is faster and already fully optimized.

There is only one way to speed this testing up :
It would be faster and handy if there was such a SIMD instruction.
For instance, doing a 4 NOR operation on the 4 bytes at once. Thus word wide at once.

For a little endian version :

NOR 1 = 8 input, bit0 to bit 7.
NOR 2 = 8 input, bit8 to bit 15.
NOR 3 = 8 input, bit16 to bit 23.
NOR 4 = 8 input, bit24 to bit 31.

Then we also need a result register where :
NOR 1 sets BIT0, clears BIT1 and clears BIT2.
NOR 2 sets BIT1, clears BIT0 and clears BIT2.
NOR 3 sets BIT0, sets BIT1 and clears BIT2.
NOR 4 sets BIT2, clears BIT0 and clears BIT1.

Also there has to be priority domination. NOR1 has the highest priority, then NOR2, then NOR3 and finally NOR4.

When the result register is 0, just add 4.
Then when the result register is not 0, it will hold the counted value representing the position where the first 0x00 was found. Just add the result for the final string length value. In separate assembler instructions this is much slower.

Darkstone · Aug 18, 2011

On a side note, on the Z80 you had an instruction just for this: CPIR. It would compare (HL) to A, and kept doing that untel a match was found or BC reached zero.

I ended up never using it, but it would be usefull right now.

William Gaatjes · Aug 19, 2011

Indeed, there are a lot instructions in general that can do string manipulations. However, i use the ARMv4T. ARM is RISC so it does not have a lot of instructions.

CISC-RISC hybrids or pure CISC architectures usually have these kind of instructions. ARM allows to add your own instructions, some of the SOC(system on a chip) developers add their own proprietary instructions to the ARM core.

x86 has a lot of these instructions.

On a sidenote, for those interested.
NOR gate :
http://en.wikipedia.org/wiki/NOR_logic

Handy material.

Truth Table
A B Q
0 0 1
0 1 0
1 0 0
1 1 0

William Gaatjes · Dec 15, 2011

I learned some new tricks and i found a way to remove the compiler warning by making clever use the optimizer to remove useless code :

Code:

[COLOR="Blue"]uint32_t[/COLOR] StrLen32( [COLOR="Blue"]uint32_t[/COLOR] *ui_pntr)
{
[COLOR="Blue"]register uint32_t[/COLOR]	ui_cntr [COLOR="Blue"]asm[/COLOR]("r0"); 

	ui_pntr = ui_pntr;  [COLOR="SeaGreen"]// This is a hack to circumvent the "unused variable" warning.
											// The optimizer will remove it.[/COLOR]
	
[COLOR="Blue"]asm volatile[/COLOR](
		"					subs  r1, r0, #0  				\r\n"	// Do a NULL pointer check.
		"					beq .ENDLB6								\r\n"	// Jump if ui_pntr = 0.
		"					ands  r0, r1, #0x00000003	\r\n" // Check if BIT0 and BIT1 are both 0. 
		"					bne .ENDLB5 							\r\n" // If not, jump.
		
		".LPLBL1:	ldr r2, [r1, r0]					\r\n" // r3 = memory address pointed at by r2 + r0.
		"					tst r2, #0x000000FF				\r\n"	// tst.
		"					beq .JLABL1								\r\n" // jump if Z is set.
		"					tst r2, #0x0000FF00				\r\n" // tst.
		"					beq .JLABL2								\r\n"	// jump if Z is set.
		"					tst r2, #0x00FF0000				\r\n"	// tst.
		"					beq .JLABL3      					\r\n"	// jump if Z is set.
		"					tst r2, #0xFF000000				\r\n"	// tst.
		"					beq .JLABL4								\r\n"	// jump if Z is set.
		"					add r0, r0, #4  					\r\n"	// r0 = r0 + 4.
		"					cmp r0, #0x0000FF00				\r\n"	// r0 - 0x0000FF00. Max string length = 65280 characters.
		"					bne .LPLBL1								\r\n" // jump if Z is cleared.
		"					bx  lr										\r\n"	// return.		

		".JLABL4:	add r0, r0, #1  					\r\n" // r0++.
		".JLABL3:	add r0, r0, #1						\r\n"	// r0++.
		".JLABL2:	add r0, r0, #1  					\r\n"	// r0++.
		".JLABL1:	bx  lr  									\r\n" // return.  
		
		".ENDLB5:	mvn r0, #0								\r\n"	// create the value 0xFFFFFFFF
		"					bx lr											\r\n"	// return.
		".ENDLB6:	mvn r1, #0								\r\n"	// create the value 0xFFFFFFFF
		"					sub %[reg0],r1,#15				\r\n" : [reg0]"=r" (ui_cntr));
																							// If the Z flag is not set, subtract 15. Thus return 0XFFFFFFF0
[COLOR="Blue"]return[/COLOR](ui_cntr);
}

I added a line of c code :

Code:

ui_pntr = ui_pntr;

This removes the warning. But since it does nothing, the optimizer removes the instruction : mov r0,r0 since it is of no use in the code.
At least that is what i assume and the list file seems to confirm this.

Now finally the compiler warning is gone. I am happy. ^_^

Schmide, thank you. I added your name honorably in the readme text for the StrLen32 function.:thumbsup:

William Gaatjes · Dec 24, 2017

It has been a while since i played with this function again.
And well, it always bothered me that the strings to be tested had to be word aligned.
So, i thought that it would be good to test if word aligned then do word aligned memory access.
If not word aligned, do byte access until word aligned. Then start doing word aligned access and doing 4 tests in registers only.
This version is universal and will always work.

And i just made this better version of the original code :

Code:

uint32_t StrLen32(char *pntr)
{
register uint32_t    cntr asm("r0");

    pntr = pntr;              // This is a hack to circumvent the "unused variable" warning.
                                            // The optimizer will remove it.
 
asm volatile(
        "                    subs  r1, r0, #0                  \r\n"    // Do a NULL pointer check.
        "                    beq .ENDLB6                                \r\n"    // Jump if pntr = 0.
        "                    ands  r0, r1, #0x00000003    \r\n" // Check if BIT0 and BIT1 are both 0.
        "                    beq .LPLBL1                                 \r\n" // If so , go word aligned testing. If not, go on with byte access untill word aligned.
     
        "         mov r0, #0                                \r\n" // Load R0 wih 0.
        ".LPBA1:  ldrb r2, [r1], #1                    \r\n" // Do a byte access, load r2 with value pointed by r1 and post increment r1.
        "         ands r2, #0xFF                        \r\n" // Check if zero.  
        "         beq .JLABL1                                \r\n" // Jump if zero.
        "                    add r0, r0, #1                         \r\n" // r0++.
        "                    ands  r2, r1, #0x00000003    \r\n" // Check if BIT0 and BIT1 of r1 are both 0.
        "                    beq .LPLBL1                                 \r\n" // If so , go word aligned testing. If not, go on with byte access untill word aligned.
        "                    b .LPBA1                                    \r\n"
     
        ".LPLBL1:    ldr r2, [r1, r0]                    \r\n" // r2 = loaded with data from memory address pointed at by r1 + r0.
        "                    tst r2, #0x000000FF                \r\n"    // tst.
        "                    beq .JLABL1                                \r\n" // jump if Z is set.
        "                    tst r2, #0x0000FF00                \r\n" // tst.
        "                    beq .JLABL2                                \r\n"    // jump if Z is set.
        "                    tst r2, #0x00FF0000                \r\n"    // tst.
        "                    beq .JLABL3                          \r\n"    // jump if Z is set.
        "                    tst r2, #0xFF000000                \r\n"    // tst.
        "                    beq .JLABL4                                \r\n"    // jump if Z is set.
        "                    add r0, r0, #4                      \r\n"    // r0 = r0 + 4.
        "                    cmp r0, #0x0000FF00                \r\n"    // r0 - 0x0000FF00. Max string length = 65280 characters.
        "                    bne .LPLBL1                                \r\n" // jump if Z is cleared.
        "                    bx  lr                                        \r\n"    // return.    

        ".JLABL4:    add r0, r0, #1                      \r\n" // r0++.
        ".JLABL3:    add r0, r0, #1                        \r\n"    // r0++.
        ".JLABL2:    add r0, r0, #1                      \r\n"    // r0++.
        ".JLABL1:    bx  lr                                      \r\n" // return.
     
        ".ENDLB5:    mvn r0, #0                                \r\n"    // create the value 0xFFFFFFFF
        "                    bx lr                                            \r\n"    // return.
        ".ENDLB6:    mvn r1, #0                                \r\n"    // create the value 0xFFFFFFFF
        "                    sub %[reg0],r1,#15                \r\n" : [reg0]"=r" (cntr));
                                                                                            // If the Z flag is not set, subtract 15. Thus return 0XFFFFFFF0
return(cntr);
}

I tested it with setting a pin high and low and sampling that pin with my saleae logic analyzer.
On the screen i can measure the amount of time each counting of characters costs.
Real world performance is different than i expected. But this StrLen32 is at least 2 times as fast when the amount to count characters is at least 10.
I tested with lower amount of characters and this version is always a bit faster, between 1.2x and 2.1x than my standard strlen c version while running on the sam7s256 i have with the test code below.
This version is less deterministic because it all depends on what address the string starts and i need to test some more and see if i can remove redundant code.

My test code :

Code:

void Test_StrLen32(void)
{
    const char    string1[] = "1";
    const char    string2[] = "12";
    const char    string3[] = "123";
    const char    string4[] = "1234567890";
    const char    string5[] = "12345678901234567890";
    const char    string6[] = "123456789012345678901234567890";
 
    uint32_t        result[12];
 
 
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[0] = StrLen32((char*)string1);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
 
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[1] = StrLen((char*)string1);
    AT91C_BASE_PIOA->PIO_CODR = LED2;

 
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[2] = StrLen32((char*)string2);
    AT91C_BASE_PIOA->PIO_CODR = LED2;

    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[3] = StrLen((char*)string2);
    AT91C_BASE_PIOA->PIO_CODR = LED2;


    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[4] = StrLen32((char*)string3);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
 
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[5] = StrLen((char*)string3);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
 

    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[6] = StrLen32((char*)string4);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
 
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[7] = StrLen((char*)string4);
    AT91C_BASE_PIOA->PIO_CODR = LED2;


    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[8] = StrLen32((char*)string5);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
     
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[9] = StrLen((char*)string5);
    AT91C_BASE_PIOA->PIO_CODR = LED2;


    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[10] = StrLen32((char*)string6);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
     
    AT91C_BASE_PIOA->PIO_SODR = LED2;
    result[11] = StrLen((char*)string6);
    AT91C_BASE_PIOA->PIO_CODR = LED2;
 
    Printf("strlen32 result string1 = %d \r\n",result[0]);
    Printf("strlen result string1 = %d \r\n",result[1]);
    Printf("strlen32 result string2 = %d \r\n",result[2]);
    Printf("strlen result string2 = %d \r\n",result[3]);
    Printf("strlen32 result string3 = %d \r\n",result[4]);
    Printf("strlen result string3 = %d \r\n",result[5]);
    Printf("strlen32 result string4 = %d \r\n",result[6]);
    Printf("strlen result string4 = %d \r\n",result[7]);    
    Printf("strlen32 result string5 = %d \r\n",result[8]);
    Printf("strlen result string5 = %d \r\n",result[9]);
    Printf("strlen32 result string6 = %d \r\n",result[10]);
    Printf("strlen result string6 = %d \r\n",result[11]);
 
}

My results printed at the terminal:

strlen32 result string1 = 1
strlen result string1 = 1
strlen32 result string2 = 2
strlen result string2 = 2
strlen32 result string3 = 3
strlen result string3 = 3
strlen32 result string4 = 10
strlen result string4 = 10
strlen32 result string5 = 20
strlen result string5 = 20
strlen32 result string6 = 30
strlen result string6 = 30

Schmide · Dec 24, 2017

You could make it a lot faster using a neon instruction.

Code:

       ".LPLBL1:    ldr r2, [r1, r0]                    \r\n" // r2 = loaded with data from memory address pointed at by r1 + r0.

// check 4 bytes 

       "                    vceq.i8        r3, r2,  #0                \r\n"    //  do a 4 byte compare
       "                    tst                r3, #0                      \r\n"    // any zero will fill r3 with 0xff 
       "                    beq              .LPLBL1                 \r\n"    // so if they are all not zero r3 will eq 0

// drop to byte check if one is zero

       "                    tst r2, #0x000000FF                \r\n"    // tst.
       "                    beq .JLABL1                                \r\n" // jump if Z is set.

I haven't done this assembly in forever but I think the above is correct.

William Gaatjes · Dec 24, 2017

Schmide said:

You could make it a lot faster using a neon instruction.

Code:

       ".LPLBL1:    ldr r2, [r1, r0]                    \r\n" // r2 = loaded with data from memory address pointed at by r1 + r0.

// check 4 bytes

       "                    vceq.i8        r3, r2,  #0                \r\n"    //  do a 4 byte compare
       "                    tst                r3, #0                      \r\n"    // any zero will fill r3 with 0xff
       "                    beq              .LPLBL1                 \r\n"    // so if they are all not zero r3 will eq 0

// drop to byte check if one is zero

       "                    tst r2, #0x000000FF                \r\n"    // tst.
       "                    beq .JLABL1                                \r\n" // jump if Z is set.

I haven't done this assembly in forever but I think the above is correct.

Hi Schmide, long time no see.

Well, i have no neon instructions on my arm7tdmi microcontroller.
But such a version could be used for the raspberry pi.

Schmide · Dec 24, 2017

I wonder if a highly parallel routine like this would be faster.

Code:

       "                    lsr        r6, r2,  #24                \r\n"    // logical shift high byte
       "                    and        r5, r2,  #0x00FF0000        \r\n"    // mask others
       "                    and        r4, r2,  #0x0000FF00        \r\n" 
       "                    and        r3, r2,  #0x000000FF        \r\n" 
     
       "                    rsb        r6, r6,  #0                 \r\n"    // sub 0 no carry
       "                    rsb        r5, r5,  #0                 \r\n" 
       "                    rsb        r4, r4,  #0                 \r\n" 
       "                    rsb        r3, r3,  #0                 \r\n" 

       "                    asr        r6, r6,  #32                \r\n"    // arithmetic rotate sign bit into dword
       "                    asr        r5, r5,  #32                \r\n" 
       "                    asr        r4, r4,  #32                \r\n" 
       "                    asr        r3, r3,  #32                \r\n" 

       "                    and        r6, r6,  #0xFF000000        \r\n"    // mask values
       "                    and        r5, r5,  #0x00FF0000        \r\n" 
       "                    and        r4, r4,  #0x0000FF00        \r\n" 
       "                    and        r3, r3,  #0x000000FF        \r\n" 

       "                    orr        r6, r6,  r5                 \r\n"    // combine values
       "                    orr        r4, r4,  r3                 \r\n" 
       "                    orr        r6, r6,  r4                 \r\n" 
     
       "                    tst        r6, #0xFFFFFFFF               \r\n"    // test values
       "                    beq        .LPLBL1                     \r\n"

Just being a spaz.

William Gaatjes · Dec 24, 2017

I need some time to digest that. It has been years since i did some assembly and i really have to read up on all the details of arm assembly instructions again.
I have not counted how much cycles it costs, but i was thinking with the old code see if i can do a tst, or cmp and do a barrel shift at the same time to move the 0xFF around the bits of the 32bit value.
It is just an idea, not sure it will work at all. Barrel shift is pretty powerfull.

From my book :
ARM System Developers Guide-Designing and Optimizing System Software

Data processing instructions are processed within the arithmetic logic unit (ALU).
A unique and powerful feature of the ARM processor is the ability to shift the 32-bit binary
pattern in one of the source registers left or right by a specific number of positions before
it enters the ALU. This shift increases the power and flexibility of many data processing
operations.
There are data processing instructions that do not use the barrel shift, for example,
the MUL (multiply), CLZ (count leading zeros), and QADD (signed saturated 32-bit add)
instructions.
Pre-processing or shift occurs within the cycle time of the instruction. This is particularly
useful for loading constants into a register and achieving fast multiplies or division by
a power of 2.

It is really a big pill to swallow.

edit:
I digested it, how do you take into account non aligned data access ?
If you have less than 4 characters and the result is smaller than 4 ?

Schmide · Dec 27, 2017

I was just augmenting your routines.

To get to aligned access, simply negate the number then mask to the bits needed to align to.

Defines that I created to do it are the following:

Code:

#define MASK(a) ( ( 1 << (a) ) - 1 )

#define TOALIGNED32(a,b) ( ((unsigned long)(-reinterpret_cast<long>(a))) & MASK(b) )

// bitsToAlignTo 2=dword 3=qword 4=8word 5=16word and so on
unsigned long toAligned = TOALIGNED32(currentPosition, bitsToAlignTo);

This is for 32bit. 64bit below.

Code:

#define TOALIGNED32(a,b) ( ((unsigned long)(-(long)reinterpret_cast<long long>(a))) & MASK(b) )

William Gaatjes · Dec 28, 2017

I was meaning address alignment.
My mcu throws abort exceptions when i do a misaligned memory access.
I do not know how other processors handle that.
I know i have read somewhere that some cpu cores can do misaligned word(32bit) and half word (16bit) access at a cost of needing more clockcycles to do so.

I am sorry, i am not understanding what you mean.
Your examples i have never seen before.
What does : "-reinterpret_cast" do ?
I am just a hobby programmer ...

Schmide · Dec 28, 2017

Oh. This is my routine for aligned allocation.

Code:

void * H_malloc_simd(const unsigned long size, unsigned long alignment)
{
#if defined WIN32           // WIN32
   return _aligned_malloc(size, alignment);
#elif defined __linux__     // Linux
   return memalign(alignment, size);
#elif defined __MACH__      // Mac OS X
   return malloc(size);
#else                       // other (use valloc for page-aligned memory)
   return valloc(size);
#endif
}

void H_free_simd(void* mem)
{
#if defined WIN32
   _aligned_free(mem);
#else  
   free(mem);
#endif
}

Schmide · Dec 28, 2017

reinterpret_cast basically tells the compiler to allow the pointer to be converted to a integer or back.

Schmide · Dec 28, 2017

Most compilers will align stack data to the nearest 16 bytes. If you need more than that there are platform specifics or in C++ 11 onward there is the alignas operator.

Fun Strlen retrieving 4 bytes at once.

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Junior Member

Lifer

Junior Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member