Without writing the code here are the things that would probably make the fastest generic algorithm.
Do quick checks for small copies and just do a quick loop.
For large copies -
Do a 1-3 byte copy to dword align the destination if necessary.
Rotate (if alignment was necessary above) and adjust source string by the following rules
If there is an odd number of bytes at the end of the source string (1 or 3 bytes), you need to quadrupedal the source string to force dword alignment. (repeating as necessary)
If there is 2 extra bytes at the end of the source string, you need to double it.
Otherwise the source string is dword aligned.
Do dword copies for as many dwords possible. Probably unrolling by 4 would be best. For really large copies unrolling by 16 would be most optimal as it is the size of most cache lines (64bytes).
Do the final 0-3 byte copy.
Code:
Alignment examples
String with 11 bytes has 3 extra bytes.
1234 5678 912
Needs to be quadrupled to force alignment.
[COLOR="Red"]1234 5678 912[/COLOR][COLOR="Green"]1 2345 6789 12[/COLOR][COLOR="Blue"]12 3456 7891 2[/COLOR][COLOR="Red"]123 4567 8912[/COLOR]
String with 10 bytes has 2 extra bytes.
1234 5678 91
Needs to be doubled to force alignment.
[COLOR="Red"]1234 5678 91[/COLOR][COLOR="Green"]12 3456 7891[/COLOR]
String with 9 bytes has 1 extra byte. (same as first example)
1234 5678 9
Needs to be quadrupled to force alignment.
[COLOR="Red"]1234 5678 9[/COLOR][COLOR="Green"]123 4567 89[/COLOR][COLOR="Blue"]12 3456 789[/COLOR][COLOR="Red"]1 2345 6789[/COLOR]