memset an array of bytes

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
Is there a way to memset multiple bytes at once? Such as:

memset(data, "0123", sizeof(data));

so that data contains "0123012301230123..."
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
memcpy in a loop :)

Be careful if the buffer is not an exact multiple of the size of your fill-data.
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Your example is a little confusing. "0123" is five bytes (NUL terminator in there), but your example output discards the NUL?

Anyway, assuming you want 'data' filled with ASCII 012301230123... with no NUL terminator, and assuming that 'data's size is a multiple of 4, and assuming C++:
Code:
assert(0 == (sizeof(data)%4));
const int kValue = 0x30313233;  // 0123 ASCII
int* data_as_int = reinterpret_cast<int*>(data);
for (int i = 0; i < sizeof(data)/4; ++i) {
  *(data_as_int+i) = kValue;
}
(note 0x30313233 is 0123's ASCII encoding represented as four bytes)

EDIT: This doesn't seem to sniff like homework... but if it is, well, sorry for giving it away.
 

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
It's already inside a nested loop, so I need it to be as fast as possible. Adding an extra loop incurs a 20% performance penalty in my code, while memset is less than 1%.

I'll probably have to precalculate the values as far ahead of time as possible.
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
It's already inside a nested loop, so I need it to be as fast as possible. Adding an extra loop incurs a 20% performance penalty in my code, while memset is less than 1%.

I'll probably have to precalculate the values as far ahead of time as possible.

Yeah, I'm gonna go out on a limb and call BS on that. Unless you're doing a loop with billions of iterations there is literally no way that a loop which sets a primitive value is causing a 20% performance impact. Wanna know how I know? Well, here's the secret: memset is a loop. :whiste:

Typical implementation (this happens to be from GCC):

Code:
PTR
memset (PTR dest, register int val, register size_t len)
{
  register unsigned char *ptr = (unsigned char*)dest;
  while (len-- > 0)
    *ptr++ = val;
  return dest;
}

Unless you hard code the values into your code there is simply no way to do this without a loop. The real question though is why you'd need this to begin with? An array with the same pattern repeated again and again is just wasted memory. Store the pattern once and access it as needed.

Anyway, assuming you want 'data' filled with ASCII 012301230123... with no NUL terminator, and assuming that 'data's size is a multiple of 4, and assuming C++:

No offense but that's a pretty horrible way to write it. It's also going to come out backwards if he/she is using a little endian system. Just do it in a simple and readable way:

Code:
for (int i = 0; i < arrSize; ++i)
{
  arr[i] = (i % 4) + '0';
}

Where arr is an array of char.
 

iCyborg

Golden Member
Aug 8, 2008
1,327
52
91
I remember looking at dissassembly with VC++ 2010 compiler, and if compiled with optimization flags, it will use SSE2 or some such. Perhaps his loop isn't getting optimized the same way or at all. If he's comfortable with asm, he could look at what memset does, and use inline asm to do the same/similar.
 

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
Yeah, I'm gonna go out on a limb and call BS on that. Unless you're doing a loop with billions of iterations there is literally no way that a loop which sets a primitive value is causing a 20% performance impact. Wanna know how I know? Well, here's the secret: memset is a loop. :whiste:

I was assuming memset used some low-level hardware feature to set the memory because it's practically free while plain assignments are predictably slow.

And the loop I'm using can get to hundreds of millions of iterations. I'm putting the preset data in one of the outer loops to increase performance.

No offense but that's a pretty horrible way to write it. It's also going to come out backwards if he/she is using a little endian system. Just do it in a simple and readable way:

Code:
for (int i = 0; i < arrSize; ++i)
{
  arr[i] = (i % 4) + '0';
}

Where arr is an array of char.

Ewwww, modulo. I'm trying to save cycles here.

Unless you hard code the values into your code there is simply no way to do this without a loop. The real question though is why you'd need this to begin with? An array with the same pattern repeated again and again is just wasted memory. Store the pattern once and access it as needed.

I'm doing vector operations and hopefully gaining performance by wasting a little extra memory.

I found a way to do it without using memset, but I'm still wondering why it's so much faster than a plain loop. Am I getting unrolled by the compiler maybe? I just assumed memset did some special sorcery behind the scenes that you can't do with plain C.
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
No offense but that's a pretty horrible way to write it. It's also going to come out backwards if he/she is using a little endian system. Just do it in a simple and readable way:
A per-byte strategy ignores the requirement I had inferred from the original description: 4-byte-wide operation. That was just a guess on my part, trying to infer what OP wanted... but you're right about the endianness, and about the readability.

Rakehellion said:
I found a way to do it without using memset, but I'm still wondering why it's so much faster than a plain loop. Am I getting unrolled by the compiler maybe? I just assumed memset did some special sorcery behind the scenes that you can't do with plain C.

memset() is typically heavily optimized for its platform. If you're doing large memsets, you can do wonders with prefetching, super-word stores, non-temporal accesses, etc., depending on your platform. Don't feel bad if you can't beat memset()'s native performance.
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
I was assuming memset used some low-level hardware feature to set the memory because it's practically free while plain assignments are predictably slow.

Memset is a good candidate for inlining, loop unrolling and other optimizations. On some platforms it may use special instructions, but in general... it's a loop.

And the loop I'm using can get to hundreds of millions of iterations. I'm putting the preset data in one of the outer loops to increase performance.

Well if you have only a few values repeated over and over, you could enter them once and keep accessing them using some pointer arithmetic (which is already happening anyway with the array). Doing that might require using mod, but are the cycles required by mod really worse than the cycles required to set up an array with millions of entries? Also, IIRC modding by a power of 2 is often replaced with bitwise operations as an optimization.

I found a way to do it without using memset, but I'm still wondering why it's so much faster than a plain loop. Am I getting unrolled by the compiler maybe? I just assumed memset did some special sorcery behind the scenes that you can't do with plain C.

You're going to have to look at assembly output to see what the compiler is doing in each case. If you can figure out what it's doing, know assembly language for your platform and don't give a flip about portability you could always write your own optimized function using inline assembly.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Without writing the code here are the things that would probably make the fastest generic algorithm.

Do quick checks for small copies and just do a quick loop.

For large copies -

Do a 1-3 byte copy to dword align the destination if necessary.

Rotate (if alignment was necessary above) and adjust source string by the following rules

If there is an odd number of bytes at the end of the source string (1 or 3 bytes), you need to quadrupedal the source string to force dword alignment. (repeating as necessary)

If there is 2 extra bytes at the end of the source string, you need to double it.

Otherwise the source string is dword aligned.

Do dword copies for as many dwords possible. Probably unrolling by 4 would be best. For really large copies unrolling by 16 would be most optimal as it is the size of most cache lines (64bytes).

Do the final 0-3 byte copy.

Code:
Alignment examples

String with 11 bytes has 3 extra bytes.

1234 5678 912

Needs to be quadrupled to force alignment. 

[COLOR="Red"]1234 5678 912[/COLOR][COLOR="Green"]1 2345 6789 12[/COLOR][COLOR="Blue"]12 3456 7891 2[/COLOR][COLOR="Red"]123 4567 8912[/COLOR]

String with 10 bytes has 2 extra bytes.

1234 5678 91

Needs to be doubled to force alignment. 

[COLOR="Red"]1234 5678 91[/COLOR][COLOR="Green"]12 3456 7891[/COLOR]

String with 9 bytes has 1 extra byte. (same as first example)

1234 5678 9

Needs to be quadrupled to force alignment. 

[COLOR="Red"]1234 5678 9[/COLOR][COLOR="Green"]123 4567 89[/COLOR][COLOR="Blue"]12 3456 789[/COLOR][COLOR="Red"]1 2345 6789[/COLOR]
 
Last edited: