C++ : Time spent declaring array

Carlis

Senior member
May 19, 2006
237
0
76
Hi

I am writing some simulation software, and speed is crucial. I do at a particular step in my program have the following option:

either just 'double G[3][3][3][72]' is needed, or 'double GS[18][18][18][72]' is needed as well. So, if I declare the GS array and actually do not use it, will there be many cpu cycles spent? Should I write a special method for each case?

Best regards
Carlis.
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Hi

I am writing some simulation software, and speed is crucial. I do at a particular step in my program have the following option:

either just 'double G[3][3][3][72]' is needed, or 'double GS[18][18][18][72]' is needed as well. So, if I declare the GS array and actually do not use it, will there be many cpu cycles spent? Should I write a special method for each case?

Best regards
Carlis.

Both of those are static declarations. If they're in global scope, those memory areas are set up at program start. If they're in local scope, they are allocated with a subtract from the stack pointer (aka, the allocations take the same amount of time).

There is one fine point to consider: the second declaration is quite large. ~3MB. That might overflow your stack, which would cause a mysterious segfault. FYI. (perhaps consider malloc() instead).

In general, memory allocation is cheap -- incredibly cheap if on the stack, just a little more expensive if on the heap (via malloc() or new). Overhead of using that memory is incurred when you fill it for the first time (usually large malloc()s are satisfied with copy-on-write pages via anonymous mmap()). There is also an overhead for having large data arrays, which occupy valuable cache space, but those overheads are paid when the memory is accessed, not when it is allocated.
 

Cogman

Lifer
Sep 19, 2000
10,286
145
106
Both of those are static declarations. If they're in global scope, those memory areas are set up at program start. If they're in local scope, they are allocated with a subtract from the stack pointer (aka, the allocations take the same amount of time).

There is one fine point to consider: the second declaration is quite large. ~3MB. That might overflow your stack, which would cause a mysterious segfault. FYI. (perhaps consider malloc() instead).

In general, memory allocation is cheap -- incredibly cheap if on the stack, just a little more expensive if on the heap (via malloc() or new). Overhead of using that memory is incurred when you fill it for the first time (usually large malloc()s are satisfied with copy-on-write pages via anonymous mmap()). There is also an overhead for having large data arrays, which occupy valuable cache space, but those overheads are paid when the memory is accessed, not when it is allocated.

Everything degibson has said is true. Your biggest overhead is going to come from doing useful things with the data, and not the allocation of the data space itself. If you are frequently allocating/deallocating memory off the heap, then a decent speed boost can be obtained from reusing memory blocks (This will also potentially result in some interesting/hard to find errors, but whats speed without a little risk :D)

My only recommendation on top of degibson's is to try and use aligned memory whenever possible. Due to the way your processor accesses memory, it can achieve higher speeds when memory is aligned to a 16 byte boundary. It also makes it easier to use SIMD instructions.
 

Schmide

Diamond Member
Mar 7, 2002
5,745
1,036
126
Edit: See note below.

I don't think allocation is going to be your bottleneck. With a structure like double G[3][3][3][72], to resolve to any one piece of data you're looking at 3 integer multiplications (3 cycles each*) 3 adds (1 cycle each*) and a load effective address (1-2 cycles*), totaling near 14 cycles. Best case scenario 2 identical resolves enter the pipeline and parallel out to 20 something cycles. When dealing with such data you have to make sure your strides reflect access to your data. Casting a look-up to a variable can help in optimizations the compiler just can't assume.

If the 72 is an 8x8 matrix you can declare

Code:
double dMatrixData[3][3][3][72];
typedef double pMatrixType[8][8];

int iX=0;
int iY=0;
int iZ=0;
/* non optimized 3 imul matrix access */
for(int i=0;i<8;i++)
	for(int j=0;j<8;j++)
		dMatrixData[iX][iY][iZ][(i<<3)+j]=(double)(i+j);

/* optimized no imul matrix access */
pMatrixType *pMatrix=(pMatrixType *)dMatrixData[iX][iY][iZ];
for(int i=0;i<8;i++)
	for(int j=0;j<8;j++)
		(*pMatrix)[i][j]=-(double)(i+j);

*core2/i7/k8/k10 more for lesser architectures

Edit: I did some tests and I was very surprised to see how well the compiler optimized out the imuls. For small strides it would replace the imul with lea [r+r*c]. The above code (release VC build) disassembly was nearly identical. So whatever I wrote can be taken with a yeah but it's more effort for very little if any gain.
 
Last edited:

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Edit: I did some tests and I was very surprised to see how well the compiler optimized out the imuls. For small strides it would replace the imul with lea [r+r*c]. The above code (release VC build) disassembly was nearly identical. So whatever I wrote can be taken with a yeah but it's more effort for very little if any gain.

The other thing to keep in mind is that, despite what the programming model says, times in cycles for instructions aren't really fixed. Modern x86's are almost universally 1. out-of-order and 2. superscalar, which tends to obscure instruction 'latencies' as they are perceived by the code.
 

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
The other thing to keep in mind is that, despite what the programming model says, times in cycles for instructions aren't really fixed. Modern x86's are almost universally 1. out-of-order and 2. superscalar, which tends to obscure instruction 'latencies' as they are perceived by the code.

Isn't part of the reason he's seeing similar results that the array is alocated statically? i.e., if the loop limits were all variables (not known at compile time) and the array size weren't known at compile time, I would expect the compiled code to be worse.

So if he did use malloc, I think the optimal way to access array entries would be to write A in the innermost loop, and update the pointer A in the outer loops if necessary. (Though I guess if the compute-time for each element is large, how you access the elements isn't such a big deal.)

Also OP, if you use malloc, you should make sure that you allocate all the array entries at in one large block. So you only want 1 single call to malloc to allocate the whole array; do not make multiple calls.

And regardless of static or dynamic, make sure the "fastest changing" index comes last. If I have A[10][20], I want to access A as "10 sets of 20" rather than "20 sets of 10."
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Isn't part of the reason he's seeing similar results that the array is alocated statically? i.e., if the loop limits were all variables (not known at compile time) and the array size weren't known at compile time, I would expect the compiled code to be worse.

It does help quite a bit when array sizes and loop counts are known at compile-time. The former allows the compiler to optimize multiply instructions into shifts in many cases, and the latter grants more freedom in loop unrolling.
 

Carlis

Senior member
May 19, 2006
237
0
76
Thank you for your replies. Obviously, it is to large. I am using it in a method, and I assume that running multiple threads this means a lot of data on the stack. I am trying a different route. Thanks a lot.
 

Cogman

Lifer
Sep 19, 2000
10,286
145
106
Thank you for your replies. Obviously, it is to large. I am using it in a method, and I assume that running multiple threads this means a lot of data on the stack. I am trying a different route. Thanks a lot.

You should know that each thread has an all new stack (in windows, this is typically 1mb big). So if it fits in one thread, it will most likely fit in all. You should also realize that (at least in windows) the stack size can actually be altered at thread creation.