3DPM: Can we fix it? Yes we can!

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
The only problem I see is here:

Code:
alpha = pi*rand() + pi;
That's going to generate values from pi to 2pi, rather than 0 to 2pi. You could switch to:

Code:
alpha = rand() * (pi + pi);
Or you could predefine a constant named twopi equal to pi * 2 (so you don't have to constantly recalculate pi * 2 . . . that's what I did in my Java code) and do this:

Code:
alpha = rand() * twopi;
Everything else looks right. Dr. Cutress seems to have used 10000 individual particles, and they all start with the same coordinates for origin (0, 0, 0). The calculations do not seem to take any kind of particle-on-particle collision into account which is just fine with me.

My rand() is uniformly generated from -1 to 1.

Multiplying by pi creates a uniformly generated distribution from -pi to pi.

Adding pi to it shifts the distribution from 0 to 2*pi.

I will upload a more advanced C++ version in a bit. I need the experience with MT and compilers.
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
My rand() is uniformly generated from -1 to 1.

Multiplying by pi creates a uniformly generated distribution from -pi to pi.

Adding pi to it shifts the distribution from 0 to 2*pi.

I will upload a more advanced C++ version in a bit. I need the experience with MT and compilers.

Oh okay, I had thought you were only using that rand() implementation for newz. Oops! In which case yes, you are going to get 0 to 2pi from your code.

You may still squeeze out some extra performance by using a predefined value for 2pi and just multiplying it by a standard 0 to 1 random float value. Bear in mind that Dr. Cutress' code calls for two different random values (one used to generate newz, the other used to generate alpha), so making two different rand() calls (one for -1 to 1, one for 0 to 1) produces no additional work.

Recycling the same random value would speed things up, but it would also produce different particle movement behavior.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
C++ Version

This is a really poor unoptimized version. If there are errors please tell me. And I know there are errors because adding '-o' or '-o2' speeds this code up tremendously.

Code:
#include<iostream>
#include<cmath>
#include<cstdio>
#include<ctime>
#include<random>
#include<functional>
using namespace std;

int main()
{
    clock_t start;
    double duration;
    start = clock();

    int steps=10240,counter,dummy;

    float newz, alpha, r;
    float pos[10000][3] = {0};

    mt19937::result_type seed = time(0);
    auto real_rand = bind(uniform_real_distribution<float>(0,1),mt19937(seed));
    for(dummy=0;dummy<10000;dummy++)
    {
        for (counter=0;counter<steps;counter++)
        {
            newz = real_rand();
            alpha=real_rand()*3.141592654;
            r = sqrt(1-newz*newz);

            pos[dummy][1] += r*cos(alpha);
            pos[dummy][2] += r*sin(alpha);
            pos[dummy][3] += abs(newz);
        }
    }

    duration = (clock()- start)/(double) CLOCKS_PER_SEC;
    cout<<"\nDuration is "<<duration<<"\n";
}
This is singlethreaded and uses standard library calls. Basically written in the shortest amount of time possible. Includes the '-std=c++0x' setting to use the Mersenne Twister.

(3630qm @ 3.2 ghz on GNU (4.7.1) with codeblocks) 10240 steps/particle, 10000 particles

No other compiler settings
67.7 seconds

-o2
3.98 seconds (17x speedup something is being wrongly assigned if I had to guess - extremely likely it is the RNG).

102400 steps, -o2 is 31 seconds.

Will try my hand at MT later.

Edit: this code is wrong
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
IMO this should be:

Code:
   [...]
        for (counter=0;counter<steps;counter++)
        {
            newz = 2.0*real_rand()-1.0;
            alpha = real_rand()*2.0*3.141592654;
            r = sqrtf(1-newz*newz);
            pos[dummy][0] += r*cosf(alpha);
            pos[dummy][1] += r*sinf(alpha);
            pos[dummy][2] += abs(newz);
        }

I'll also advise to display a checksum of the final positions to easily track regressions with future refactorings
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
Question:

Code:
 mt19937::result_type seed = time(0);
 auto real_rand = bind(uniform_real_distribution<float>(0,1),mt19937(seed));

I've never seen code like that before. I think I understand the first line (you're using the mt19937 namespace, and you're creating a variable of type result_type named seed that's equal to the current system time. I think?). I'll be buggered if I know exactly how the second line works, though the end result is that you've got a MT random number generation function named real_rand.

So does real_rand() generate a random float value between 0 and 1?

^ the compiler is treating the loop as dead code.

I run into that with Java all the time. Gotta send those values to an output somewhere or else the JVM ignores all the work. Blah.

IMO this should be:

Code:
   [...]
        for (counter=0;counter<steps;counter++)
        {
            newz = 2.0*real_rand()-1.0;
            alpha = real_rand()*2.0*3.141592654;
            r = sqrt(1-newz*newz);
            pos[dummy][0] += r*cos(alpha);
            pos[dummy][1] += r*sin(alpha);
            pos[dummy][2] += abs(newz);
        }

I think you're right on that, he's only generating values from 0 to 1 for newz and from 0 to pi for alpha. The more I look at it, though, the more I realize that Dr. Cutress' code never actually needs negative values in newz since it squares them when calculating r and it uses the absolute value when stepping z.

So what he could do is keep his code for newz, eliminate the abs call when calculating z and change his alpha calculation to:

Code:
alpha = real_rand() * 6.283185308;

It should produce results identical to what is called for by Dr. Cutress' code.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Fixing those gives ~ 12 seconds for 10240 steps.

The rng is not running MT as the program is only using 12-13% of my CPU. (mt is mersenne twister)

lol yeah I pretty much never run compiler options for that reason. Don't really know whats happening when you are an amateur programmer.

I've never done any multithreading but I will try something using openMP. (I just let it run and do work on something else so i've never really prioritized time).
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
The more I look at it, though, the more I realize that Dr. Cutress' code never actually needs negative values in newz since it squares them when calculating r and it uses the absolute value when stepping z.

as I mentioned above in this thread an alternate possible interpretation is

Code:
  [...]
  pos[dummy][2] += newz;
  pos[dummy][2] = abs(pos[dummy][2]);

this variant seems more likely when you imagine the particles bouncing on the plane z=0
the original author's input will be welcome
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
Yeah, that does seem logical. As long as the sum total z value never goes into negatives, it ought to be able to increment downward just as easily as upward on any given step . . .

Oh yeah! Another thing. Enigmoid, I think you can use cosf() and sinf() in lieu of the standard cos()/sin() functions.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
I'll advise to repost the whole code after fixes

Sure, though I'm most definitely working on it. I will try to get it multitheaded (which may take a while) and then work on optimizations.

Code:
#include<iostream>
#include<cmath>
#include<cstdio>
#include<ctime>
#include<random>
#include<functional>

using namespace std;

int main()
{
    clock_t start;
    double duration;
    start = clock();

    int steps=10240,counter,dummy,cumulative=0;

    float newz, alpha, r;
    float pos[10000][3] = {0};
    float non=0;//prevent nothing from happening!

    mt19937::result_type seed = time(0);
    auto real_rand = bind(uniform_real_distribution<float>(-1,1),mt19937(seed));
    for(dummy=0;dummy<10000;dummy++)
    {
        for (counter=0;counter<steps;counter++)
        {
            newz = real_rand();
            alpha=real_rand()*3.141592654+3.141592654;
            r = sqrt(1-newz*newz);

            pos[dummy][0] += r*cos(alpha);
            pos[dummy][1] += r*sin(alpha);
            pos[dummy][2] += abs(newz);
            cumulative++; //make sure it at least enters every loop
        }
        non += pos[dummy][0]+ pos[dummy][1]+pos[dummy][2];
    }

    duration = (clock()- start)/(double) CLOCKS_PER_SEC;
    cout<<"dummy value is "<<non;
    cout<<"\nDuration is "<<duration<<"\n";
    cout<<"A total of "<<cumulative<<" steps were taken.";
}
I will also try to get ICC up and running (looks like I need visual studio) to see how AMD and intel chips do. This may take a while and I will upload the exe files.

I also want to check two things; when the program spawns a ton of threads and when there is the cache sharing issue (unrolled loops).

Yeah, that does seem logical. As long as the sum total z value never goes into negatives, it ought to be able to increment downward just as easily as upward on any given step . . .

Oh yeah! Another thing. Enigmoid, I think you can use cosf() and sinf() in lieu of the standard cos()/sin() functions.

Will try that.

Edit: not seeing any preliminary difference with sinf() and cosf(). C++ overloads sin() and cos() with the float equivalent (c does not). I will optimize much more after I'm sure the code is correct and its multithreaded well.

As it stands the code above is pretty much (assuming there aren't more problems) how its going to be run in any time non-critical task by many scientists. Many will find that additional time optimizing is better spent working on other projects.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
Interesting how it overloads sin() and cos(). I didn't know that, but now I do. Thanks!

I copypastad your code into VS2013, ripped off a few tricks that I used in my Java version, set every compiler switch I could to "performance", and got completion time down to ~7s on my 7700k (10240 steps). Still single-threaded. I tried using #pramga omp parallel for on the main loop and it slowed down to ~16s even though it pegged all four cores on my Kaveri. I'm thinking the use of a single, two-dimensional array is causing thread contention over a shared resource . . . or maybe I'm just talking out of my butt.

By the by, I'm thinking you can get rid of the int cumulative. As long as each loop iteration is doing work which affects the final output (which in the case of your code, it is), the compiler won't be able to flag it as dead code and remove it.

Thanks for pasting that though, you really jump-started my C++ work. If you want me to paste my work thus far or send you a copy, just let me know. Right now I'm still tooling around with stuff.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Interesting how it overloads sin() and cos(). I didn't know that, but now I do. Thanks!

I copypastad your code into VS2013, ripped off a few tricks that I used in my Java version, set every compiler switch I could to "performance", and got completion time down to ~7s on my 7700k (10240 steps). Still single-threaded. I tried using #pramga omp parallel for on the main loop and it slowed down to ~16s even though it pegged all four cores on my Kaveri. I'm thinking the use of a single, two-dimensional array is causing thread contention over a shared resource . . . or maybe I'm just talking out of my butt.

By the by, I'm thinking you can get rid of the int cumulative. As long as each loop iteration is doing work which affects the final output (which in the case of your code, it is), the compiler won't be able to flag it as dead code and remove it.

Thanks for pasting that though, you really jump-started my C++ work. If you want me to paste my work thus far or send you a copy, just let me know. Right now I'm still tooling around with stuff.

Yep. I think the 1 large array has to go with multithreading.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
cool to see how it evolves, I'll advise to use a double for the checksum value non to have a better check against regressions

That cast is quite expensive, especially if I optimize. Non is primarily a way to see if the results are biased. non adds up a bunch of random numbers with mean value 0. Therefore non should be around 0 with a wide degree of variability (I typically get something with magnitude of a couple thousand.

Here is 3DPM MT! Still no optimizations. I'm working on them now.

Finally got OpenMP working.

Code:
#include<iostream>
#include<cmath>
#include<cstdio>
#include<ctime>
#include<random>
#include<functional>
#include<omp.h>

using namespace std;

int main()
{
    int threadnum;
    cout<<"Enter the number of threads: 1, 2, 4, 8\n";
    cin>>threadnum;

    clock_t start;
    double duration;
    start = clock();

    int steps=102400,cumulative=0, counter;
    float x,y,z;

    float newz, alpha, r;
    float pos[10000][3] = {0};
    float non=0;//prevent nothing from happening!

    /*mt19937::result_type seed = time(0);
    auto real_rand = bind(uniform_real_distribution<float>(-1,1),mt19937(seed));*/
    #pragma omp parallel private(newz,r,alpha,counter,x,y,z)  reduction (+: non, cumulative) num_threads(threadnum)
    {
        for(int dummy=0;dummy<(10000);dummy++)
        {
            mt19937::result_type seed = time(0);
            auto real_rand = bind(uniform_real_distribution<float>(-1,1),mt19937(seed));

            #pragma omp for
                for (counter=0;counter<steps;counter++)
                {
                    //int x=0,y=0,z=0; //individual x,y,z declared for each thread loop
                    x=0,y=0,z=0;
                    newz = real_rand();
                    alpha=real_rand()*3.141592654+3.141592654;
                    r = sqrt(1-newz*newz);

                    x += r*cosf(alpha);
                    y += r*sinf(alpha);
                    z += abs(newz);

                    cumulative++; //make sure it at least enters every loop
                }
                pos[dummy][0] = x;
                pos[dummy][1] = y;
                pos[dummy][2] = z;
                non+= pos[dummy][0];
        }
    }

    duration = (clock()- start)/(double) CLOCKS_PER_SEC;
    cout<<"dummy value is "<<non;
    cout<<"\nDuration is "<<duration<<"\n";
    cout<<"A total of "<<cumulative<<" steps were taken.\n";
    cout<<"Score (M steps/sec) is "<<cumulative/(duration*1000000)<<".";
    return 0;
}
Tried stackoverflow here but no results so I eventually just moved the RNG into the first for loop. You can see the comments above what I moved. Increased MT scaling from ~2x on 1->4 threads to 3.9x. Now each thread has its own RNG (it was on the main thread before meaning each thread had to query the main thread for a random value)

http://stackoverflow.com/questions/30944569/openmp-double-for-loop-array-with-stored-results

I have redone the bench to work on a private x,y,z variable that at the end of the loop place that final value at the appropriate position in the array. This helps scaling. Note that the reassignment to the array is what is holding back perfect scaling; I will have to figure out a better way to do this as reduction does not work for arrays.

Using GNU 4.9.2 (seems to be about 5% faster) with -o2 and -std=c++11 enabled. Throwing on the SSE and AVX compiler switches gains about 10-15% performance.

The RNG is using up a lot of CPU time. Erasing the contents of the second for loop and replacing with (getting rid of the array and other variables as well).

Code:
for (counter=0;counter<steps;counter++)
                {
                    x=real_rand();
                    y=real_rand();
                    z=real_rand();
                    non+=x+y+z;
                    cumulative++; //loop counter
                }
Gives ~ 62M random numbers (1 thread) or 31M particle loops/sec (two random numbers). Thus the rng roughly takes up 29% of the total CPU time as ~9M particle loops are executed each second.

Now I will optimize some more. First here are some preliminary results. Note that while scaling is worse than my new benchmark, Cutress' test performs far better (140M vs. 58M steps/sec for 8T). I cannot get the performance for Dr. Cutress' test at thread numbers other than 1 and 8 because the trig subsection behaves quite differently than other tests and 3DPM is a aggregate of 6 tests.

2hovk9x.jpg


148dv0p.jpg


The 8T results are the best of several runs as I have a lot of other stuff going. The 8T tests also do NOT run the CPU at 100%. Looking at task manager, I typically see utilization jumping from 90-96%. I will try to fix this later.
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
Question . . . how is the last build of my Java implementation (6162015) running on your 3630qm? I am curious how the optimizations I made affect performance on your CPU.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Question . . . how is the last build of my Java implementation (6162015) running on your 3630qm? I am curious how the optimizations I made affect performance on your CPU.

Mode 3 (102400 steps) takes 11.4 seconds for MT and 53.2 seconds for ST. (89M steps/sec for MT)

C++ (enigmoid) is 17s and 114 seconds for the same setting.

Looks like its faster the my c++ implementation but I haven't pulled any optimization steps yet.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
Thanks! So I've got it up to ~89.8M particle movements/second on your 3630qm. Still not as good as Dr. Cutress' implementation on Ivy Bridge. Hmm! I wonder why my Java implementation is slower on Ivy Bridge but faster on Kaveri . . .

I'm gonna tinker with your MT code and see if I can do something interesting with it.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
That cast is quite expensive

nope, don't worry, a good compiler will use a single clock rcp throughput instruction such as vcvtps2pd (for AVX targets) and you do it only once per >10000 iterations (with sin(), cos()) ..., I see that's it's no more a full checksum (x only) in your new version though (why ?), *trust me* one can spend days on "successful" optimizations producing nice speedups but wrong results when seriously tested later, a cornerstone of code optimization (and refactoring in general) is *automated regression testing*, and obviously to optimize true hotspots first, not imaginary hotspots

now if you care about good performance I'll suggest to replace pos[10000][3] with the more sensible (vectorizer friendly) layout pos[3][10000], or a more classical const size_t size =10000; float posx[size],posy[size],posz[size];
(in this respect your code is a regression from DrMrLordX's example with an hybrid SoA layout), then use something like SVML for a vectorized cos() and don't forget my advice to use an sqrt() expression for the sin() http://forums.anandtech.com/showpost.php?p=37478266&postcount=55

IMO, the goal should be to process N*8 particles (N= amount of unrolling) per loop for AVX targets if you want something fast that can really claim to be "floating point intensive"

I predict > 10x speedup from these advices vs. a scalar version using a scalar sin/cos/rand library
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,041
136
A few thoughts:

The line

Code:
z += abs(newz);

could probably be replaced with:

Code:
z = abs(z + newz);

to allow negative movement along the z-axis without the running total of z ever becoming negative. I tried this in my singlethreaded C++ h4x of Enigmoid's code, and it only slowed it down a little bit.

I also tried to copypasta Enigmoid's OpenMP code. It worked, but it was kinda slow. Scaling between 2 threads and 4 threads was not optimal. I know how to put together and use a thread pool in Java, but I don't really know how to do it in C++ (using win32 threading apis in VS or otherwise), so I can't be too critical. Regardless, I think breaking down the workload into smaller workers and orchestrating their execution by way of a thread pool would probably work out better than trying to get OpenMP to do it.

Finally, I did try replacing the large two-dimensional array with a struct in the singlethreaded code. I think I did it wrong somehow since the inline function I defined inside the struct to sum the total of all three member arrays kept producing a final non value of -1.#NAN or something like that. Regardless, the struct had three arrays (one for x, y, and z) as members, and each array had 8 entries. Then I created an array of 1250 structs. It slowed down my code from ~70s for 102400 steps to ~75-76 seconds for the same. Not a major speed penalty, but it was still slower than the big array. It might just have been due to my code being crap.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Remember kaveri has only two FPU and this is running x87 FP code.

Unrolled the loops and enabled AVX and the like. Now getting 10.5M ST, 33M 4T (down from 34-35M) but no improvement on 8T. Need to tackle the trig functions.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Some more testing. I've been trying to get a AVX trig library but no matter what I keep getting errors or suffering from a lack of documentation (I am no expect programmer but I'm not good at looking at huge interdependent AVX/SSE libraries and getting stuff properly linked and built).

I've gotten ICC installed and running through VS (which is a steaming pile of crap from head to toe compared to codeblocks at getting basic code out the door). The RNG does NOT function well on VS with the microsoft compiler. It is about 1/10 the speed (7M numbers/s) than on GNU and slower than the entire particle movement (8M/s) on GNU. DO NOT compile with VS Microsoft compiler - you must fix the RNG.

Notes on the RNG. Adding SSE and AVX extensions speed up the RNG greatly (looks like SMID is built in) on GNU. ICC shows little gains from AVX for strictly the RNG.

Scores are for 8T at this time rought 60M for GNU, 22M for MS visual c++, and 57M for ICC no optimizations beyond -o2 and 87M with AVX optimizations. ICC with AVX gives ~23M ST.

At this point with ICC I am the same as the java version 6162015 (slightly faster ST slightly slower MT). I have not touched the stock trig functions. Loops are unrolled x8.

I'm really thinking VS is not the way to go. Only has openMP 2.0 support which may be hurting things (GNU is 4.0 and ICC is 3.0).

Finally said screw it and am trying to get intel MKL linked. Which so far is not happening intel is so stringent with integration into VS yet cannot automatically integrate when I call a header. This is going to bias the code so if I get it working I will post results then attempt something opensource.

IMO: lol at ICC and AMD. Intel doesn't compile properly for AMD and AMD's AVX math libraries require FMA4 which intel CPUs don't have.

http://www.agner.org/optimize/blog/read.php?i=209&v=t

Sums up cos calculations and libraries nicely. Honestly I do not see a significant difference between 112 and 108 cycles as that is basically a add or mul operation. I can compile for static targets as well (all generic SSE or AVX).
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
At this point with ICC I am the same as the java version 6162015 (slightly faster ST slightly slower MT). I have not touched the stock trig functions. Loops are unrolled x8.

are you sure SVML is used for the trig funcs ? you should see stuff such as call __svml_cosf8 in the ASM dump
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
are you sure SVML is used for the trig funcs ? you should see stuff such as call __svml_cosf8 in the ASM dump

I have not touched the stock trig functions. SVML is not used. I am trying to figure out how to get it enabled. IMO it looks like ICC is using intel's stock trig functions which are faster than GNUs but still not vectorized (see the Agner Fog link).
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I have not touched the stock trig functions. SVML is not used.

it is used automatically by the ICC vectorizer, once you have a data layout amenable to vectorization

Code:
void SinTest (const float * __restrict x, float * __restrict y, size_t size)
{
  __assume_aligned(x,32);
  __assume_aligned(y,32);
  for (int i=0; i<size; i++) y[i] = sin(x[i]);
}


Code:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150";
 
; [...]
 
.B1.4:: ; Preds .B1.14 .B1.3
L4:: ; optimization report
; LOOP WAS VECTORIZED
; VECTORIZATION SPEEDUP COEFFECIENT 9.953125
$LN24:
$LN25:
call __svml_sinf8 ;538.10
$LN26:
; LOE rbx rbp rsi rdi r12 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 ymm0
.B1.14:: ; Preds .B1.4
$LN27:
add r15d, 8 ;76.3
$LN28:
vmovups YMMWORD PTR [rsi+r14*4], ymm0 ;76.30
$LN29:
add r14, 8 ;76.3
$LN30:
cmp r15, rdi ;76.3
$LN31:
jb .B1.4 ; Prob 82% ;76.3
 
Last edited: