3DPM: Can we fix it? Yes we can!

bronxzv · Jul 14, 2015

Schmide said:
I may write this I may not. I was really just enjoying the review process.

it's not clear (IMHO) from your description what you are aiming at

1) processing fewer particles
2) processing fewer steps per particle
3) faster steps
4) other

some pseudo-code will probably help to understand what you are trying to explain with quite strange and irrelevant (to me) analogies

bronxzv · Jul 14, 2015

borandi said:
- Source code is lost.

too bad, it explains why it wasn't so easy to try to fix the supposed cache issues I have raised (hint: the out of this world speedup with HT enabled)

anyway thank you for your answer, too bad I don't have free time these days to post a fully vectorized solution

Schmide · Jul 14, 2015

bronxzv said:
it's not clear (IMHO) from your description what you are aiming at

1) processing fewer particles
2) processing fewer steps per particle
3) faster steps
4) other

The problem here as it exists is 3d particle generation. The less time you spend on this the more time you can spend elsewhere. So...

1) Number of particles is up to later parts of the simulation. I am only focusing on generating random particles.
2) I'm not sure steps is the right word. In general to produce more throughput and parallelism.
3) See 2

bronxzv said:
some pseudo-code will probably help to understand what you are trying to explain with quite strange and irrelevant (to me) analogies

I have listed the operations very well. It would be hard to pseudo it, as it would be more than one routine. I may try but at this level of complexity, you generally just code it after listing the operations as I did.

If you have a question on any specific. I will explain it to the best of my abilities.

bronxzv · Jul 14, 2015

Schmide said:
1) Number of particles is up to later parts of the simulation. I am only focusing on generating random particles.

I don't get it, still can't tell if you have more/less/same number of particles
than the original i.e. number of distinct positions at the end of the simulation

Schmide said:
2) I'm not sure steps is the right word

well, this is the common word for random walk and more generally all numerical simulations, anyway the one used by borandi in the code discussed here

Schmide said:
3) See 2

I don't get it

unless I can see some *pseudo code* I'll say you just wrote some *pseudo scientific* gibberish

DrMrLordX · Jul 14, 2015

borandi said:
DrMrLordX - did you delegate the particle generation inside the MT loop, to be created and destroyed within the thread rather than an overriding array that keeps track of data?

I generate the particles in groups of eight, using the Particles class, which occurs before the creation of a thread pool and the spawning of worker threads. The Particles class does little more than store three single-dimension arrays with 8 entries each (one array for x, one for y, one for z). It also has a method for updating the x, y, and z positions of each particle, named addvalues, by accepting three arrays (8 entries each) and summing accordingly:

Code:

public void addvalues(float[] xvalues, float[] yvalues, float[] zvalues)
	{		
		int i = 0;
		
		do
		{
			xpos[i] += xvalues[i];
			ypos[i] += yvalues[i];
			zpos[i] = Math.abs(zpos[i] + zvalues[i]);
			
			++i;
		}
		while (i < 8);
	}

The Organize class creates 1250 Particles objects, stores them in an ArrayList, and assigns each one to an instance of ParticleMovement. Each instance of ParticleMovement is created as a worker to be handled by the thread pool created via ExecutorService in the Organize class.

In theory, I could create the particles local to each thread. The Particles class just makes it easy to retain data after the threads are closed and the work is complete, so I can write it to a file and prevent the JVM from flagging anything important as dead code. It also helped me to avoid a massive array (okay, there is a huge ArrayList but whatever). Big arrays, in my experience, slow Java to a crawl. Probably has to do with how data fits into cache and all that.

The downside to using the Particles class is that it helps 3DPMRedux to be a memory hog compared to the original 3DPM. There also seems to be a memory leak (or there's no gc going on, or something) since repeated iterations of the benchmark during a single execution results in successively greater memory consumption.

Schmide · Jul 14, 2015

bronxzv said:
I don't get it, still can't tell if you have more/less/same number of particles
than the original i.e. number of distinct positions at the end of the simulation

Ok it's the same number of particles. The reality is any generation of random numbers can have any number of particles.

This part of 3DPM is not a simulation. It is nothing more than a unit sphere 3d point generator with an 4/pi z bias. Intentional or not it is as stated in my signature.

bronxzv said:
well, this is the common word for random walk and more generally all numerical simulations, anyway the one used by borandi in the code discussed here

I understand the simulation, what Brownian Motion is, how Monte Carlo it can be used to simulate and prove Fick's laws of diffusion, and so on.

If I get to the point of producing and testing it. I'll run it through a simulation just to quantify the algorithm. Any form of random number generator must be statistically validated to prove its worth.

bronxzv said:
I don't get it

unless I can see some *pseudo code* I'll say you just wrote some *pseudo scientific* gibberish

I don't think you're trying. When you call what I wrote "*pseudo scientific* gibberish" you are diminishing those who do not code but design algorithms. I do both but would never diminish those who are good at one side or the other. My logic is sound, if you wish to pick it apart please do so, but please don't just sit back and throw stones. It's not cool.

bronxzv · Jul 14, 2015

Schmide said:
I don't think you're trying. When you call what I wrote "scientific* gibberish"

note that I wrote "pseudo scientific" not "scientific"

Schmide · Jul 14, 2015

bronxzv said:
note that I wrote "pseudo scientific" not "scientific"

Sorry I miss quoted your insult. Edit: Fixed.

bronxzv · Jul 14, 2015

Schmide said:
Sorry I miss quoted your insult. Edit: Fixed.

neat! thank you

a good example of pseudoscientific arguments IMHO is your analogy with "DFT vs FFT" computational complexity: it makes it sound as if your proposal is somewhat less "brute force" than the original

your solution looks actually more complex than the original and less amenable to vectorization and branch elimination, i.e. more complex and slower, a lose-lose situation

DrMrLordX · Jul 14, 2015

borandi said:
Ultimately, a lightweight PRNG with a period 2^100 should be sufficient here.

I totally forgot to respond to that. Ooops! Sorry.

I have read that it is "best practice" to make sure that the period of the PRNG is at least the square of the number of randoms that will need to be generated by any one instance of the PRNG. In this case, SplittableRandom would appear to suffice, since it has a period of "at least 2^64". 3DPMRedux is set up to create a separate instance of SplittableRandom - that is, a separate PRNG - for each worker (there are 1250 of them). Assuming the user selects 1024000 steps, each worker should only create 16384000 successive randoms, permitting a period as low as ~268.4 x 10^12. 2^64 should suffice for that number of steps or below.

Now, the "best practice" about which I read may have no relevance to the issue at hand, or may be otherwise inappropriate when attempting to model particle movements, so take that with a grain of salt. I will note that SplittableRandom is not suitable for crypto.

edit: it also looks as though I may be mis-using SplittableRandom somewhat . . . it's meant to be split to create new instances of SplittableRandom. Generating new instances per thread, within the thread, seems to be mechanically out-of-step with that intention. Not sure if it's functionally different that way.

Schmide · Jul 14, 2015

bronxzv said:
neat! thank you

a good example of pseudoscientific arguments IMHO is your analogy with "DFT vs FFT" computational complexity: it makes it sound as if your proposal is somewhat less "brute force" than the original

your solution looks actually more complex than the original and less amenable to vectorization and branch elimination, i.e. more complex and slower, a loss-loss situation

The science and analogy is sound. DFT is O(N^2) FFT is O(N log N). You can write a DFT in a few lines while an FFT will take a magnitude more. One does every calculation for every point, the other joins calculations.

The Spherical Vector Distributions version is less of a brute force method. It was what spurred my thoughts on how to keep a point on a sphere without using trans or sqrt.

You look what I put forth and think, it's big and therefore could in no way compete with a simple few line program. It may not compete. It should be fully vectorizable and for the most part linear. I think it has merit. Especially if I can keep the wobbles under 40 ops.

I really don't mind being wrong but being called pseudoscientific is flat out wrong and insulting. What I am putting forth is pure science. For you learn just as much in science from your failures as you do from your successes.

bronxzv · Jul 14, 2015

Schmide said:
The science and analogy is sound. DFT is O(N^2) FFT is O(N log N).

I don't see what this has to do with the problem at hand which complexity is O(M*N) with M particles and N steps, for both the original and your proposal

now, maybe some pseudo-code will convince me that I'm wrong ?

Schmide · Jul 14, 2015

bronxzv said:
I don't see what this has to do with the problem at hand which complexity is O(M*N) with M particles and N steps, for both the original and your proposal

I'll concede this point.

But also put forth. The cost of producing one n-tupple. If each part is calculated as a separate costly operation it's O(M*N) M items N expansions. If the relationship is joined procedurally like in the Spherical Vector Distributions. It's closer to O(M log N).

bronxzv said:
now, maybe some pseudo-code will convince me that I'm wrong ?

Fine, but don't expect it tonight. I may have time tomorrow.

bronxzv · Jul 14, 2015

Schmide · Jul 14, 2015

not time for bed but close enough

bronxzv · Jul 15, 2015

Schmide said:
If the relationship is joined procedurally like in the Spherical Vector Distributions. It's closer to O(M log N).

I don't see how you can reach this conclusion (log N instead of N) based on the spherical vector distribution method which has an extra inner loop for each step with a hard to predict termination branch IIRC the example discussed here the other day: http://forums.anandtech.com/showpost.php?p=37483197&postcount=63

I may be wrong but it looks like you missed my "branch miss more costly than polynomial evaluation" argument in this old post of mine

now, you can get log N instead of N with tables of precomputed values and prefiltered in a pyramid as in the MipMap analogy you have raised, it will cut off all high frequencies so we can arguably say this is not equivalent to the example discussed here for high particle counts (typical in the field as per borandi's explanation) i.e. high sampling frequency, anyway this is orthogonal to the choice of the method of generation of the unit vector at each step which is the only area for which you have provided some details as far I can tell

Schmide · Jul 15, 2015

I stayed home today and made a pass at it using Enigmoid's template. It took a few hours and to be honest it really doesn't do that well. Prob a 3rd as fast as the base code. So where Enigmoid code would get 43 the code I just wrote would get 16. Looking at the way it finally fell together I really don't see it competing in any way. To avoid branches I made tables. In doing so it pushed a fair amount of the execution time to either retrieving many randoms or parsing a random into pieces to the tables.

Code:

// partrand.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"

#define _USE_MATH_DEFINES
#include<iostream>
#include<cmath>
#include<cstdio>
#include<ctime>
#include<random>
#include<functional>
#include<omp.h>
using namespace std;

#define TABLESIZES 128 
#define INDEXSIZE 7
float vectorX[3][TABLESIZES][TABLESIZES];
float vectorY[3][TABLESIZES][TABLESIZES];
float vectorZ[3][TABLESIZES][TABLESIZES];
float inversionTable[3][8];
int swapTable[3][8] = { { 0, 0, 1, 1, 2, 2 }, { 1, 2, 0, 2, 0, 1 }, { 2, 1, 2, 0, 1, 0 } };
float wobbleFactors[2][TABLESIZES];

int _tmain(int argc, _TCHAR* argv[])
{
   int threadnum;
   cout << "Enter the number of threads: 1, 2, 4, 8\n";
   cin >> threadnum;


   clock_t start;
   double duration;
   start = clock();

   int steps = 1024000, cumulative = 0, counter, dummyval = 10000;
   float x, y, z;

   float newz, alpha, r;
   float pos[10000][3] = { 0 };
   float non = 0;//prevent nothing from happening!


   // build the tables
   for (int i = 0; i < TABLESIZES; i++) {
      vectorX[0][0][i] = vectorX[1][0][TABLESIZES - 1 - i] = cos(M_PI * ((double)i) / ((double)TABLESIZES)); // equator arc
      vectorY[1][0][i] = 1.0; // top point
   }
   for (int i = 1; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorX[0][i][j] = vectorX[0][0][2 * i] * vectorX[0][0][j]; // cos2 cos
         vectorX[1][i][j] = vectorX[0][0][2 * i] * vectorX[1][0][j]; // cos2 sin
         vectorX[2][i][j] = vectorX[1][0][2 * i]; // sin2
      }
   }
   // y vector all transposed x vector reverse order (-y, x)
   for (int i = 1; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorY[0][i][j] = -vectorX[1][TABLESIZES - 1 - i][j];
         vectorY[1][i][j] = vectorX[0][TABLESIZES - 1 - i][j];
         vectorY[2][i][j] = vectorX[2][TABLESIZES - 1 - i][j];
      }
   }
   // z vector all transposed x zero vector ( y, -x)
   for (int i = 0; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorZ[0][i][j] = vectorX[1][0][j];
         vectorZ[1][i][j] = -vectorX[0][0][j];
      }
   }

   // for flipping signs
   for (int i = 0; i < 8; i++) {
      inversionTable[0][i] = i & 1 ? -1.0 : 1.0;
      inversionTable[1][i] = i & 2 ? -1.0 : 1.0;
      inversionTable[3][i] = i & 4 ? -1.0 : 1.0;
   }

   for (int i = 0; i < TABLESIZES; i++) {
      wobbleFactors[0][i] = cos(((double)M_PI) * ((double)i)) / (((double)TABLESIZES) *((double)TABLESIZES));
      wobbleFactors[1][i] = 1.0 / sqrt(1.0 + wobbleFactors[0][i] * wobbleFactors[0][i]);
   }

//   std::srand(std::time(0)); // use current time as seed for random generator
//   int random_variable = std::rand();


   /*mt19937::result_type seed = time(0);
   auto real_rand = bind(uniform_real_distribution<float>(-1,1),mt19937(seed));*/
   int arcx, arcy, swap, neg, wobblex, wobbley, wobbleSignx, wobbleSigny, localrand;
#pragma omp parallel private(arcx, arcy, swap, neg, wobblex, wobbley, wobbleSignx, wobbleSigny,counter,x,y,z)  reduction (+: non, cumulative) num_threads(threadnum)
   {
      for (int dummy = 0; dummy<dummyval; dummy++)
      {

                  mt19937::result_type seed = time(0);
                  auto int_rand = bind(uniform_int_distribution<>(), mt19937(seed));
         //         mt19937::result_type seed = time(0);
         //         auto int_rand = bind(uniform_real_distribution<float>(-1, 1), mt19937(seed));

#pragma omp for
         for (counter = 0; counter<steps; counter++)
         {
            //int x=0,y=0,z=0; //individual x,y,z declared for each thread loop
            x = 0, y = 0, z = 0;

            localrand = int_rand();
            arcx = localrand&((1 << INDEXSIZE) - 1);
            arcy = (localrand >> 8)&((1 << INDEXSIZE) - 1);
            wobblex = (localrand >> 16)&((1 << INDEXSIZE) - 1);
            wobbley = (localrand >> 24)&((1 << INDEXSIZE) - 1);
            wobbleSignx = (localrand >> 30) & 1;
            wobbleSigny = (localrand >> 31) & 1;
            swap = (localrand >> 7)&((1 << 3) - 1);
            neg = (localrand >> 12)&((1 << 3) - 1);

            x += (((vectorX[swapTable[0][swap]][arcx][arcy] +
               vectorY[swapTable[0][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[0][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[0][neg];
            y += (((vectorX[swapTable[1][swap]][arcx][arcy] +
               vectorY[swapTable[1][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[1][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[1][neg];
            z += abs((((vectorX[swapTable[2][swap]][arcx][arcy] +
               vectorY[swapTable[2][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[2][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[2][neg]);

            cumulative++; //make sure it at least enters every loop
         }
         pos[dummy][0] = x;
         pos[dummy][1] = y;
         pos[dummy][2] = z;
         non += pos[dummy][0];
      }
   }

   duration = (clock() - start) / (double)CLOCKS_PER_SEC;
   cout << "dummy value is " << non;
   cout << "\nDuration is " << duration << "\n";
   cout << "A total of " << cumulative << " steps were taken.\n";
   cout << "Score (M steps/sec) is " << cumulative / (duration * 1000000) << ".";
   cout << "Press any key to continue\n";
   cin >> threadnum;
   return 0;
}

My original thought was to do each step in order then parse out the results, but after refining the complexities of the algorithm I doubt it could compete.

I also expanded the tables more than I wished to just to get the code functional. Most of the data within them is redundant but I doubt any cleanup in that area will lead to any improvements.

So most likely closing this branch of the project.

DrMrLordX · Jul 15, 2015

Thanks for contributing code regardless. That's an interesting approach.

bronxzv · Jul 15, 2015

Schmide said:

I stayed home today and made a pass at it using Enigmoid's template. It took a few hours and to be honest it really doesn't do that well. Prob a 3rd as fast as the base code. So where Enigmoid code would get 43 the code I just wrote would get 16. Looking at the way it finally fell together I really don't see it competing in any way. To avoid branches I made tables. In doing so it pushed a fair amount of the execution time to either retrieving many randoms or parsing a random into pieces to the tables.

Code:

// partrand.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
 
#define _USE_MATH_DEFINES
#include<iostream>
#include<cmath>
#include<cstdio>
#include<ctime>
#include<random>
#include<functional>
#include<omp.h>
using namespace std;
 
#define TABLESIZES 128 
#define INDEXSIZE 7
float vectorX[3][TABLESIZES][TABLESIZES];
float vectorY[3][TABLESIZES][TABLESIZES];
float vectorZ[3][TABLESIZES][TABLESIZES];
float inversionTable[3][8];
int swapTable[3][8] = { { 0, 0, 1, 1, 2, 2 }, { 1, 2, 0, 2, 0, 1 }, { 2, 1, 2, 0, 1, 0 } };
float wobbleFactors[2][TABLESIZES];
 
int _tmain(int argc, _TCHAR* argv[])
{
   int threadnum;
   cout << "Enter the number of threads: 1, 2, 4, 8\n";
   cin >> threadnum;
 
 
   clock_t start;
   double duration;
   start = clock();
 
   int steps = 1024000, cumulative = 0, counter, dummyval = 10000;
   float x, y, z;
 
   float newz, alpha, r;
   float pos[10000][3] = { 0 };
   float non = 0;//prevent nothing from happening!
 
 
   // build the tables
   for (int i = 0; i < TABLESIZES; i++) {
      vectorX[0][0][i] = vectorX[1][0][TABLESIZES - 1 - i] = cos(M_PI * ((double)i) / ((double)TABLESIZES)); // equator arc
      vectorY[1][0][i] = 1.0; // top point
   }
   for (int i = 1; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorX[0][i][j] = vectorX[0][0][2 * i] * vectorX[0][0][j]; // cos2 cos
         vectorX[1][i][j] = vectorX[0][0][2 * i] * vectorX[1][0][j]; // cos2 sin
         vectorX[2][i][j] = vectorX[1][0][2 * i]; // sin2
      }
   }
   // y vector all transposed x vector reverse order (-y, x)
   for (int i = 1; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorY[0][i][j] = -vectorX[1][TABLESIZES - 1 - i][j];
         vectorY[1][i][j] = vectorX[0][TABLESIZES - 1 - i][j];
         vectorY[2][i][j] = vectorX[2][TABLESIZES - 1 - i][j];
      }
   }
   // z vector all transposed x zero vector ( y, -x)
   for (int i = 0; i < TABLESIZES; i++) {
      for (int j = 0; j < TABLESIZES; j++) {
         vectorZ[0][i][j] = vectorX[1][0][j];
         vectorZ[1][i][j] = -vectorX[0][0][j];
      }
   }
 
   // for flipping signs
   for (int i = 0; i < 8; i++) {
      inversionTable[0][i] = i & 1 ? -1.0 : 1.0;
      inversionTable[1][i] = i & 2 ? -1.0 : 1.0;
      inversionTable[3][i] = i & 4 ? -1.0 : 1.0;
   }
 
   for (int i = 0; i < TABLESIZES; i++) {
      wobbleFactors[0][i] = cos(((double)M_PI) * ((double)i)) / (((double)TABLESIZES) *((double)TABLESIZES));
      wobbleFactors[1][i] = 1.0 / sqrt(1.0 + wobbleFactors[0][i] * wobbleFactors[0][i]);
   }
 
//   std::srand(std::time(0)); // use current time as seed for random generator
//   int random_variable = std::rand();
 
 
   /*mt19937::result_type seed = time(0);
   auto real_rand = bind(uniform_real_distribution<float>(-1,1),mt19937(seed));*/
   int arcx, arcy, swap, neg, wobblex, wobbley, wobbleSignx, wobbleSigny, localrand;
#pragma omp parallel private(arcx, arcy, swap, neg, wobblex, wobbley, wobbleSignx, wobbleSigny,counter,x,y,z)  reduction (+: non, cumulative) num_threads(threadnum)
   {
      for (int dummy = 0; dummy<dummyval; dummy++)
      {
 
                  mt19937::result_type seed = time(0);
                  auto int_rand = bind(uniform_int_distribution<>(), mt19937(seed));
         //         mt19937::result_type seed = time(0);
         //         auto int_rand = bind(uniform_real_distribution<float>(-1, 1), mt19937(seed));
 
#pragma omp for
         for (counter = 0; counter<steps; counter++)
         {
            //int x=0,y=0,z=0; //individual x,y,z declared for each thread loop
            x = 0, y = 0, z = 0;
 
            localrand = int_rand();
            arcx = localrand&((1 << INDEXSIZE) - 1);
            arcy = (localrand >> 8)&((1 << INDEXSIZE) - 1);
            wobblex = (localrand >> 16)&((1 << INDEXSIZE) - 1);
            wobbley = (localrand >> 24)&((1 << INDEXSIZE) - 1);
            wobbleSignx = (localrand >> 30) & 1;
            wobbleSigny = (localrand >> 31) & 1;
            swap = (localrand >> 7)&((1 << 3) - 1);
            neg = (localrand >> 12)&((1 << 3) - 1);
 
            x += (((vectorX[swapTable[0][swap]][arcx][arcy] +
               vectorY[swapTable[0][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[0][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[0][neg];
            y += (((vectorX[swapTable[1][swap]][arcx][arcy] +
               vectorY[swapTable[1][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[1][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[1][neg];
            z += abs((((vectorX[swapTable[2][swap]][arcx][arcy] +
               vectorY[swapTable[2][swap]][arcx][arcy] * wobbleFactors[0][wobblex]) * wobbleFactors[1][wobblex] * inversionTable[0][wobbleSignx] +
               vectorZ[swapTable[2][swap]][arcx][arcy] * wobbleFactors[0][wobbley]) * wobbleFactors[1][wobbley] * inversionTable[0][wobbleSigny]) * inversionTable[2][neg]);
 
            cumulative++; //make sure it at least enters every loop
         }
         pos[dummy][0] = x;
         pos[dummy][1] = y;
         pos[dummy][2] = z;
         non += pos[dummy][0];
      }
   }
 
   duration = (clock() - start) / (double)CLOCKS_PER_SEC;
   cout << "dummy value is " << non;
   cout << "\nDuration is " << duration << "\n";
   cout << "A total of " << cumulative << " steps were taken.\n";
   cout << "Score (M steps/sec) is " << cumulative / (duration * 1000000) << ".";
   cout << "Press any key to continue\n";
   cin >> threadnum;
   return 0;
}

thank you for taking the time to write it down, this is clearly O(M*N), not O(M log N)

vectorizing it will use a lot of serially dependent gathers which will be quite slow, even on Broadwell

it makes me think a bit to my legacy code for Perlin's noise and it sucks at vectorization

DrMrLordX · Jul 29, 2015

I know this topic is slipping into necroland, but I finally got around to running 3DPMRedux 7122015 on an AMD E1-2500.

Original 3DPM (Stage 1): 14.7194M particle movements per second

3DPMRedux:

Build: 7122015
1024000 steps completed.
Mode: multi-threaded

It took 421346 milliseconds to complete the workload.
(24.303066M steps per second)
It took 343 milliseconds to write the output file.

So, while the speedup isn't as impressive on Jaguar as it is on Kaveri, it's still a 65% increase in speed.

Essence_of_War · Oct 7, 2015

I have some anec-data from some particle pushing code of my own.

It's single-threaded, but decently vectorized, Matlab/Octave code designed to push particles in a 2D plane with a known electromagnetic field that is derived from a Vlasov simulation of particle distribution functions. It's a not strictly speaking self-consistent, but it's a useful diagnostic. I use it with Matlab 2014, and with Octave from the Ubuntu repos (3.8.2). There is no stochastic component to the motion once initial velocities and positions are seeded because I'm simulating a plasma where the particle density is so low, and the time scale is so fast, that collisions just do not happen.

C2D E8400 (3.0 GHz) - Matlab - ~107,000 particle pushes per second
C2D E8400 (3.0 GHz) - Octave - ~24,000 particle pushes per second

Haswell i5 (3.2-3.6 GHz) - Matlab - ~270,000 ppps
Haswell i5 (3.2 -3.6 GHz) - Octave - ~63,000 ppps

I'm hoping to test it on a first-gen i5 and a 3rd gen i7 today as well.

DrMrLordX · Oct 7, 2015

Interesting. Do you want anyone else to test your code? And are you interested in publishing it for others to see?

Essence_of_War · Oct 8, 2015

I'd be happy to. I'd need to make some modifications so that it is usable without the associated electric field data files, which are quite large. If you don't mind it basically just modeling cyclotron motion of a thermal distribution of particles I can just set the electric field data-set to zeros.

DrMrLordX · Oct 8, 2015

That's fine. None of us end-users will really know the difference, so long as you can avoid dead code removal (always a risk with compilers that optimize aggressively).

On a side note, I tried converting my most-recent Java version of 3DPMRedux to 32-bit int code using low nonsense numbers that were guaranteed not to cause overflows. The code is basically the same, with the primary exceptions being that:

a). the code no longer produces scientifically valid results, despite carrying out the same number of operations as before and
b). the code is able to directly create random 32-bit integers instead of creating a random double which is then cast to a float thanks to the way SplittableRandom works

The results were surprising. The best performance I could get out of my 7700k @ 3.4 GHz (same settings as elsewhere in this thread) was ~75m movements per second. The fp code was faster on Kaveri. Hmm. Very interesting.

borandi · Feb 9, 2016

Hi all,

Just an update my end. I'm updating the benchmark for 2016 somewhat, writing from scratch. I've laid an outline and implemented the first algorithm (trig). Here are the salient points:

1) PRNG is still Ranq2 from Numerical Recipes. This has a period of ~8.5 x 10^37 and as far as I can tell is faster than common Mersenne Twisters.
2) Particles are now generated in thread.
3) The PRNG takes the particle number as the seed.
4) The results are taken at the end of the movement and summed. I've adjusted the data structure for the sum to avoid false sharing.
5) No Z-boundary. All particles start from 0,0,0. Thus with a large enough sample, they should stay in near 0,0,0.
6) Still using C++ and OpenMP. Moved to Visual Studio Community 2015 which comes with OpenMP 2.0

To clarify as well, yes I've simulated Fick's Laws in 2D Implicit/Explicit and 3D Explicit. Both were interesting to do on GPUs. However the research I was doing at the time was starting to focus on low concentrations of solutions in microchannels, hence the move from PDEs to parallel movement algorithms (which in this case was valid if the particles were non interacting and electric fields were negligible).

Here's the code as it stands. Ran.h holds the PRNG, main has everything else. Currently set to run 1e6 particles with 1000 steps each.

Ran.h

Code:

struct Ranq2 {
	unsigned long long v, w;
	Ranq2(unsigned long long j) : v(4101842887655102017LL), w(1) {
		v ^= j;
		w = int64();
		v = int64();
	}
	inline unsigned long long int64() {
		v ^= v >> 17;
		v ^= v << 31;
		v ^= v >> 8;
		w = 4294957665U * (w & 0xffffffff) + (w >> 32);
		return v^w;
	}
	inline double doub() { return 5.42101086242752217e-20 * int64(); }
	inline unsigned int int32() { return (unsigned int)int64(); }
};

main.cpp

Code:

#include "stdafx.h"
#include "math.h"
#include <string>
#include "stdio.h"
#include <iostream>
#include "omp.h"
#include "windows.h"
#include "ran.h"

using namespace std;

#ifdef _DEBUG
static long particles = 20;
int steps = 2000;
#else
static long particles = 1e6;
int steps = 1000;
#endif

#define twopi 6.28318531f

int main()
{
	// thread detect and display
	int threads = omp_get_max_threads();
	cout << "Max Number of Threads = " << threads << endl;
	cout << "Max Number of Particles = " << particles << endl;
	cout << "Max Number of Steps = " << steps << endl;

	// final position calculation setup
	double* sumx = new double[threads * 64];
	double* sumy = new double[threads * 64];
	double* sumz = new double[threads * 64];
	for (int i = 0; i < threads; i++) {
		sumx[64 * i] = 0.0;
		sumy[64 * i] = 0.0;
		sumz[64 * i] = 0.0;
	}

	// start timer
	double start = omp_get_wtime();





#pragma omp parallel for ordered
	for (int i = 0; i < particles; i++) {

		float x = 0.0;
		float y = 0.0;
		float z = 0.0;
		unsigned long seed = i;
		Ranq2 ran(seed);

		for (int j = 0; j < steps; j++) {

			float newz = 2.0f * ran.doub() - 1.0f;
			float alpha = ran.doub() * twopi;
			float r = sqrtf(1 - newz*newz);
			x += r*cosf(alpha);
			y += r*sinf(alpha);
			z += newz;


		}
		
		// calc location from 0,0,0

#pragma omp atomic
		sumx[omp_get_thread_num() * 64] += x;
#pragma omp atomic
		sumy[omp_get_thread_num() * 64] += y;
#pragma omp atomic
		sumz[omp_get_thread_num() * 64] += z;


			
	}


	double end = omp_get_wtime();

	cout.precision(15);

	for (int i = 1; i < threads; i++) {
		sumx[0] += sumx[64 * i];
		sumy[0] += sumy[64 * i];
		sumz[0] += sumz[64 * i];
	}

	double average = (double)steps * (double)( particles) / (end - start);
		cout << "x = " << sumx[0] / particles << endl;
		cout << "y = " << sumy[0] / particles << endl;
		cout << "z = " << sumz[0] / particles << endl;
		cout << "time = " << end - start << endl;
		cout << "average = " << average << " ops/sec" << endl;

		system("pause");
	return 0;

}

Flags: /O2 /fp:fast /openMP /Oi /GL /MD /arch:SSE2

(basically optimised for fast math and program speed).

As it stands I have the OpenMP timer in here, though that might not be wholly accurate and I'll invoke system time instead (or as a double check).

At the minute it only does one loop of this, rather than 3DPM which loops over each algorithm for 10 seconds or so and then takes total movement.

I've only run 3DPM2 on my ASUS Zenbook so far, but the i7-4558U in here gets around 75 million ops/sec. The TRIG function in 3DPM currently gets around the same.

Obviously the movement to larger blocks of particles should be considered, using larger instructions, although that would essentially mean building an SSE, SSE2, SSE3 and other variants depending on processor support. This somewhat goes against the mantra of 'self-taught non-CompSci chemist writing code good enough to just work', but if there's an easy way to do it, I'm all ears.

I also need to adjust the PRNG to output floats, not doubles.

3DPM: Can we fix it? Yes we can!

Senior member

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Senior member

Lifer

Platinum Member

Lifer

Platinum Member

Lifer

Member