3DPM: Can we fix it? Yes we can!

Schmide · Jun 12, 2015

I notice you have a series of alphas that are shifted by halfpi then recalculated. If you really wanted to speed up the cos calculation you could transpose your loop and do recurance over smaller deltas. Then randomize over those indexes.

Code:

int numberSteps = 128;

float theta = // small radian number
float tempReal = sin(0.5 * theta);
float deltaReal = -2.0 * tempReal * tempReal;
float deltaImag = -1.0 * sin(theta);
float baseReal = cos(theta*numberSteps);
float baseImag = sin(theta*numberSteps);

for (int j = 0; j<numberSteps; j++)
{
   (output) = baseReal;  
   // recurrance
   baseReal = (tempReal = baseReal) * deltaReal - baseImag * deltaImag + baseReal;
   baseImag = baseImag * deltaReal + tempReal * deltaImag + baseImag;
}

DrMrLordX · Jun 13, 2015

bronxzv said:
one may also argue that it's not fair to write the code with 8x unrolling by hand as you do, how much performance does it buy you btw ?

As long as the code is doing essentially the same thing rolled vs unrolled, I don't see it as being that big of a deal. Sometimes Java gives me no benefit from unrolling small loops and somethings it gives me a great deal. Sometimes it's slower that way (for very long loops). The JVM is a little bit odd in that regard.

In terms of what may be gained from reducing loop overhead, not much. Frequently the bytecode is similar/identical between a rolled and unrolled loop. What changes is whether or not the JVM chooses to autovectorize. Sometimes you have to hit it upside the head with a fully unrolled loop to get it to do that.

Schmide said:
I notice you have a series of alphas that are shifted by halfpi then recalculated. If you really wanted to speed up the cos calculation you could transpose your loop and do recurance over smaller deltas. Then randomize over those indexes.

Code:

int numberSteps = 128; float theta = // small radian number float tempReal = sin(0.5 * theta); float deltaReal = -2.0 * tempReal * tempReal; float deltaImag = -1.0 * sin(theta); float baseReal = cos(theta*numberSteps); float baseImag = sin(theta*numberSteps); for (int j = 0; j<numberSteps; j++) { (output) = baseReal; // recurrance baseReal = (tempReal = baseReal) * deltaReal - baseImag * deltaImag + baseReal; baseImag = baseImag * deltaReal + tempReal * deltaImag + baseImag; }

I'll have to play with that some, thanks.

bronxzv · Jun 13, 2015

DrMrLordX said:
As long as the code is doing essentially the same thing rolled vs unrolled, I don't see it as being that big of a deal.

it's a big deal in the real world for code maintenance, why don't you work in C++ with a mature vectorizer instead ? the source code will be simpler, more readable and the "fixed" benchmark really faster

DrMrLordX · Jun 13, 2015

Unrolled loops do reduce code readability. I don't think it's a big deal with respect to the underlying hardware and whether or not it's properly supported. For me, ultimately, it's about changing 3DPM from being a dubious cache benchmark into a "legit" fp benchmark while keeping the underlying functionality as close to the original as possible. Whether or not the code looks like a dog's breakfast is of secondary importance.

I haven't touched C/C++ for this project yet 'cuz, honestly, I'm pretty bad at it (worse than I am in Java programming anyway). It would be a good opportunity for me to get better.

That aside . . .

Here is build 6132015:

https://www.dropbox.com/s/a7j864pdnu30gld/3DPMRedux6132015.zip?dl=0

And source:

https://www.dropbox.com/s/471xjzpykaze7s9/3DPMReduxSource6132015.zip?dl=0

This version uses MersenneTwisterFast for random number generation. Doing so helped my 7700k speed up a little. Now it's taking ~18500ms to complete 102400 steps per particle. It's about 1000ms slower than 692015 in terms of raw speed. Still haven't done anything with Schmide's suggestion.

Note that there is an associated license in this version of the software to accommodate the use of MersenneTwisterFast which seems to use a BSD-style OSS license. I have only included the license as required by MersenneTwisterFast. The license does not apply to any other part of the software.

bronxzv · Jun 13, 2015

DrMrLordX said:
Unrolled loops do reduce code readability

not only, unrolling by hand makes it very difficult to experiment, unlike with a compiler where you simply say unroll 4x,8x, etc., the end result is code at subpar performance on most targets since the best amount of unrolling is code path dependent (obviously, you'll not unroll the same for x87, SSE, AVX or AVX-512)

anyway, provided that you want peer review (TBC), it will be a good idea to provide a basic version without unrolling showing well what you do with respect to the original loop body http://forums.anandtech.com/showpost.php?p=37006057&postcount=197
which is only 7 lines of code

I see that you replaced the original clumsy AoS layout with an SoA layout, this is clever, and assignation (obvious bug in the original code) with summation which is most probably the right thing to do but it looks like you are ignoring the last statement (absolute value according to the original author IIRC) which must be there for some reason in the original source

one transformation I'll advise to do is to replace the sin/cos pair with a single sin or cos + sqrt, native vectorized sqrt is very fast on modern cores
I mean something like:

float c = cosf(alpha); // cos
float s = sign(pi-alpha) * sqrtf(1.0f-c*c); // sin

side note: sign(x) is vectorizer friendly, unlike range reduction in sin/cos with branches and thus a lot of extra code for if-conversion (or worse: prevented vectorization)

DrMrLordX · Jun 15, 2015

bronxzv said:
not only, unrolling by hand makes it very difficult to experiment, unlike with a compiler where you simply say unroll 4x,8x, etc., the end result is code at subpar performance on most targets since the best amount of unrolling is code path dependent (obviously, you'll not unroll the same for x87, SSE, AVX or AVX-512)

anyway, provided that you want peer review (TBC), it will be a good idea to provide a basic version without unrolling showing well what you do with respect to the original loop body http://forums.anandtech.com/showpost.php?p=37006057&postcount=197
which is only 7 lines of code

I'm still kicking myself for not trying the unrolled loops first. I "rolled up" the main body of code in ParticleMovement.java into a do/while loop and performance increased by about 2 seconds on setting 3. I've seen that happen before, even in my own code (on other projects).

I also "rolled up" the loop in Particles.java (in the addvalues method) without any gain or loss in performance.

but it looks like you are ignoring the last statement (absolute value according to the original author IIRC) which must be there for some reason in the original source

Reading his mini-essay from 2013 on the subject, it appeared to me that Stage 1 was somehow dependent on the z value having a positive sign. So I have actually restored the absolute value element of Dr. Cutress' code snippet in the Particles.java file (addvalues method). The core of the addvalues() method now reads:

Code:

theseparticles.xpos[i] += xvalues[i];
theseparticles.ypos[i] += yvalues[i];
theseparticles.zpos[i] += Math.abs(zvalues[i]);

Math.abs applies a speed penalty on my 7700k of about ~500ms with 102400 steps per particle.

one transformation I'll advise to do is to replace the sin/cos pair with a single sin or cos + sqrt, native vectorized sqrt is very fast on modern cores
I mean something like:

float c = cosf(alpha); // cos
float s = sign(pi-alpha) * sqrtf(1.0f-c*c); // sin

side note: sign(x) is vectorizer friendly, unlike range reduction in sin/cos with branches and thus a lot of extra code for if-conversion (or worse: prevented vectorization)

I tried implementing that code directly:

Code:

tempcos[m] = this.minicos(alpha[m] - halfpi);
					
newx[m] = r[m] * tempcos[m];				
										
newy[m] = r[m] * (Math.signum(pi - alpha[m]) * (float)Math.sqrt(1.0f - (tempcos[m] * tempcos[m])));

It was faster this way, by about 1000ms on my 7700k. Currently, 102400 steps per particle requires ~15500ms @ 3.4 ghz (no turbo). It also appears that the current code spits out sine function results that are slightly more accurate than minisin().

It should be interesting to see how these optimizations will affect performance on Enigmoid's 3630qm . . .

bronxzv · Jun 15, 2015

DrMrLordX said:
Reading his mini-essay from 2013 on the subject, it appeared to me that Stage 1 was somehow dependent on the z value having a positive sign. So I have actually restored the absolute value element of Dr. Cutress' code snippet in the Particles.java file (addvalues method). The core of the addvalues() method now reads:

Code:

theseparticles.xpos[i] += xvalues[i]; theseparticles.ypos[i] += yvalues[i]; theseparticles.zpos[i] += Math.abs(zvalues[i]);

yes, this looks like the author's intended meaning

note that the same result will be obtained by replacing

Code:

float newz = 2 x randgen() - 1;

in the original source

with simply

Code:

float newz = randgen();

this way newz will be >= 0 from the start and no absolute value will be required

[EDIT] oups, I was wrong, I wrote it too fast, that's not strictly equivalent since the random distribution for newz will be different with this simplification

DrMrLordX said:
I tried implementing that code directly:

Code:

tempcos[m] = this.minicos(alpha[m] - halfpi); newx[m] = r[m] * tempcos[m]; newy[m] = r[m] * (Math.signum(pi - alpha[m]) * (float)Math.sqrt(1.0f - (tempcos[m] * tempcos[m])));

It was faster this way

I'm glad that you tested it but I'm afraid most of the potential speedup is lost in the libray calls overhead (library calls most probably also prevent vectorization), the goal here for AVX enabled targets for example will be to to use a single inlined VSQRTPS instruction for 8 packed sqrtf(x)

pls wake-me up when you'll switch to C/C++ so that we can start to talk about truelly optmized code

Schmide · Jun 15, 2015

Ok. I'm going in the poo camp.

I had some time today to actually look into this and this would never be viable to show real world performance. There are just so many ways to do what you're doing more efficiently and accurately. The best and easiest probably being

gnu.org - Spherical Vector Distributions
java lib gsl

They are using a really cool method by Robert E. Knop that seems to use linear coefficients between pairs to maintain uniformity.

Once you have uniformity you can bias any movement with position and scale. Moreover trigonometric values are inherit to the unit vector. I.e. if you pick a random |x| <= 1 it only takes a sqrt and a sign choice to determine the y.

bronxzv · Jun 15, 2015

Schmide said:
gnu.org - Spherical Vector Distributions
java lib gsl
They are using a really cool method by Robert E. Knop that seems to use linear coefficients between pairs to maintain uniformity.
.

looks cool but this isn't exactly what the code discussed here is doing, the original code do some progress in Z at each step, only dx/dy are selected on a circle so that the x,y sum is very near 0,0 after a lot of steps but the z sum progress toward infinity

it's more or less a long cylinder vs. small sphere thing, well if I figure correctly the particle swarms of both cases

btw where can we find the source for gsl_ran_dir_3d() ? it's not available at your java lib gsl link (or I'm blind ?)

Schmide · Jun 15, 2015

Schmide said:
Once you have uniformity you can bias any movement with position and scale.

bronxzv said:
looks cool but this isn't exactly what the code discussed here is doing, the original code do some progress in Z at each step, only dx/dy are selected on a circle so that the x,y sum is very near 0,0 after a lot of steps but z is a big value

it's more or less a cylinder vs. sphere thing

Whatever the constraints, you can easily bias it with vector math. To make the cone you're talking about. Add a unit z vector and scale by 0.5 in the z direction.

Or use a 2d vector if you want to keep the z bias out of the unit calculations.

The point is the most efficient way to do this calculation is to make uniform pairs or tupples and bias the sign from there.

Edit: http://www.gnu.org/software/gsl/

Essence_of_War · Jun 15, 2015

bronxzv said:
btw where can we find the source for gsl_ran_dir_3d() ? it's not available at your java lib gsl link (or I'm blind ?)

I think you can find it here:

ftp://ftp.gnu.org/gnu/gsl/

Latest release is 1.16

Code:

/* randist/sphere.c
 * 
 * Copyright (C) 1996, 1997, 1998, 1999, 2000, 2004, 2007 James Theiler, Brian Gough
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 */

#include <config.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

void
gsl_ran_dir_2d (const gsl_rng * r, double *x, double *y)
{
  /* This method avoids trig, but it does take an average of 8/pi =
   * 2.55 calls to the RNG, instead of one for the direct
   * trigonometric method.  */

  double u, v, s;
  do
    {
      u = -1 + 2 * gsl_rng_uniform (r);
      v = -1 + 2 * gsl_rng_uniform (r);
      s = u * u + v * v;
    }
  while (s > 1.0 || s == 0.0);

  /* This is the Von Neumann trick. See Knuth, v2, 3rd ed, p140
   * (exercise 23).  Note, no sin, cos, or sqrt !  */

  *x = (u * u - v * v) / s;
  *y = 2 * u * v / s;

  /* Here is the more straightforward approach, 
   *     s = sqrt (s);  *x = u / s;  *y = v / s;
   * It has fewer total operations, but one of them is a sqrt */
}

void
gsl_ran_dir_2d_trig_method (const gsl_rng * r, double *x, double *y)
{
  /* This is the obvious solution... */
  /* It ain't clever, but since sin/cos are often hardware accelerated,
   * it can be faster -- it is on my home Pentium -- than von Neumann's
   * solution, or slower -- as it is on my Sun Sparc 20 at work
   */
  double t = 6.2831853071795864 * gsl_rng_uniform (r);          /* 2*PI */
  *x = cos (t);
  *y = sin (t);
}

void
gsl_ran_dir_3d (const gsl_rng * r, double *x, double *y, double *z)
{
  double s, a;

  /* This is a variant of the algorithm for computing a random point
   * on the unit sphere; the algorithm is suggested in Knuth, v2,
   * 3rd ed, p136; and attributed to Robert E Knop, CACM, 13 (1970),
   * 326.
   */

  /* Begin with the polar method for getting x,y inside a unit circle
   */
  do
    {
      *x = -1 + 2 * gsl_rng_uniform (r);
      *y = -1 + 2 * gsl_rng_uniform (r);
      s = (*x) * (*x) + (*y) * (*y);
    }
  while (s > 1.0);

  *z = -1 + 2 * s;              /* z uniformly distributed from -1 to 1 */
  a = 2 * sqrt (1 - s);         /* factor to adjust x,y so that x^2+y^2
                                 * is equal to 1-z^2 */
  *x *= a;
  *y *= a;
}

void
gsl_ran_dir_nd (const gsl_rng * r, size_t n, double *x)
{
  double d;
  size_t i;
  /* See Knuth, v2, 3rd ed, p135-136.  The method is attributed to
   * G. W. Brown, in Modern Mathematics for the Engineer (1956).
   * The idea is that gaussians G(x) have the property that
   * G(x)G(y)G(z)G(...) is radially symmetric, a function only
   * r = sqrt(x^2+y^2+...)
   */
  d = 0;
  do
    {
      for (i = 0; i < n; ++i)
        {
          x[i] = gsl_ran_gaussian (r, 1.0);
          d += x[i] * x[i];
        }
    }
  while (d == 0);
  d = sqrt (d);
  for (i = 0; i < n; ++i)
    {
      x[i] /= d;
    }
}

bronxzv · Jun 15, 2015

bronxzv · Jun 15, 2015

Essence_of_War said:

I think you can find it here:

ftp://ftp.gnu.org/gnu/gsl/

Latest release is 1.16

Code:

/* randist/sphere.c
 * 
 * Copyright (C) 1996, 1997, 1998, 1999, 2000, 2004, 2007 James Theiler, Brian Gough
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 */
 
#include <config.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
 
void
gsl_ran_dir_2d (const gsl_rng * r, double *x, double *y)
{
  /* This method avoids trig, but it does take an average of 8/pi =
   * 2.55 calls to the RNG, instead of one for the direct
   * trigonometric method.  */
 
  double u, v, s;
  do
    {
      u = -1 + 2 * gsl_rng_uniform (r);
      v = -1 + 2 * gsl_rng_uniform (r);
      s = u * u + v * v;
    }
  while (s > 1.0 || s == 0.0);
 
  /* This is the Von Neumann trick. See Knuth, v2, 3rd ed, p140
   * (exercise 23).  Note, no sin, cos, or sqrt !  */
 
  *x = (u * u - v * v) / s;
  *y = 2 * u * v / s;
 
  /* Here is the more straightforward approach, 
   *     s = sqrt (s);  *x = u / s;  *y = v / s;
   * It has fewer total operations, but one of them is a sqrt */
}
 
void
gsl_ran_dir_2d_trig_method (const gsl_rng * r, double *x, double *y)
{
  /* This is the obvious solution... */
  /* It ain't clever, but since sin/cos are often hardware accelerated,
   * it can be faster -- it is on my home Pentium -- than von Neumann's
   * solution, or slower -- as it is on my Sun Sparc 20 at work
   */
  double t = 6.2831853071795864 * gsl_rng_uniform (r);          /* 2*PI */
  *x = cos (t);
  *y = sin (t);
}
 
void
gsl_ran_dir_3d (const gsl_rng * r, double *x, double *y, double *z)
{
  double s, a;
 
  /* This is a variant of the algorithm for computing a random point
   * on the unit sphere; the algorithm is suggested in Knuth, v2,
   * 3rd ed, p136; and attributed to Robert E Knop, CACM, 13 (1970),
   * 326.
   */
 
  /* Begin with the polar method for getting x,y inside a unit circle
   */
  do
    {
      *x = -1 + 2 * gsl_rng_uniform (r);
      *y = -1 + 2 * gsl_rng_uniform (r);
      s = (*x) * (*x) + (*y) * (*y);
    }
  while (s > 1.0);
 
  *z = -1 + 2 * s;              /* z uniformly distributed from -1 to 1 */
  a = 2 * sqrt (1 - s);         /* factor to adjust x,y so that x^2+y^2
                                 * is equal to 1-z^2 */
  *x *= a;
  *y *= a;
}
 
void
gsl_ran_dir_nd (const gsl_rng * r, size_t n, double *x)
{
  double d;
  size_t i;
  /* See Knuth, v2, 3rd ed, p135-136.  The method is attributed to
   * G. W. Brown, in Modern Mathematics for the Engineer (1956).
   * The idea is that gaussians G(x) have the property that
   * G(x)G(y)G(z)G(...) is radially symmetric, a function only
   * r = sqrt(x^2+y^2+...)
   */
  d = 0;
  do
    {
      for (i = 0; i < n; ++i)
        {
          x[i] = gsl_ran_gaussian (r, 1.0);
          d += x[i] * x[i];
        }
    }
  while (d == 0);
  d = sqrt (d);
  for (i = 0; i < n; ++i)
    {
      x[i] /= d;
    }
}

thank you!, I see that the 1st step is an iterative method (calling two rand() per loop iteration!) so it will be way slower than the *simpler* solution discussed here and this one isn't amenable to vectorization since the number of loops is not known statically, for modern code we prefer to use polynomial approximatiions since they are easily vectorizable (and well suited for FMA), legacy iterative methods such as this must be *avoided like the plague*: for ex. the cost of a single branch miss in this scalar case (due to the unpredictable loop count in this example) can be more than the evaluation of 8 polynomials in parallel with AVX2!

bronxzv · Jun 15, 2015

Schmide said:
Add a unit z vector and scale by 0.5 in the z direction.

then normalize the whole vector again to have a unit vector at each step as in the original example

Schmide · Jun 15, 2015

bronxzv said:
then normalize the whole vector again to have a unit vector at each step as in the original example

There is never a need to re-normalize, you always start with a normalized vector?

bronxzv · Jun 15, 2015

Schmide said:
There is never a need to re-normalize, you always start with a normalized vector?

there is obviously a need to renormalize after your proposed transform (translate + scale z)

example:
let's start with a normalized vector v = [0.4 0.3 -0.866]

after the transform you described it will become v' = [0.4 0.3 0.067]

|v| = ~1.0, but |v'| = ~0.25

v' is 4x too short in this example

the best counter example will be v = [0 0 -1] where v' = [0 0 0] (!)

Schmide · Jun 15, 2015

^ ok I looked at the original code and see how it's different.

DrMrLordX · Jun 16, 2015

bronxzv said:
I'm glad that you tested it but I'm afraid most of the potential speedup is lost in the libray calls overhead (library calls most probably also prevent vectorization), the goal here for AVX enabled targets for example will be to to use a single inlined VSQRTPS instruction for 8 packed sqrtf(x)

pls wake-me up when you'll switch to C/C++ so that we can start to talk about truelly optmized code

Okay, will do. And yes, Java library calls seem to be a source of slowdown depending on how the method operates. Replacing the call to minisin() still sped things up nicely, so I'll take what I can get.

It's still interesting to see Java outperforming the original 3DPM (Stage 1 anyway) on my 7700k . . . not sure why it was slower on a 3630qm, but maybe I'll get to the bottom of that later.

Schmide said:
Ok. I'm going in the poo camp.

I had some time today to actually look into this and this would never be viable to show real world performance. There are just so many ways to do what you're doing more efficiently and accurately. The best and easiest probably being

gnu.org - Spherical Vector Distributions
java lib gsl

They are using a really cool method by Robert E. Knop that seems to use linear coefficients between pairs to maintain uniformity.

Once you have uniformity you can bias any movement with position and scale. Moreover trigonometric values are inherit to the unit vector. I.e. if you pick a random |x| <= 1 it only takes a sqrt and a sign choice to determine the y.

You'll see no inherent disagreement from me here. I'm only copying someone else's work and trying to spruce it up a bit, hopefully in such a way that we can all learn more about why AMD processors suffer and fail so badly on the original. Fact is that I'm just a hobbyist programmer/forum zombie, so that doesn't really put me in any better category than the original author, much less anyone else. What you're discussing is a bit over my head at the moment.

The cool thing is that I'm not the only one allowed to take a whack at re-imagining Dr. Cutress' 3DPM Stage 1. That is, unless Dr. Cutress exercises IP rights to all derivative works and demands that we cease distribution of such, but I don't think he's going to do that.

In other news, here's build 6162015:

https://www.dropbox.com/s/158tbdoo8rwdno8/3DPMRedux6162015.zip?dl=0

Source:

https://www.dropbox.com/s/crjlp6tnk5twoab/3DPMReduxSource6162015.zip?dl=0

As discussed, it includes some code tweaks that have improved performance over the previous version without making things less-accurate. I expect that this will be my final iteration of the Java version unless I find a way to improve the efficiency of the range reduction for the minicos() method.

Here's my results running 6162015:

Code:

Build: 6162015
102400 steps completed.
Mode: multi-threaded

It took 15524 milliseconds to complete the workload.
It took 61 milliseconds to write the output file.

Build: 6162015
102400 steps completed.
Mode: multi-threaded

It took 15231 milliseconds to complete the workload.
It took 16 milliseconds to write the output file.

Build: 6162015
102400 steps completed.
Mode: multi-threaded

It took 15455 milliseconds to complete the workload.
It took 44 milliseconds to write the output file.

Judging by those results, the 7700k @ 3.4 ghz (no turbo), DDR3-1600, 1800mhz NB is getting ~63M movements per second, which is 46-47% faster than what I get running 3DPM in Stage 1. To the extent that I had sought to prove that the same code could be made to run faster on an AMD processor, I consider this build to be "mission accomplished". I'd still like to see some others post their results if they're willing, perhaps to show whether or not my implementation of Stage 1 has hurt other processors (such as an Ivy Bridge) as much as it has helped my Steamroller. If it's still running slower on Intel chips than does Dr. Cutress' code, then I've traded one performance discrepancy for another, and that isn't necessarily a good thing.

Future versions will probably be in C/C++. Please don't be too alarmed if you find that the code is utter rubbish. It's easy to write bad C++, and I'll be stumbling around for a bit until I manage some core competency.

Essence_of_War · Jun 16, 2015

DrMrLordX,

Allow a fortran and matlab programmer to reveal their mega-ignorance of java, how does one go from source files to compiled binaries. Do you use a makefile? Or is it something different. I didn't see a makefile in your source stuff, so I wasn't sure.

Also, I was thinking about trying to implement this in fortran, I've done a little OMP in fortran before, and although I haven't used anything like AVX intrinsics, I have done some particle pushing before. In this vein, could you run through your/Ian's algorithm in non or pseudo code, just so we can get on the same page.

Schmide · Jun 17, 2015

This is kind of stupid but kind of fun.

I went back and took a little longer to understand the implications of what the original code is doing and what could be done to form equivalence without using transcendentals.

So lets look at it in pieces.

Code:

float newz = 2 x randgen() - 1;

Essentially maps 0-1 to -1 to 1. Later in the code this has the effect of slightly biasing the computed vector in the z direction and half the time producing a non unit non z vector.

Code:

  float alpha = 2 * pi * randgen();
  float r = sqrtf(1 - newz * newz);
  particle[i].x = r*cosf(alpha);
  particle[i].y = r*sinf(alpha);
  particle[i].z = newz;

This makes a random angle and then produces an xy offset such that the resulting vector is unit.

Code:

if(particle[i].z < 0) {particle[i].z -= particle[i].z;}

The final line essentially throws away the z value if it is negative. This is where a non unit vector is created.

So lets look at the statistical outcomes for the generated random z number.

Half the time 0 to 0.5 it produces a non unit vector where the magnitude in the xy plane is between 0 and 1.

The other half 0.5 to 1 it produces a unit vector in the z positive.

The crux is designing a system that produces the same results while mapping linear space, where statistical outcomes are evenly spaced, to curved space, where statistical outcomes are biased relative to the area under the curve.

If you drop random points on a unit square with a circle on it.
(aka monte carlo. We could also integrate but for visual perspective this is easier)

The chances of hitting the circle is the area of the circle divided by the area of the square. (pi r^2) / (2 r)^2 = pi / 4

Essentially when the original algorithm chooses its z first the bias of its method vs a random point on a curve is 1/ (pi / 4) or 4/pi or ~1.27.

If one was to choose a random vector using the library method and do the same negative z drop, it will on the average travel at pi/4 the speed in the z direction and 4/pi in the xy plane.

The easiest and most likely equally effective method to alleviate this issue would be to choose another random number. If it is below pi/4, find the xy component with the most magnitude and exchange it with the z. This maintains unitness and pushes the random tupple into the z direction.

pseudo code for it

Code:

if(randgen()<(pi/4)) {
  if(abs(particle[i].x)<abs(particle[i].y)) {
     if(abs(particle[i].y)>abs(particle[i].z)) {
         double tempZ=particle[i].z;
         particle[i].z=particle[i].y;
         particle[i].y=tempZ;
     } 
  } else if(abs(particle[i].x)>abs(particle[i].z)) {
              double tempZ=particle[i].z;
              particle[i].z=particle[i].x;
              particle[i].x=tempZ;
           }

Now is this what should be done? I don't know but I could guess that this is what the pros would do if faced with the same constraint.

bronxzv · Jun 18, 2015

Schmide said:
Code:

if(particle[i].z < 0) {particle[i].z -= particle[i].z;}

The final line essentially throws away the z value if it is negative. This is where a non unit vector is created.

from this original author's comment http://forums.anandtech.com/showpost.php?p=37249640&postcount=235 we have deduced that it should be interpreted as:

Code:

particle[i].z += abs(newz);

not the clamp to 0.0 in the pseudo code, this way the delta is a unit vector at each step
our main problem here is that we still miss a fixed version of the buggy pseudocode posted by the original author so it's not 100% clear if the absolute value is for the current total z or the delta z at each step

Schmide said:
If one was to choose a random vector using the library method and do the same negative z drop, it will on the average travel at pi/4 the speed in the z direction and 4/pi in the xy plane.

note that the average travel speed on the xy plane tends to 0 for high step counts (if you compare the starting position to the ending position), with a good random angular distribution a particle always pass very near the origin after a while, and so again and again

DrMrLordX · Jun 18, 2015

Essence_of_War said:
DrMrLordX,

Allow a fortran and matlab programmer to reveal their mega-ignorance of java, how does one go from source files to compiled binaries. Do you use a makefile? Or is it something different. I didn't see a makefile in your source stuff, so I wasn't sure.

Hey I got ignorance to spare, so you're in good company. Java's pretty easy. What you do is use the "javac" command to "compile" your java source into .class files, which is what you see in the binary distribution. The .class files aren't executable in any real sense, they're just binary-format bytecode that the JVM dynamically compiles into machine code at runtime.

Normally I rely on my IDE (I use Eclipse) to do the compiling. If you want to do it from the command line, you have to specify the .java file with the main method (static public void main(String args[]) is usually what you're looking for).

In this case, the project format is pretty simple, so all you have to do is "javac ThreeDPMRedux.java" inside the directory containing the source files. Be aware that Windows distributions of the Oracle JDK may not set PATH statements correctly for you to use javac.exe anywhere you want, especially if you did a version upgrade. For example, I installed JDK 8 update 25, and it has since patched to update 40, and the PATH statement points to the updated java.exe in the bin directory of update 40 which does not contain javac.exe 'cuz they haven't modified it since then (so the JDK still relies on an old javac.exe).

As a consequence, javac.exe is in the bin dir in jdk1.8.0_25 while the PATH statement points to the bin dir of jre1.8.0_40. Fun, huh? Fortunately, Eclipse isn't easily confused by all that and keeps up rather nicely. Eclipse CDT is another story . . .

Also, I was thinking about trying to implement this in fortran, I've done a little OMP in fortran before, and although I haven't used anything like AVX intrinsics, I have done some particle pushing before. In this vein, could you run through your/Ian's algorithm in non or pseudo code, just so we can get on the same page.

I think Schmide has summed it up rather nicely. Personally I do not fully understand what it was that Dr. Cutress was trying to do here. He's following what is apparently an algorithm from some other researcher's work on three-dimensional particle movement in some particular scenario. He wrote back in 2013:

The purpose of Brownian motion simulation is to calculate what happens to individual chemicals in circumstances where the finite motion is what dictates the activity occurring. For example, experiments dealing with single molecule detection rely on how a single molecule moves, or during deposition chemistry, the movement of individual chemicals will dictate how a surface achieves deposition in light of other factors. The movement of the single particle is often referred to a Random Walk (of which there are several types).

So my guess is that 3DPM has something to do with particle movement in a chemical deposition scenario. That implies some base or substrate acting as the deposition surface, hence the need for non-negative z values (maybe?).

When I saw the pseudocode, all I saw was some math and thought to myself, hot damn, you could totally autovectorize that!

But, basically, to echo what Schmide said:

First you generate a random 32-bit float value (newz) from -1 to 1, with the understanding that this particular value is going to be used as a component of other value calculations.

Then you generate an entirely different random 32-bit float value (alpha) ranging from 0 to 2pi.

Then you calculate an r (radius?) value (32-bit float) which is the square root of (1 - newz^2).

Finally you calculate the "walk" values using trig (r * cos(alpha), r * sin(alpha)) except for the z value which is the absolute value of newz.

My guess is, the particle starts at origin in some chemical deposition process, and the distance it "walks" from origin is based on whatever z value it assumes plus some randomness, and that the algorithm we're seeing here matches the experimental data of some researcher(es) who wrote a paper about it. The z value has to be positive in the end since a negative value implies penetration of the underlying substrate.

Now given all that, I'm not 100% sure why the starting z value has to range from -1 to 1. newz goes positive in the r calculation (newz^2 will always be positive) and in the final determination of the particle walk in the z direction.

Given any random distribution of values from -1 to 1 stored in newz, the final values of newz^2 and abs(newz) should be identical if the random distribution of values stored in newz ranged only from 0 to 1.

Enigmoid · Jun 18, 2015

DrMrLordX said:
When I saw the pseudocode, all I saw was some math and thought to myself, hot damn, you could totally autovectorize that!

But, basically, to echo what Schmide said:

First you generate a random 32-bit float value (newz) from -1 to 1, with the understanding that this particular value is going to be used as a component of other value calculations.

Then you generate an entirely different random 32-bit float value (alpha) ranging from 0 to 2pi.

Then you calculate an r (radius?) value (32-bit float) which is the square root of (1 - newz^2).

Finally you calculate the "walk" values using trig (r * cos(alpha), r * sin(alpha)) except for the z value which is the absolute value of newz.

My guess is, the particle starts at origin in some chemical deposition process, and the distance it "walks" from origin is based on whatever z value it assumes plus some randomness, and that the algorithm we're seeing here matches the experimental data of some researcher(es) who wrote a paper about it. The z value has to be positive in the end since a negative value implies penetration of the underlying substrate.

Now given all that, I'm not 100% sure why the starting z value has to range from -1 to 1. newz goes positive in the r calculation (newz^2 will always be positive) and in the final determination of the particle walk in the z direction.

Given any random distribution of values from -1 to 1 stored in newz, the final values of newz^2 and abs(newz) should be identical if the random distribution of values stored in newz ranged only from 0 to 1.

So basically in C pseudocode for a single particle using vanilla c

Code:

int steps, count;
float newz, alpha, r, part_x=0, part_y=0, part_z=0; 

for (count = 0; count < steps; count++)
{
newz = rand();          //rand() is U[-1,1]
alpha = pi*rand() + pi;
r = sqrt(1-newz*newz);
part_x += r*cos(alpha);
part_y += r*sin(alpha);
part_z += abs(newz);
}

Is this correct? Well basically correct?

Edit: Of course part_n is an array.

bronxzv · Jun 18, 2015

Enigmoid said:
So basically in C pseudocode for a single particle using vanilla c

Code:

int steps, count; float newz, alpha, r, part_x=0, part_y=0, part_z=0; for (count = 0; count < steps; count++) { newz = rand(); //rand() is U[-1,1] alpha = pi*rand() + pi; r = sqrt(1-newz*newz); part_x += r*cos(alpha); part_y += r*sin(alpha); part_z += abs(newz); }

Is this correct? Well basically correct?

Edit: Of course part_n is an array.

this looks basically correct, I will use another name than "rand" since it's already used in LibC with another meaning (integer values)

another possible intepretation is :

Code:

int steps, count;
float newz, alpha, r, part_x=0, part_y=0, part_z=0; 
 
for (count = 0; count < steps; count++)
{
newz = rand();          //rand() is U[-1,1]
alpha = pi*rand() + pi;
r = sqrt(1-newz*newz);
part_x += r*cos(alpha);
part_y += r*sin(alpha);
part_z += newz;
part_z = abs(part_z);
}

it will be like if the particle was rebounding on the plane z=0

DrMrLordX · Jun 18, 2015

Enigmoid said:
So basically in C pseudocode for a single particle using vanilla c

Code:

int steps, count; float newz, alpha, r, part_x=0, part_y=0, part_z=0; for (count = 0; count < steps; count++) { newz = rand(); //rand() is U[-1,1] alpha = pi*rand() + pi; r = sqrt(1-newz*newz); part_x += r*cos(alpha); part_y += r*sin(alpha); part_z += abs(newz); }

Is this correct? Well basically correct?

Edit: Of course part_n is an array.

The only problem I see is here:

Code:

alpha = pi*rand() + pi;

That's going to generate values from pi to 2pi, rather than 0 to 2pi. You could switch to:

Code:

alpha = rand() * (pi + pi);

Or you could predefine a constant named twopi equal to pi * 2 (so you don't have to constantly recalculate pi * 2 . . . that's what I did in my Java code) and do this:

Code:

alpha = rand() * twopi;

Everything else looks right. Dr. Cutress seems to have used 10000 individual particles, and they all start with the same coordinates for origin (0, 0, 0). The calculations do not seem to take any kind of particle-on-particle collision into account which is just fine with me.

3DPM: Can we fix it? Yes we can!

Diamond Member

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Platinum Member

Diamond Member

Senior member

Lifer

Platinum Member

Senior member

Lifer