Just to let you guys know, I've been monitoring this thread for about a month and watching what you're doing. I've not had much time to post as I really want to put an indepth piece together for you explaining some of it, but after a month on the road after Computex I'm now more time constrained than ever with main site business. Stay tuned, but don't expect miracles.
I have been contacted by DrMrLordX I believe through email, and shed some light on his questions. But I want to put down a few things here:
- Source code is lost. Not my fault, had some jobsworth admin wipe my system in the lab because he uninstalled the graphics drivers then wiped the data rather than fix his error. I thought I had backup, but best laid plans and all that.
- Written in C++ using VS2012 and OpenMP extensions. All flags for optimising speed over anything else.
- The base of 3DPM is three-dimensional movement algorithms.
In the field I was working (computational electrochemistry), 3D movement was defined as:
i = floor(rand(0,6));
if(i == 0) {moveup();}
if(i == 1) {movedown();}
if(i == 2) {moveleft();}
if(i == 3) {moveright();}
if(i == 4) {moveback();}
if(i == 5) {moveforward();}
This is the Bipyramidal method in 3DPM essentially. It's quick, dirty, but ultimately it doesn't move particles on a sphere but a bipyramid, reducing accuracy.
3DPM does 6 algorithms for 3D movement, all of which are in the literature, some of which have been discussed in this thread.
There is a trap with 3D movement just to consider two random angles and calculate position on a sphere, but that gives an uneven distribution at the poles. I go through this in detail in my research, because it seems like an easy trap to fall into (and I did for about 2 weeks).
The method I outlined in my psuedo code before, the even distribution over Z and random angle for X,Y is a proven spherical movement algorithm (as in mathematically proven if you integrate over the surface of a sphere) and ends up being the fastest on most architectures if you design accordingly.
If anyone has access to online scientific journals, my main paper on this is here:
http://www.sciencedirect.com/science/article/pii/S1572665711001068
I think I go over the algorithms in more detail in the supporting information.
Or if you have access to Oxford University's PhD Thesis library, I'm in there. Unfortunately I can't give the thing out, due to publishers copyright and such.
But the methods I use for 3DPM, as I believe they are called (or perhaps I made up these names):
Bipyramidal Method (quick dirty ifs)
Cosine Method (spherical polar coordinates, integrate over surface area element)
Normal Deviate Method (xyz are (0,1], normalise on sphere)
Hypercube Rejection Method (xyz are (-1,1], reject hose outside sphere, scale to sphere)
Trigonometric Method (Uniform over z, random angle to calc xy)
2-D Rejection (Uniform over z then random XY over (-1,1] and reject if outside sphere)
Each of these differ in speed due to various math calls. Ultimately you want to avoid division at all costs (remove out to a variable if it's a constant and multiply), but the algorithm should spend most of its time in the cosf/sinf functions.
For anyone that's wondering about accuracy, I used single precision here. I have done work in partial differential equations which require double precision due to iterations over matrices, but the diffuse and parallel nature of this problem is not effected by precision to that extent. I have a feeling using 8-bit numbers might even help. But for as many functions as I could, I used the lower accuracy models.
Technically, I used 3DPM for verification of my GPU version of the program. Being a parallel problem, GPUs were much faster and that's what I used in my published papers. But at the time the best on the market was a GTX 280, so go figure, but I was still reducing the time it took other people to do similar work down from one month per simulation to one minute. CPUs got me some of the way there (better algorithm, multithreaded) but GPUs made this research a tangible asset going forward.
With regards random number generation, I used the Ranq2 algorithm for CPUs from Numerical Recipes 3rd Edition: The Art of Scientific Computing (
https://*******/WZGYPr) but modified it slightly to output between (0,-1]. If anyone's interested, I used the XORWOW for GPUs.
Another element to speed up the CPU simulation, which was perhaps dubious in my pseudo code earlier, is that you do not need to generate the particles before the threads are issued. Initiate the xyz ords (and a Ranq2 sequence) when you start the thread and then loop over steps as necessary. When the steps end, find a final distance from origin, dump that into an overriding array so the thread is still compiled (make the array the size of particles, then do a final reduction for average distance from origin of all particles - should be near 0 if the RNG is good), then move to the next thread. I believe my numbers were 10000 threads at 100 steps each, though I'm not 100% sure on that. It was definitely enough steps to cover thread generation and enough threads to make the big CPUs sweat for a time. In 3DPM, single thread spends 5 seconds on each algorithm then does a calc for speed, multithread spends 10 seconds on each algorithm (do until time>60 sec, calc steps then divide by time, making sure you keep accuracy).
Regarding negative z direction movement, I distinctly remember adjusting this slightly.
So the idea is that it should bounce off a solid plane, but you only perform the 'bounce' if the particle ends up below the plane. e.g.
z-value at each step, where plane is z=0
0.7
0.8
0.2
-0.3 -> becomes 0.3
0.4
-0.5 -> becomes 0.5
0.6
etc.
I remember in my code that the boundary for z was actually -10, not 0.
So
if (z < -10) {z = z - 2(z+10);}
Or something similar - double check to make sure that's right! In otherwords it should bounce off of the z=-10 plane. I did this so the final if statement would have less effect, but in the simulation it's really there, if only a few particles hit it anyway. (It actually makes a big difference when you're dealing with warps of particles on a GPU as each particle in the warp has to do the code, even if it doesn't apply).
I think that's most of the basic stuff here. Clearly at some point in the future it needs a re-write, and I think I'll have to come back here and we can discuss potential optimizations and make it a 'fairer' test as it were. I still believe it's fair, because it's basic code that I wrote for my work at the time, even though it's not CompSci inspired and I was learning as I was doing.
If anyone has any questions, I'd be happy to answer them to the best of my abilities. I have an email subscription to the thread. Or email me direct through the main AT website, that works too.