3DPM: Can we fix it? Yes we can!

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Double post. Oops!

I'll salvage this by saying that I have made a minor update in the form of build 6222015:

https://www.dropbox.com/s/vpa8t5yl04ajape/3DPMRedux6222015.zip?dl=0

Source:

https://www.dropbox.com/s/enz9kz2u8up8v2x/3DPMReduxSource6222015.zip?dl=0

I modified the program to produce results in millions of steps/second. It also corrects the issue of an ever-increasing z value as I've discussed elsewhere in this thread. Should run about as fast as before.

Build: 6222015
102400 steps completed.
Mode: multi-threaded

It took 15301 milliseconds to complete the workload.
(66.92373M steps per second)
It took 51 milliseconds to write the output file.

(that's at 3.4 GHz)
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Some more testing. I've been trying to get a AVX trig library but no matter what I keep getting errors or suffering from a lack of documentation (I am no expect programmer but I'm not good at looking at huge interdependent AVX/SSE libraries and getting stuff properly linked and built).

Have you looked at Yeppp! yet? I haven't done anything with it (couldn't get it to work properly in Java for some reason, though that may be user stupidity).

I've gotten ICC installed and running through VS (which is a steaming pile of crap from head to toe compared to codeblocks at getting basic code out the door). The RNG does NOT function well on VS with the microsoft compiler. It is about 1/10 the speed (7M numbers/s) than on GNU and slower than the entire particle movement (8M/s) on GNU. DO NOT compile with VS Microsoft compiler - you must fix the RNG.

Ouch. VS is often lambasted for producing slow code. Interestingly enough, I think Dr. Cutress used VS2008 to compile 3DPM.

I'm really thinking VS is not the way to go. Only has openMP 2.0 support which may be hurting things (GNU is 4.0 and ICC is 3.0).

Agreed. There's a lot of things not-right about VS. There may be other ways to squeeze performance out of it, but frankly I don't think it'll be worth the effort.

IMO: lol at ICC and AMD. Intel doesn't compile properly for AMD and AMD's AVX math libraries require FMA4 which intel CPUs don't have.

ICC is funny like that. GCC is probably going to work better for AMD build targets. Unfortunately, it may also require fundamentally different code in a few places depending on what libraries are in use and other things. Or maybe it won't.

So what toolchain are you using for GCC right now, and are you using it in Windows or some other OS? I can use GCC just fine in teh Lunix, but Eclipse CDT + MinGW aren't playing along well at all. I can compile using MinGW from the command-line, but that's about it.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
are you sure SVML is used for the trig funcs ? you should see stuff such as call __svml_cosf8 in the ASM dump

it is used automatically by the ICC vectorizer, once you have a data layout amenable to vectorization

Code:
void SinTest (const float * __restrict x, float * __restrict y, size_t size)
{
  __assume_aligned(x,32);
  __assume_aligned(y,32);
  for (int i=0; i<size; i++) y[i] = sin(x[i]);
}
Code:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150";
 
; [...]
 
.B1.4:: ; Preds .B1.14 .B1.3
L4:: ; optimization report
; LOOP WAS VECTORIZED
; VECTORIZATION SPEEDUP COEFFECIENT 9.953125
$LN24:
$LN25:
call __svml_sinf8 ;538.10
$LN26:
; LOE rbx rbp rsi rdi r12 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 ymm0
.B1.14:: ; Preds .B1.4
$LN27:
add r15d, 8 ;76.3
$LN28:
vmovups YMMWORD PTR [rsi+r14*4], ymm0 ;76.30
$LN29:
add r14, 8 ;76.3
$LN30:
cmp r15, rdi ;76.3
$LN31:
jb .B1.4 ; Prob 82% ;76.3

Will look into it. SVML is not being called anywhere looking at the assembly code. Probably need to format it. Still I thought ICC would catch something like
x[0]=r[0]*cos(valpha[0]);
.
.
x[7] = r[7]*cos(valpha[7]);

perhaps thats the wrong way to write it.

Have you looked at Yeppp! yet? I haven't done anything with it (couldn't get it to work properly in Java for some reason, though that may be user stupidity).

ICC is funny like that. GCC is probably going to work better for AMD build targets. Unfortunately, it may also require fundamentally different code in a few places depending on what libraries are in use and other things. Or maybe it won't.

So what toolchain are you using for GCC right now, and are you using it in Windows or some other OS? I can use GCC just fine in teh Lunix, but Eclipse CDT + MinGW aren't playing along well at all. I can compile using MinGW from the command-line, but that's about it.

Will try Yepp! I've tried others like

http://software-lisc.fbk.eu/avx_mathfun/

and sleef but sleef has 0 documentation and avx_mathfun gives a ton of inline errors in the AVX header files (immintrin.h and others).

I'm using codeblocks with the newest version of GNU.

http://tdm-gcc.tdragon.net/download

Select OpenMP on download. Change linker settings in codeblocks. Codeblocks has support for ICC but I cannot get it working.

I'm trying to get the mathkernel libraries in ICC working (and SVML like bronxzv suggested). I don't like using ICC however and would prefer something open source. Will try yepp on GCC though.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
SVML is not being called anywhere looking at the assembly code. Probably need to format it. Still I thought ICC would catch something like
x[0]=r[0]*cos(valpha[0]);
.
.
x[7] = r[7]*cos(valpha[7]);

perhaps thats the wrong way to write it.

I don't see the point to unroll by hand, you seem to have the same strange idea than DrMrLordX on this, just write
Code:
x[i] = r[i]*cos(valpha[i]);
in a loop, it's easy to maintain and vectorize well
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
I don't see the point to unroll by hand, you seem to have the same strange idea than DrMrLordX on this, just write
Code:
x[i] = r[i]*cos(valpha[i]);
in a loop, it's easy to maintain and vectorize well

Yeah. I tried both and I'm not seeing anything. Need to investigate. Anyway, some stuff came up and I'm going to have to put this project on hold for a few days.
 

Schmide

Diamond Member
Mar 7, 2002
5,745
1,036
126
I'm pretty sure you can't vectorize transcendentals. When the pipeline is 14-17 cycles and one instruction takes 100, it kind of defeats the purpose.

Now one thing you can vectorize is cephes operations.

http://gruntthepeon.free.fr/ssemath/

The nice thing is they can parallel 4 calculations in one pass.

There has been a bit of talk about making things faster by loss of precision and such. Are there any declared rules for this?
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
There has been a bit of talk about making things faster by loss of precision and such. Are there any declared rules for this?

Only that Dr. Cutress chose to use 32-bit floats, which limits accuracy (to an extent).
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Enigmoid: sorry it took me so long to think of this, but I would like to know if your i7-3630qm is throttling at all while running the Java 3DPMRedux? Or is it staying steady @ 3.2 GHz?
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Enigmoid: sorry it took me so long to think of this, but I would like to know if your i7-3630qm is throttling at all while running the Java 3DPMRedux? Or is it staying steady @ 3.2 GHz?

Never checked throttling but it runs Prime 95 @ 3.2 without a sweat.

Checked now, its 3.2 solid.
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Okay, thanks. If you have the time, could you try disabling HT and rerunning 3DPM Stage 1 vs 3DPMRedux and see what happens? It's kind of a shot in the dark, but the problem my code has on Intel chips may be related to the way the thread pool is assigning workers . . . maybe.
 

borandi

Member
Feb 27, 2011
138
116
116
Just to let you guys know, I've been monitoring this thread for about a month and watching what you're doing. I've not had much time to post as I really want to put an indepth piece together for you explaining some of it, but after a month on the road after Computex I'm now more time constrained than ever with main site business. Stay tuned, but don't expect miracles.

I have been contacted by DrMrLordX I believe through email, and shed some light on his questions. But I want to put down a few things here:

- Source code is lost. Not my fault, had some jobsworth admin wipe my system in the lab because he uninstalled the graphics drivers then wiped the data rather than fix his error. I thought I had backup, but best laid plans and all that.
- Written in C++ using VS2012 and OpenMP extensions. All flags for optimising speed over anything else.
- The base of 3DPM is three-dimensional movement algorithms.

In the field I was working (computational electrochemistry), 3D movement was defined as:

i = floor(rand(0,6));
if(i == 0) {moveup();}
if(i == 1) {movedown();}
if(i == 2) {moveleft();}
if(i == 3) {moveright();}
if(i == 4) {moveback();}
if(i == 5) {moveforward();}

This is the Bipyramidal method in 3DPM essentially. It's quick, dirty, but ultimately it doesn't move particles on a sphere but a bipyramid, reducing accuracy.

3DPM does 6 algorithms for 3D movement, all of which are in the literature, some of which have been discussed in this thread.

There is a trap with 3D movement just to consider two random angles and calculate position on a sphere, but that gives an uneven distribution at the poles. I go through this in detail in my research, because it seems like an easy trap to fall into (and I did for about 2 weeks).

The method I outlined in my psuedo code before, the even distribution over Z and random angle for X,Y is a proven spherical movement algorithm (as in mathematically proven if you integrate over the surface of a sphere) and ends up being the fastest on most architectures if you design accordingly.

If anyone has access to online scientific journals, my main paper on this is here:

http://www.sciencedirect.com/science/article/pii/S1572665711001068

I think I go over the algorithms in more detail in the supporting information.

Or if you have access to Oxford University's PhD Thesis library, I'm in there. Unfortunately I can't give the thing out, due to publishers copyright and such.

But the methods I use for 3DPM, as I believe they are called (or perhaps I made up these names):

Bipyramidal Method (quick dirty ifs)
Cosine Method (spherical polar coordinates, integrate over surface area element)
Normal Deviate Method (xyz are (0,1], normalise on sphere)
Hypercube Rejection Method (xyz are (-1,1], reject hose outside sphere, scale to sphere)
Trigonometric Method (Uniform over z, random angle to calc xy)
2-D Rejection (Uniform over z then random XY over (-1,1] and reject if outside sphere)

Each of these differ in speed due to various math calls. Ultimately you want to avoid division at all costs (remove out to a variable if it's a constant and multiply), but the algorithm should spend most of its time in the cosf/sinf functions.

For anyone that's wondering about accuracy, I used single precision here. I have done work in partial differential equations which require double precision due to iterations over matrices, but the diffuse and parallel nature of this problem is not effected by precision to that extent. I have a feeling using 8-bit numbers might even help. But for as many functions as I could, I used the lower accuracy models.

Technically, I used 3DPM for verification of my GPU version of the program. Being a parallel problem, GPUs were much faster and that's what I used in my published papers. But at the time the best on the market was a GTX 280, so go figure, but I was still reducing the time it took other people to do similar work down from one month per simulation to one minute. CPUs got me some of the way there (better algorithm, multithreaded) but GPUs made this research a tangible asset going forward.

With regards random number generation, I used the Ranq2 algorithm for CPUs from Numerical Recipes 3rd Edition: The Art of Scientific Computing (https://*******/WZGYPr) but modified it slightly to output between (0,-1]. If anyone's interested, I used the XORWOW for GPUs.

Another element to speed up the CPU simulation, which was perhaps dubious in my pseudo code earlier, is that you do not need to generate the particles before the threads are issued. Initiate the xyz ords (and a Ranq2 sequence) when you start the thread and then loop over steps as necessary. When the steps end, find a final distance from origin, dump that into an overriding array so the thread is still compiled (make the array the size of particles, then do a final reduction for average distance from origin of all particles - should be near 0 if the RNG is good), then move to the next thread. I believe my numbers were 10000 threads at 100 steps each, though I'm not 100% sure on that. It was definitely enough steps to cover thread generation and enough threads to make the big CPUs sweat for a time. In 3DPM, single thread spends 5 seconds on each algorithm then does a calc for speed, multithread spends 10 seconds on each algorithm (do until time>60 sec, calc steps then divide by time, making sure you keep accuracy).

Regarding negative z direction movement, I distinctly remember adjusting this slightly.
So the idea is that it should bounce off a solid plane, but you only perform the 'bounce' if the particle ends up below the plane. e.g.

z-value at each step, where plane is z=0

0.7
0.8
0.2
-0.3 -> becomes 0.3
0.4
-0.5 -> becomes 0.5
0.6

etc.

I remember in my code that the boundary for z was actually -10, not 0.
So

if (z < -10) {z = z - 2(z+10);}

Or something similar - double check to make sure that's right! In otherwords it should bounce off of the z=-10 plane. I did this so the final if statement would have less effect, but in the simulation it's really there, if only a few particles hit it anyway. (It actually makes a big difference when you're dealing with warps of particles on a GPU as each particle in the warp has to do the code, even if it doesn't apply).

I think that's most of the basic stuff here. Clearly at some point in the future it needs a re-write, and I think I'll have to come back here and we can discuss potential optimizations and make it a 'fairer' test as it were. I still believe it's fair, because it's basic code that I wrote for my work at the time, even though it's not CompSci inspired and I was learning as I was doing.

If anyone has any questions, I'd be happy to answer them to the best of my abilities. I have an email subscription to the thread. Or email me direct through the main AT website, that works too.
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Good post! I think that clarifies many questions people might still have about the benchmark. One of the things I'm learning about 3DPM is that it seems to gain a great deal from Hyperthreading, which hints at possible pipeline stalls. It might also explain why AMD processors have so much trouble with it.

edit: and yeah that was me with the email
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,700
4,661
75
Wow, scheduling FP instructions is hard! I managed to speed up minicos on Wolfdale a little:
Code:
float result = x - (x3c * x * x2) + (x5c * x * x4) - (x7c * x * x2 * x4);
But that's because you can only have 4-6 fmuls in flight at once. On Sandy/Ivy you can have ten. (!) Haswell has one port dedicated to FMA, so unless your compiler is crazy-smart you can only have 5 again. So this might help there too.

If you're wondering where I got those numbers: http://agner.org/optimize/#manuals

This should really be done in C with SSE or AVX intrinsics. That way you can batch up groups of 4 or 8 floats to work on at the same time. (But AVX would kill AMD performance.)
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Interesting. The same modification slowed performance on my Kaveri by a hair (102400 steps was ~500ms slower).

I believe that Enigmoid's efforts were going in the general direction you indicated, though he is leaning on OpenMP for multithreading and relying on the compiler to autovectorize. I know I do not currently have the skill to pull that off right now. Java allegedly autovectorizes where possible, and that there has been AVX support since Java 6 (or so) and AVX2 support since Java 8. Obviously that is not going to produce results as strong as C with intrinsics.
 

Schmide

Diamond Member
Mar 7, 2002
5,745
1,036
126
I did a bit of thinking on this a week ago when I had time and I think there needs to be a set of constraints on the problem. As it is, no matter how many threads you spawn it is still basically just 2 calls to 2 unique libraries. (rand and trans) In the world of simulations it is really a brute force method having results somewhat akin to DFT vs FFT in complexity of computation. I.e. it works fine for small samples but once you get to large resolutions you're just bashing your face into the wall.

I remembered reading about monte carlo in 3d space for lighting a while ago and was able to pull up the GPU Gems 3 article on it. (here) They use a texture to sample random light vectors around the pixel in question to produce a good area distribution and further improved aliasing with mipmaps.

So why use monte carlo to integrate? Unless you have a very tame function certain errors can crop up. If for example you point sample a function you are susceptible to aliasing and harmonics. If your random sample is biased in some manner, results may indicate trends that are not actually part of the solution. Quantifying small differences over large areas can also lead to computational pitfalls.

Slightly related: Noise is often essential to making ordered systems look natural. Perlin Noise is one of these such factors. It's a different application using similar ingredients.

So the algorithm I was thinking of involves hemisphere based arcs remapped over one octant of the sphere. Build basis vectors from those and wobble them using vector math and normalizing magnitudes. (this maintains unity) Then use statically sampled symmetry to distribute points over the sphere.

pseudo code

1) Build cos table first quadrant (power of 2 size). This is first arc at equator. (x, y) = (cos ordered, cos reversed).

2) Project into octant scaling by cos2.

Each arc is xyz ( cos * cos2, sin * cos2, sin2) which is equivalent to walking the original table with the following values.
cos = ordered
cos2 = ordered 2x index
sin = reversed
sin2 = reversed 2x index

3) Basis vectors are
x = arcs in order
y = reverse order -x component
z = first arc y -x components

4) Wobble by adding scaled tangent vectors and readjusting magnitude.

5) Sample into other octants (neg, invert, etc) based on points per area.

Just my thoughts.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Schmide, I'll let you and borandi sort that out. I have nothing especially useful to add to that conversation.

While my C++ efforts have not been particularly inspiring, Cogman offered some suggestions that helped me speed up my Java code. Here is build 782015:

https://www.dropbox.com/s/c2pd4hicoy93bdb/3DPMRedux782015.zip?dl=0

Source:

https://www.dropbox.com/s/i8alkrsn99so3r7/3DPMReduxSource782015.zip?dl=0

Try it out! On my 7700k, it got me up to ~75M steps per second @ 3.4 GHz (no turbo).

Build: 782015
102400 steps completed.
Mode: multi-threaded

It took 13634 milliseconds to complete the workload.
(75.10635M steps per second)
It took 20 milliseconds to write the output file.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Schmide, I'll let you and borandi sort that out. I have nothing especially useful to add to that conversation.

While my C++ efforts have not been particularly inspiring, Cogman offered some suggestions that helped me speed up my Java code. Here is build 782015:

https://www.dropbox.com/s/c2pd4hicoy93bdb/3DPMRedux782015.zip?dl=0

Source:

https://www.dropbox.com/s/i8alkrsn99so3r7/3DPMReduxSource782015.zip?dl=0

Try it out! On my 7700k, it got me up to ~75M steps per second @ 3.4 GHz (no turbo).

95M/s on the 3630qm (but I am doing other things on the computer as well at the same time).
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Okay, an improvement, but still behind original 3DPM. Something I am doing is inadvertently hurting Ivy Bridge. It may come down to the way I have set up the minicos method per Ken g6's experience on Wolfdale.
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Cogman showed his modifications to the code that he made to run it faster on his machine here:

http://forums.anandtech.com/showthread.php?t=2438542

I cribbed some more stuff from him that made my 7700k run faster than 782015. The result is 792015:

https://www.dropbox.com/s/mxzoy30cj79n3u2/3DPMRedux792015.zip?dl=0

Source:

https://www.dropbox.com/s/yne3bdaeusmqndy/3DPMReduxSource792015.zip?dl=0

A10-7700k @ 3.4 GHz (no turbo) is up to ~78M movements per second:

Build: 792015
1024000 steps completed.
Mode: multi-threaded

It took 131236 milliseconds to complete the workload.
(78.027374M steps per second)
It took 15 milliseconds to write the output file.

As an added bonus, I think the way Cogman changed the minicos method jives somewhat with the changes suggested above by Ken g6, so it ought to improve performance on his Wolfdale without hurting performance on my Kaveri. Maybe!
 

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136
Holy cripes on toast! Switching from MersenneTwisterFast to SplittableRandom (part of Java8) was an enormous boost to performance. Here's build 7122015:

https://www.dropbox.com/s/k18zcf6yza6e63k/3DPMRedux7122015.zip?dl=0

source:

https://www.dropbox.com/s/artotdvmbra5c1l/3DPMReduxSource7122015.zip?dl=0

Note that this version has no MTFast implementation, so I have excised the license.

Aaaaaaand here's the performance on my 7700k @ 3.4 GHz (no turbo), all else default settings:

Build: 7122015
1024000 steps completed.
Mode: multi-threaded

It took 98785 milliseconds to complete the workload.
(103.659454M steps per second)
It took 24 milliseconds to write the output file.

~103M steps per second is awesome! That's ~139% faster than the performance of the same CPU at the same clockspeed in Stage 1 of Dr. Cutress' 3DPM. And I did it all in Java. Yaaahooo!

Enigmoid, would you mind running this build on your 3630qm? Hopefully it'll break 120M steps per second. Hopefully.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Lol... its broken on my computer. I get 83M steps/sec.

I'm just running the exe so maybe its a java runtime problem.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,953
13,043
136

borandi

Member
Feb 27, 2011
138
116
116
I did a bit of thinking on this a week ago when.

snip

This simulation is nothing to do with graphics. There are a lot of nuances in graphics, especially with random numbers, to avoid edge case issues and maintain uniformity. We're not talking effects being placed onto 3D environments, where particle numbers are small and edge cases happen more often - the larger the particle numbers, the less likely an overall edge effect occurs.

The random number generators are more complex than rand() due to the use of billions of particle movements over 3D space. Each particle moves several hundred times, rather than an effect being placed once and being fixed.

Your method of movement outlined is akin to one of the cosine or hypercube rejection methods listed above and implements a lot of extra steps, some of which (particularly inverse) are computationally expensive. When dealing with graphics, where you have a couple of million pixels, that's usually fine - but dealing with billions of particles is several orders of magnitude more time consuming.


-------

The battle here is PRNG vs movement. It would be prudent to do a kernel timing profile analysis of PRNG generation against movement, see which requires the most CPU time. The Mersenne Twister is an example of an expensive PRNG that most people like to use, because it's fairly well documented and offers several advantages. Ultimately, a lightweight PRNG with a period 2^100 should be sufficient here.

DrMrLordX - did you delegate the particle generation inside the MT loop, to be created and destroyed within the thread rather than an overriding array that keeps track of data?
 

Schmide

Diamond Member
Mar 7, 2002
5,745
1,036
126
This simulation is nothing to do with graphics. There are a lot of nuances in graphics, especially with random numbers, to avoid edge case issues and maintain uniformity. We're not talking effects being placed onto 3D environments, where particle numbers are small and edge cases happen more often - the larger the particle numbers, the less likely an overall edge effect occurs.

Regardless of what a simulation does, it is the application of random numbers that relate the operations. By definition when you start with a 3d particle you have started the simulation in a 3d environment. Whatever constraints you place after that reduce that subspace. I do not know the details of the original simulation but I assume it has something akin to Euler's method. I.e. sampling a lot of small steps can approximate the integration of a function.

The random number generators are more complex than rand() due to the use of billions of particle movements over 3D space. Each particle moves several hundred times, rather than an effect being placed once and being fixed.

This really doesn't make sense to me. A random number is a random number. How you apply those numbers is what makes a simulation complex. You seem to be justifying a separation from rand() by iteration count. Further, random numbers in practice are only pseudo random numbers generated by a function. A lot of the complexity of this problem seems to come from the first application of the random numbers.

What I and other 3d random number generators are doing is rather than start with multiple liner ranges that are mapped into the unit 3d domain, start in the unit 3d domain and apply operations that work within that subspace.

Your method of movement outlined is akin to one of the cosine or hypercube rejection methods listed above and implements a lot of extra steps, some of which (particularly inverse) are computationally expensive. When dealing with graphics, where you have a couple of million pixels, that's usually fine - but dealing with billions of particles is several orders of magnitude more time consuming.

The only thing it has in relation to cosine or hypercube, is it starts in one octant then branches out to others from there. It never rejects anything but rather resamples the environment and maps those samples to skewed unit positions. (they are only skewed on the unit sphere and maintain unity)

There is no inverse? I don't know if you read invert and thought inverse? The invert is basically taking a vector in 3 space and projecting it in the opposite direction. (antipodal) That with switching components and negation can map one octant into all octants.

Edit: I guess the other place where you could assume inverse (normalize) was in the wobble. There is no normalize as it is just a scaled unit vector add and multiply. (/edit)

Off the top of my head the number of calculations in my proposed method is

# cos in the original arc (2^N cos) (basically nothing)

Remapping of the arcs into the positive octant (3 mults) (basically nothing)

Remapping of those points into basis vectors (6 memory copies) (basically nothing)

Random sampling of basises - (memory operations no worse than rand)

Wobble - 2 vector (mult, add, mult) (3*3*3*2=18) (may be applied twice)

I was also thinking of converting the original arc points into quaternions and slerping randomly between them. This would reduce the wobble from I believe 18-36 to 12-14.

I may write this I may not. I was really just enjoying the review process.
 
Last edited: