3DPM: Can we fix it? Yes we can!

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
22,706
12,663
136
Epic necro. It hurts a little seeing @The_Stilt posting in a thread, along with @borandi . But time marches on.

Glad it's still available.

Maybe someday I'll revisit my janky Java version and attempt to emulate all six modes based on the pseudocode @borandi provided. That day is not today.
 
  • Like
Reactions: moinmoin

Schmide

Diamond Member
Mar 7, 2002
5,712
978
126
If I knew then what I know now I could of expressed my thoughts better.

I still think that a half angle table and a quasi quaternion slurp would avoid a lot of expensive trigonometry calculations.
 
Jul 27, 2020
26,117
18,016
146
1751653868567.png

It compiled but not without some "head scratching" thanks to being a Visual Studio project.

Gonna try to test it on my Tiger Lake laptop when I get the time to see if the AVX512 optimization helps or not. There are no explicit AVX512 instructions in the code unfortunately.
 

DrMrLordX

Lifer
Apr 27, 2000
22,706
12,663
136
View attachment 126654

It compiled but not without some "head scratching" thanks to being a Visual Studio project.

Gonna try to test it on my Tiger Lake laptop when I get the time to see if the AVX512 optimization helps or not. There are no explicit AVX512 instructions in the code unfortunately.

From what I understand, Dr. Cutress got a third party to help him with the final builds of 3DPMv2 that he used in some of his later benchmark articles for AT. It did utilize AVX2 and AVX512, and apparently it didn't rely just on autovectorization via OpenMP.
 
  • Like
Reactions: igor_kavinski
Jul 27, 2020
26,117
18,016
146
1751714538125.png

Pretty awful. Looked in the IDE project options and enabled OpenMP.

No ISA extensions

1751718642240.png

SSE

1751719010270.png

SSE2

1751719202047.png

AVX

1751719598497.png

AVX2

1751715624781.png

AVX-512 256-bit vector length

1751719867656.png

AVX-512 512-bit vector length

1751720097785.png

Highest score came from the AVX executable.

Now I need to figure out how to compile this as a 64-bit executable. I hate Visual Studio.

Looks like Dr. Ian got a pretty seriously heavy uplift from the manual optimization.
 
  • Like
Reactions: 511 and Io Magnesso

Schmide

Diamond Member
Mar 7, 2002
5,712
978
126
Ok since this is back in my head I will expand on why my algorithm was designed as such.

The original algorithm is flawed. The whole point of it is to produce random vectors to simulate a particle moving randomly. In post #4 a simple explanation of it is given.

This is a mapping problem that cartographers have mulled over endlessly and is the reason maps are so distorted.

The algorithm should be a bit different. Some simple code

Code:
    double x, y, z, zp, ang[2], mag;
    ang[0] = 22.5 / 180.0 * M_PI;
    ang[1] = 30.0 / 180.0 * M_PI;
    x = sin(ang[0]);
    y = cos(ang[0]);
    z = sin(ang[1]);
    zp = cos(ang[1]);
    x *= zp;
    y *= zp;
    mag = sqrt(x * x + y * y + z * z);

adding a z' to normalize the z transform and maintain 3d polar coordinates.

As long as the angles chosen are < 45 degrees this will perform a close to uniform mapping to the equatorial area of the sphere. However, as it approaches the middle latitudes (parallel); for this example when the z angle reaches 45 degrees, the mapping compresses to 70% of the equatorial area and starts dropping off rapidly towards the polls where it reaches zero.

This will statistically skew the frequency and randomness of the data. If one was to not restrict the data to < 45 degrees or use a non-normalized z coordinate, the skew would be even greater.

Why not just use 3 random values and normalize the final outcome ? Well in this case you're mapping a cube to a patch thus adding all the statistical biases of the cube. Making things even worse.

If one is willing to accept the above patch mapping, (not the cube), You can further randomize the vector by permuting and negating randomly over the vector. This will map to the entire sphere with some overlap and bias. This is better than the original algorithm and may be acceptable but it's still not perfect.

What my algorithm did, although it was poorly implemented, was to maintain 3 basis vectors, then roll pitch and yaw them relative to their own coordinate system. This along with inverting and permuting the system randomly would produce a purely random system as long as the random angles were chosen appropately (<45 degrees)

Issues with this implementation. It has about 5 times the calculation and may be subject to vector drift. Requiring stabilization.

After the years and a few days of thinking about it. The issues above can be alleviated by just randomizing the roll pitch and yaw from base vectors then pushing the randomness from the permutation and negation of those vectors. This would give 3x output, full uniform coverage of the sphere, and less reliance on polar values.
 
Jul 27, 2020
26,117
18,016
146
I think I'm in love with a new compiler: https://github.com/skeeto/w64devkit/releases/tag/v2.3.0

Command-line: g++ -Wall -g -fopenmp -static Dr_Ian_3DPM.cpp

1751733127823.png

It's fully PORTABLE!

Just look at the side by side comparison with VStudio's highest score:

1751733641430.png

g++ -Wall -g -fopenmp -static -march=cascadelake Dr_Ian_3DPM.cpp

1751733937427.png

1751734054733.png


AND THEN...

g++ -Wall -g -fopenmp -static -march=cascadelake -Ofast -frename-registers Dr_Ian_3DPM.cpp

1751735377569.png

BEAST MODE ACTIVATED!!!!! :eek:

1751735543317.png

1751735716293.png

Despite losing three test scores to Dr. Ian's hand optimized executable, the open source compiled exe comes out a winner by 9.97% !!!

Thanks, @DrMrLordX ! A simple mention of 3DPM led to so much amazing learning!
 
Jul 27, 2020
26,117
18,016
146
So two advantages of that BEAUTIFUL portable compiler.

The EXE compiled as 64-bit and it uses some better threading model (or maybe better version of OpenMP) because VStudio EXE was doing max 80% CPU utilization but the g++ 15.1.0 EXE maxes the CPU cores out so much that even mouse movement becomes laggy so it makes for a really great MT stress test too.
 

MS_AT

Senior member
Jul 15, 2024
743
1,509
96
Despite losing three test scores to Dr. Ian's hand optimized executable, the open source compiled exe comes out a winner by 9.97% !!!
Since you have enabled -Ofast it would be nice to test if the results are still correct. It does not matter how fast something is, if it is producing wrong results;) If you want to be on the safer side, stick to -O3. For further reading https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-Ofast

It would be also nice to list flags you passed down to MSVC, for completeness.

AVX512F target but that CPU IS Cascade Lake.
Not really, not to mention with gcc you can enable only avx512f without specifying the arch.
 
  • Like
Reactions: igor_kavinski
Jul 27, 2020
26,117
18,016
146
It would be also nice to list flags you passed down to MSVC, for completeness.
Code:
/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column
 

MS_AT

Senior member
Jul 15, 2024
743
1,509
96
Code:
/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column
If you wish to remove redundant and/or duplicate options consult with with https://learn.microsoft.com/en-us/c...iler-options-listed-by-category?view=msvc-170
 
  • Like
Reactions: igor_kavinski

Schmide

Diamond Member
Mar 7, 2002
5,712
978
126
I did some vibe coding. (github copilot) and for the most part on my 9950x3d is memory bound. Using a std::mersenne_twister_engine for random values.

Code:
[Standard] Generated 134217728 random normalized vectors in 177.939 ms.
Vector 0: (0.453678, 0.710586, 0.537815, 1)
Vector 1: (0.121839, -0.833704, 0.538603, 1)
Vector 2: (0.80229, -0.26128, -0.536715, 1)
Vector 3: (-0.467435, 0.526024, 0.710495, 1)
Vector 4: (-0.39317, 0.833771, 0.387613, 1)
[AVX] Generated 134217728 random normalized vectors in 168.096 ms.
Vector 0: (0.720739, 0.504543, 0.475365, 1)
Vector 1: (-0.116079, -0.275841, -0.954168, 1)
Vector 2: (0.698501, 0.150877, 0.699523, 1)
Vector 3: (0.469703, 0.634067, -0.614279, 1)
Vector 4: (0.568022, -0.415479, -0.710442, 1)
[AVX-512] Generated 134217728 random normalized vectors in 161.694 ms.
Vector 0: (-0.0977937, 0.84237, 0.529952, 1)
Vector 1: (-0.257729, -0.572719, -0.778183, 1)
Vector 2: (0.662639, 0.0722229, -0.745448, 1)
Vector 3: (-0.208273, -0.940973, -0.266819, 1)
Vector 4: (-0.729976, -0.246146, 0.637611, 1)
[No Normalization] Generated 134217728 random vectors in 159.824 ms.
Vector 0: (0.155903, -0.115883, 0.138903, 1)
Vector 1: (0.89671, 0.804303, 0.444347, 1)
Vector 2: (0.915703, 0.984343, -0.692272, 1)
Vector 3: (0.310002, 0.466344, 0.993065, 1)
Vector 4: (-0.90372, 0.87979, -0.0445499, 1)
[Fixed] Generated 134217728 fixed vectors in 142.19 ms.
Vector 0: (0, 1, 2, 1)
Vector 1: (0, 1, 2, 1)
Vector 2: (0, 1, 2, 1)
Vector 3: (0, 1, 2, 1)
Vector 4: (0, 1, 2, 1)

Depending on the run they can flip places. (avx, avx512, no norm) Except for the non-vectored and fixed vectored book ending the results.

Edit: Doing the math 2^27 * (vector in bytes 4 * 8 = 32 bytes) * (1/0.16 = 6.25) = 26.8 GB/s

Fixed vectors ... (1/0.142 = 7) = 30 GB/s
 
Last edited:
Jul 27, 2020
26,117
18,016
146
Depending on the run they can flip places. (avx, avx512, no norm) Except for the non-vectored and fixed vectored book ending the results
Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?

Also, would you like to share the vibe coded vector bench code? Would be fun to try out different optimization switches on it to see what has the best effect in the end.
 

MS_AT

Senior member
Jul 15, 2024
743
1,509
96
AMD maybe (assuming it's free) but I think Intel doesn't give away its tools for free?
Both are Clang derivative. Both are accesible for free. AMD one is linux only so harder to run on Windows iirc. But Intel one can be used for AMD if one is careful with flags.

You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain
While nice advice in principle, since we do nothing about code in question, the code gen might be good enough already either due to use of intrinsics or inline assembly.

Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?
If it is streaming workload in nature, then the answer is nothing as L3 is a victim cache. If the data is reused, then proportionally to the size of data set that fits in the cache. But bandwidth is the same between the two CCDs. It just x3D has more space.
 
  • Like
Reactions: 511