3DPM: Can we fix it? Yes we can!

DrMrLordX · Jul 1, 2025

Epic necro. It hurts a little seeing @The_Stilt posting in a thread, along with @borandi . But time marches on.

Glad it's still available.

Maybe someday I'll revisit my janky Java version and attempt to emulate all six modes based on the pseudocode @borandi provided. That day is not today.

igor_kavinski · Jul 1, 2025

DrMrLordX said:
Maybe someday I'll revisit my janky Java version and attempt to emulate all six modes based on the pseudocode @borandi provided.

What makes you say that? Isn't that working C/C++ code that he provided?

DrMrLordX · Jul 1, 2025

igor_kavinski said:
What makes you say that? Isn't that working C/C++ code that he provided?

It's his benchmark. He never provided the full sourcecode that I can recall. All he did was provide pseudo-code.

igor_kavinski · Jul 1, 2025

DrMrLordX said:
It's his benchmark. He never provided the full sourcecode that I can recall. All he did was provide pseudo-code.

This very much looks like working code: https://forums.anandtech.com/threads/3dpm-can-we-fix-it-yes-we-can.2433693/post-38095680

I may try compiling it tomorrow if I can find the time.

DrMrLordX · Jul 1, 2025

Oh I forgot about that. Let us know how it works for you!

Schmide · Jul 1, 2025

If I knew then what I know now I could of expressed my thoughts better.

I still think that a half angle table and a quasi quaternion slurp would avoid a lot of expensive trigonometry calculations.

igor_kavinski · Jul 4, 2025

It compiled but not without some "head scratching" thanks to being a Visual Studio project.

Gonna try to test it on my Tiger Lake laptop when I get the time to see if the AVX512 optimization helps or not. There are no explicit AVX512 instructions in the code unfortunately.

DrMrLordX · Jul 4, 2025

igor_kavinski said:
View attachment 126654

It compiled but not without some "head scratching" thanks to being a Visual Studio project.

Gonna try to test it on my Tiger Lake laptop when I get the time to see if the AVX512 optimization helps or not. There are no explicit AVX512 instructions in the code unfortunately.

From what I understand, Dr. Cutress got a third party to help him with the final builds of 3DPMv2 that he used in some of his later benchmark articles for AT. It did utilize AVX2 and AVX512, and apparently it didn't rely just on autovectorization via OpenMP.

igor_kavinski · Jul 5, 2025

Pretty awful. Looked in the IDE project options and enabled OpenMP.

No ISA extensions

SSE

SSE2

AVX

AVX2

AVX-512 256-bit vector length

AVX-512 512-bit vector length

Highest score came from the AVX executable.

Now I need to figure out how to compile this as a 64-bit executable. I hate Visual Studio.

Looks like Dr. Ian got a pretty seriously heavy uplift from the manual optimization.

Schmide · Jul 5, 2025

Ok since this is back in my head I will expand on why my algorithm was designed as such.

The original algorithm is flawed. The whole point of it is to produce random vectors to simulate a particle moving randomly. In post #4 a simple explanation of it is given.

This is a mapping problem that cartographers have mulled over endlessly and is the reason maps are so distorted.

The algorithm should be a bit different. Some simple code

Code:

    double x, y, z, zp, ang[2], mag;
    ang[0] = 22.5 / 180.0 * M_PI;
    ang[1] = 30.0 / 180.0 * M_PI;
    x = sin(ang[0]);
    y = cos(ang[0]);
    z = sin(ang[1]);
    zp = cos(ang[1]);
    x *= zp;
    y *= zp;
    mag = sqrt(x * x + y * y + z * z);

adding a z' to normalize the z transform and maintain 3d polar coordinates.

As long as the angles chosen are < 45 degrees this will perform a close to uniform mapping to the equatorial area of the sphere. However, as it approaches the middle latitudes (parallel); for this example when the z angle reaches 45 degrees, the mapping compresses to 70% of the equatorial area and starts dropping off rapidly towards the polls where it reaches zero.

This will statistically skew the frequency and randomness of the data. If one was to not restrict the data to < 45 degrees or use a non-normalized z coordinate, the skew would be even greater.

Why not just use 3 random values and normalize the final outcome ? Well in this case you're mapping a cube to a patch thus adding all the statistical biases of the cube. Making things even worse.

If one is willing to accept the above patch mapping, (not the cube), You can further randomize the vector by permuting and negating randomly over the vector. This will map to the entire sphere with some overlap and bias. This is better than the original algorithm and may be acceptable but it's still not perfect.

What my algorithm did, although it was poorly implemented, was to maintain 3 basis vectors, then roll pitch and yaw them relative to their own coordinate system. This along with inverting and permuting the system randomly would produce a purely random system as long as the random angles were chosen appropately (<45 degrees)

Issues with this implementation. It has about 5 times the calculation and may be subject to vector drift. Requiring stabilization.

After the years and a few days of thinking about it. The issues above can be alleviated by just randomizing the roll pitch and yaw from base vectors then pushing the randomness from the permutation and negation of those vectors. This would give 3x output, full uniform coverage of the sphere, and less reliance on polar values.

igor_kavinski · Jul 5, 2025

You are a close or distant relative of John Carmack, aren't you?

igor_kavinski · Jul 5, 2025

I think I'm in love with a new compiler: https://github.com/skeeto/w64devkit/releases/tag/v2.3.0

Command-line: g++ -Wall -g -fopenmp -static Dr_Ian_3DPM.cpp

It's fully PORTABLE!

Just look at the side by side comparison with VStudio's highest score:

g++ -Wall -g -fopenmp -static -march=cascadelake Dr_Ian_3DPM.cpp

AND THEN...

g++ -Wall -g -fopenmp -static -march=cascadelake -Ofast -frename-registers Dr_Ian_3DPM.cpp

BEAST MODE ACTIVATED!!!!!

Despite losing three test scores to Dr. Ian's hand optimized executable, the open source compiled exe comes out a winner by 9.97% !!!

Thanks, @DrMrLordX ! A simple mention of 3DPM led to so much amazing learning!

igor_kavinski · Jul 5, 2025

So two advantages of that BEAUTIFUL portable compiler.

The EXE compiled as 64-bit and it uses some better threading model (or maybe better version of OpenMP) because VStudio EXE was doing max 80% CPU utilization but the g++ 15.1.0 EXE maxes the CPU cores out so much that even mouse movement becomes laggy so it makes for a really great MT stress test too.

DrMrLordX · Jul 5, 2025

Word. Also I find it interesting that Cascade Lake is the compiler target . . .

igor_kavinski · Jul 5, 2025

DrMrLordX said:
Word. Also I find it interesting that Cascade Lake is the compiler target . . .

I mean, there is the AVX512F target but that CPU IS Cascade Lake.

MS_AT · Jul 5, 2025

igor_kavinski said:
Despite losing three test scores to Dr. Ian's hand optimized executable, the open source compiled exe comes out a winner by 9.97% !!!

Since you have enabled -Ofast it would be nice to test if the results are still correct. It does not matter how fast something is, if it is producing wrong results

If you want to be on the safer side, stick to -O3. For further reading https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-Ofast

It would be also nice to list flags you passed down to MSVC, for completeness.

igor_kavinski said:
AVX512F target but that CPU IS Cascade Lake.

Not really, not to mention with gcc you can enable only avx512f without specifying the arch.

igor_kavinski · Jul 5, 2025

Ah. So more testing!

Yay?

igor_kavinski · Jul 7, 2025

MS_AT said:
It would be also nice to list flags you passed down to MSVC, for completeness.

Code:

/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column

MS_AT · Jul 7, 2025

igor_kavinski said:

Code:

/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column

If you wish to remove redundant and/or duplicate options consult with with https://learn.microsoft.com/en-us/c...iler-options-listed-by-category?view=msvc-170

Schmide · Jul 13, 2025

I did some vibe coding. (github copilot) and for the most part on my 9950x3d is memory bound. Using a std::mersenne_twister_engine for random values.

Code:

[Standard] Generated 134217728 random normalized vectors in 177.939 ms.
Vector 0: (0.453678, 0.710586, 0.537815, 1)
Vector 1: (0.121839, -0.833704, 0.538603, 1)
Vector 2: (0.80229, -0.26128, -0.536715, 1)
Vector 3: (-0.467435, 0.526024, 0.710495, 1)
Vector 4: (-0.39317, 0.833771, 0.387613, 1)
[AVX] Generated 134217728 random normalized vectors in 168.096 ms.
Vector 0: (0.720739, 0.504543, 0.475365, 1)
Vector 1: (-0.116079, -0.275841, -0.954168, 1)
Vector 2: (0.698501, 0.150877, 0.699523, 1)
Vector 3: (0.469703, 0.634067, -0.614279, 1)
Vector 4: (0.568022, -0.415479, -0.710442, 1)
[AVX-512] Generated 134217728 random normalized vectors in 161.694 ms.
Vector 0: (-0.0977937, 0.84237, 0.529952, 1)
Vector 1: (-0.257729, -0.572719, -0.778183, 1)
Vector 2: (0.662639, 0.0722229, -0.745448, 1)
Vector 3: (-0.208273, -0.940973, -0.266819, 1)
Vector 4: (-0.729976, -0.246146, 0.637611, 1)
[No Normalization] Generated 134217728 random vectors in 159.824 ms.
Vector 0: (0.155903, -0.115883, 0.138903, 1)
Vector 1: (0.89671, 0.804303, 0.444347, 1)
Vector 2: (0.915703, 0.984343, -0.692272, 1)
Vector 3: (0.310002, 0.466344, 0.993065, 1)
Vector 4: (-0.90372, 0.87979, -0.0445499, 1)
[Fixed] Generated 134217728 fixed vectors in 142.19 ms.
Vector 0: (0, 1, 2, 1)
Vector 1: (0, 1, 2, 1)
Vector 2: (0, 1, 2, 1)
Vector 3: (0, 1, 2, 1)
Vector 4: (0, 1, 2, 1)

Depending on the run they can flip places. (avx, avx512, no norm) Except for the non-vectored and fixed vectored book ending the results.

Edit: Doing the math 2^27 * (vector in bytes 4 * 8 = 32 bytes) * (1/0.16 = 6.25) = 26.8 GB/s

Fixed vectors ... (1/0.142 = 7) = 30 GB/s

igor_kavinski · Jul 13, 2025

Schmide said:
Depending on the run they can flip places. (avx, avx512, no norm) Except for the non-vectored and fixed vectored book ending the results

Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?

Also, would you like to share the vibe coded vector bench code? Would be fun to try out different optimization switches on it to see what has the best effect in the end.

511 · Jul 13, 2025

You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain

igor_kavinski · Jul 13, 2025

511 said:
You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain

AMD maybe (assuming it's free) but I think Intel doesn't give away its tools for free?

MS_AT · Jul 13, 2025

igor_kavinski said:
AMD maybe (assuming it's free) but I think Intel doesn't give away its tools for free?

Both are Clang derivative. Both are accesible for free. AMD one is linux only so harder to run on Windows iirc. But Intel one can be used for AMD if one is careful with flags.

511 said:
You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain

While nice advice in principle, since we do nothing about code in question, the code gen might be good enough already either due to use of intrinsics or inline assembly.

igor_kavinski said:
Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?

If it is streaming workload in nature, then the answer is nothing as L3 is a victim cache. If the data is reused, then proportionally to the size of data set that fits in the cache. But bandwidth is the same between the two CCDs. It just x3D has more space.

Schmide · Jul 13, 2025

It's a big nothing. Are we sharing? I don't want to be the only one.

3DPM: Can we fix it? Yes we can!

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Senior member

Lifer

Lifer

Senior member

Diamond Member

Lifer

Platinum Member

Lifer

Senior member

Diamond Member