- Apr 27, 2000
- 22,706
- 12,662
- 136
What makes you say that? Isn't that working C/C++ code that he provided?Maybe someday I'll revisit my janky Java version and attempt to emulate all six modes based on the pseudocode @borandi provided.
It's his benchmark. He never provided the full sourcecode that I can recall. All he did was provide pseudo-code.What makes you say that? Isn't that working C/C++ code that he provided?
This very much looks like working code: https://forums.anandtech.com/threads/3dpm-can-we-fix-it-yes-we-can.2433693/post-38095680It's his benchmark. He never provided the full sourcecode that I can recall. All he did was provide pseudo-code.
View attachment 126654
It compiled but not without some "head scratching" thanks to being a Visual Studio project.
Gonna try to test it on my Tiger Lake laptop when I get the time to see if the AVX512 optimization helps or not. There are no explicit AVX512 instructions in the code unfortunately.
double x, y, z, zp, ang[2], mag;
ang[0] = 22.5 / 180.0 * M_PI;
ang[1] = 30.0 / 180.0 * M_PI;
x = sin(ang[0]);
y = cos(ang[0]);
z = sin(ang[1]);
zp = cos(ang[1]);
x *= zp;
y *= zp;
mag = sqrt(x * x + y * y + z * z);
I mean, there is the AVX512F target but that CPU IS Cascade Lake.Word. Also I find it interesting that Cascade Lake is the compiler target . . .
Since you have enabled -Ofast it would be nice to test if the results are still correct. It does not matter how fast something is, if it is producing wrong resultsDespite losing three test scores to Dr. Ian's hand optimized executable, the open source compiled exe comes out a winner by 9.97% !!!
Not really, not to mention with gcc you can enable only avx512f without specifying the arch.AVX512F target but that CPU IS Cascade Lake.
It would be also nice to list flags you passed down to MSVC, for completeness.
/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column
If you wish to remove redundant and/or duplicate options consult with with https://learn.microsoft.com/en-us/c...iler-options-listed-by-category?view=msvc-170Code:/vlen=512 /Yu"stdafx.h" /ifcOutput "Release\" /GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /Fd"Release\vc143.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX512 /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\Dr_Ian_3DPM.pch" /diagnostics:column
[Standard] Generated 134217728 random normalized vectors in 177.939 ms.
Vector 0: (0.453678, 0.710586, 0.537815, 1)
Vector 1: (0.121839, -0.833704, 0.538603, 1)
Vector 2: (0.80229, -0.26128, -0.536715, 1)
Vector 3: (-0.467435, 0.526024, 0.710495, 1)
Vector 4: (-0.39317, 0.833771, 0.387613, 1)
[AVX] Generated 134217728 random normalized vectors in 168.096 ms.
Vector 0: (0.720739, 0.504543, 0.475365, 1)
Vector 1: (-0.116079, -0.275841, -0.954168, 1)
Vector 2: (0.698501, 0.150877, 0.699523, 1)
Vector 3: (0.469703, 0.634067, -0.614279, 1)
Vector 4: (0.568022, -0.415479, -0.710442, 1)
[AVX-512] Generated 134217728 random normalized vectors in 161.694 ms.
Vector 0: (-0.0977937, 0.84237, 0.529952, 1)
Vector 1: (-0.257729, -0.572719, -0.778183, 1)
Vector 2: (0.662639, 0.0722229, -0.745448, 1)
Vector 3: (-0.208273, -0.940973, -0.266819, 1)
Vector 4: (-0.729976, -0.246146, 0.637611, 1)
[No Normalization] Generated 134217728 random vectors in 159.824 ms.
Vector 0: (0.155903, -0.115883, 0.138903, 1)
Vector 1: (0.89671, 0.804303, 0.444347, 1)
Vector 2: (0.915703, 0.984343, -0.692272, 1)
Vector 3: (0.310002, 0.466344, 0.993065, 1)
Vector 4: (-0.90372, 0.87979, -0.0445499, 1)
[Fixed] Generated 134217728 fixed vectors in 142.19 ms.
Vector 0: (0, 1, 2, 1)
Vector 1: (0, 1, 2, 1)
Vector 2: (0, 1, 2, 1)
Vector 3: (0, 1, 2, 1)
Vector 4: (0, 1, 2, 1)
Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?Depending on the run they can flip places. (avx, avx512, no norm) Except for the non-vectored and fixed vectored book ending the results
AMD maybe (assuming it's free) but I think Intel doesn't give away its tools for free?You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain
Both are Clang derivative. Both are accesible for free. AMD one is linux only so harder to run on Windows iirc. But Intel one can be used for AMD if one is careful with flags.AMD maybe (assuming it's free) but I think Intel doesn't give away its tools for free?
While nice advice in principle, since we do nothing about code in question, the code gen might be good enough already either due to use of intrinsics or inline assembly.You can use Intel DPC++ Compiler for better performance if you want on Intel system or AMDs compiler toolchain
If it is streaming workload in nature, then the answer is nothing as L3 is a victim cache. If the data is reused, then proportionally to the size of data set that fits in the cache. But bandwidth is the same between the two CCDs. It just x3D has more space.Is it possible for you to devise a vector benchmark that does a lot of repeated calculations on a large (100+ MB) set of data? Then you could run that separately through affinity on the normal CCD and V-cache CCD and see how much the V-cache's terabyte worth of bandwidth helps the AVX-512 instructions?