3DPM: Can we fix it? Yes we can!

Essence_of_War · Feb 9, 2016

Great to hear from you Ian, and I am really pleased to see you re-writing so that you have source moving forward for the 3DPM benchmark! Would you mind if anyone tinkers with this code?

Edit:

A quick scan through the code indicates to me that up to a header file or two, this would compile on g++ more or less as written.

DrMrLordX · Feb 9, 2016

borandi said:
Obviously the movement to larger blocks of particles should be considered, using larger instructions, although that would essentially mean building an SSE, SSE2, SSE3 and other variants depending on processor support. This somewhat goes against the mantra of 'self-taught non-CompSci chemist writing code good enough to just work', but if there's an easy way to do it, I'm all ears.

First of all, thanks for revisiting 3DPM! I'd be interested in hearing how your results compare to the Java 3DPMRedux. And thank you for sharing code! I'd like to see your code for the other five movement models if/when you get around to it, so I could tinker with it and maybe try the same in 3DPMRedux, if that's okay with you.

If you're concerned about SIMD, autovectorization might be the way to go here, though I am not sure how well any of the MS compilers handle that sort of thing. You may get better performance using a GCC variant instead.

borandi · Mar 11, 2016

Progress:

- Added Normal Distribution PRNG
- Added NormDev function
- Converted to function pointer array to add new algorithms quicker

Here's a mid-beta test to try:

https://dl.dropboxusercontent.com/u/49768099/3DPM.rar

I've included the original 3DPM and the new v2, as well as a screenshot.
v2 only has three algorithms at the minute, still unoptimized - each alg runs for 15 seconds in a do(){while}; loop which works on OpenMP time still.

So far, on a 3.5 GHz i7-3960X:

3D Trig seems 25% slower (149.4 from 205.9)
BiPy is 50% faster (369.9 from 235.7)
NormDev is 50% faster (67.6 from 43.8)

Still to do:

- Add Polar Reject Algorithm
- Add Cosine Algorithm
- Add Hypercube algorithm
- Find why 3D Trig is slower
- Optimise for particle/step
- Make display nicer
- Output result text file
- Create benchmark upload site / security
- Some other stuff I can't think of at 4am

Edit: Just tried on a E3-1225 v5

v1 vs v2
3DTrig: 99.62 vs 114.97
BiPy: 110.37 vs 194.02
NormDev: 21.19 vs 43.48

So big speed ups on each.

Edit2: A6-7400K

3DTrig: 24.79 vs 25.72
BiPy: 29.20 vs 55.46
NormDev: 5.67 vs 11.65

Edit3: i3-6100TE at 35% OC

3DTrig: 94.86 vs 80.08
Bipy: 106.41 vs 167.17
NormDev: 21.28 vs 32.73

DrMrLordX · Mar 12, 2016

Thanks for the download. For those who don't know, you'll need MSVC 2015 redistributable to run this thing unless you have the compiler installed on your system. Anyway, here are numbers from my 7700k (currently @ 4.5 GHz, 2100 mhz NB, DDR3-2400 CL10):

v1 vs v2 vs 3DPMRedux

3DTrig: 58.1377 vs 61.5679 vs 141.28035
Bipy: 68.0851 vs 130.0988 vs NA
NormDev: 13.3856 vs 27.2759 vs NA

Big improvements on Bipy and NormDev. 3DPM is still faster in 3DTrig mode, but not by much vs v2 Bipy. Would definitely like to see some code snippets on v2 if possible.

borandi · Mar 14, 2016

Progress, 3DPM v2.0 b1

- Added all algorithms
- Adjusted layout with function pointer array
- Tidied up the look, detects CPU
- Auto sets process to high priority
- adjusted particles/steps to 100k and 1k
- Does some initialization

To do:

- Should add a 10 second gap between tests, to allow cool down for turbo.
- add more comments
- Output result text file
- Create benchmark upload site / security
- Find a FP PRNG
- Identify bottlenecks

Updated download in same place:
https://dl.dropboxusercontent.com/u/49768099/3DPM.rar

Results breakdown on an i7-3960X at 3.5G:
v2b1 against v1.03

Code:

Stage 1: 3DTrig      - 170.75 : 165.50 (+  3.17%)
Stage 2: BiPy        - 433.75 : 186.90 (+132%)
Stage 3: PolarReject - 208.88 : 123.46 (+ 69.2%)
Stage 4: Cosine      -  84.41 :  92.80 (-  9.04%)
Stage 5: Hypercube   - 143.17 :  59.93 (+139%)
Stage 6: NormDev     -  76.54 :  38.98 (+ 96.4%)

Total Score: 1117.49 vs 667.56 (+67.4%)

This was on my main work PC, which has tons of crap in the background. No 3DTrig and Cosine seem very susceptible to it.

Here's the code:

ran.h

Code:

#pragma once
#include <math.h>	

struct Ranq2 {
	const double twoPower64 = 18446744073709551616.0;
	const double oneOver2Power64Minus1 = 1 / (twoPower64 - 1);
	unsigned long long v, w;
	Ranq2(unsigned long long j) : v(4101842887655102017LL), w(1) {
		seed(j);
	}
	void seed(unsigned long long int seed) {
		v = 4101842887655102017LL;
		w = 1;
		v ^= seed;
		w = int64();
		v = int64();
		for (size_t i = 0; i < 10; i++) { int64(); }
	}
	inline double normal(double stdDev = 1) {
		double u = 0;
		double v = 0;
		double x = 0;
		double y = 0;
		double q = 0;

		do {
			u = real64Closed();
			v = 1.7156 * (real64Closed() - 0.5);
			x = u - 0.449871;
			y = fabs(v) + 0.386595;
			q = x*x + y*(0.19600 * y - 0.25472 * x);
		} while (q > 0.27597 && (q > 0.27846 || v*v > -4.0 * log(u) * u * u));

		return stdDev * v / u;

	}
	inline double doub() { return 5.42101086242752217e-20 * int64(); }
	inline unsigned int int32() { return (unsigned int)int64(); }
	inline float float32() { return int32() * 2.328306437e-10f; }

private:
	inline unsigned long long int64() {
		v ^= v >> 17;
		v ^= v << 31;
		v ^= v >> 8;
		w = 4294957665U * (w & 0xffffffff) + (w >> 32);
		return v^w;
	}
	inline double real64Closed() { return oneOver2Power64Minus1 * int64(); }
};

struct Simul {
	float x = 0.0f;
	float y = 0.0f;
	float z = 0.0f;
	float time = 0.0f;
	float average = 0.0f;
};

main.cpp

Code:

// 3DPM v2.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "math.h"
#include <string>
#include "stdio.h"
#include <iostream>
#include <iomanip>
#include "omp.h"
#include "windows.h"
#include "ran.h"

using namespace std;

#ifdef _DEBUG
static long particles = 20;
int steps = 2000;
float timeperloop = 5.0f;
#else
static long particles = 1e5;
int steps = 1000;
float timeperloop = 20.0f;
#endif

#define twopi 6.28318531f

Simul Sim13DTrig(Simul partic, Ranq2 &ran); // 3DTrig
Simul Sim2BiPy(Simul partic, Ranq2 &ran); // BiPy
Simul Sim3PolarReject(Simul partic, Ranq2 &ran); // PolarReject
Simul Sim4Cosine(Simul partic, Ranq2 &ran); // Cosine
Simul Sim5Hypercube(Simul partic, Ranq2 &ran); // Hypercube
Simul Sim6NormDev(Simul partic, Ranq2 &ran); // NormDev


Simul(*functptr[])(Simul parti, Ranq2 &ran) = { Sim13DTrig, Sim2BiPy, Sim3PolarReject, Sim4Cosine, Sim5Hypercube, Sim6NormDev };
char* names[] = { " 3DTrig", "   BiPy", "PlrRjct", " Cosine", "HypCube", "NormDev" };

float outputFunc(double* x, double* y, double* z, double start, double end, int threads, int looptimer, int s);
void printCPUName();
char *trim(char *str);

int main()
{

	//set high priority
	SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);


	//get system info

	SYSTEM_INFO sysInfo;
	GetSystemInfo(&sysInfo);

	MEMORYSTATUSEX statex;
	statex.dwLength = sizeof(statex);
	GlobalMemoryStatusEx(&statex);

	// thread detect and display
	int threads = omp_get_max_threads();

	cout << "                               3DPM v2.0 beta-1" << endl;
	cout << "                             (c) Ian Cutress 2016" << endl;
	cout << "/------------------------------------------------------------------------------\\";
	cout << "|--- CPU             ---/  "; printCPUName(); cout << endl;
	cout << "|--- Cores / Threads ---/  " << sysInfo.dwNumberOfProcessors << " / " << threads << endl;
	cout << "|--- Memory          ---/  " << (statex.ullTotalPhys / 1024) / 1024 << "MB" << endl;
	cout << "|------------------------------------------------------------------------------|";
	cout << "|--- Points / Steps  ---/  " << particles << " / " << steps << endl;
	cout << "\\------------------------------------------------------------------------------/";
	cout << endl;
	cout << "Initializing six algorithms of >20 seconds each." << endl;
	cout << "Stage 0";

	for (int s = 0; s < 6; s++) // simulations
	{
		Simul(*functptr2)(Simul parti, Ranq2 &ran) = *functptr[s];
		for (int i = 0; i < particles; i++) {
			Simul partic;
			Ranq2 ran(i);
			for (int j = 0; j < 100; j++) {
				partic = functptr2(partic, ran);
			}
		}
		cout << "\rStage " << s+1 << " ";
		for (int i = 0; i < s+1; i++) { cout << "."; }
	}
	cout << "\rInitialization Complete" << endl << endl;

//	cout << "Max Number of Threads = " << threads << endl;
//	cout << "Max Number of Particles = " << particles << endl;
//	cout << "Max Number of Steps = " << steps << endl;

	// final score counter
	double totalScore = 0.0;

	// initial final position calculation
	double* sumx = new double[threads * 64];
	double* sumy = new double[threads * 64];
	double* sumz = new double[threads * 64];
	for (int i = 0; i < threads; i++) {
		int a = 64 * i;
		sumx[a] = sumy[a] = sumz[a] = 0.0;
	}



	for (int s = 0; s < 6; s++) // simulations
	{

		// get correct function
		Simul(*functptr2)(Simul parti, Ranq2 &ran) = *functptr[s];

		// start timer
		double start = omp_get_wtime();

		// start looptimer counter
		int looptimer = 0;

		do {
			looptimer++;

#pragma omp parallel for ordered
			for (int i = 0; i < particles; i++) {
				Simul partic;
				Ranq2 ran(i);

				for (int j = 0; j < steps; j++) {
					partic = functptr2(partic, ran);
				} // Simulation loop


#pragma omp atomic
				sumx[omp_get_thread_num() * 64] += partic.x; // calc location from 0,0,0
#pragma omp atomic
				sumy[omp_get_thread_num() * 64] += partic.y; // calc location from 0,0,0
#pragma omp atomic
				sumz[omp_get_thread_num() * 64] += partic.z; // calc location from 0,0,0

			}
			double diff = omp_get_wtime() - start; // finish time
			cout << "\r" << names[s] << " stage " << s+1 << ": " << looptimer << " loops @ " << diff;
		} while (omp_get_wtime() - start < timeperloop); // end OpenMP loop
		double end = omp_get_wtime(); // finish time
		totalScore += outputFunc(sumx, sumy, sumz, start, end, threads, looptimer, s); // calculate final position

		looptimer = 0;

		for (int i = 0; i < threads; i++) {
			int a = 64 * i;
			sumx[a] = sumy[a] = sumz[a] = 0.0;
		}


		//write results files

	}

	cout << "|------------------------------------------------------------------------------|";
	cout << "|--- Total Score --- | " << setprecision(3) << fixed << totalScore /1000 /1000 << " Mops/sec" << endl;
	cout << "|------------------------------------------------------------------------------|";

	system("pause");
	return 0;

}

float outputFunc(double* x, double* y, double* z, double start, double end, int threads, int looptimer, int s) {

	cout.precision(2);


	for (int i = 1; i < threads; i++) {
		x[0] += x[64 * i];
		y[0] += y[64 * i];
		z[0] += z[64 * i];
	}

	x[0] /= particles;
	x[0] /= looptimer;
	y[0] /= particles;
	y[0] /= looptimer;
	z[0] /= particles;
	z[0] /= looptimer;

	float magnitude = sqrtf(x[0]*x[0] + y[0]*y[0] + z[0]*z[0]);
	float average = (double)looptimer * (double)steps * (double)(particles) / (end - start);
	cout << setprecision(3) << fixed;
	cout << "\r" << names[s] << " " << looptimer << "*(" << magnitude << ") ";
	cout << setprecision(4) << fixed;
	cout << "in " << end - start << " sec : " << (float)average / 1000 / 1000 << " Mops/sec" << endl;

	return average;
}

Simul Sim13DTrig(Simul partic, Ranq2 &ran) {

	// 3DTrig 
	float newz = 2.0f * ran.doub() - 1.0f;
	float alpha = ran.doub() * twopi;
	float r = sqrtf(1 - newz*newz);
	partic.x += r*cosf(alpha);
	partic.y += r*sinf(alpha);
	partic.z += newz;

	return partic;
}

Simul Sim2BiPy(Simul partic, Ranq2 &ran) {

	// BiPy
	int k = (int)floor(6 * ran.doub());
	if (k == 0) { partic.x += 1; }
	else if (k == 1) { partic.x -= 1; }
	else if (k == 2) { partic.y += 1; }
	else if (k == 3) { partic.y -= 1; }
	else if (k == 4) { partic.z += 1; }
	else { partic.z -= 1; }

	return partic;
}

Simul Sim3PolarReject(Simul partic, Ranq2 &ran) {

	//Polar Reject
	float u, v, s;
	do {
		u = 2.0f * ran.doub() - 1.0f;
		v = 2.0f * ran.doub() - 1.0f;
		s = sqrtf(u*u + v*v);
	} while (s > 1.0f);
	float a = 2 * sqrt(1.0f - s);
	partic.x += a * u;
	partic.y += a * v;
	partic.z += 2.0f * ran.doub() - 1.0f;

	return partic;
}

Simul Sim4Cosine(Simul partic, Ranq2 &ran) {

	// Cosine
	float phi = twopi * ran.doub();
	float theta = acosf(2 * ran.doub() - 1);
	partic.x += sinf(phi) * sinf(theta);
	partic.y += cosf(phi) * sinf(theta);
	partic.z += cosf(theta);

	return partic;
}

Simul Sim5Hypercube(Simul partic, Ranq2 &ran) {

	//Hypercube
	float u, v, w, s;
	do {
		u = 2.0f * ran.doub() - 1.0f;
		v = 2.0f * ran.doub() - 1.0f;
		w = 2.0f * ran.doub() - 1.0f;
		s = sqrtf(u*u + v*v + w*w);
	} while (s > 1.0f);
	float invs = 1.0f / s;
	partic.x += u * invs;
	partic.y += v * invs;
	partic.z += w * invs;
	return partic;
}

Simul Sim6NormDev(Simul partic, Ranq2 &ran) {

	// NormDev
	float u = ran.normal(1);
	float v = ran.normal(1);
	float w = ran.normal(1);
	float gamma = 1 / sqrtf(u*u + v*v + w*w);
	partic.x += gamma * u;
	partic.y += gamma * v;
	partic.z += gamma * w;

	return partic;
}

void printCPUName() {

	int CPUInfo[4] = { -1 };
	unsigned   nExIds, i = 0;
	char CPUBrandString[0x40];
	// Get the information associated with each extended ID.
	__cpuid(CPUInfo, 0x80000000);
	nExIds = CPUInfo[0];
	for (i = 0x80000000; i <= nExIds; ++i)
	{
		__cpuid(CPUInfo, i);
		// Interpret CPU brand string
		if (i == 0x80000002)
			memcpy(CPUBrandString, CPUInfo, sizeof(CPUInfo));
		else if (i == 0x80000003)
			memcpy(CPUBrandString + 16, CPUInfo, sizeof(CPUInfo));
		else if (i == 0x80000004)
			memcpy(CPUBrandString + 32, CPUInfo, sizeof(CPUInfo));
	}
	//string includes manufacturer, model and clockspeed
	cout << trim(CPUBrandString);

}

char *trim(char *str)
{
	size_t len = 0;
	char *frontp = str;
	char *endp = NULL;

	if (str == NULL) { return NULL; }
	if (str[0] == '\0') { return str; }

	len = strlen(str);
	endp = str + len;

	/* Move the front and back pointers to address the first non-whitespace
	* characters from each end.
	*/
	while (isspace(*frontp)) { ++frontp; }
	if (endp != frontp)
	{
		while (isspace(*(--endp)) && endp != frontp) {}
	}

	if (str + len - 1 != endp)
		*(endp + 1) = '\0';
	else if (frontp != str &&  endp == frontp)
		*str = '\0';

	/* Shift the string so that it starts at str so that if it's dynamically
	* allocated, we can still free it on the returned pointer.  Note the reuse
	* of endp to mean the front of the string buffer now.
	*/
	endp = str;
	if (frontp != str)
	{
		while (*frontp) { *endp++ = *frontp++; }
		*endp = '\0';
	}


	return str;
}

I've been trying to use Visual Studio's profiler, but it's not really showing me any particular bottlenecks. Random numbers are fast, divisions seem OK. Main point of contention are the trig functions.

Edit: Two more results:

AMD Athlon X4 845 (Carrizo) scores 401.60:
v2b1 vs v1.03

Code:

Stage 1: 3DTrig      -  69.64 :  36.73 (+ 89.60%)
Stage 2: BiPy        - 130.31 :  41.03 (+217.60%)
Stage 3: PolarReject -  85.70 :  26.17 (+227.47%)
Stage 4: Cosine      -  35.30 :  24.41 (+ 44.61%)
Stage 5: Hypercube   -  51.07 :  11.13 (+358.85%)
Stage 6: NormDev     -  29.53 :  12.37 (+138.72%)

  Total: 401.60 : 151.84 (+164.49%)

Intel Core i3-6100TE (skylake) at stock scores 347.12:
v2b1 vs v1.03

Code:

Stage 1: 3DTrig      -  60.38 :  70.13 (- 13.90%)
Stage 2: BiPy        - 124.75 :  80.94 (+ 54.13%)
Stage 3: PolarReject -  65.84 :  48.33 (+ 36.23%)
Stage 4: Cosine      -  30.38 :  30.65 (-  0.88%)
Stage 5: Hypercube   -  42.55 :  22.23 (+ 91.41%)
Stage 6: NormDev     -  24.23 :  15.75 (+ 53.84%)

  Total: 348.14 : 268.04 (+ 29.88%)

Wow, this is exciting. $70 AMD Carrizo based APU at 65W gets a better score than a $117 Intel at 35W.

Edit2: More
Intel Core i7-6700K scores 1021.98:
v2b1 vs v1.03

Code:

Stage 1: 3DTrig      - 175.92 : 208.36 (- 15.57%)
Stage 2: BiPy        - 364.97 : 238.63 (+ 82.70%)
Stage 3: PolarReject - 194.18 : 146.63 (+ 32.43%)
Stage 4: Cosine      -  88.95 :  90.10 (-  1.28%)
Stage 5: Hypercube   - 125.74 :  65.10 (+ 93.15%)
Stage 6: NormDev     -  72.22 :  46.31 (+ 55.95%)

  Total: 1021.98 : 795.13 (+ 28.53%)

Intel Core i7-4558U (Asus Zenbook) scores 414.48

Code:

Stage 1: 3DTrig      -  69.64 :  77.26 (-  9.86%)
Stage 2: BiPy        - 156.54 :  85.68 (+ 52.94%)
Stage 3: PolarReject -  75.85 :  51.05 (+ 38.58 %)
Stage 4: Cosine      -  34.63 :  31.10 (+ 11.35%)
Stage 5: Hypercube   -  50.85 :  21.21 (+139.75%)
Stage 6: NormDev     -  26.97 :  15.44 (+ 74.68%)

  Total: 414.48 : 281.75 (+ 47.11%)

Essence_of_War · Mar 14, 2016

I've been trying to use Visual Studio's profiler, but it's not really showing me any particular bottlenecks. Random numbers are fast, divisions seem OK. Main point of contention are the trig functions.

Ian,

Are the trig functions being optimized into calls to FSINCOS? I think that's the fastest way of doing sin/cos, right?

borandi · Mar 14, 2016

Essence_of_War said:
Ian,

Are the trig functions being optimized into calls to FSINCOS? I think that's the fastest way of doing sin/cos, right?

sinf(); cosf(); and acosf(); are all in play 🙂
I would assume that /fp:fast would make this happen anyway in the compiler

DrMrLordX · Mar 14, 2016

Thanks for the additional updates and the code snippets. Great stuff! I'll run your latest version later when I get the chance . . .

Essence_of_War · Mar 14, 2016

borandi said:
sinf(); cosf(); and acosf(); are all in play 🙂
I would assume that /fp:fast would make this happen anyway in the compiler

Sorry, I didn't notice that in your compiler flags, but I would expect that also.

DrMrLordX · Mar 14, 2016

Here are my results with v2.0 b1 on a 7700k @ 4.5 GHz:

3DTrig: 61.6035
BiPy: 131.1203
PlrRjct: 62.9879 (v1 score: 41.3570)
Cosine: 32.7215 (v1 score: 26.8851)
HypCube: 43.9045 (v1 score: 17.0292)
NormDev: 27.4274

3DTrig, ByPy, and NormDev remain largely unchanged from v2 original. PlrRjct and HypCube show significant performance increases over v1, and Cosine shows a moderate performance increase.

TheRyuu · Mar 15, 2016

borandi said:
...

Have you checked out Agner's PRNG page[1]? Not sure if you would be interested in anything there. This is the same guy that publishes the really good optimization manuals[2] (for C++ and x86 asm).

[1] http://www.agner.org/random/?e=0#0
[2] http://www.agner.org/optimize/?e=0#0

borandi · Mar 15, 2016

TheRyuu said:
Have you checked out Agner's PRNG page[1]? Not sure if you would be interested in anything there. This is the same guy that publishes the really good optimization manuals[2] (for C++ and x86 asm).

[1] http://www.agner.org/random/?e=0#0
[2] http://www.agner.org/optimize/?e=0#0

Most of them are based on Mersenne Twisters. They have a large period (super hideously large, 10^6000), but also a large memory footprint.

My comment was more about the double->float cast time and if that's an issue or bottleneck. It shouldn't be more than a couple of micro-ops on a modern microarchitecture.

BtDaG · Mar 18, 2016

First - really looking forward to the update.

Second I found some formatting issues on Windows 10. The line brakes aren't being observed for some reason. (I know not really high priority)

w1TLG8A4RvhYgSYr6SEraqr5dGuXW6NXCXaMmV52kv9DeEcpDPEryQZdbQIhOVJXxFLbxmDjQOFXdVIfQuEyrmv_YOeXkP39_Ansi5vyTgx2Cp3hwZ1-pAvsvCnyvF7NeBUz6SFHLCVQFRsFet1Zab0lz2HZWxPjUeVCXpF239Zw0bbqSlF8366nC4agjCjivy7A8k5uS1LpT1bxV2tAOwL2CDkeus-0yZ4s9-pWrz3WC8VLLJG94j3eLMyaq-bBvMNuiKUbrTJH1XNTqL5E1a9WISckM3cPRbcyVqtevVjOB_RAGALM6l9_L4Lpc49V8CB4OaGuq3udaeZsbCCZ_CRK2XuL70kiQ1Ql8cUzw6Q-qpMs3BDKXY-d9YFaImmI252QK28jZEAn7pT-YJ7mNK13pOKEzyD1af6JMW-lTupU9NAnI7idNZxBSmH1mKRWdNwnP8AoK-D1sCuwcwPKaqKIz2YtAncCIRmA3kjF-X-gCkVTiA0fbMKJ1eEcgezwwPhLpE8eUPHF3au7B6Ru8LF6kML3xJdM9OFt6Q-5M3dajhnKBgJxaBNVJiavwT8=w979-h514-no

Third - is there any chance the new version could have command line arguments supported? For example, I would really like to be able to control the location & name of the log file that's generated.

borandi · Mar 21, 2016

BtDaG said:
First - really looking forward to the update.

Second I found some formatting issues on Windows 10. The line brakes aren't being observed for some reason. (I know not really high priority)

Third - is there any chance the new version could have command line arguments supported? For example, I would really like to be able to control the location & name of the log file that's generated.

1) Thanks 🙂
2) I saw that as well after I posted the last update. It's to do with the cmd width no longer being 80, so I need to add in line breaks.
3) With any luck, that's the plan. If I get time, I want to code up a database for people to submit results. Though at this rate I'll never get any reviews done 😛

DrMrLordX · Mar 21, 2016

Think of it this way: a new 3DPM will enhance your future reviews. Your work is appreciated regardless.

BtDaG · Mar 22, 2016

borandi said:
1) Thanks 🙂
2) I saw that as well after I posted the last update. It's to do with the cmd width no longer being 80, so I need to add in line breaks.
3) With any luck, that's the plan. If I get time, I want to code up a database for people to submit results. Though at this rate I'll never get any reviews done 😛

Amazing! Obviously, I would prefer the reviews continue but I'm quite excited by the new version of this tool too.

A "loop" command (e.g. -l 3) might be nice as well, with the idea that it automatically runs the benchmark 3 times and average the results to give a final score.

borandi · Apr 9, 2016

BtDaG said:
Amazing! Obviously, I would prefer the reviews continue but I'm quite excited by the new version of this tool too.

A "loop" command (e.g. -l 3) might be nice as well, with the idea that it automatically runs the benchmark 3 times and average the results to give a final score.

Here's a new version, called beta-2 :

https://dl.dropboxusercontent.com/u/49768099/3DPMv2.0 beta-2.rar

Run the batch file.

- It should go through the whole thing 6 times and output a text file with the final result (average score from six runs). Final result is just a number.
- It also rests between sub-tests for 10 seconds for consistency
- Layout updated slightly

Next part is to output the per-loop results as well, in case people want to test algorithm consistency or see the console output.

Batch file can be edited. Main exe is called with three flags: loops fileName pauseAtEnd
So if you want to run 100 loops (the maximum), change 6 to 100.

borandi · Apr 15, 2016

Going to add a caveat to using the software. If using 3DPM to obtain data for reviews outside AT, then a link to www.anandtech.com should be added. I've seen 3DPM be used in a few places, so it seems like someone is looking at it 🙂

DrMrLordX · Apr 15, 2016

No surprise there . . .

BtDaG · Apr 18, 2016

borandi said:
Going to add a caveat to using the software. If using 3DPM to obtain data for reviews outside AT, then a link to www.anandtech.com should be added. I've seen 3DPM be used in a few places, so it seems like someone is looking at it 🙂

Makes sense, I only use it for internal reporting, nothing is made public. We all know where it comes from as we're all fans of the website and massive techies 😀

borandi · Feb 15, 2017

Update to v2.1 (2017-Feb-15)

Download: link

Change Log
- Added a copy-constructor to the particle struct
- Changed calls to the particle struct to ByRef
- Changed BiPy algorithm to switch/case rather than if (compiles to look-up table over predicates)
- PRNG now outputs a float rather than a double (stops some upconvert and deconvert)
- Scores are NOT comparable to v2.0

Score Comparison on a Core i7-6950X between versions

DrMrLordX · Feb 15, 2017

Thanks for continuing to update this benchmark! I look forward to seeing it run on Ryzen.

The Stilt · Mar 6, 2018

AVX-512 seems to provide a pretty nice boost in 3DPM.

v2.0b1 built on the newest Intel Compiler and compiled using QaxCORE-AVX512 option (i.e. auto dispatched).

Zero performance difference between QaxCORE-AVX2 and QaxCORE-AVX512 on Ryzens and SKL/KBL/CFL, however on Skylake-X:

ST, 3.8GHz fixed frequency.

AVX2

3DTrig 9*(0.119) in 21.2947 sec : 42.2641 Mops/sec
BiPy 11*(0.096) in 21.2700 sec : 51.7161 Mops/sec
PlrRjct 8*(0.030) in 21.4103 sec : 37.3653 Mops/sec
Cosine 5*(0.108) in 21.2139 sec : 23.5694 Mops/sec
HypCube 5*(0.049) in 23.2903 sec : 21.4682 Mops/sec
NormDev 3*(0.062) in 23.4378 sec : 12.7999 Mops/sec
||--- Total Score --- | 189.183 Mops/sec

AVX512

3DTrig 11*(0.119) in 20.2216 sec : 54.3971 Mops/sec
BiPy 11*(0.096) in 21.3113 sec : 51.6158 Mops/sec
PlrRjct 10*(0.030) in 21.8429 sec : 45.7815 Mops/sec
Cosine 6*(0.108) in 22.1933 sec : 27.0352 Mops/sec
HypCube 7*(0.049) in 22.6453 sec : 30.9116 Mops/sec
NormDev 4*(0.062) in 22.4101 sec : 17.8491 Mops/sec
||--- Total Score --- | 227.590 Mops/sec

DrMrLordX · Mar 6, 2018

Hey thanks for remembering that. Good to see something take advantage of AVX-512, especially considering where the original benchmark was before this thread.

edit: Intel made a big to-do about the Java 8 JVM optimizing for AVX/AVX2 back in the day. Have they been working with Oracle to update the JVM to target AVX-512 where available?

igor_kavinski · Jul 1, 2025

Link to the benchmark here: https://www.anandtech.com/show/14605/the-and-ryzen-3700x-3900x-review-raising-the-bar/8

3DPM: Can we fix it? Yes we can!

Platinum Member

Lifer

Member

Lifer

Member

Platinum Member

Member

Lifer

Platinum Member

Lifer

Diamond Member

Member

Junior Member

Member

Lifer

Junior Member

Member

Member

Lifer

Junior Member

Member

Lifer

Golden Member

Lifer

Lifer