How much electricity would be saved worldwide if Windows was writen in Assembly?

Scali · Jul 19, 2010

Any_Name_Does said:
Thanks.
Now, if you would only recommend a compiler, ide and learn stuff?

Assuming Windows, then Visual Studio 2010 Express (doesn't do 64-bit though, Microsoft wants to see $$ for that

but for learning it doesn't matter. You can always plug in gcc later).

Any_Name_Does · Jul 19, 2010

Scali said:
Assuming Windows, then Visual Studio 2010 Express (doesn't do 64-bit though, Microsoft wants to see $$ for that but for learning it doesn't matter. You can always plug in gcc later).

so if you don't want to do away with 64 bit, GCC would be it?

How about a good code editor, what do you use if I may ask?

Scali · Jul 19, 2010

Any_Name_Does said:
so if you don't want to do away with 64 bit, GCC would be it?

How about a good code editor, what do you use if I may ask?

That's the thing... gcc is just a compiler.
Visual Studio is an IDE... You get a good code editor, nice project management, integrated source control etc... and a decent compiler to boot.
With gcc, I don't know, everyone has their preferences. Some like Eclipse, others use Netbeans, there's the Qt development environment, Anjuta, KDevelop, and tons of others...

You could still use the Visual Studio IDE and plug gcc in as a custom compiler.

Cogman · Jul 19, 2010

Schmide said:
I just did a quick check and I think your foiling is foiled?

Code:

SchmideSSE 1.716 seconds simplesol 1.857 seconds simplesolASM 1.357 seconds SchmideSSEASM 0.827 seconds cogmanSimplesol 1.716 seconds Hit any key to continue (Where's the ANY key?)

What happened I think is you put this semicolon here in the second loop. The assignment to the solution variable is important to prevent dead code. VC caught it. Though I would of dug into it as it just didn't make sense.

Code:

if (d == (double)iD)[COLOR="Red"];[/COLOR]

This caused the compiler to see the second loop as dead code (i.e. no need to execute it)

If you look at the disassembly.

Code:

for(int i=2;i<5001;++i) 01361068 inc eax 01361069 cmp eax,1389h 0136106E jl CogmanSimpleSolution+66h (1361066h) for(int iRuncount=0;iRuncount<10;iRuncount++) 01361070 sub edi,1 01361073 jne CogmanSimpleSolution+20h (1361020h) int c; double d; c = mulTable[i] + mulTable[j]; d=sqrt((double)c); int iD=(int)d; if (d == (double)iD) { } } } }

Nice try though but multiplying 2 numbers locally should only save you 3 ticks at most nowhere near the amount you saved there, especially with the extra access of the lookup table.

Crap, Should have known something went wrong

Just a sec, I actually did have a faster version then the original, but was so excited to see the .25 seconds I thought I had discovered gold

.

Code:

bool CogmanSimpleSolution()
{
	bool solution;
	int iStart=GetTickCount();
    int mulTable[5001];
    mulTable[2] = 7;
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
    int start = 1;
    if (mulTable[2] != 4)
    {
    for (int i = 1; i < 5001; ++i)
    {
        solution = false;
        mulTable[i] = i * i;
        int c;
        int d;
        c = 1 + mulTable[i];
        d = sqrt(c);
        if (d * d == c)
        {
            solution = true;
        }
    }
        start = 2;
    }

	for(int i=start;i<5001;++i)
	{
		for(int j=i;j<5001;++j)
		{
			solution=false;
			int c;
			int d;
			c = mulTable[i] + mulTable[j];
			d=sqrt(c);
			if (mulTable[d] == c)
			{
				solution=true; // without this it would deadcode the solution.
#ifdef _OUTPUTENABLE
				cout<<i;
				cout<<' ';
				cout<<j;
				cout<<' ';
				cout<<iD;
				cout<<endl;
#endif
			}
		}
	}
}
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"cogmanSimplesol ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
	return solution;
}

This version is about 100ms faster then the simple solution that you posted (on my laptop at least.). There is SOME savings to using a table vs doing the straight multiplication.

Any_Name_Does · Jul 19, 2010

Scali said:
That's the thing... gcc is just a compiler.
Visual Studio is an IDE... You get a good code editor, nice project management, integrated source control etc... and a decent compiler to boot.
With gcc, I don't know, everyone has their preferences. Some like Eclipse, others use Netbeans, there's the Qt development environment, Anjuta, KDevelop, and tons of others...

You could still use the Visual Studio IDE and plug gcc in as a custom compiler.

I would like to start with 64 bit right away. I already have a wonderful code editor with every feature you can think of. And it supports several languages. One of those languages is mentioned as cpp. is that c++?

Scali · Jul 19, 2010

Any_Name_Does said:
I would like to start with 64 bit right away.

If you're doing C++, it really doesn't matter. You write pretty much the exact same code (assuming you don't screw up, and write nice code using intptr_t etc when using pointer-int casts etc). You just use a different compiler.

But I think gcc is the only free option for 64-bit.

And yes, cpp is probably C++. Generally you use the .cpp extension for C++ source files (most filesystems won't accept '+' as a valid character in a filename). Alternatively you may see .cxx occassionally.

Any_Name_Does · Jul 19, 2010

Scali said:
If you're doing C++, it really doesn't matter. You write pretty much the exact same code (assuming you don't screw up, and write nice code using intptr_t etc when using pointer-int casts etc). You just use a different compiler.

But I think gcc is the only free option for 64-bit.

And yes, cpp is probably C++. Generally you use the .cpp extension for C++ source files (most filesystems won't accept '+' as a valid character in a filename). Alternatively you may see .cxx occassionally.

So, here come the most important question. Learning stuff.

OK. I used google.

Schmide · Jul 19, 2010

Cogman said:

Crap, Should have known something went wrong

Just a sec, I actually did have a faster version then the original, but was so excited to see the .25 seconds I thought I had discovered gold

.

Code:

bool CogmanSimpleSolution()
{
	bool solution;
	int iStart=GetTickCount();
    int mulTable[5001];
    mulTable[2] = 7;
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
    int start = 1;
    if (mulTable[2] != 4)
    {
    for (int i = 1; i < 5001; ++i)
    {
        solution = false;
        mulTable[i] = i * i;
        int c;
        int d;
        c = 1 + mulTable[i];
        d = sqrt(c);
        if (d * d == c)
        {
            solution = true;
        }
    }
        start = 2;
    }

	for(int i=start;i<5001;++i)
	{
		for(int j=i;j<5001;++j)
		{
			solution=false;
			int c;
			int d;
			c = mulTable[i] + mulTable[j];
			d=sqrt(c);
			if (mulTable[d] == c)
			{
				solution=true; // without this it would deadcode the solution.
#ifdef _OUTPUTENABLE
				cout<<i;
				cout<<' ';
				cout<<j;
				cout<<' ';
				cout<<iD;
				cout<<endl;
#endif
			}
		}
	}
}
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"cogmanSimplesol ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
	return solution;
}

This version is about 100ms faster then the simple solution that you posted (on my laptop at least.). There is SOME savings to using a table vs doing the straight multiplication.

How is that even running on your system? It overloads the table at i=101 j=5000. Edit: (i.e. sqrt(101^2 + 5000^2) = 5001)

Scali · Jul 19, 2010

Any_Name_Does said:
So, here come the most important question. Learning stuff.

http://msdn.microsoft.com
Lots of sample code, C++ reference etc.

Cogman · Jul 19, 2010

Schmide said:
How is that even running on your system? It overloads the table at i=101 j=5000. Edit: (i.e. sqrt(101^2 + 5000^2) = 5001)

gcc is less strict in the way it sets up addressing. Your right, I'm wrong with it. Just reading at a memory address out of bounds usually doesn't cause problems.

Voo · Jul 19, 2010

Scali said:
And yes, cpp is probably C++. Generally you use the .cpp extension for C++ source files (most filesystems won't accept '+' as a valid character in a filename). Alternatively you may see .cxx occassionally.

And not to forget .cc - google and some OS project use that. And don't forget printing when talking about 64bit portability - they somehow forgot to specify some types in C99.. (oh and those fun alignment problems when storing structs that should be shared between 32 and 64bit programs, for gcc there's __attribute__(packed) - I assume MSVC has something as well)

Ahm b2t - I just SSHed into a cluster and run the small cuda program on a FX5600 (yeah sorry couldn't find anything slower

) and it takes ~35ms for one run (the cuda timer has an accuracy of less than 1ms, so that's fine).
Not bad for an ancient GPU and a thrown together program.. a threaded version of schmiedes SSE code on a modern CPU should be faster if we interpolate from the single threaded results, but the CPU program is already highly optimized and we would have to use a modern GPU as well (and no doubt the CUDA program could be improved)

code for reference:

Code:

#include <stdlib.h>
#include <stdio.h>

#include <cuda.h>
#include <cutil.h>

#define LEAF_SIZE (100)
#define MATRIX_SIZE (5000)
#define THREADS_PER_BLOCK (256)
#define MAX_NR_BLOCKS (65535)
#define MAX_NR_THREADS (MAX_NR_BLOCKS * THREADS_PER_BLOCK)

__device__ int ComputeRow(int i) {
	int ii = MATRIX_SIZE * (MATRIX_SIZE + 1) / 2 - 1 - i;
	int k = (((int) sqrt((float)8 * ii + 1)) - 1) / 2;
	return MATRIX_SIZE - 1 - k;
}

__device__ int ComputeCol(int row, int i) {
	return i - MATRIX_SIZE * row + row * (row + 1) / 2;
}

__device__ int IsTriple(int a2, int b) {
	int b2 = b * b;
	int c2 = a2 + b2;
	float c = sqrt((float)c2);
	return (int) c == c;
}

__global__ void ComputeTriples(int *erg) {
	int row, row2, col, i, j, sum;
	i = (blockIdx.x * THREADS_PER_BLOCK + threadIdx.x) * LEAF_SIZE;
	row = ComputeRow(i);
	col = ComputeCol(row, i);
	row2 = row * row;
	for(j = 0; j < LEAF_SIZE; j++) {
		sum += IsTriple(row2, col);
		col++;
		if (col == MATRIX_SIZE) {
			row++;
			row2 = row * row;
			col = row;
		}
	} 
	// make sure nothing is optimized away
	*erg += sum;
}

int main() {
	unsigned int timer, nr_blocks;
	int *erg;
	// don't care about one off errors here, at worst it computes a little bit too much.
	nr_blocks = MATRIX_SIZE * (MATRIX_SIZE + 1) / ( 2 * THREADS_PER_BLOCK * LEAF_SIZE) + 1;
	if (nr_blocks >= MAX_NR_BLOCKS) {
		printf("Too many blocks for one kernel invocation.\n");
		return 1;
	}
	cutCreateTimer(&timer);
	cutStartTimer(timer);
	cudaMalloc((void**) &erg, sizeof(int));
	ComputeTriples <<< nr_blocks, THREADS_PER_BLOCK >>> (erg);
	// cudaFree(erg); not sure if it'd optimize something away if we just free it after the call.
	cutStopTimer(timer);
	printf("done in: &#37;fms\n", cutGetTimerValue(timer));
	cutDeleteTimer(timer);
	return 0;
}

Net · Jul 19, 2010

well... would it be written by microsoft

iCyborg · Jul 19, 2010

Cogman's 2nd solution needs extending the mulTable to have 7072 entries. I also had to fix some conversions, VC++ 2008 complains with an error...

I have two solutions, both are math/algorithm optimizations, not ASM, my goal was to beat the fastest SSE+ASM solution by math only

The first one improves Cogman's solution by taking into account several trivial facts: there are no triples involving 1 or 2. The second is that j can start from i+1, as 2*i^2 is never a square (one could prove that sqrt(2) would be rational otherwise

). The third thing is that j only needs to go as far as ((i*i)-1)/2, after that, j is too big so i*i will not be enough to reach (j+1)^2. This cuts very little actually (around 0.030s), because this i-expression quickly goes over 5000 (at i=100), so the lower limit of 5000 has to be used anyway. Overall, it is only a little over 10% faster than Cogman's simple solution...

The second solution relies on Euclid's algorithm to generate triples. The algorithm is simple, converting it to find triples with i,j < 5000 is not so simple however, as the algorithm doesn't find them ordered by size, and it looks a bit convoluted without explanation. It also doesn't give you unique triples, so a trivial implementation will give many duplicates. If this is OK (we do get all the triples after all), this algorithm is ridiculously fast, something around 0.0009s (obtained by running 10000 instead of 10 runs, and it still finishes under a second (0.904s)). I've added a very crude way to guarantee uniqueness, and it comes at the expense of a 25M matrix that keeps this info. The matrix is symmetric and generally has lots of wasted space, so one could do quite better than this. This does find exactly all the same triples as other good algorithms, and still beats everything, but the overhead with this 25M matrix is very substantial and represents 99+% of CPU time.

Also, I've added a solution count to be reported by all functions and this adds some overhead. I wanted it to serve as a check that func is correct, and it seems that the SchmideSSEASM that is the fastest except Euclid, is so well optimized that it optimizes some solutions away

SchmideSSE 1.763 seconds, solutions found: 66160
simplesol 1.888 seconds, solutions found: 66160
simplesolASM 1.31 seconds, solutions found: 66160
SchmideSSEASM 0.562 seconds, solutions found: 37660
cogmanSimplesol 1.529 seconds, solutions found: 66160
iCyborgSimplesol 1.357 seconds, solutions found: 66160
iCyborgEuclidSol 0.546 seconds, solutions found: 66160

All results done on i7 920@3.60GHz

Code:

bool iCyborgSimpleSolution()
{
	bool solution;
	int solutions = 0;
	int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
    int mulTable[7072];
    for (int i = 3; i < 7072; ++i)
    {
        mulTable[i] = i * i;
    }

	for(int i=3;i<5000;++i)
	{
		int lim = (i>100) ? 5001 : (mulTable[i]-1)/2;
		for(int j=i+1;j<=lim;++j)
		{
			solution=false;
			int c;
			int d;
			c = mulTable[i] + mulTable[j];
			d=sqrt(double(c));
			if (mulTable[d] == c)
			{
				solution=true; // without this it would deadcode the solution.
				++solutions;
#ifdef _OUTPUTENABLE
				cout<<i;
				cout<<' ';
				cout<<j;
				cout<<' ';
				cout<<iD;
				cout<<endl;
#endif
			}
		}
	}
}

	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"iCyborgSimplesol ";
	cout<<dTotalTick;
	cout<<" seconds, solutions found: " << solutions <<endl;
	return solution;
}

Code:

bool iCyborgEuclidSolution()
{
	#define _UNIQUE

	int solutions = 0;
	int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
    int mulTable[2501];
#ifdef _UNIQUE
	 bool* unique = new bool[5001*5001];

	 for (int i=1; i<5001; ++i)
		 for (int j=1; j<5001; ++j)
			 unique[i*5001+j]=false;
#endif

	 for (int i = 1; i < 2501; ++i)
    {
        mulTable[i] = i * i;
    }

	for(int n=1; n<2501;++n)
	{
		for (int m=n+1; m<=2500/n; ++m) 
		{
			int l = mulTable[m]-mulTable[n];
			for (int k=1; true; ++k) 
			{
				if (k*l>5000 || k*m*n>2500) break;
#ifdef _UNIQUE				
				if (!unique[5001*k*l + 2*k*m*n]) {
					unique[5001*k*l + 2*k*m*n] = unique[5001*2*k*m*n + k*l] = true;
					++solutions;
				}
#else
				++solutions;
#endif
#ifdef _OUTPUTENABLE
				cout<<k*l;
				cout<<' ';
				cout<<k*m*n;
				cout<<' ';
				cout<<k*(mulTable[m]+mulTable[n]);
				cout<<endl;
#endif
			}
		}
		
	}
#ifdef _UNIQUE
	delete[] unique;
#endif
	
}

	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"iCyborgEuclidSol ";
	cout<<dTotalTick;
	cout<<" seconds, solutions found: "<< solutions << endl;
	return true;
}

Cogman · Jul 19, 2010

Net said:
well... would it be written by microsoft

Since Winxp, I've yet to see microsoft release a truly poor piece of software (well, ok IE6 and 7 aren't all that great, but other then that...). To attack them at this point is quite ignorant.

MrPickins · Jul 19, 2010

This thread is priceless!

I challenge the OP to code an object oriented system using assembly.

I'm not saying it's impossible, but the difficulty would be orders of magnitude higher than coding in, for example, c++

Cogman · Jul 19, 2010

MrPickins said:
This thread is priceless!

I challenge the OP to code an object oriented system using assembly.

I'm not saying it's impossible, but the difficulty would be orders of magnitude higher than coding in, for example, c++

The Win32 API isn't STRICTLY OO... it isn't strictly procedural either. It is like this weird hybrid. Part of the reason for that is the fact that when it was initially programmed they didn't want to alienate their C programmers. Though the trend is going more and more towards an OO model.

Schmide · Jul 19, 2010

iCyborg said:
SchmideSSE 1.763 seconds, solutions found: 66160
simplesol 1.888 seconds, solutions found: 66160
simplesolASM 1.31 seconds, solutions found: 66160
SchmideSSEASM 0.562 seconds, solutions found: 37660
cogmanSimplesol 1.529 seconds, solutions found: 66160
iCyborgSimplesol 1.357 seconds, solutions found: 66160
iCyborgEuclidSol 0.546 seconds, solutions found: 66160

All results done on i7 920@3.60GHz

Fixed version. It was still doing the calculations, I just set up the variables wrong.

Code:

bool PythsseASM()
{
	bool bSolution=false;
	int iStart=GetTickCount();
	int solutions = 0;
for(int iRuncount=0;iRuncount<10;iRuncount++)
	for(int i=1;i<5001;i+=2) 
	{
		sSimdScalar a;
		a.m_ints[0]=i; // load
		a.m_ints[1]=i+1;
		a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
		a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);

		for(int j=i;j<5001;j+=2)
		{
			sSimdScalar b, c;
			sSimdScalarInt bInt, cInt;
			bInt.m_ints[0]=j;
			bInt.m_ints[0]=j;
			_asm {
				mov			eax, j
				psubd		mm3, mm3
				movd		mm0, eax
				inc			eax
				movd		mm1, eax
				xor			eax, eax			
				inc			eax
				movd		mm2, eax
				punpckldq   mm3, mm2
				paddd		mm2, mm3
				punpckldq   mm0, mm1
				paddd		mm2, mm0
				cvtpi2pd    xmm0, mm0 
				cvtpi2pd    xmm1, mm2 
				movapd		xmm2, a.m_vec128d
				mulpd		xmm0, xmm0
				mulpd		xmm1, xmm1
				addpd		xmm0, xmm2
				addpd		xmm1, xmm2
				sqrtpd		xmm0, xmm0
				sqrtpd		xmm1, xmm1
				movaps		xmm2, xmm0
				movaps		xmm3, xmm1
				cvtpd2pi    mm0,xmm0 
				movq		bInt.m_vec64, mm0
				cvtpd2pi    mm1,xmm1 
				movq		cInt.m_vec64, mm1
				cvtpi2pd    xmm0, mm0 
				cvtpi2pd    xmm1, mm1 
				cmpeqpd     xmm0,xmm2 
				movapd      b.m_vec128d,xmm0 
				cmpeqpd     xmm1,xmm3 
				movapd      c.m_vec128d,xmm1 
				nop
			}
			for(int k=0;k<2;k++)
			{
				if(b.m_ints[k<<1])
				{
					solutions++;
					bSolution=true;
#ifdef _OUTPUTENABLE
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k));
					cout<<' ';
					cout<<bInt.m_ints[k];
					cout<<endl;
#endif
				}
				if(c.m_ints[k<<1])
				{
					solutions++;
					bSolution=true;
#ifdef _OUTPUTENABLE
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k+1));
					cout<<' ';
					cout<<cInt.m_ints[k];
					cout<<endl;
#endif
				}
			} 
		} 
	}
	_mm_empty();
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"SchmideSSEASM ";
	cout<<dTotalTick;
	cout<<" seconds, solutions found: "<< solutions << endl;
	return bSolution;
}

eLiu · Jul 19, 2010

HAY GUYZ!

Wow this thread is fucking epic. I don't care about Windows in ASM, but generating pythagorean triples quickly does pique my curiosity! (This is how we know I'm an academic...) Anyway, I don't have time to implement this right now, but I'm pretty sure I have a faster algorithm than anything proposed so far. (Not to mention that I'm too much of a noob to hand-code anything in ASM, much less SSE in ASM. I can barely do SSE with intrinsics... typically rely on ICC to save me. Hats off to Schmide for being awesome.)

Basically, looping a & b is wasteful. If I generate (3,4,5) then I know many more triples... like (6,8,10), (9,12,15), etc. So why not only generate "primitive" triples (triples that are not [integer >1] multiples of other triples) and then scale them up?

How do you generate all primitive triples? That's easy; we'll do it by construction. Take two positive integers, (m,n):
(m^2 - n^2)^2 + (2*m*n)^2 = (m^2+n^2)^2

Then the triple is: (m^2-n^2, 2*m*n, m^2+n^2).

So the idea is to generate a primitive triple. Then write down all the multiples of that triple that are within the size-constraint.

Implementation-wise... we know m>n (otherwise m^2-n^2 is 0 or less). So your loop should start n at 1 and m at n+1=2. (Which will give (3,4,5)). Then keep incrementing m until the maximum leg-length is reached, and start over at the next n value. For a given n value, it's easy to figure out how high m can go before the OP's bound is exceeded. (He wanted triples where one leg is at most 10,000??)

So:
for(n=1; n<too_lazy_to_compute; n++){
for(m=n+1, m<_too_lazy_again; m++){
}
}

Every (m,n) pair generates a valid triple: (a,b,c). Identify whether a or b is bigger. (I bet if you're smart about it, you can know this without checking explicitly.) Divide max-leg-length by the larger of a,b. Say this gives me K possibilities... like with (3,4,5) and a max-leg-length of 10,000, I would now want to output (2*3,2*4,2*5), (3*3, 3*4, 4*5), ... , (2500*3, 2500*4, 2500*5).

K, I'm done with (3,4,5) (m=2,n=1), so move on to (m=3,n=1), rinse and repeat.

Though I'm not sure how to code this thing to be a fair comparison with the previous examples. My version never involves checking the validity of my triples: they're all valid and exhaustive by the construction of the (m,n) formula. To prevent the compiler from ignoring code, probably the easiest way would be to also output the total number of triples found. Or perhaps output the final state of some inner-loop temp quantity.

This should be WAY faster:
1) No floating point arithmetic, no sqrts. With proper loop unrolls, SSE should really rock some socks here b/c we only need 32 bit unsigned integers and we can run wild with those new 256-bit wide integer registers (errr... maybe that's not out yet, I forgot). All we need to do are integer multiplies (most of the work), integer adds, and the occasional division (to figure out how many non-primitive triples must be generated from each primitive triple).
2) Iterating over far fewer numbers, at least in the (m,n) loops. Somewhere btwn half as many and sqrt as many outer loop iterations. Like suppose I want to count all triples involving 11. 2mn ever equals 11, so we need all solutions for m^2 - n^2 = 11, s.t. m>n >=1. Only choice is m=6,n=5: (11,60,61). Voila, all triples with a=11 without checking b=1..59,61..10000.
3) No unpredictable if-checks.

Edit: NUTS, iCyborg already got it. I didn't read the whole thread before I started thinking about how to design a better algorithm. Apparently it's been known to mankind for a few thousand years! Haha, well played Euclid.

Edit2: Didn't consider uniqueness, whoops. Seems like a hash table would get the job done better than a gigantic array. Also I believe you only need to check the uniqueness of triples generated with the base (m,n). No need to check the multiples, because this algorithm only generates primitive triples. And it would be impossible to have two primitive triples (a,b,c) and (x,y,z) such that (Na,Nb,Nc) = (Mx,My,Mz), as this would imply one triple is non-primitive. I think anyway. Indexing on (m,n) only should knock your table size down into the hundreds of thousands.

Voo · Jul 19, 2010

Cogman said:
The Win32 API isn't STRICTLY OO... it isn't strictly procedural either. It is like this weird hybrid. Part of the reason for that is the fact that when it was initially programmed they didn't want to alienate their C programmers. Though the trend is going more and more towards an OO model.

Considering the ugly hacks you need for even basic stuff sometimes (well that's more a c++ limitation but nonetheless), I'd only consider the .NET stuff really OO.

I mean consider only all the callbacks you need that can't be class members, so you end up with setting this pointers when initializing it and then accessing those (hi window procedures) - I wouldn't consider that very OOig.. or beautiful for that matter

iCyborg · Jul 19, 2010

Schmide said:
Fixed version. It was still doing the calculations, I just set up the variables wrong.

True story, the time is identical. Anyway, I have a new winner, I did some optimizations in some multiplications, but that didn't help much. What HELPED A LOT was replacing that huge double loop with which initializes unique[][] to false for each member to do it by a memset() to 0 of size 5001^2, boy, that cuts the time by 5x.

SchmideSSE 1.778 seconds, solutions found: 66160
simplesol 1.872 seconds, solutions found: 66160
simplesolASM 1.311 seconds, solutions found: 66160
SchmideSSEASM 0.561 seconds, solutions found: 66160
cogmanSimplesol 1.529 seconds, solutions found: 66160
iCyborgSimplesol 1.357 seconds, solutions found: 66160
iCyborgEuclidSol 0.094 seconds, solutions found: 66160
Hit any key to continue (Where's the ANY key?)

Code:

bool iCyborgEuclidSolution()
{
	#define _UNIQUE

	int solutions = 0;
	int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
    int mulTable[2501];
#ifdef _UNIQUE
	 bool* unique = new bool[5001*5001];
	 memset(unique, 0, 5001*5001);
#endif

	 for (int i = 1; i < 2501; ++i)
    {
        mulTable[i] = i * i;
    }

	for(int n=1; n<2501;++n)
	{
		for (int m=n+1; m<=2500/n; ++m) 
		{
			int l = mulTable[m]-mulTable[n];
			int a = 0;
			int b = 0;
			for (int k=1; true; ++k) 
			{
				a += l;
				b += m*n;
				if (a>5000 || b>2500) break;
#ifdef _UNIQUE				
				if (!unique[5001*a + 2*b]) {
					unique[5001*a + 2*b] = unique[10002*b + a] = true;
					++solutions;
				}
#else
				++solutions;
#endif
#ifdef _OUTPUTENABLE
				cout<<k*l;
				cout<<' ';
				cout<<k*m*n;
				cout<<' ';
				cout<<k*(mulTable[m]+mulTable[n]);
				cout<<endl;
#endif
			}
		}
		
	}
#ifdef _UNIQUE
	delete[] unique;
#endif
	
}

	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"iCyborgEuclidSol ";
	cout<<dTotalTick;
	cout<<" seconds, solutions found: "<< solutions << endl;
	return true;
}

MrPickins · Jul 19, 2010

Cogman said:
The Win32 API isn't STRICTLY OO... it isn't strictly procedural either. It is like this weird hybrid. Part of the reason for that is the fact that when it was initially programmed they didn't want to alienate their C programmers. Though the trend is going more and more towards an OO model.

I should have mentioned that I was going on a bit of a tangent. It may not directly relate to the Win32 API, but it might firm up in the OP's mind why writing everything in ASM is unreasonable (to say the least).

iCyborg · Jul 19, 2010

eLiu said:
HAY GUYZ!

Wow this thread is fucking epic. I don't care about Windows in ASM, but generating pythagorean triples quickly does pique my curiosity! (This is how we know I'm an academic...) Anyway, I don't have time to implement this right now, but I'm pretty sure I have a faster algorithm than anything proposed so far. (Not to mention that I'm too much of a noob to hand-code anything in ASM, much less SSE in ASM. I can barely do SSE with intrinsics... typically rely on ICC to save me.)

Basically, looping a & b is wasteful. If I generate (3,4,5) then I know many more triples... like (6,8,10), (9,12,15), etc. So why not only generate "primitive" triples (triples that are not [integer >1] multiples of other triples) and then scale them up?

How do you generate all primitive triples? That's easy; we'll do it by construction. Take two positive integers, (m,n):
(m^2 - n^2)^2 + (2*m*n)^2 = (m^2+n^2)^2

Then the triple is: (m^2-n^2, 2*m*n, m^2+n^2).

So the idea is to generate a primitive triple. Then write down all the multiples of that triple that are within the size-constraint.

That is exactly what my solution does - that's Euclid's formula, there's a reason why I'm the only one who named his loop variables m,n,k instead of i,j

The tricky part is to generate only those triples where a,b are <= 5000, because you loop by m,n which don't produce members in an orderly fashion.
If you look at my loop, it has:
1. n going from 1 to 2500 - this comes from 2*m*n<5000, so n<2500 (i just realized that since m>n>=1, I can cut the n loop to 1250, but did a quick run, and due to timer resolution, no observable improvement)
2. m goes from n+1 to 2500/n for the same 2*m*n<5000 reason
3. Now we have a primitive triple, so there goes a loop by k which generates triples by multiplying by k until one member is bigger than 5000

That's it. I'm not an ASM expert either, especially not SSE/intrinsics, so optimizations would take me a couple days most likely...

Modelworks · Jul 19, 2010

MrPickins said:
This thread is priceless!

I challenge the OP to code an object oriented system using assembly.

I'm not saying it's impossible, but the difficulty would be orders of magnitude higher than coding in, for example, c++

It has been done many times in the past. The reason it isn't done on the pc now is it isn't required. I use mainly embedded processors and there it is required quite a bit because of space concerns. Eventually even that will become high level languages. Some of it already is in the C++, Java realm but the majority is still ASM.

From 1996 , GEOS, written in ASM and object oriented.
http://www.pencomputing.com/developer/geos_introduction.html

MrPickins · Jul 19, 2010

Modelworks said:
It has been done many times in the past. The reason it isn't done on the pc now is it isn't required. I use mainly embedded processors and there it is required quite a bit because of space concerns. Eventually even that will become high level languages. Some of it already is in the C++, Java realm but the majority is still ASM.

From 1996 , GEOS, written in ASM and object oriented.
http://www.pencomputing.com/developer/geos_introduction.html

I never said it was impossible or had never been done. I was suggesting that the OP try it for himself to see the difference in effort required.

I do realize that ASM has its place, but the OP seems to think it's the best everywhere...

iCyborg · Jul 19, 2010

eLiu said:
Edit: NUTS, iCyborg already got it. I didn't read the whole thread before I started thinking about how to design a better algorithm. Apparently it's been known to mankind for a few thousand years! Haha, well played Euclid.

Edit2: Didn't consider uniqueness, whoops. Seems like a hash table would get the job done better than a gigantic array. Also I believe you only need to check the uniqueness of triples generated with the base (m,n). No need to check the multiples, because this algorithm only generates primitive triples. And it would be impossible to have two primitive triples (a,b,c) and (x,y,z) such that (Na,Nb,Nc) = (Mx,My,Mz), as this would imply one triple is non-primitive. I think anyway.

Yeah, that's true for the check, but for hash table I'm not sure, because lookup would still include pretty much the same function as the index into array computation.

I think I've nailed down
for(n=1; n<too_lazy_to_compute; n++){
for(m=n+1, m<_too_lazy_again; m++){

The best I can do is n=1; n<50
and m goes from m=n+1; m<2500/n

It doesn't improve time much though. I should correct my time to 0.114s, running 100 runs gets it more accurate.

How much electricity would be saved worldwide if Windows was writen in Assembly?

Banned

Member

Banned

Lifer

Member

Banned

Member

Diamond Member

Banned

Lifer

Golden Member

Golden Member

Golden Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Lifer

Diamond Member

Golden Member