I plugged in the code from this post and got the same results. Maybe its an AMD/Intel/I7 thingy because the code does seem updated.
My first Euclid was around your SSEASM time (which was ~0.550s for me), and the 2nd ~0.115s. Both of your versions of Euclid seem quite better than SSEASM, if you really got the right code, then my i7 isn't doing something very well in the first case.
Anyway, I have something much, much better. I was surprised how much I could cut the time by pulling that big memory allocation out of the iRuncount loop. I cut the time to 0.026s just by allocating it once, and not for every run. This is, of course, cheating, as I'm sharing stuff between the runs, that's not what the runs are for, but it really made me want to get rid of it altogether. And it can be done. And the resulting algorithm does a single run under 0.1ms! (0.952s for 10,000 runs, granted, it's on an OCed i7)
Any meaningful gains would probably come from optimizing this algorithm with SSE, not sure how much more optimal you can get without it, as there are not that many arithmetic operations involved per generated triple, and all are integer ops (most of them come from only 2 integer additions involved!).
@eLiu
After going through wiki on this, the problem is that the algorithm does not generate primitive triples unless m,n are coprime and exactly one of them is odd. Since n<50, there are only 14 primes from 1 to 50 (not including 2 which is handled by the m-loop going from n+1 (surely different parity) and incrementing by 2 preserving different parity) so you only check in a loop that m,n do not share any of those primes as factors and you have a primitive triple. Once you have a primitive triple, you just keep multiplying with integers until one of them is larger than 5000 (there's a possible small optimization here to check which member is bigger, and loop until that one is bigger).
Also, I've taken into account that m^2-n^2 must be <=5000, and this means that m<=min(87, 2500/n), because if m>87, then since n is max 50, m^2-n^2 >= 88^2-50^2 > 5000. Could probably add a small optimization to make this 87 limit vary with n.
Taking all this into account, I get (per 10 runs, Euclid does 10,000 to get around timer res, but I account for that in the printout):
SchmideSSE
1.732 seconds, solutions found: 66160
simplesol
1.794 seconds, solutions found: 66160
simplesolASM
1.279 seconds, solutions found: 66160
SchmideSSEASM
0.546 seconds, solutions found: 66160
SchmideSSEASMOpt
0.53 seconds, solutions found: 66160
cogmanSimplesol
1.528 seconds, solutions found: 66160
iCyborgSimplesol
1.342 seconds, solutions found: 66160
iCyborgEuclidSol
0.000952 seconds, solutions found: 66160
AnyASM
10.484 seconds
Hit any key to continue (Where's the ANY key?)
Crazy
The code does start to look less understandable than that simplest ASM though, because it needs quite a bit of math explanation as to why it's like it is...
Code:
bool iCyborgEuclidSolution()
{
#define _UNIQUE
int solutions = 0;
int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10000;iRuncount++)
{
int mulTable[2501];
#ifdef _UNIQUE
int primes[] = {3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
#endif
for (int i = 1; i < 2501; ++i)
{
mulTable[i] = i * i;
}
for(int n=1; n<50;++n)
{
int lim = (n>28) ? 2500/n : 87;
for (int m=n+1; m<=lim; m+=2)
{
#ifdef _UNIQUE
bool coprime = true;
for (int i=0; i<14; i++)
{
if ((n%primes[i])==0 && (m%primes[i])==0)
{
coprime = false;
break;
}
}
if (!coprime) continue; // not primitive, skip
#endif
int l = mulTable[m]-mulTable[n];
int a = 0;
int b = 0;
for (int k=1; true; ++k)
{
a += l;
b += m*n;
if (a>5000 || b>2500) break;
++solutions;
#ifdef _OUTPUTENABLE
cout<<a;
cout<<' ';
cout<<b;
cout<<' ';
cout<<k*(mulTable[m]+mulTable[n]);
cout<<endl;
#endif
}
}
}
}
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000000);
cout<<"iCyborgEuclidSol ";
cout<<dTotalTick;
cout<<" seconds, solutions found: "<< solutions/1000 << endl;
return true;
}