Thanks.
Now, if you would only recommend a compiler, ide and learn stuff?
Assuming Windows, then Visual Studio 2010 Express (doesn't do 64-bit though, Microsoft wants to see $$ for that
Thanks.
Now, if you would only recommend a compiler, ide and learn stuff?
Assuming Windows, then Visual Studio 2010 Express (doesn't do 64-bit though, Microsoft wants to see $$ for thatbut for learning it doesn't matter. You can always plug in gcc later).
so if you don't want to do away with 64 bit, GCC would be it?
How about a good code editor, what do you use if I may ask?
I just did a quick check and I think your foiling is foiled?
Code:SchmideSSE 1.716 seconds simplesol 1.857 seconds simplesolASM 1.357 seconds SchmideSSEASM 0.827 seconds cogmanSimplesol 1.716 seconds Hit any key to continue (Where's the ANY key?)
What happened I think is you put this semicolon here in the second loop. The assignment to the solution variable is important to prevent dead code. VC caught it. Though I would of dug into it as it just didn't make sense.
Code:if (d == (double)iD)[COLOR="Red"];[/COLOR]
This caused the compiler to see the second loop as dead code (i.e. no need to execute it)
If you look at the disassembly.
Code:for(int i=2;i<5001;++i) 01361068 inc eax 01361069 cmp eax,1389h 0136106E jl CogmanSimpleSolution+66h (1361066h) for(int iRuncount=0;iRuncount<10;iRuncount++) 01361070 sub edi,1 01361073 jne CogmanSimpleSolution+20h (1361020h) int c; double d; c = mulTable[i] + mulTable[j]; d=sqrt((double)c); int iD=(int)d; if (d == (double)iD) { } } } }
Nice try though but multiplying 2 numbers locally should only save you 3 ticks at most nowhere near the amount you saved there, especially with the extra access of the lookup table.
bool CogmanSimpleSolution()
{
bool solution;
int iStart=GetTickCount();
int mulTable[5001];
mulTable[2] = 7;
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
int start = 1;
if (mulTable[2] != 4)
{
for (int i = 1; i < 5001; ++i)
{
solution = false;
mulTable[i] = i * i;
int c;
int d;
c = 1 + mulTable[i];
d = sqrt(c);
if (d * d == c)
{
solution = true;
}
}
start = 2;
}
for(int i=start;i<5001;++i)
{
for(int j=i;j<5001;++j)
{
solution=false;
int c;
int d;
c = mulTable[i] + mulTable[j];
d=sqrt(c);
if (mulTable[d] == c)
{
solution=true; // without this it would deadcode the solution.
#ifdef _OUTPUTENABLE
cout<<i;
cout<<' ';
cout<<j;
cout<<' ';
cout<<iD;
cout<<endl;
#endif
}
}
}
}
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000);
cout<<"cogmanSimplesol ";
cout<<dTotalTick;
cout<<" seconds"<<endl;
return solution;
}
That's the thing... gcc is just a compiler.
Visual Studio is an IDE... You get a good code editor, nice project management, integrated source control etc... and a decent compiler to boot.
With gcc, I don't know, everyone has their preferences. Some like Eclipse, others use Netbeans, there's the Qt development environment, Anjuta, KDevelop, and tons of others...
You could still use the Visual Studio IDE and plug gcc in as a custom compiler.
I would like to start with 64 bit right away.
If you're doing C++, it really doesn't matter. You write pretty much the exact same code (assuming you don't screw up, and write nice code using intptr_t etc when using pointer-int casts etc). You just use a different compiler.
But I think gcc is the only free option for 64-bit.
And yes, cpp is probably C++. Generally you use the .cpp extension for C++ source files (most filesystems won't accept '+' as a valid character in a filename). Alternatively you may see .cxx occassionally.
Crap, Should have known something went wrongJust a sec, I actually did have a faster version then the original, but was so excited to see the .25 seconds I thought I had discovered gold
.
Code:bool CogmanSimpleSolution() { bool solution; int iStart=GetTickCount(); int mulTable[5001]; mulTable[2] = 7; for(int iRuncount=0;iRuncount<10;iRuncount++) { int start = 1; if (mulTable[2] != 4) { for (int i = 1; i < 5001; ++i) { solution = false; mulTable[i] = i * i; int c; int d; c = 1 + mulTable[i]; d = sqrt(c); if (d * d == c) { solution = true; } } start = 2; } for(int i=start;i<5001;++i) { for(int j=i;j<5001;++j) { solution=false; int c; int d; c = mulTable[i] + mulTable[j]; d=sqrt(c); if (mulTable[d] == c) { solution=true; // without this it would deadcode the solution. #ifdef _OUTPUTENABLE cout<<i; cout<<' '; cout<<j; cout<<' '; cout<<iD; cout<<endl; #endif } } } } int iEnd=GetTickCount(); double dEnd=(double) iEnd; double dStart=(double) iStart; double dTotalTick=((dEnd-dStart)/(double) 1000); cout<<"cogmanSimplesol "; cout<<dTotalTick; cout<<" seconds"<<endl; return solution; }
This version is about 100ms faster then the simple solution that you posted (on my laptop at least.). There is SOME savings to using a table vs doing the straight multiplication.
So, here come the most important question. Learning stuff.
How is that even running on your system? It overloads the table at i=101 j=5000. Edit: (i.e. sqrt(101^2 + 5000^2) = 5001)
And not to forget .cc - google and some OS project use that. And don't forget printing when talking about 64bit portability - they somehow forgot to specify some types in C99.. (oh and those fun alignment problems when storing structs that should be shared between 32 and 64bit programs, for gcc there's __attribute__(packed) - I assume MSVC has something as well)And yes, cpp is probably C++. Generally you use the .cpp extension for C++ source files (most filesystems won't accept '+' as a valid character in a filename). Alternatively you may see .cxx occassionally.
#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>
#include <cutil.h>
#define LEAF_SIZE (100)
#define MATRIX_SIZE (5000)
#define THREADS_PER_BLOCK (256)
#define MAX_NR_BLOCKS (65535)
#define MAX_NR_THREADS (MAX_NR_BLOCKS * THREADS_PER_BLOCK)
__device__ int ComputeRow(int i) {
int ii = MATRIX_SIZE * (MATRIX_SIZE + 1) / 2 - 1 - i;
int k = (((int) sqrt((float)8 * ii + 1)) - 1) / 2;
return MATRIX_SIZE - 1 - k;
}
__device__ int ComputeCol(int row, int i) {
return i - MATRIX_SIZE * row + row * (row + 1) / 2;
}
__device__ int IsTriple(int a2, int b) {
int b2 = b * b;
int c2 = a2 + b2;
float c = sqrt((float)c2);
return (int) c == c;
}
__global__ void ComputeTriples(int *erg) {
int row, row2, col, i, j, sum;
i = (blockIdx.x * THREADS_PER_BLOCK + threadIdx.x) * LEAF_SIZE;
row = ComputeRow(i);
col = ComputeCol(row, i);
row2 = row * row;
for(j = 0; j < LEAF_SIZE; j++) {
sum += IsTriple(row2, col);
col++;
if (col == MATRIX_SIZE) {
row++;
row2 = row * row;
col = row;
}
}
// make sure nothing is optimized away
*erg += sum;
}
int main() {
unsigned int timer, nr_blocks;
int *erg;
// don't care about one off errors here, at worst it computes a little bit too much.
nr_blocks = MATRIX_SIZE * (MATRIX_SIZE + 1) / ( 2 * THREADS_PER_BLOCK * LEAF_SIZE) + 1;
if (nr_blocks >= MAX_NR_BLOCKS) {
printf("Too many blocks for one kernel invocation.\n");
return 1;
}
cutCreateTimer(&timer);
cutStartTimer(timer);
cudaMalloc((void**) &erg, sizeof(int));
ComputeTriples <<< nr_blocks, THREADS_PER_BLOCK >>> (erg);
// cudaFree(erg); not sure if it'd optimize something away if we just free it after the call.
cutStopTimer(timer);
printf("done in: %fms\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
return 0;
}
bool iCyborgSimpleSolution()
{
bool solution;
int solutions = 0;
int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
int mulTable[7072];
for (int i = 3; i < 7072; ++i)
{
mulTable[i] = i * i;
}
for(int i=3;i<5000;++i)
{
int lim = (i>100) ? 5001 : (mulTable[i]-1)/2;
for(int j=i+1;j<=lim;++j)
{
solution=false;
int c;
int d;
c = mulTable[i] + mulTable[j];
d=sqrt(double(c));
if (mulTable[d] == c)
{
solution=true; // without this it would deadcode the solution.
++solutions;
#ifdef _OUTPUTENABLE
cout<<i;
cout<<' ';
cout<<j;
cout<<' ';
cout<<iD;
cout<<endl;
#endif
}
}
}
}
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000);
cout<<"iCyborgSimplesol ";
cout<<dTotalTick;
cout<<" seconds, solutions found: " << solutions <<endl;
return solution;
}
bool iCyborgEuclidSolution()
{
#define _UNIQUE
int solutions = 0;
int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
int mulTable[2501];
#ifdef _UNIQUE
bool* unique = new bool[5001*5001];
for (int i=1; i<5001; ++i)
for (int j=1; j<5001; ++j)
unique[i*5001+j]=false;
#endif
for (int i = 1; i < 2501; ++i)
{
mulTable[i] = i * i;
}
for(int n=1; n<2501;++n)
{
for (int m=n+1; m<=2500/n; ++m)
{
int l = mulTable[m]-mulTable[n];
for (int k=1; true; ++k)
{
if (k*l>5000 || k*m*n>2500) break;
#ifdef _UNIQUE
if (!unique[5001*k*l + 2*k*m*n]) {
unique[5001*k*l + 2*k*m*n] = unique[5001*2*k*m*n + k*l] = true;
++solutions;
}
#else
++solutions;
#endif
#ifdef _OUTPUTENABLE
cout<<k*l;
cout<<' ';
cout<<k*m*n;
cout<<' ';
cout<<k*(mulTable[m]+mulTable[n]);
cout<<endl;
#endif
}
}
}
#ifdef _UNIQUE
delete[] unique;
#endif
}
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000);
cout<<"iCyborgEuclidSol ";
cout<<dTotalTick;
cout<<" seconds, solutions found: "<< solutions << endl;
return true;
}
well... would it be written by microsoft
This thread is priceless!
I challenge the OP to code an object oriented system using assembly.
I'm not saying it's impossible, but the difficulty would be orders of magnitude higher than coding in, for example, c++
SchmideSSE 1.763 seconds, solutions found: 66160
simplesol 1.888 seconds, solutions found: 66160
simplesolASM 1.31 seconds, solutions found: 66160
SchmideSSEASM 0.562 seconds, solutions found: 37660
cogmanSimplesol 1.529 seconds, solutions found: 66160
iCyborgSimplesol 1.357 seconds, solutions found: 66160
iCyborgEuclidSol 0.546 seconds, solutions found: 66160
All results done on i7 920@3.60GHz
bool PythsseASM()
{
bool bSolution=false;
int iStart=GetTickCount();
int solutions = 0;
for(int iRuncount=0;iRuncount<10;iRuncount++)
for(int i=1;i<5001;i+=2)
{
sSimdScalar a;
a.m_ints[0]=i; // load
a.m_ints[1]=i+1;
a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);
for(int j=i;j<5001;j+=2)
{
sSimdScalar b, c;
sSimdScalarInt bInt, cInt;
bInt.m_ints[0]=j;
bInt.m_ints[0]=j;
_asm {
mov eax, j
psubd mm3, mm3
movd mm0, eax
inc eax
movd mm1, eax
xor eax, eax
inc eax
movd mm2, eax
punpckldq mm3, mm2
paddd mm2, mm3
punpckldq mm0, mm1
paddd mm2, mm0
cvtpi2pd xmm0, mm0
cvtpi2pd xmm1, mm2
movapd xmm2, a.m_vec128d
mulpd xmm0, xmm0
mulpd xmm1, xmm1
addpd xmm0, xmm2
addpd xmm1, xmm2
sqrtpd xmm0, xmm0
sqrtpd xmm1, xmm1
movaps xmm2, xmm0
movaps xmm3, xmm1
cvtpd2pi mm0,xmm0
movq bInt.m_vec64, mm0
cvtpd2pi mm1,xmm1
movq cInt.m_vec64, mm1
cvtpi2pd xmm0, mm0
cvtpi2pd xmm1, mm1
cmpeqpd xmm0,xmm2
movapd b.m_vec128d,xmm0
cmpeqpd xmm1,xmm3
movapd c.m_vec128d,xmm1
nop
}
for(int k=0;k<2;k++)
{
if(b.m_ints[k<<1])
{
solutions++;
bSolution=true;
#ifdef _OUTPUTENABLE
cout<<((int)(i+k));
cout<<' ';
cout<<((int)(j+k));
cout<<' ';
cout<<bInt.m_ints[k];
cout<<endl;
#endif
}
if(c.m_ints[k<<1])
{
solutions++;
bSolution=true;
#ifdef _OUTPUTENABLE
cout<<((int)(i+k));
cout<<' ';
cout<<((int)(j+k+1));
cout<<' ';
cout<<cInt.m_ints[k];
cout<<endl;
#endif
}
}
}
}
_mm_empty();
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000);
cout<<"SchmideSSEASM ";
cout<<dTotalTick;
cout<<" seconds, solutions found: "<< solutions << endl;
return bSolution;
}
Considering the ugly hacks you need for even basic stuff sometimes (well that's more a c++ limitation but nonetheless), I'd only consider the .NET stuff really OO.The Win32 API isn't STRICTLY OO... it isn't strictly procedural either. It is like this weird hybrid. Part of the reason for that is the fact that when it was initially programmed they didn't want to alienate their C programmers. Though the trend is going more and more towards an OO model.
True story, the time is identical. Anyway, I have a new winner, I did some optimizations in some multiplications, but that didn't help much. What HELPED A LOT was replacing that huge double loop with which initializes unique[][] to false for each member to do it by a memset() to 0 of size 5001^2, boy, that cuts the time by 5x.Fixed version. It was still doing the calculations, I just set up the variables wrong.
bool iCyborgEuclidSolution()
{
#define _UNIQUE
int solutions = 0;
int iStart=GetTickCount();
for(int iRuncount=0;iRuncount<10;iRuncount++)
{
int mulTable[2501];
#ifdef _UNIQUE
bool* unique = new bool[5001*5001];
memset(unique, 0, 5001*5001);
#endif
for (int i = 1; i < 2501; ++i)
{
mulTable[i] = i * i;
}
for(int n=1; n<2501;++n)
{
for (int m=n+1; m<=2500/n; ++m)
{
int l = mulTable[m]-mulTable[n];
int a = 0;
int b = 0;
for (int k=1; true; ++k)
{
a += l;
b += m*n;
if (a>5000 || b>2500) break;
#ifdef _UNIQUE
if (!unique[5001*a + 2*b]) {
unique[5001*a + 2*b] = unique[10002*b + a] = true;
++solutions;
}
#else
++solutions;
#endif
#ifdef _OUTPUTENABLE
cout<<k*l;
cout<<' ';
cout<<k*m*n;
cout<<' ';
cout<<k*(mulTable[m]+mulTable[n]);
cout<<endl;
#endif
}
}
}
#ifdef _UNIQUE
delete[] unique;
#endif
}
int iEnd=GetTickCount();
double dEnd=(double) iEnd;
double dStart=(double) iStart;
double dTotalTick=((dEnd-dStart)/(double) 1000);
cout<<"iCyborgEuclidSol ";
cout<<dTotalTick;
cout<<" seconds, solutions found: "<< solutions << endl;
return true;
}
The Win32 API isn't STRICTLY OO... it isn't strictly procedural either. It is like this weird hybrid. Part of the reason for that is the fact that when it was initially programmed they didn't want to alienate their C programmers. Though the trend is going more and more towards an OO model.
That is exactly what my solution does - that's Euclid's formula, there's a reason why I'm the only one who named his loop variables m,n,k instead of i,jHAY GUYZ!
Wow this thread is fucking epic. I don't care about Windows in ASM, but generating pythagorean triples quickly does pique my curiosity! (This is how we know I'm an academic...) Anyway, I don't have time to implement this right now, but I'm pretty sure I have a faster algorithm than anything proposed so far. (Not to mention that I'm too much of a noob to hand-code anything in ASM, much less SSE in ASM. I can barely do SSE with intrinsics... typically rely on ICC to save me.)
Basically, looping a & b is wasteful. If I generate (3,4,5) then I know many more triples... like (6,8,10), (9,12,15), etc. So why not only generate "primitive" triples (triples that are not [integer >1] multiples of other triples) and then scale them up?
How do you generate all primitive triples? That's easy; we'll do it by construction. Take two positive integers, (m,n):
(m^2 - n^2)^2 + (2*m*n)^2 = (m^2+n^2)^2
Then the triple is: (m^2-n^2, 2*m*n, m^2+n^2).
So the idea is to generate a primitive triple. Then write down all the multiples of that triple that are within the size-constraint.
This thread is priceless!
I challenge the OP to code an object oriented system using assembly.
I'm not saying it's impossible, but the difficulty would be orders of magnitude higher than coding in, for example, c++
It has been done many times in the past. The reason it isn't done on the pc now is it isn't required. I use mainly embedded processors and there it is required quite a bit because of space concerns. Eventually even that will become high level languages. Some of it already is in the C++, Java realm but the majority is still ASM.
From 1996 , GEOS, written in ASM and object oriented.
http://www.pencomputing.com/developer/geos_introduction.html
Yeah, that's true for the check, but for hash table I'm not sure, because lookup would still include pretty much the same function as the index into array computation.Edit: NUTS, iCyborg already got it. I didn't read the whole thread before I started thinking about how to design a better algorithm. Apparently it's been known to mankind for a few thousand years! Haha, well played Euclid.
Edit2: Didn't consider uniqueness, whoops. Seems like a hash table would get the job done better than a gigantic array. Also I believe you only need to check the uniqueness of triples generated with the base (m,n). No need to check the multiples, because this algorithm only generates primitive triples. And it would be impossible to have two primitive triples (a,b,c) and (x,y,z) such that (Na,Nb,Nc) = (Mx,My,Mz), as this would imply one triple is non-primitive. I think anyway.
