How much electricity would be saved worldwide if Windows was writen in Assembly?

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Any_Name_Does

Member
Jul 13, 2010
143
0
0
You lost the argument. /thread.

Not quite yet.

to schmide,

to make it fair, would you be willing to code that without those xmm registers?

I mean to defeat a kung fu fighter with a machine gun isn't a big deal. is it?
 
Last edited:

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I don't think you guys see the bigger picture, ASM can do so much more than that crappy C code.

My crappy C compiled windows install uses ~500 watts on full load. Using the savings from ASM, I calculate my computer will use:
500*177=70.8KW

Now, this is a lot of heat. By placing a geothermal powerplant over my computer at 80% efficiency I generate ~56.6KW of pure power.

I then resell this to the grid at a rate of .12/KWH for a profit of ~ $6.79 an hour 24/7.

I now make $60k a year by doing nothing. Thank you ASM, I can now live out my dream of being a freelance lumberjack.
 

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
Not bad at all. your code is more complicated than mine, one would expect it the other way around when comparing a high level language to a low level one. so you are using a set of more appropriate registers. the results you are getting is a chaos. but it is ok because you haven't had time to take care small details. I knew my function is bottlenecked by the mul instruction, having to immediately use the result when it is not ready yet, but there was nothing I could do about it. but all in all you win. your algorythm is better. but how do get that 177x. on my computer your code does it in almost 4 seconds without outputting a file. Mine takes 13 seconds, with outputting a file. you win anyway, but how did you get that 177x?

I typed it wrong oops! Fixed above. It was 117x. It could depend on your hardware. The first test was on a AthlonII 620 (popus) at 2.6htz. I just ran it on my q9550 at 2.83ghz and I got 0.093sec and 14.015 seconds. So for that run its 150x.

Your bottleneck is the way you approach the final c value. It takes many more cycles the way you do it than to caclualte the sqrt and check whether that sqrt is an int.

Ok I'll make a simple solution... but you won't be happy.
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
Not quite yet.

to schmide,

to make it fair, would you be willing to code that without those xmm registers?

I mean to defeat a kung fu fighter with a machine gun isn't a big deal. is it?

Ok I created a simple solution and put in a couple of extra assignments just to make sure the original code wasn't getting optimized out in the release. If you don't use a calculation it will sometimes remove the calculation all together in the release. It did increase it a few hundredths of a sec. 0.12 vs 0.17. The sse code is about twice as fast as the simple code.

The results

SchmideSSE 0.171 seconds
simplesol 0.281 seconds
AnyASM 14.695 seconds
Hit any key to continue (Where's the ANY key?)


Code:
bool SimpleSolution()
{
	bool solution;
	int iStart=GetTickCount();
	for(int i=1;i<5001;i++) 
	{
		for(int j=i;j<5001;j++)
		{
			solution=false;
			double a=(double)i, b=(double)j, c, d;
			a*=a;
			b*=b;
			c=a+b;
			d=sqrt(c);
			int iD=(int)d;
			if( ((double)iD) == d)
			{
				solution=true; // without this it would deadcode the solution.
#ifdef _OUTPUTENABLE
				cout<<i;
				cout<<' ';
				cout<<j;
				cout<<' ';
				cout<<iD;
				cout<<endl;
#endif
			}
		}
	}
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"simplesol ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
	return solution;
}

Updated SSE2 code

Code:
bool Pythsse()
{
	bool bSolution=false;
	int iStart=GetTickCount();
	for(int i=1;i<5001;i+=2) 
	{
		sSimdScalar a;
		a.m_ints[0]=i; // load
		a.m_ints[1]=i+1;
		a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
		a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);

		for(int j=i;j<5001;j+=2)
		{
			sSimdScalar b,c,d,e;
			sSimdScalarInt bInt, cInt;
			b.m_ints[0]=j; // load 
			b.m_ints[1]=j+1;
			c.m_ints[0]=j+1;
			c.m_ints[1]=j+2;
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
			b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
			c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
			b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
			c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
			d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
			e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
			b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
			c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
			bInt.m_vec64=b.m_vec64[0];
			cInt.m_vec64=c.m_vec64[0];
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
			d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
			e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);
			for(int k=0;k<2;k++)
			{
				if(d.m_ints[k<<1])
				{
					bSolution=true;
#ifdef _OUTPUTENABLE
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k));
					cout<<' ';
					cout<<bInt.m_ints[k];
					cout<<endl;
#endif
				}
				if(e.m_ints[k<<1])
				{
					bSolution=true;
#ifdef _OUTPUTENABLE
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k+1));
					cout<<' ';
					cout<<cInt.m_ints[k];
					cout<<endl;
#endif
				}
			} 
		} 
	}
	_mm_empty();
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"SchmideSSE ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
	return bSolution;
}
 
Last edited:

Any_Name_Does

Member
Jul 13, 2010
143
0
0
Ok I created a simple solution and put in a couple of extra assignments just to make sure the original code wasn't getting optimized out in the release. If you don't use a calculation it will sometimes remove the calculation all together in the release. It did increase it a few hundredths of a sec. 0.12 vs 0.17. The sse code is about twice as fast as the simple code.

The results




Code:
bool SimpleSolution()
{
    bool solution;
    int iStart=GetTickCount();
    for(int i=1;i<5001;i++) 
    {
        for(int j=i;j<5001;j++)
        {
            solution=false;
            double a=(double)i, b=(double)j, c, d;
            a*=a;
            b*=b;
            c=a+b;
            d=sqrt(c);
            int iD=(int)d;
            if( ((double)iD) == d)
            {
                solution=true; // without this it would deadcode the solution.
#ifdef _OUTPUTENABLE
                cout<<i;
                cout<<' ';
                cout<<j;
                cout<<' ';
                cout<<iD;
                cout<<endl;
#endif
            }
        }
    }
    int iEnd=GetTickCount();
    double dEnd=(double) iEnd;
    double dStart=(double) iStart;
    double dTotalTick=((dEnd-dStart)/(double) 1000);
    cout<<"simplesol ";
    cout<<dTotalTick;
    cout<<" seconds"<<endl;
    return solution;
}
Updated SSE2 code

Code:
bool Pythsse()
{
    bool bSolution=false;
    int iStart=GetTickCount();
    for(int i=1;i<5001;i+=2) 
    {
        sSimdScalar a;
        a.m_ints[0]=i; // load
        a.m_ints[1]=i+1;
        a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
        a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);

        for(int j=i;j<5001;j+=2)
        {
            sSimdScalar b,c,d,e;
            sSimdScalarInt bInt, cInt;
            b.m_ints[0]=j; // load 
            b.m_ints[1]=j+1;
            c.m_ints[0]=j+1;
            c.m_ints[1]=j+2;
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
            b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
            c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
            b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
            c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
            d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
            e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
            b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
            c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
            bInt.m_vec64=b.m_vec64[0];
            cInt.m_vec64=c.m_vec64[0];
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
            d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
            e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);
            for(int k=0;k<2;k++)
            {
                if(d.m_ints[k<<1])
                {
                    bSolution=true;
#ifdef _OUTPUTENABLE
                    cout<<((int)(i+k));
                    cout<<' ';
                    cout<<((int)(j+k));
                    cout<<' ';
                    cout<<bInt.m_ints[k];
                    cout<<endl;
#endif
                }
                if(e.m_ints[k<<1])
                {
                    bSolution=true;
#ifdef _OUTPUTENABLE
                    cout<<((int)(i+k));
                    cout<<' ';
                    cout<<((int)(j+k+1));
                    cout<<' ';
                    cout<<cInt.m_ints[k];
                    cout<<endl;
#endif
                }
            } 
        } 
    }
    _mm_empty();
    int iEnd=GetTickCount();
    double dEnd=(double) iEnd;
    double dStart=(double) iStart;
    double dTotalTick=((dEnd-dStart)/(double) 1000);
    cout<<"SchmideSSE ";
    cout<<dTotalTick;
    cout<<" seconds"<<endl;
    return bSolution;
}

OK your code destroyes mine in a D: manner. But I still feel like the winner, because I learnt how powerful those xmm registers are. I never worked with them before. Time for me to move on to a high level language and see if I can optimize a few things there :)
Anyway, I had fun with thread. thanks guys :awe:
By the way any recommendations for a good compiler with good documentation?
 

jchu14

Senior member
Jul 5, 2001
613
0
0
Thanks Schmide for adding educational value to an entertaining thread :)

Assembly optimization may not be so practical in a commercial/development environment, but it does have its place in high performance scientific computing. Kazushige Goto has a wiki page for his contribution in hand tuning a (parallel processing friendly) linear algebra library. My parallel computing professor says Goto is the only guy he's met that can really think and write in assembly as natural as typical programmers can think in pseudo-code.

Well done parallelization in C and Fortran is already tough, I can't even begin to imagine doing parallelization in assembly. :eek:
 

EarthwormJim

Diamond Member
Oct 15, 2003
3,239
0
76
Thanks Schmide for adding educational value to an entertaining thread :)

Assembly optimization may not be so practical in a commercial/development environment, but it does have its place in high performance scientific computing. Kazushige Goto has a wiki page for his contribution in hand tuning a (parallel processing friendly) linear algebra library. My parallel computing professor says Goto is the only guy he's met that can really think and write in assembly as natural as typical programmers can think in pseudo-code.

Well done parallelization in C and Fortran is already tough, I can't even begin to imagine doing parallelization in assembly. :eek:

That's funny, one of the best asm programmers in the world works for microsoft :D
 

Scali

Banned
Dec 3, 2004
2,495
1
0
My parallel computing professor says Goto is the only guy he's met that can really think and write in assembly as natural as typical programmers can think in pseudo-code.

With a name like Goto, is that any surprise? :)
 

Any_Name_Does

Member
Jul 13, 2010
143
0
0
Ladies and gentlemen, let me congratulate myself as the winner of this thread. :eek: ?

I started coding a few years ago in asm. wrote a couple of small apps, and lost my enthusiasm fairly quickly due to the fact that it was pretty mind consuming writing asm code, and looking at my own code a day later I wouldn't understand what was going on. so I needed a good reason to start learning something new. that is why I came here challenging everyone to get that reason. And now I have it. :twisted:

Once more, thanks schmide ( and others ) for taking your time. ():)
 

Cogman

Lifer
Sep 19, 2000
10,286
147
106
Not bad at all. your code is more complicated than mine, one would expect it the other way around when comparing a high level language to a low level one. so you are using a set of more appropriate registers. the results you are getting is a chaos. but it is ok because you haven't had time to take care small details. I knew my function is bottlenecked by the mul instruction, having to immediately use the result when it is not ready yet, but there was nothing I could do about it. but all in all you win. your algorythm is better. but how do get that 177x. on my computer your code does it in almost 4 seconds without outputting a file. Mine takes 13 seconds, with outputting a file. you win anyway, but how did you get that 177x?

You seem to be falling under the illlusion that smaller code is faster code. It simply isn't.

While I think Schmide sort of cheated using intrinsics (it defeats the purpose of compiler code creation/optimization comparison, the compiler should be able tell when to use the sse instruction set.) He does prove that poorly written assembly will be slower then compiler optimized code. Properly created hand written assembly usually shines in the area of register usage. Most compilers really sort of suck at deciding when to use say edx vs putting the value into memory (gcc does this). Or just plain using SSE instructions.

x264 is a good example of a complex modern program that relies heavily on ASM. While a good portion of it is C, larger and larger portions of it are being re-written into asm (by some real assembly pros).
 

Voo

Golden Member
Feb 27, 2009
1,684
0
76
While I think Schmide sort of cheated using intrinsics (it defeats the purpose of compiler code creation/optimization comparison, the compiler should be able tell when to use the sse instruction set.) He does prove that poorly written assembly will be slower then compiler optimized code. Properly created hand written assembly usually shines in the area of register usage. Most compilers really sort of suck at deciding when to use say edx vs putting the value into memory (gcc does this). Or just plain using SSE instructions.
Yeah manual register allocation is usually rather usefull.. But I don't think he cheated with the SSE stuff, because that's really the only way to get good performance out of those instructions - at least I haven't seen lots of code that profits from them without that and imho that's still far easier to read than pure asm code.
And I assume they rewrite stuff like the MV search and the decoding routines in asm?

But even his standard C version is still faster and that's code that anyone would write to solve the problem (I assume that the compiler optimizes the a^2 computation into the outer loop, but that's a safe bet)
 

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
You seem to be falling under the illlusion that smaller code is faster code. It simply isn't.

While I think Schmide sort of cheated using intrinsics (it defeats the purpose of compiler code creation/optimization comparison, the compiler should be able tell when to use the sse instruction set.)

Cheated??? It is the epitome of giving the compiler enough information to complete the required task. Yes it's basically assembly, but the key lies in the data organization of the structure.

The compiler, even intel's magic vectorizing one, will never truly extract packed parallel sse from code. (yet?) There really just isn't enough information in regular code to decipher what independent variables could be packed; thus, the need for intrinsics.

He does prove that poorly written assembly will be slower then compiler optimized code.

I proved nothing of the sort. Actually his assembly was well optimized, I don't think anyone could extract any extra cycles out of it, it was the algorithm that was lacking.

Properly created hand written assembly usually shines in the area of register usage. Most compilers really sort of suck at deciding when to use say edx vs putting the value into memory (gcc does this). Or just plain using SSE instructions.

This is very true, especially with regards to sse/x87 code.

For example the code for the simple solution.

Code:
			double a=(double)i, b=(double)j, c, d;
00FA126E  fild        dword ptr [esp+18h] 
00FA1272  fmul        st(0),st 
00FA1274  fstp        qword ptr [esp+20h] 
00FA1278  fild        dword ptr [esp+14h] 
00FA127C  xor         bl,bl 
			a*=a;
			b*=b;
00FA127E  fmul        st(0),st 
			c=a+b;
00FA1280  fadd        qword ptr [esp+20h] 
			d=sqrt(c);
00FA1284  call        _CIsqrt (0FA1F1Ch) 
			int iD=(int)d;
			if( ((double)iD) == d)
00FA1289  fld         st(0) 
00FA128B  call        _ftol2_sse (0FA1E70h) 
00FA1290  mov         dword ptr [esp+18h],eax 
00FA1294  fild        dword ptr [esp+18h] 
00FA1298  fucompp          
00FA129A  fnstsw      ax   
00FA129C  test        ah,44h

The optimal x87 code would be something like this.

Code:
	fld         dI2
	fild        j
	fmul		st(0),st 
	fadd		st(0),st(1)
	fsqrt
	fldz    
	fadd		st(0),st(1)
	frndint
	fucom   
	fnstsw      ax   
	test        ah,44h
	je nosol
	mov solution,1
	nosol:
	ffree		st
	ffree		st(1)
	ffree		st(2)

But there is no way for the compiler to know I want to round, compare and forget. This would probably be 10&#37; faster. You're basically hitting a couple L1 3cycle penalties and a couple extra fpops.
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
Ladies and gentlemen, let me congratulate myself as the winner of this thread. :eek: ?

I started coding a few years ago in asm. wrote a couple of small apps, and lost my enthusiasm fairly quickly due to the fact that it was pretty mind consuming writing asm code, and looking at my own code a day later I wouldn't understand what was going on. so I needed a good reason to start learning something new. that is why I came here challenging everyone to get that reason. And now I have it. :twisted:

Once more, thanks schmide ( and others ) for taking your time. ():)

Here's my advice. Take a c++ class. This will give you a two fold advantage: First you will be a registered student and get access to free/discounted compilers and tools. Second it will give you some prospective on what is important in the software design industry.

A software trick can only do so much when compared to a good structured design. With what you've shown here a good data structures class/book could give you what you seem to be lacking in your coding skills. You seem to have the drive, not many people jump head first into assembly.
 

jlee

Lifer
Sep 12, 2001
48,518
223
106
Again I should emphasize I know exactly what I am talking about. For example the Windows string concatenation API takes an order of magnitude more clock cycles than the one I have written in asm. It is just one example. Windows has grown so big that it needs to be optimized, otherwise it has no future.

Those who know what they are talking about do not typically have to tell everyone they know what they're talking about.

FYI. :p
 

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
Ha. We have a new winner.

I wrote a the x87 simple solution code into the system and it scored a 0.125 seconds besting the SSE2 intrinsic version by 0.05 secs.

I really not sure if I want to do an optimized sse2 version.
 

Voo

Golden Member
Feb 27, 2009
1,684
0
76
Ha. We have a new winner.

I wrote a the x87 simple solution code into the system and it scored a 0.125 seconds besting the SSE2 intrinsic version by 0.05 secs.

I really not sure if I want to do an optimized sse2 version.
You know if you don't stop right now, I'll have to write a Cuda version and see how that fares (hey I don't think I could handoptimize the SSE2 code, so that's the only thing left) ;)


@Any_Name_Does: Though I think that learning datastructures/algorithms in a higher level language is easier, but if you're not fixed on x86 asm, but just like the real low level stuff, then TAOCP (http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) may be the right thing for you, that's really one of the books every programmer worth his salt should have lying around, great written by one of the brightest guys in the field and includes lots of interesting examples (with solutions and difficulty ratings) for lots of important stuff.
Introduction to Algorithms by CLR (http://en.wikipedia.org/wiki/Introduction_to_Algorithms) is another classic that's definietly worth reading.
 
Last edited:

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,087
3,598
126
i just realized something..

were arguing about effiency of windows here.
But we really dont need to because, although windows maybe hella efficient, the secondary software we run on top of windows wont.

And to wish everything was efficient means im a iphone hugging prius worshiper.... <dont mean any offense to prius and iphone owners>
So i guess were asking for an impossible solution.

A more universal and easier fix would be just to make the cpu's faster.
 

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
You know if you don't stop right now, I'll have to write a Cuda version and see how that fares (hey I don't think I could handoptimize the SSE2 code, so that's the only thing left) ;)

You made a deal with the devil???

Code:
SchmideSSE 0.171 seconds
simplesol 0.297 seconds
simplesolASM 0.125 seconds
SchmideSSEASM 0.078 seconds
Hit any key to continue (Where's the ANY key?)

Edit: I will add that I think I'm near the end of the resolution of the timer which seems to have a resolution of 0.0125.

So the SSE2 asm is about
40&#37; faster than the x87 asm
2.2x faster than the c++ intrinsics
3.8x faster than simple c++

Core of the optimized sse2 routine

Code:
_asm {
	mov			eax, j
	movd		mm0, eax
	inc			eax
	movd		mm1, eax
	movq		mm2, mm0
	punpckldq   mm2, mm0
	punpckldq   mm0, mm1
	paddd		mm2, mm0
	cvtpi2pd    xmm0, mm0 
	cvtpi2pd    xmm1, mm2 
	movapd		xmm2, a.m_vec128d
	mulpd		xmm0, xmm0
	mulpd		xmm1, xmm1
	addpd		xmm0, xmm2
	addpd		xmm1, xmm2
	sqrtpd		xmm0, xmm0
	sqrtpd		xmm1, xmm1
	movaps		xmm2, xmm0
	movaps		xmm3, xmm1
	cvtpd2pi    mm0,xmm0 
	movq		bInt.m_vec64, mm0
	cvtpd2pi    mm1,xmm1 
	movq		cInt.m_vec64, mm1
	cvtpi2pd    xmm0, mm0 
	cvtpi2pd    xmm1, mm1 
	cmpeqpd     xmm0,xmm2 
	movapd      b.m_vec128d,xmm0 
	cmpeqpd     xmm1,xmm3 
	movapd      c.m_vec128d,xmm1 
}
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,286
147
106
Use QueryPerformanceCounter (windows) if you are near the resolution of your timer.

GetTickCount has a resolution of ~15ms. QueryPerformanceCounter often uses the CPU frequency for its resolution. Of course, since we are working with assembly, you could use RDTSC as well.
 

Munky

Diamond Member
Feb 5, 2005
9,372
0
76
Yeah manual register allocation is usually rather usefull.. But I don't think he cheated with the SSE stuff, because that's really the only way to get good performance out of those instructions - at least I haven't seen lots of code that profits from them without that and imho that's still far easier to read than pure asm code.
And I assume they rewrite stuff like the MV search and the decoding routines in asm?

But even his standard C version is still faster and that's code that anyone would write to solve the problem (I assume that the compiler optimizes the a^2 computation into the outer loop, but that's a safe bet)

I tested out the code. Both the SSE and the simple C versions ran in about 0.1-0.2 seconds. The ASM version took about 10 seconds.
 

Schmide

Diamond Member
Mar 7, 2002
5,747
1,039
126
Ok I ran it on my q9550 and although the scores do fluctuate a bit, it was quite a bit faster. Especially with the standard code. I think I've reached the limits of the timer. So I made each routine run 10 times to shift the resolution all times are 10x i.e. just a shift of a decimal point. Results.

Q9550 2.83

Code:
SchmideSSE 1.328 seconds
simplesol 1.64 seconds
simplesolASM 1.344 seconds
SchmideSSEASM 0.453 seconds
Hit any key to continue (Where's the ANY key?)

Athlon II 620 2.6ghz

Code:
SchmideSSE 1.685 seconds
simplesol 2.87 seconds
simplesolASM 1.342 seconds
SchmideSSEASM 0.827 seconds
Hit any key to continue (Where's the ANY key?)

WTH is up with the Athlon II and the straight up x87 code?

If someone with an I7 would run this, I'd like to see the results.

EXE pythsse2.zip

SRC and VCproj pythsse2src.zip
 
Last edited:
Status
Not open for further replies.