How much electricity would be saved worldwide if Windows was writen in Assembly?

Any_Name_Does · Jul 16, 2010

jchu14 said:
So you want to find all of the Pythagorean triples such that a and b < 10,000? Is that right? and out put the results to a text file.

for the text file would for example, look like this:

3 4 5
4 3 5
5 12 13
6 8 10
... etc.

Then I don't see how aphorism's program does that?

that is why I asked him to output it to a text file as I am c illiterate .

Any_Name_Does · Jul 16, 2010

EarthwormJim said:
Here's the way I think of it. The OS is not the program anyone really wants to use. All the OS does is enable people to use the programs that they want to actually use. Who really just uses windows.

Ideally an OS is transparent to hardware, it doesn't suck any cycles. That's not possible so what we want is as efficient as possible of an OS, that doesn't get in the way of launching whatever program we want to use (crash or be incompatible). Since the OS is running 100% of the time, any increase in efficiency will add up over time. 1ms saved every minute can add up.

At the same time, an OS needs to not get in the way of letting us do what we want with the computer. That's where the difficulty in using ASM comes in. Difficult to trouble-shoot, difficult to implement for large amounts of code, difficult to upgrade, difficult to use in the role that OS's play. The OS also needs to be compatible with whatever a user is expected to throw at it software or hardware wise.

You cannot measure efficiency in only power usage. We don't use computers just to see how much power they draw or could draw. We actually do work with them! What makes windows relatively easy to use and highly compatible is the fact that most of it was written in high level (or in C's case middle level?).

Wherever it is feasible to use asm in the OS, and the gains are substantial enough, I cannot fathom that Microsoft or Apple or Linux developers wouldn't use asm. So it seems like the majority of the work that any_name is proposing is already being done or has been done. You can't write an OS like Windows completely in asm, but the parts where gains can be had, asm is used.

Conversely lets be realistic, an OS written in mostly in asm will not be as reliable as an OS written in higher level languages (you cannot debug to the same degree with asm). Think about the wasted energy from users restarting their computers, restarting their programs and dealing with crashes.

Any_name_does are you a Compsci major or did you learn assembly on your own (oh God why learn asm on your own!!!)? You had mentioned you don't know C, yet you know assembly? I have yet to meet an EE or CE or Compsci Major who didn't know C.

1+

the reason I concluded that windows is not optimized was the benchmarks I did on some kernel APIs.

power saving is something that comes consequently with efficiency. And efficiency is never good enough. there are going to come more and more gadgets and toys like these smart phone, which require more code efficiency.
Yes I learnt it on my own, and it is the proper language for me, as I am an optimizer by nature.

Any_Name_Does · Jul 16, 2010

aphorism said:

cycles arent cheap for all applications. i really dont know about you but i need more compute power. hardware companies would not bother making faster hardware if it wasnt needed. they wouldnt spend time on things like avx or sse either.

i can give you more info about the source code if i disassemble it. i added a integer sqrt function. it's not the same code i posted.

do you want the .exe?

Code:

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.30319.01 

    TITLE    c:\Users\aphorism\documents\visual studio 2010\Projects\....
    .686P
    .XMM
    include listing.inc
    .model    flat

INCLUDELIB OLDNAMES

PUBLIC    ?isqrt@@YAHH@Z                    ; isqrt
PUBLIC    __real@3fc00000
PUBLIC    __real@3f000000
EXTRN    __fltused:DWORD
;    COMDAT __real@3fc00000
; File c:\users\aphorism\documents\visual studio 2010\projects\....
CONST    SEGMENT
__real@3fc00000 DD 03fc00000r            ; 1.5
CONST    ENDS
;    COMDAT __real@3f000000
CONST    SEGMENT
__real@3f000000 DD 03f000000r            ; 0.5
; Function compile flags: /Ogtp
CONST    ENDS
;    COMDAT ?isqrt@@YAHH@Z
_TEXT    SEGMENT
_x$ = -4                        ; size = 4
_rr$ = -4                        ; size = 4
?isqrt@@YAHH@Z PROC                    ; isqrt, COMDAT
; _r$ = edx
; Line 20
    push    ebp
    mov    ebp, esp
    push    ecx
; Line 25
    movss    xmm5, DWORD PTR __real@3f000000
; Line 28
    movss    xmm4, DWORD PTR __real@3fc00000
    xorps    xmm3, xmm3
    cvtsi2ss xmm3, edx
    movss    DWORD PTR _rr$[ebp], xmm3
    mov    eax, -1100021760            ; be6f0000H
    sub    eax, DWORD PTR _rr$[ebp]
    movaps    xmm2, xmm3
    shr    eax, 1
    mov    DWORD PTR _x$[ebp], eax
    movss    xmm1, DWORD PTR _x$[ebp]
    movaps    xmm6, xmm1
    movaps    xmm0, xmm1
    mulss    xmm6, xmm1
    mulss    xmm2, xmm5
    mulss    xmm1, xmm2
    mulss    xmm0, xmm4
    mulss    xmm6, xmm1
    subss    xmm0, xmm6
; Line 29
    cmp    edx, 101123                ; 00018b03H
    jle    SHORT $LN1@isqrt
    movaps    xmm1, xmm0
    mulss    xmm1, xmm4
    movaps    xmm4, xmm0
    mulss    xmm4, xmm0
    mulss    xmm0, xmm2
    mulss    xmm4, xmm0
    subss    xmm1, xmm4
    movaps    xmm0, xmm1
$LN1@isqrt:
; Line 30
    mulss    xmm0, xmm3
    addss    xmm0, xmm5
    cvttss2si ecx, xmm0
; Line 31
    mov    eax, 1
    sub    eax, ecx
    imul    eax, ecx
    add    eax, edx
    sar    eax, 31                    ; 0000001fH
; Line 32
    mov    esp, ebp
    pop    ebp
    ret    0
?isqrt@@YAHH@Z ENDP                    ; isqrt
_TEXT    ENDS
PUBLIC    _main
; Function compile flags: /Ogtp
;    COMDAT _main
_TEXT    SEGMENT
_main    PROC                        ; COMDAT
; Line 17
    xor    eax, eax
; Line 18
    ret    0
_main    ENDP
_TEXT    ENDS
END

Is this your disassembled code or are you using inline assembly?

well it looks like inline assembly. it only confirms my point that any language other than assembly is insufficient/efficient when it gets tough.
by the way, my code is much faster, as it only uses general purpose registers.

Voo · Jul 16, 2010

Any_Name_Does said:
Is this your disassembled code or are you using inline assembly?

well it looks like inline assembly. it only confirms my point that any language other than assembly is insufficient/efficient when it gets tough.
by the way, my code is much faster, as it only uses general purpose registers.

Post your code but for a and b up to, let's say 45k (only the generating part, e.g. no printing/writing to files, write them to an array so we can also check if it's correct). Then just post the code (I hope you respected some calling conventions, otherwise you'll have to fix that I fear) and we'll measure how much faster your version is compared to my brute force approach.

mv2devnull · Jul 16, 2010

aphorism said:
i already did, in less than 10 lines. i didnt output to a .txt file b/c that's not necessary.

Code:

int main() { int c, time = GetTickCount(); for(int i=1; i<10000; i++)c = sqrt(i * i + i * i); time -= GetTickCount(); cout << time; return 0; }

What you do there is computing 9999 values (for i = 1..9999).
And what you compute is sqrt( 2*i² ).

But that is not really what the task was. If I'm not mistaken, a brute solution would look like:

Code:

int c = sum = 0;
 for ( int a=1; a < 10000; ++a ) {
  int a2 = a*a;
  for ( int b=a; b < 10000; ++b ) {
    sum = a2 + b*b;
    c = sqrt( sum );
    if ( c*c == sum ) printf( "%d %d %d\n", a, b, c );
  }
}

But I have to agree with others; the only thing that could relate to CPU's in this discussion would be, if code implementation with instruction set of one CPU model would significantly save cycles compared to some other CPU model. For example, encryption routines with 980X compared to i486.

Any_Name_Does · Jul 16, 2010

ok. it goes up to 5000 for a and b

; a^2 + b^2 = c^2 = d

pushad
xor esi,esi

a1:
inc esi
xor ebp,ebp
mov eax,esi
mul esi
cmp esi,5001
je finish
mov edi,eax

a2:
inc ebp
cmp ebp,5001
je a1
mov eax,ebp
mul ebp
mov ecx,ebp
cmp esi,ebp
jbe a_
mov ecx,esi
a_:
lea ebx,[eax+edi]

a3:
mov eax,ecx
mul ecx
inc ecx
cmp eax,ebx
jb a3
ja a2

; here we have a hit. with a b c and d in esi ebp ecx and eax respectively.

dec ecx

; do what you want. I call wsprintf and write to the file

jmp a2

finish:
popad
push eax
call ExitProcess

Voo · Jul 16, 2010

Ahm that's no function (okay probably just the extracted code, but even with my rusty asm knowledge there should be some kind of .type <name>, @function iirc)

Just change it so, that I can call it from C code (even if you don't know C, you surely know the different calling conventions) that gets an integer (max size of a and b), a pointer to an integer where we safe the number of found triples and returns let's say an array containing all c values to make it easy.

Any_Name_Does · Jul 16, 2010

Voo said:
Ahm that's no function.

Just change it so, that I can call it from C code (even if you don't know C, you surely know the different calling conventions) that gets an integer (max size of a and b), a pointer to an integer where we safe the number of found triples and returns let's say an array containing all c values to make it easy.

I am not using any pointers. that is my advantage. pointers are slow. besides you don't want me to do your homework do you?

try this with pure c and you come up with an .exe which is an order of magnitude slower than this code.

Voo · Jul 16, 2010

Any_Name_Does said:
I am not using any pointers. that is my advantage. pointers are slow. besides you don't want me to do your homework do you?

Your code right now is _USELESS_. You can't call it from any larger project because it's not a function, it writes to some file instead of returning the triples and even calls ExitProcess instead of returning.

You think it's slower to write numbers to memory then to write them in a file? And what if we want to use them afterwards? Write them to the file, read them from the file into memory and use them then? And guess what wsprintf does with your arguments (and where probably 90% of the functions time is spent)
If you can't change the code to do this small, trivial, but nontheless essential things, how would you write anything larger than 30 loc projects?
I'll be integrating your code in the C file so you're not doing my homework, you should be doing yours and write a useable function. That thing couldn't be sensibly used in any asm project either, so that's really not the problem here.

Right now we can't even measure how fast the function really is, since we've got the different startup times, the problems with IO and so on..

EarthwormJim · Jul 16, 2010

mv2devnull said:
What you do there is computing 9999 values (for i = 1..9999).
And what you compute is sqrt( 2*i² ).

I think it uses i = 2 and 10000 in the sqrt equation. First loop i is 1, compared with result true, then incremented to 2 THEN used in the equation.

Any_Name_Does · Jul 16, 2010

Voo said:
Your code right now is _USELESS_. You can't call it from any larger project because it's not a function, it writes to some file instead of returning the triples and even calls ExitProcess instead of returning.

You think it's slower to write numbers to memory then to write them in a file? And what if we want to use them afterwards? Write them to the file, read them from the file into memory and use them then? And guess what wsprintf does with your arguments (and where probably 90% of the functions time is spent)
If you can't change the code to do this small, trivial, but nontheless essential things, how would you write anything larger than 30 loc projects?
I'll be integrating your code in the C file so you're not doing my homework, you should be doing yours and write a useable function. That thing couldn't be sensibly used in any asm project either, so that's really not the problem here.

Right now we can't even measure how fast the function really is, since we've got the different startup times, the problems with IO and so on..

write your own code and compile it. we'll have our executables measured for speed. But it must work, and as proof of that you need to write to a file and display it.

Voo · Jul 16, 2010

Any_Name_Does said:
write your own code and compile it. we'll have our executables measured for speed. But it must work, and as proof of that you need to write to a file and display it.

God I start to doubt that you've got ANY idea what you're doing here. What's your biggest project you ever worked on?

I just give up if you can't even be able to write a small function in asm and understand what you're measuring it's just useless to continue that.. God you really don't understand that most of your "functions" time is spent in a C function and that the comparable C program needs some more initilisation (which every real program would need in every case and is a constant startup time factor which has NOTHING to do with the speed of the c function compared to the asm function), lots of time is waited for IO and that you couldn't use your kind of function in any larger project?
Also I'd love to see how you compared a non existing C function that you couldn't write/understand to your asm project - not that it matters since whatever you measured is just useless and no indicator for anything.

Come on, the asked function is trivial to write, even in assembly, shouldn't take much time and then we could compare it..

Sorry, but it seems you lack the experience for those kinds of things..

zephyrprime · Jul 16, 2010

Any_Name_Does said:
gettickcount is good for very small time measurement. for this code it would overlap itself several times.

gettickcount can time up to 49.7 days.

There's a higher performance counter instruction that returns real processor cycle counts as a 64bit number if you need real accuracy.

jvroig · Jul 16, 2010

@Voo: He has consistently disregarded any point about maintainability, scalability, or even just outright feasibility of implementing such projects or courses of action within a project. He has also admitted to not having done any big projects. His industry experience is also a bit fuzzy (if any at all), since he has also admitted to not knowing anything but assembly. Why do you expect him to understand now what you are trying to say? If he were able or willing to understand, he would have understood that point several pages before this one. Let him be and let's just let the thread die (he did also ask the mods to lock up the thread already before), if/when he gets exposure to the real world of business software development, he'll make the leap in understanding himself. Until then, he can believe in whatever thoughts or ideals he wants to, and until he starts to actively persuade people to do things his way, there is no harm done here by him and no need to try to make him change his mind.

Voo · Jul 16, 2010

Yeah you're right, I just like to argue about stuff like that - but since he seems to have no notion about calling conventions, functions or anything but most basic asm stuff, it seems kinda pointless to argue about that.

We can continue probably after he has worked on a multi million line project and had to maintain some stuff (will also help him learn stuff about performance measuring ~~).. though I fear that's the way TDWTFs arise.

@zephyrprime: Honestly for that stuff, clock() should give a good enough resolution anyways and is portable..

Any_Name_Does · Jul 16, 2010

Voo said:
Yeah you're right, I just like to argue about stuff like that - but since he seems to have no notion about calling conventions, functions or anything but most basic asm stuff, it seems kinda pointless to argue about that.

We can continue probably after he has worked on a multi million line project and had to maintain some stuff (will also help him learn stuff about performance measuring ~~).. though I fear that's the way TDWTFs arise.

@zephyrprime: Honestly for that stuff, clock() should give a good enough resolution anyways and is portable..

I know how to make a procedure out of this code. but it is not my business here. I am just saying that there is no way you can come close to my code with c. I doubt it you can do it at all without spending far more time on it than I did, which was the point of this thread. asm is not as sexy as c but it does thing no other language can. Why on earth do you want me to make this code callable from your c code? you have understood the task. and I am convinced that you are a far more advanced programmer than I, because you know c, I saw you know assembly, and I noticed you have an understanding of far more registers than I. I can only work with general purpose registers. I haven't had any need for more. Anyway you are MUCH better than I in programming and despite this, I am saying that my code is MUCH faster than whatever you come up with. Do you get me?
Someone said asm is for small projects. TRUE. But every API in windows is small project on it's own. and that is where optimizations are required.

Munky · Jul 16, 2010

Any_Name_Does said:
I know how to make a procedure out of this code. but it is not my business here. I am just saying that there is no way you can come close to my code with c. I doubt it you can do it at all without spending far more time on it than I did, which was the point of this thread. asm is not as sexy as c but it does thing no other language can. Why on earth do you want me to make this code callable from your c code? you have understood the task. and I am convinced that you are a far more advanced programmer than I, because you know c, I saw you know assembly, and I noticed you have an understanding of far more registers than I. I can only work with general purpose registers. I haven't had any need for more. Anyway you are MUCH better than I in programming and despite this, I am saying that my code is MUCH faster than whatever you come up with. Do you get me?
Someone said asm is for small projects. TRUE. But every API in windows is small project on it's own. and that is where optimizations are required.

I disagree. Your code may "fast" by purely academic standards, but it does not take advantage of instruction-level parallelism or multithreading. A well-written C program running on a modern cpu would do better.

Any_Name_Does · Jul 16, 2010

Munky said:
I disagree. Your code may "fast" by purely academic standards, but it does not take advantage of instruction-level parallelism or multithreading. A well-written C program running on a modern cpu would do better.

You must be joking. are you saying you can't do multithreading in asm?
Give up guys. Not a chance.
You could however, optimize the windows multithreading API with asm!

Munky · Jul 16, 2010

Any_Name_Does said:
You must be joking. are you saying you can't do multithreading in asm?
Give up guys. Not a chance.
You could however, optimize the windows multithreading API with asm!

You could.. but your code does not do multithreading.

Any_Name_Does · Jul 16, 2010

Munky said:
You could.. but your code does not do multithreading.

you didn't quite get it?
drop that gun, put a smile on your face and you will.

Munky · Jul 16, 2010

Any_Name_Does said:
you didn't quite get it?
drop that gun, put a smile on your face and you will.

Looks to me like you don't get it. A wise man once said "In theory, there's no difference between theory and practice. In practice there is."

Schmide · Jul 16, 2010

Ok I did a quick implementation using cpp intrensics and compared it to your function. It's compiled with VS 2008. My routine will output the triples if you add the commented out the #define _OUTPUTENABLE. I didn't check every triple as I just threw this together, but I ended up besting your ASM by 117x. However that's mostly because I used a better algorithm. But this goes to what you can do quickly with a higher level language.

Edit: I was skipping cases where a=b. Fixed. However the execution time did not change it was a very small subset of the total execution time.

BTW: I thought your routine was in a infinite loop and set a break point int it a few times because it took so long. He He

The results

SchmideSSE 0.125 seconds
AnyASM 14.695 seconds

The code

Code:

// pythsse2.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <fstream>
#include <mmintrin.h>
#include <fvec.h>
#include <dvec.h>
#include <cmath>
#include <conio.h>

using namespace std;

//#define _OUTPUTENABLE

struct	sSimdScalar
{
	union
	{
		__m128d		m_vec128d;
		__m128		m_vec128;
		__m64		m_vec64[2];
		double		m_doubles[2];
		float		m_floats[4];
		int			m_ints[4];
	};
};

struct	sSimdScalarInt
{
	union
	{
		__m64		m_vec64;
		int			m_ints[2];
	};
};

void Pythsse()
{
	int iStart=GetTickCount();
	for(int i=1;i<5001;i+=2) 
	{
		sSimdScalar a;
		a.m_ints[0]=i; // load
		a.m_ints[1]=i+1;
		a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
		a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);

		for(int j=i;j<5001;j+=2)
		{
			sSimdScalar b,c,d,e;
#ifdef _OUTPUTENABLE
			sSimdScalarInt bInt, cInt;
#endif
			b.m_ints[0]=j; // load 
			b.m_ints[1]=j+1;
			c.m_ints[0]=j+1;
			c.m_ints[1]=j+2;
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
			b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
			c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
			b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
			c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
			d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
			e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
			b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
			c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
#ifdef _OUTPUTENABLE
			bInt.m_vec64=b.m_vec64[0];
			cInt.m_vec64=c.m_vec64[0];
#endif
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
			d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
			e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);
#ifdef _OUTPUTENABLE
			for(int k=0;k<2;k++)
			{
				if(d.m_ints[k<<1])
				{
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k));
					cout<<' ';
					cout<<bInt.m_ints[k];
					cout<<endl;
				}
				if(e.m_ints[k<<1])
				{
					cout<<((int)(i+k));
					cout<<' ';
					cout<<((int)(j+k+1));
					cout<<' ';
					cout<<cInt.m_ints[k];
					cout<<endl;
				}
			} 
#endif
		} 
	}
	_mm_empty();
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"SchmideSSE ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
}

void AnyAsm()
{
	int iStart=GetTickCount();
	_asm {
		pushad
		xor esi,esi

		a1:
		inc esi
		xor ebp,ebp
		mov eax,esi
		mul esi
		cmp esi,5001
		je finish
		mov edi,eax

		a2:
		inc ebp
		cmp ebp,5001
		je a1
		mov eax,ebp
		mul ebp
		mov ecx,ebp
		cmp esi,ebp
		jbe a_
		mov ecx,esi
		a_:
		lea ebx,[eax+edi]

		a3:
		mov eax,ecx
		mul ecx
		inc ecx
		cmp eax,ebx
		jb a3
		ja a2

		; here we have a hit. with a b c and d in esi ebp ecx and eax respectively.

		dec ecx

		; do what you want. I call wsprintf and write to the file

		jmp a2

		finish:
		popad
	}
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000);
	cout<<"AnyASM ";
	cout<<dTotalTick;
	cout<<" seconds"<<endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
	Pythsse();
	AnyAsm();
	char cTemp;
	cout<<"Hit any key to continue (Where's the ANY key?)";
	cTemp=_getch();
	return 0;
}

Adding if you look at the disassembly of my core routine, it is almost pure. I.e. register to register out of order pipelined. I could probably squeeze another 1-5% writing it by hand and I'm very doubtful of that. I guarantee no power savings.

Code:

for(int j=i;j<5001;j+=2)
01371020  cmp         ecx,1389h 
01371026  lea         eax,[ecx+1] 
01371029  mov         dword ptr [esp+10h],ecx 
0137102D  mov         dword ptr [esp+14h],eax 
01371031  movq        mm0,mmword ptr [esp+10h] 
01371036  cvtpi2pd    xmm2,mm0 
0137103A  mulpd       xmm2,xmm2 
0137103E  movapd      xmmword ptr [esp+10h],xmm2 
01371044  jge         Pythsse+0BAh (13710BAh) 
01371046  jmp         Pythsse+50h (1371050h) 
01371048  lea         esp,[esp] 
0137104F  nop              
		{
			sSimdScalar b,c,d,e;
#ifdef _OUTPUTENABLE
			sSimdScalarInt bInt, cInt;
#endif
			b.m_ints[0]=j; // load 
01371050  lea         edx,[eax-1] 
01371053  mov         dword ptr [esp+20h],edx 
			b.m_ints[1]=j+1;
01371057  mov         dword ptr [esp+24h],eax 
			c.m_ints[0]=j+1;
			c.m_ints[1]=j+2;
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
0137105B  movq        mm0,mmword ptr [esp+20h] 
01371060  lea         edx,[eax+1] 
01371063  cvtpi2pd    xmm0,mm0 
			b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
01371067  mulpd       xmm0,xmm0 
0137106B  mov         dword ptr [esp+30h],eax 
0137106F  mov         dword ptr [esp+34h],edx 
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
01371073  movq        mm0,mmword ptr [esp+30h] 
			c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
			b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
01371078  addpd       xmm0,xmm2 
			c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
			d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
0137107C  sqrtpd      xmm0,xmm0 
01371080  cvtpi2pd    xmm1,mm0 
			e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
			b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
01371084  cvtpd2pi    mm0,xmm0 
			c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
#ifdef _OUTPUTENABLE
			bInt.m_vec64=b.m_vec64[0];
			cInt.m_vec64=c.m_vec64[0];
#endif
			b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
01371088  cvtpi2pd    xmm0,mm0 
0137108C  add         eax,2 
0137108F  mulpd       xmm1,xmm1 
01371093  movapd      xmmword ptr [esp+20h],xmm0 
01371099  addpd       xmm1,xmm2 
0137109D  sqrtpd      xmm0,xmm1 
013710A1  lea         edx,[eax-1] 
013710A4  cmp         edx,1389h 
013710AA  cvtpd2pi    mm0,xmm0 
			c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
013710AE  cvtpi2pd    xmm0,mm0 
013710B2  movapd      xmmword ptr [esp+30h],xmm0 
013710B8  jl          Pythsse+50h (1371050h) 
	for(int i=1;i<5001;i+=2) 
013710BA  add         ecx,2 
013710BD  cmp         ecx,1389h 
013710C3  jl          Pythsse+20h (1371020h) 
			d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
			e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);

outfull.txt
exe txt pythsse2.zip

Any_Name_Does · Jul 16, 2010

Schmide said:

Ok I did a quick implementation using cpp intrensics and compared it to your function. It's compiled with VS 2008. My routine will output the triples if you add the commented out the #define _OUTPUTENABLE. I didn't check every triple as I just threw this together, but I ended up besting your ASM by 177x. However that's mostly because I used a better algorithm. But this goes to what you can do quickly with a higher level language.

Edit: I was skipping cases where a=b. Fixed. However the execution time did not change it was a very small subset of the total execution time.

BTW: I thought your routine was in a infinite loop and set a break point int it a few times because it took so long. He He

The results

The code

Code:

// pythsse2.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <fstream>
#include <mmintrin.h>
#include <fvec.h>
#include <dvec.h>
#include <cmath>
#include <conio.h>

using namespace std;

//#define _OUTPUTENABLE

struct    sSimdScalar
{
    union
    {
        __m128d        m_vec128d;
        __m128        m_vec128;
        __m64        m_vec64[2];
        double        m_doubles[2];
        float        m_floats[4];
        int            m_ints[4];
    };
};

struct    sSimdScalarInt
{
    union
    {
        __m64        m_vec64;
        int            m_ints[2];
    };
};

void Pythsse()
{
    int iStart=GetTickCount();
    for(int i=1;i<5001;i+=2) 
    {
        sSimdScalar a;
        a.m_ints[0]=i; // load
        a.m_ints[1]=i+1;
        a.m_vec128d=_mm_cvtpi32_pd(a.m_vec64[0]); // convert double
        a.m_vec128d=_mm_mul_pd(a.m_vec128d,a.m_vec128d);

        for(int j=i;j<5001;j+=2)
        {
            sSimdScalar b,c,d,e;
#ifdef _OUTPUTENABLE
            sSimdScalarInt bInt, cInt;
#endif
            b.m_ints[0]=j; // load 
            b.m_ints[1]=j+1;
            c.m_ints[0]=j+1;
            c.m_ints[1]=j+2;
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
            b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
            c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
            b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
            c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
            d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
            e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
            b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
            c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
#ifdef _OUTPUTENABLE
            bInt.m_vec64=b.m_vec64[0];
            cInt.m_vec64=c.m_vec64[0];
#endif
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
            d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
            e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);
#ifdef _OUTPUTENABLE
            for(int k=0;k<2;k++)
            {
                if(d.m_ints[k<<1])
                {
                    cout<<((int)(i+k));
                    cout<<' ';
                    cout<<((int)(j+k));
                    cout<<' ';
                    cout<<bInt.m_ints[k];
                    cout<<endl;
                }
                if(e.m_ints[k<<1])
                {
                    cout<<((int)(i+k));
                    cout<<' ';
                    cout<<((int)(j+k+1));
                    cout<<' ';
                    cout<<cInt.m_ints[k];
                    cout<<endl;
                }
            } 
#endif
        } 
    }
    _mm_empty();
    int iEnd=GetTickCount();
    double dEnd=(double) iEnd;
    double dStart=(double) iStart;
    double dTotalTick=((dEnd-dStart)/(double) 1000);
    cout<<"SchmideSSE ";
    cout<<dTotalTick;
    cout<<" seconds"<<endl;
}

void AnyAsm()
{
    int iStart=GetTickCount();
    _asm {
        pushad
        xor esi,esi

        a1:
        inc esi
        xor ebp,ebp
        mov eax,esi
        mul esi
        cmp esi,5001
        je finish
        mov edi,eax

        a2:
        inc ebp
        cmp ebp,5001
        je a1
        mov eax,ebp
        mul ebp
        mov ecx,ebp
        cmp esi,ebp
        jbe a_
        mov ecx,esi
        a_:
        lea ebx,[eax+edi]

        a3:
        mov eax,ecx
        mul ecx
        inc ecx
        cmp eax,ebx
        jb a3
        ja a2

        ; here we have a hit. with a b c and d in esi ebp ecx and eax respectively.

        dec ecx

        ; do what you want. I call wsprintf and write to the file

        jmp a2

        finish:
        popad
    }
    int iEnd=GetTickCount();
    double dEnd=(double) iEnd;
    double dStart=(double) iStart;
    double dTotalTick=((dEnd-dStart)/(double) 1000);
    cout<<"AnyASM ";
    cout<<dTotalTick;
    cout<<" seconds"<<endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
    Pythsse();
    AnyAsm();
    char cTemp;
    cout<<"Hit any key to continue (Where's the ANY key?)";
    cTemp=_getch();
    return 0;
}

Adding if you look at the disassembly of my core routine, it is almost pure. I.e. register to register out of order pipelined. I could probably squeeze another 1-5% writing it by hand and I'm very doubtful of that. I guarantee no power savings.

Code:

for(int j=i;j<5001;j+=2)
01371020  cmp         ecx,1389h 
01371026  lea         eax,[ecx+1] 
01371029  mov         dword ptr [esp+10h],ecx 
0137102D  mov         dword ptr [esp+14h],eax 
01371031  movq        mm0,mmword ptr [esp+10h] 
01371036  cvtpi2pd    xmm2,mm0 
0137103A  mulpd       xmm2,xmm2 
0137103E  movapd      xmmword ptr [esp+10h],xmm2 
01371044  jge         Pythsse+0BAh (13710BAh) 
01371046  jmp         Pythsse+50h (1371050h) 
01371048  lea         esp,[esp] 
0137104F  nop              
        {
            sSimdScalar b,c,d,e;
#ifdef _OUTPUTENABLE
            sSimdScalarInt bInt, cInt;
#endif
            b.m_ints[0]=j; // load 
01371050  lea         edx,[eax-1] 
01371053  mov         dword ptr [esp+20h],edx 
            b.m_ints[1]=j+1;
01371057  mov         dword ptr [esp+24h],eax 
            c.m_ints[0]=j+1;
            c.m_ints[1]=j+2;
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert double
0137105B  movq        mm0,mmword ptr [esp+20h] 
01371060  lea         edx,[eax+1] 
01371063  cvtpi2pd    xmm0,mm0 
            b.m_vec128d=_mm_mul_pd(b.m_vec128d,b.m_vec128d); // square
01371067  mulpd       xmm0,xmm0 
0137106B  mov         dword ptr [esp+30h],eax 
0137106F  mov         dword ptr [esp+34h],edx 
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]);
01371073  movq        mm0,mmword ptr [esp+30h] 
            c.m_vec128d=_mm_mul_pd(c.m_vec128d,c.m_vec128d); 
            b.m_vec128d=_mm_add_pd(a.m_vec128d,b.m_vec128d); // add 
01371078  addpd       xmm0,xmm2 
            c.m_vec128d=_mm_add_pd(a.m_vec128d,c.m_vec128d);
            d.m_vec128d=_mm_sqrt_pd(b.m_vec128d); // sqrt
0137107C  sqrtpd      xmm0,xmm0 
01371080  cvtpi2pd    xmm1,mm0 
            e.m_vec128d=_mm_sqrt_pd(c.m_vec128d);
            b.m_vec64[0]=_mm_cvtpd_pi32(d.m_vec128d);  // convert int
01371084  cvtpd2pi    mm0,xmm0 
            c.m_vec64[0]=_mm_cvtpd_pi32(e.m_vec128d);
#ifdef _OUTPUTENABLE
            bInt.m_vec64=b.m_vec64[0];
            cInt.m_vec64=c.m_vec64[0];
#endif
            b.m_vec128d=_mm_cvtpi32_pd(b.m_vec64[0]); // convert back double 
01371088  cvtpi2pd    xmm0,mm0 
0137108C  add         eax,2 
0137108F  mulpd       xmm1,xmm1 
01371093  movapd      xmmword ptr [esp+20h],xmm0 
01371099  addpd       xmm1,xmm2 
0137109D  sqrtpd      xmm0,xmm1 
013710A1  lea         edx,[eax-1] 
013710A4  cmp         edx,1389h 
013710AA  cvtpd2pi    mm0,xmm0 
            c.m_vec128d=_mm_cvtpi32_pd(c.m_vec64[0]); 
013710AE  cvtpi2pd    xmm0,mm0 
013710B2  movapd      xmmword ptr [esp+30h],xmm0 
013710B8  jl          Pythsse+50h (1371050h) 
    for(int i=1;i<5001;i+=2) 
013710BA  add         ecx,2 
013710BD  cmp         ecx,1389h 
013710C3  jl          Pythsse+20h (1371020h) 
            d.m_vec128d=_mm_cmpeq_pd(b.m_vec128d, d.m_vec128d); // compare int to double
            e.m_vec128d=_mm_cmpeq_pd(c.m_vec128d, e.m_vec128d);

outfull.txt
exe txt pythsse2.zip

Not bad at all. your code is more complicated than mine, one would expect it the other way around when comparing a high level language to a low level one. so you are using a set of more appropriate registers. the results you are getting is a chaos. but it is ok because you haven't had time to take care small details. I knew my function is bottlenecked by the mul instruction, having to immediately use the result when it is not ready yet, but there was nothing I could do about it. but all in all you win. your algorythm is better. but how do get that 177x. on my computer your code does it in almost 4 seconds without outputting a file. Mine takes 13 seconds, with outputting a file. you win anyway, but how did you get that 177x?

Any_Name_Does · Jul 16, 2010

Munky said:
Looks to me like you don't get it. A wise man once said "In theory, there's no difference between theory and practice. In practice there is."

get terminated

Munky · Jul 16, 2010

Any_Name_Does said:
get terminated

You lost the argument. /thread.

How much electricity would be saved worldwide if Windows was writen in Assembly?

Member

Member

Member

Golden Member

Golden Member

Member

Golden Member

Member

Golden Member

Diamond Member

Member

Golden Member

Diamond Member

Platinum Member

Golden Member

Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Member

Diamond Member