How much electricity would be saved worldwide if Windows was writen in Assembly?

Voo · Jul 19, 2010

Modelworks said:
It has been done many times in the past. The reason it isn't done on the pc now is it isn't required. I use mainly embedded processors and there it is required quite a bit because of space concerns. Eventually even that will become high level languages. Some of it already is in the C++, Java realm but the majority is still ASM.

I only played around a bit with µcs, but can't you use C most of the time? I ran into a few problems where the C code was too large and had to change some bits, but most of the time that went well (hobby projects only). And C is a lot more comfortable than asm imho.

eLiu · Jul 19, 2010

iCyborg said:
That is exactly what my solution does - that's Euclid's formula, there's a reason why I'm the only one who named his loop variables m,n,k instead of i,j
The tricky part is to generate only those triples where a,b are <= 5000, because you loop by m,n which don't produce members in an orderly fashion.
If you look at my loop, it has:
1. n going from 1 to 2500 - this comes from 2*m*n<5000, so n<2500 (i just realized that since m>n>=1, I can cut the n loop to 1250, but did a quick run, and due to timer resolution, no observable improvement)
2. m goes from n+1 to 2500/n for the same 2*m*n<5000 reason
3. Now we have a primitive triple, so there goes a loop by k which generates triples by multiplying by k until one member is bigger than 5000

That's it. I'm not an ASM expert either, especially not SSE/intrinsics, so optimizations would take me a couple days most likely...

Yeah I saw that... didn't read the whole thread before posting, whoops. See my edits. I used (m,n) because (a,b,c) were already taken, these aren't primes so (p,q) are out, and I never use (i,j) for math--only programming. Edits also discuss how I think you can vastly decrease the size of the uniqueness lookup table... and maybe hashing is also useful here.

It's also probably worth checking which of (m^2 - n^2) vs 2mn will reach the size limit sooner. Similarly when you scale by k, probably faster to branch on which of (m^2-n^2) or (2mn) reaches 5000 first, and then have the loop over k only scale one of those quantities. (Granted you probably won't be able to measure the difference, lol.)

eLiu · Jul 19, 2010

iCyborg said:
Yeah, that's true for the check, but for hash table I'm not sure, because lookup would still include pretty much the same function as the index into array computation.

I think I've nailed down
for(n=1; n<too_lazy_to_compute; n++){
for(m=n+1, m<_too_lazy_again; m++){

The best I can do is n=1; n<50
and m goes from m=n+1; m<2500/n

It doesn't improve time much though. I should correct my time to 0.114s, running 100 runs gets it more accurate.

lol, I keep posting before seeing your posts. The hash table might be faster if we can find a way to accurately estimate the number of (m,n) pairs generating unique primitive triples. That table should only need to be 1.5-2x larger than the total number of unique triple generating pairs.

Even w/your current choices, you only need an array of ~1.5million, which is a good bit smaller than 25million. In C++, is "bool" actually 1 bit? If not, you can tighten up more by using an array of integers and using bit-ops to manipulate individual bits as your flags.

Cogman · Jul 19, 2010

eLiu said:
lol, I keep posting before seeing your posts. The hash table might be faster if we can find a way to accurately estimate the number of (m,n) pairs generating unique primitive triples. That table should only need to be 1.5-2x larger than the total number of unique triple generating pairs.

Even w/your current choices, you only need an array of ~1.5million, which is a good bit smaller than 25million. In C++, is "bool" actually 1 bit? If not, you can tighten up more by using an array of integers and using bit-ops to manipulate individual bits as your flags.

A bool in c++ is a byte in size. Though, going to smaller then byte sizes will slow things down significantly. Bit operations aren't exactly fast.

Schmide · Jul 19, 2010

I added iCyborg's code, applied his loop reduction to my sseasm and did a full run.

Code:

SchmideSSE 1.716 seconds, solutions found: 66160
simplesol 1.888 seconds, solutions found: 66160
simplesolASM 1.326 seconds, solutions found: 66160
SchmideSSEASM 0.842 seconds, solutions found: 66160
SchmideSSEASMOpt 0.78 seconds, solutions found: 66160
cogmanSimplesol 1.623 seconds, solutions found: 66160
iCyborgSimplesol 3.057 seconds, solutions found: 66160
iCyborgEuclidSol 0.203 seconds, solutions found: 66160
iCyborgEuclidSol2 0.219 seconds, solutions found: 66160
Hit any key to continue (Where's the ANY key?)

Edit 10:17pm east:

Updated exe and proj/code.

Changed all loops to use loop reduction except SchmideSSEASM since it has a special case of SchmideSSEASMOpt.

EXE pythsse2.zip

SRC and VCproj pythsse2src.zip

Schmide · Jul 19, 2010

iCyborg said:
True story, the time is identical. Anyway, I have a new winner, I did some optimizations in some multiplications, but that didn't help much. What HELPED A LOT was replacing that huge double loop with which initializes unique[][] to false for each member to do it by a memset() to 0 of size 5001^2, boy, that cuts the time by 5x.

I plugged in the code from this post and got the same results. Maybe its an AMD/Intel/I7 thingy because the code does seem updated.

Net · Jul 19, 2010

Since Winxp, I've yet to see microsoft release a truly poor piece of software (well, ok IE6 and 7 aren't all that great, but other then that...). To attack them at this point is quite ignorant.

we are talking low power consumption. they aren't focused on lightweight code. simply put that's not in their business model. if they write it they are going to focus on getting it out quick, looking good and with new features.

Cogman · Jul 19, 2010

Net said:
we are talking low power consumption. they aren't focused on lightweight code. simply put that's not in their business model. if they write it they are going to focus on getting it out quick, looking good and with new features.

Oh? Do you have an OS that is constantly pegging the CPU at 100% 'cause I certainly have never seen an OS that keeps the CPU constantly occupied with OS tasks.

Bloat doesn't necessarily translate into power loss. If my hard-drive is filled up, but I don't run anything, it is just as efficient as if I where to run with an empty hard-drive.

There is very little an OS can do in terms of power consumption. Beyond implementing ACPI standards, PowerNow, and whatever other cutesy "downclock the CPU" model there is out there. Once your computer finishes loading, the OS stops being a major power drawer. Your cpu goes down to 0.1 utilization and spends most of its time idling. Older, "less bloated" operating systems will do worse as far as power consumption goes because they haven't implemented the downclocking features. However, if the CPU doesn't support them, then an old OS and a new one will perform about the same.

Microsoft has been very thoughtful of power usage. Its called "Notebooks" and they have been around for a while. You ever wonder why you can literally disable just about every OS feature in microsoft? Or why they have the "power options" section of the control panel (Which has been around since the 95 days)?

If this thread has proven anything, its that lightweight code is not power efficient code. I don't know if you notice it, but the code got longer and longer as the speeds went up. In fact, the shortest solution was the least efficient of all of them. Judging anything by code size is stupid and shows no understanding of how software or algorithms work. Bitching about microsoft doesn't help your case any more as it just shows how truly ignorant you are about what is going on with OSes. You've fallen into the newagy computer-snobbery of bashing microsoft, and for no good solid reason.

Do I sound like I'm being harsh? Probably. I am, however, sick and tired of all the ignorant and retarded comments I've been hearing about microsoft since vista's release. Vista, and 7 are good solid OSes, the only people who bitch about them are retards that don't know how to use them, or can't wrap their minds around the fact that 1 TB hard drives are common place, so a 5-10 GB foot print is really peanuts.

iCyborg · Jul 20, 2010

Schmide said:
I plugged in the code from this post and got the same results. Maybe its an AMD/Intel/I7 thingy because the code does seem updated.

My first Euclid was around your SSEASM time (which was ~0.550s for me), and the 2nd ~0.115s. Both of your versions of Euclid seem quite better than SSEASM, if you really got the right code, then my i7 isn't doing something very well in the first case.

Anyway, I have something much, much better. I was surprised how much I could cut the time by pulling that big memory allocation out of the iRuncount loop. I cut the time to 0.026s just by allocating it once, and not for every run. This is, of course, cheating, as I'm sharing stuff between the runs, that's not what the runs are for, but it really made me want to get rid of it altogether. And it can be done. And the resulting algorithm does a single run under 0.1ms! (0.952s for 10,000 runs, granted, it's on an OCed i7)
Any meaningful gains would probably come from optimizing this algorithm with SSE, not sure how much more optimal you can get without it, as there are not that many arithmetic operations involved per generated triple, and all are integer ops (most of them come from only 2 integer additions involved!).

@eLiu
After going through wiki on this, the problem is that the algorithm does not generate primitive triples unless m,n are coprime and exactly one of them is odd. Since n<50, there are only 14 primes from 1 to 50 (not including 2 which is handled by the m-loop going from n+1 (surely different parity) and incrementing by 2 preserving different parity) so you only check in a loop that m,n do not share any of those primes as factors and you have a primitive triple. Once you have a primitive triple, you just keep multiplying with integers until one of them is larger than 5000 (there's a possible small optimization here to check which member is bigger, and loop until that one is bigger).
Also, I've taken into account that m^2-n^2 must be <=5000, and this means that m<=min(87, 2500/n), because if m>87, then since n is max 50, m^2-n^2 >= 88^2-50^2 > 5000. Could probably add a small optimization to make this 87 limit vary with n.

Taking all this into account, I get (per 10 runs, Euclid does 10,000 to get around timer res, but I account for that in the printout):

SchmideSSE 1.732 seconds, solutions found: 66160
simplesol 1.794 seconds, solutions found: 66160
simplesolASM 1.279 seconds, solutions found: 66160
SchmideSSEASM 0.546 seconds, solutions found: 66160
SchmideSSEASMOpt 0.53 seconds, solutions found: 66160
cogmanSimplesol 1.528 seconds, solutions found: 66160
iCyborgSimplesol 1.342 seconds, solutions found: 66160
iCyborgEuclidSol 0.000952 seconds, solutions found: 66160
AnyASM 10.484 seconds
Hit any key to continue (Where's the ANY key?)

Crazy

The code does start to look less understandable than that simplest ASM though, because it needs quite a bit of math explanation as to why it's like it is...

Code:

bool iCyborgEuclidSolution()
{
	#define _UNIQUE

	int solutions = 0;
	int iStart=GetTickCount();

for(int iRuncount=0;iRuncount<10000;iRuncount++)
{
    int mulTable[2501];
#ifdef _UNIQUE
	 int primes[] = {3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
#endif

	 for (int i = 1; i < 2501; ++i)
    {
        mulTable[i] = i * i;
    }

	for(int n=1; n<50;++n)
	{
		int lim = (n>28) ? 2500/n : 87;
		for (int m=n+1; m<=lim; m+=2) 
		{
#ifdef _UNIQUE
			bool coprime = true;
			for (int i=0; i<14; i++) 
			{
				if ((n%primes[i])==0 && (m%primes[i])==0) 
				{ 
					coprime = false;
					break;
				}
			}
			if (!coprime) continue; // not primitive, skip
#endif
			int l = mulTable[m]-mulTable[n];
			int a = 0; 
			int b = 0;
			for (int k=1; true; ++k) 
			{
				a += l;
				b += m*n;
				if (a>5000 || b>2500) break;
				++solutions;
#ifdef _OUTPUTENABLE
				cout<<a;
				cout<<' ';
				cout<<b;
				cout<<' ';
				cout<<k*(mulTable[m]+mulTable[n]);
				cout<<endl;
#endif
			}
		}
		
	}
}
	
	int iEnd=GetTickCount();
	double dEnd=(double) iEnd;
	double dStart=(double) iStart;
	double dTotalTick=((dEnd-dStart)/(double) 1000000);
	cout<<"iCyborgEuclidSol ";
	cout<<dTotalTick;
	cout<<" seconds, solutions found: "<< solutions/1000 << endl;
	return true;
}

eLiu · Jul 20, 2010

iCyborg said:
My first Euclid was around your SSEASM time (which was ~0.550s for me), and the 2nd ~0.115s. Both of your versions of Euclid seem quite better than SSEASM, if you really got the right code, then my i7 isn't doing something very well in the first case.

Anyway, I have something much, much better. I was surprised how much I could cut the time by pulling that big memory allocation out of the iRuncount loop. I cut the time to 0.026s just by allocating it once, and not for every run. This is, of course, cheating, as I'm sharing stuff between the runs, that's not what the runs are for, but it really made me want to get rid of it altogether. And it can be done. And the resulting algorithm does a single run under 0.1ms! (0.952s for 10,000 runs, granted, it's on an OCed i7)
Any meaningful gains would probably come from optimizing this algorithm with SSE, not sure how much more optimal you can get without it, as there are not that many arithmetic operations involved per generated triple, and all are integer ops (most of them come from only 2 integer additions involved!).

@eLiu
After going through wiki on this, the problem is that the algorithm does not generate primitive triples unless m,n are coprime and exactly one of them is odd. Since n<50, there are only 14 primes from 1 to 50 (not including 2 which is handled by the m-loop going from n+1 (surely different parity) and incrementing by 2 preserving different parity) so you only check in a loop that m,n do not share any of those primes as factors and you have a primitive triple. Once you have a primitive triple, you just keep multiplying with integers until one of them is larger than 5000 (there's a possible small optimization here to check which member is bigger, and loop until that one is bigger).
Also, I've taken into account that m^2-n^2 must be <=5000, and this means that m<=min(87, 2500/n), because if m>87, then since n is max 50, m^2-n^2 >= 88^2-50^2 > 5000. Could probably add a small optimization to make this 87 limit vary with n.

Taking all this into account, I get (per 10 runs, Euclid does 10,000 to get around timer res, but I account for that in the printout):

Wow I got lucky ages ago then. I first worked out this way of generating pythagorean triples back in HS for a math contest. It gave me the right answer at the time, but I guess I got that problem right for the wrong reasons b/c I had "proved" that you only get primitive triples. Fail! I must have at least known the "only one can be even" condition at some point b/c that's such an obvious condition for non-primitive triples. *sigh* Well, you learn something new everyday, thanks!

Also yeah, memory allocation is surprisingly painful eh? Until recently, malloc/free were the most often called functions in the project I work on... pretty ridiculous. Granted when you're doing substantial amounts of computation it doesn't matter that much, but still.

Cogman: Why are bitwise operations slow? Flipping a bit involves XOR-ing with the appropriate mask. The mask is an integer so you can write it into code as a literal & that won't require any loads from memory/cache (unlike w/float & double). The bitop itself is only 1 cycle and the compute-unit is separate from the integer adder & multiplier, so it can happen in parallel with other computations.

Additionally it may be possible to structure the table so that when you load a cache-line (or even an individual byte or word or whatever unit you want) from it, you end up checking/manipulating many bits of that data.

Anyway in my experience, (like solving the n-queens problem) using bit representations is substantially faster than using int, short, or even char representations of 'yes/no' flags. (Well, as long as you can structure the data array for good locality. I guess it would not be helpful if you loaded an array entry, edited one bit, and jumped to somewhere far away in the table.)

Schmide: Are there gains to be had by converting the Euclid code to take advantage of SSE? Maybe it'd have to be one of the earlier versions w/less branching?

Also, unrelated question: are you familiar with using the prefetchnta (or other prefetch*) asm command?

Schmide · Jul 20, 2010

Included it in the files. There is absolutely no real asm speed up for that code that I can see. No way to execute two iterations in the same loop. You could maybe squeeze out single digit percent performance improvements.

Code:

SchmideSSE 1.669 seconds, solutions found: 66160
simplesol 1.81 seconds, solutions found: 66160
simplesolASM 1.341 seconds, solutions found: 66160
SchmideSSEASM 0.811 seconds, solutions found: 66160
SchmideSSEASMOpt 0.78 seconds, solutions found: 66160
cogmanSimplesol 1.638 seconds, solutions found: 66160
iCyborgEuclidSol 0.203 seconds, solutions found: 66160
iCyborgEuclidSolPrime 0.002293 seconds, solutions found: 66160
Hit any key to continue (Where's the ANY key?)

iCyborg · Jul 20, 2010

Went to look into dissassembly, and ahem, I was compiling in debug mode with no optimizations whatsoever (I was counting on at least hoisting out obvious loop invariants, like m*n in the for-k loop instead of imul every time...). When I enable optimizations, the non-ASM versions get a pretty decent boost.

Code:

SchmideSSE 1.263 seconds, solutions found: 66160
simplesol 1.295 seconds, solutions found: 66160
simplesolASM 1.185 seconds, solutions found: 66160
SchmideSSEASM 0.562 seconds, solutions found: 66160
SchmideSSEASMOpt 0.546 seconds, solutions found: 66160
cogmanSimplesol 1.295 seconds, solutions found: 66160
iCyborgSimplesol 1.217 seconds, solutions found: 66160
iCyborgEuclidSol 0.000733 seconds, solutions found: 66160
AnyASM 10.514 seconds
Hit any key to continue (Where's the ANY key?)

Schmide · Jul 20, 2010

Just for the hell of it I changed the variables in your final product to see if using the floating point unit would be faster. Well it wasn't by a factor of 3-4x. Apparently fmod is not faster than the int % version. Although the functions remain in the code, I'm depreciating AnyASM and iCyborgEuclidSolPrimeDouble from the test as one takes too long and the other was a bad idea.

Code:

SchmideSSE 1.669 seconds, solutions found: 66160
simplesol 1.825 seconds, solutions found: 66160
simplesolASM 1.326 seconds, solutions found: 66160
SchmideSSEASM 0.827 seconds, solutions found: 66160
SchmideSSEASMOpt 0.811 seconds, solutions found: 66160
cogmanSimplesol 1.451 seconds, solutions found: 66160
iCyborgSimplesol 2.652 seconds, solutions found: 66160
iCyborgEuclidSol 0.187 seconds, solutions found: 66160
iCyborgEuclidSolPrime 0.002325 seconds, solutions found: 66160
iCyborgEuclidSolPrimeDouble 0.008674 seconds, solutions found: 66160
Hit any key to continue (Where's the ANY key?)

I'm actually impressed at cogman's regular code. It's beating any regular code version and holding its own with the x87 ASM.

BTW make sure you also select fastmath and sse2 in the optimizations. It makes a rather large difference especially on AMD processors.

eLiu · Jul 20, 2010

Schmide said:
Included it in the files. There is absolutely no real asm speed up for that code that I can see. No way to execute two iterations in the same loop. You could maybe squeeze out single digit percent performance improvements.

Code:

SchmideSSE 1.669 seconds, solutions found: 66160 simplesol 1.81 seconds, solutions found: 66160 simplesolASM 1.341 seconds, solutions found: 66160 SchmideSSEASM 0.811 seconds, solutions found: 66160 SchmideSSEASMOpt 0.78 seconds, solutions found: 66160 cogmanSimplesol 1.638 seconds, solutions found: 66160 iCyborgEuclidSol 0.203 seconds, solutions found: 66160 iCyborgEuclidSolPrime 0.002293 seconds, solutions found: 66160 Hit any key to continue (Where's the ANY key?)

Would it not be possible (or at least not advantageous) to unroll the loop over n with SSE instructions? Like several "copies" of the innermost loop where you're just scaling by k each time (well iCyborg has it written with additions) could be handled in parallel. Since m is fixed, you'd run however many iterations is implied by the smallest n value in the current set.

I think you'd still want to do the validity checks first. If a (m,n) pair is invalid, then mask out that part of the SSE register with 0s. (Since your innermost loop is multiplying by k or more simply adding two SSE registers together). Then when you print out, you only print nonzero results. Counting the number of solutions would be relatively easy since it's easy enough to calculate how many each (m,n) pair can contribute: min(5000/(m^2-n^2), 5000/(2*m*n)).

Scali · Jul 20, 2010

That Euclid's method reminds me of calculating primes with Eratosthenes' sieve.
I once created a sieve implementation that way, but with a 'partial' table.
You have an array of booleans, and mark each prime you find as prime, and all its multiples as non-prime...
Problem is, for large ranges, the array gets very large. So firstly I'd remove all the even numbers, as they can't be primes anyway. Secondly, I'd pack them into bits.
Lastly, I don't really need the entire table at a time.
Namely, if I know which numbers are primes (only a relatively short list compared to all the possible numbers in a range), I can reconstruct a sub-range of the table at any point. I simply need to figure out the starting point, and then mark the multiples from there.
This way you can make a small table that can remain in cache, and greatly improve speed.

http://srcvault.scali.eu.org/cgi-bin/Syntax/Syntax.cgi?SmallSieve.c

Perhaps some of this is useful for this algorithm.

At any rate, this proves once again that the algo is far more important than the language.
The "iCyborgEuclidSol" routine would probably still be faster than the assembly one if it were coded in VB

BoberFett · Jul 20, 2010

[Personal attacks, gratuitous profanity removed]

Keep the conversation civil and respectful, and the profanity to a minimum, or go back to whatever forum this monstrosity of a thread originated in.

Markbnj
Programming moderator

Any_Name_Does · Jul 20, 2010

It is getting more and more fun.
Where are we now. 100000 times faster than my original code.

You guys didn't disappoint me. ()

()

Modelworks · Jul 20, 2010

Voo said:
I only played around a bit with µcs, but can't you use C most of the time? I ran into a few problems where the C code was too large and had to change some bits, but most of the time that went well (hobby projects only). And C is a lot more comfortable than asm imho.

Yes you can use C quite a bit. The problem with the embedded world is that even the best compilers do not make full use of the hardware. Things like ARM have so many variations that the compilers end up being targeted at a general ARM chip rather than specific versions. So ASM is needed when you want to access special functions that your chip may have but the compiler has no recognition of or worse compiles wrong making the function not work.

When you go to debug the compiled code it is running on the processor and doesn't come back referring to the C code but as register and memory data in hex so without knowing ASM you can't figure out what the processor is doing internally.

Cogman · Jul 20, 2010

eLiu said:
Cogman: Why are bitwise operations slow? Flipping a bit involves XOR-ing with the appropriate mask. The mask is an integer so you can write it into code as a literal & that won't require any loads from memory/cache (unlike w/float & double). The bitop itself is only 1 cycle and the compute-unit is separate from the integer adder & multiplier, so it can happen in parallel with other computations.

Additionally it may be possible to structure the table so that when you load a cache-line (or even an individual byte or word or whatever unit you want) from it, you end up checking/manipulating many bits of that data.

Anyway in my experience, (like solving the n-queens problem) using bit representations is substantially faster than using int, short, or even char representations of 'yes/no' flags. (Well, as long as you can structure the data array for good locality. I guess it would not be helpful if you loaded an array entry, edited one bit, and jumped to somewhere far away in the table.)

Well first, what do you save? space? We aren't in a space constrained situation, cutting that back doesn't really help. Next, can an xor compare to a single intermediate to reg mov? Not likely. the excess of having to store and load masking flags is just adding an extra burden.

When you are strictly using the variable for a true/false value (and nothing else) the operation for comparing for "truthiness" is a single compare operator.

I'm skeptical it would be faster as when comparing true values with false values, there isn't a whole lot going on.

I should note, however, if you are doing logic operations, that can be a significant slow down. A & operator is much faster then a && operator. So something like

this && is && probably && true
is going to be slower then
this & is & probably & true

And that breaks down to the fact that each variable has to be evaluated with a cmp, stored, and then checked with an "and" operator in the first case vs just a single and comparison. Now, if you are disiplined, and make sure you true/false values contain either 1 or 0, then you are safe using the bitwise operations instead of the logic operations. And in that case, it will pretty much always be faster to use a byte value instead of a bit field.

I really am interested to see your n-queens solutions to see the difference and be able to comb through the code.

Cogman · Jul 20, 2010

Small note, cyborgs solutions calculation was wrong... It is off by a factor of 10 (missed a zero

)
There are really 6616 solutions.

Any_Name_Does · Jul 20, 2010

BoberFett said:
[Personal attacks, gratuitous profanity removed]

Keep the conversation civil and respectful, and the profanity to a minimum, or go back to whatever forum this monstrosity of a thread originated in.

Markbnj
Programming moderator

Thanks Markbnj,

Well, I actually didn't mind his remarks. I saw it more as provocation than insult and It takes much more than that to provoke me, but you are right. Rules are rules.

Cogman · Jul 20, 2010

Cyborgs Algorithm, Threaded!

(It had to be done :awe

Code:

struct threadData
{
    int nStart;
    int nEnd;
    int* mulTable;
    int* sol;
    CRITICAL_SECTION* CS;
};

DWORD WINAPI CyborgThread(void* n)
{
#define _UNIQUE
    threadData* tData = (threadData*)n;
    int* mulTable = tData->mulTable;
    int solutions = 0;

#ifdef _UNIQUE
    int primes[] = {3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
#endif

    for(int n=tData->nStart; n<tData->nEnd; ++n)
    {
        int lim = (n>28) ? 2500/n : 87;
        for (int m=n+1; m<=lim; m+=2)
        {
#ifdef _UNIQUE
            bool coprime = true;
            for (int i=0; i<14; i++)
            {
                if ((n&#37;primes[i])==0 && (m%primes[i])==0)
                {
                    coprime = false;
                    break;
                }
            }
            if (!coprime) continue; // not primitive, skip
#endif
            int l = mulTable[m]-mulTable[n];
            int a = 0;
            int b = 0;
            for (int k=1; true; ++k)
            {
                a += l;
                b += m*n;
                if (a>5000 || b>2500) break;
                ++solutions;
#ifdef _OUTPUTENABLE
                cout<<a;
                cout<<' ';
                cout<<b;
                cout<<' ';
                cout<<k*(mulTable[m]+mulTable[n]);
                cout<<endl;
#endif
            }
        }
    }
    EnterCriticalSection(tData->CS);
    *tData->sol += solutions;
    LeaveCriticalSection(tData->CS);
    return 0;
}

bool iCyborgEuclidSolutionThreaded()
{
    const int NUM_RUNS = 5000;
#define _UNIQUE
    LARGE_INTEGER start, end, freq;
    int solutions = 0;
    QueryPerformanceCounter(&start);

    for(int iRuncount=0; iRuncount<NUM_RUNS; iRuncount++)
    {
        int mulTable[2501];

        for (int i = 1; i < 2501; ++i)
        {
            mulTable[i] = i * i;
        }
#define NUMTHREADS 4
        CRITICAL_SECTION CS;
        InitializeCriticalSection(&CS);
        HANDLE threads[NUMTHREADS];
        threadData tData[NUMTHREADS];
        int stopPos = 1;
        for (int j = 0; j < NUMTHREADS; ++j)
        {
            tData[j].nStart = stopPos;
            stopPos += 50 / NUMTHREADS;
            if (stopPos >= 49) // Make sure we catch the end
                stopPos = 50;
            tData[j].nEnd = stopPos;
            tData[j].sol = &solutions;
            tData[j].mulTable = mulTable;
            tData[j].CS = &CS;
            threads[j] = CreateThread(NULL, 0, CyborgThread, (void*)&tData[j], 0, NULL);
        }
        WaitForMultipleObjects(NUMTHREADS, threads, true, INFINITE);
    }
    QueryPerformanceCounter(&end);
    QueryPerformanceFrequency(&freq);
    double dEnd=(double) end.QuadPart / (double)freq.QuadPart;
    double dStart=(double) start.QuadPart / (double)freq.QuadPart;
    double dTotalTick=((dEnd-dStart)/(double) NUM_RUNS);
    cout<<"iCyborgEuclidSolT ";
    cout<<dTotalTick;
    cout<<" seconds, solutions found: "<< solutions/NUM_RUNS << endl;
    return true;
}

The only change I really added to this was to break it up into NUMTHREADS, and to use a higher resolution timer (Lazy fetchers

), Oh, and I corrected the solutions bug (6616, not 66160, slip of the zero)

The results?

iCyborgEuclidSolT 0.00049463 seconds, solutions found: 6616
iCyborgEuclidSol 0.000159711 seconds, solutions found: 6616

Whoops, Slower

. I didn't really expect it to be faster (in fact, I'm somewhat surprised it wasn't a lot slower.) Cyborgs solution is already in the realm of "Faster then the thread overhead". I'm sure our slower solutions would benefit from threading more.

Scali · Jul 20, 2010

Don't use criticalsections, they're expensive.
Try using an InterlockedAdd().
Not that it would matter here, I suppose.

iCyborg · Jul 20, 2010

Schmide said:
J I'm actually impressed at cogman's regular code. It's beating any regular code version and holding its own with the x87 ASM.

BTW make sure you also select fastmath and sse2 in the optimizations. It makes a rather large difference especially on AMD processors.

I'm actually puzzled by this, because on my system, my version improves on that slightly. And it should, because I ripped off his solution, and just had those small optimizations that you've included in your versions as well (i starts from 3 not 1, j starts from i+1 not i, and has the upper limit with that expression). I've based it on his 2nd solution which basically replaces double->int conversion with mulTable access (which is now constructed for 7072 elts and not 5001), as it received a slight boost. So I'm surprised that this change ends up quite slower on your machine than Cogman's. Perhaps you could try those optimizations on Cogman's method. Perhaps there really is some difference in arch that makes it go ~2x slower than i7 (relative to Cogman).

fastmath seems to help for some (not for Euclid obviously, no FP there), sse2 makes things worse, especially for your simple sol that gets a nice boost from fastFP

Code:

Method	          Reg	fastmath   SSE2	   fastmath+SSE2
-----------------------------------------------------------------------------
SchmideSSE 	  1.263	  1.263    1.264    1.264
simplesol 	  1.295	  1.155    1.529    1.528
simplesolASM 	  1.185	  1.185    1.185    1.186
SchmideSSEASM 	  0.562	  0.562	   0.562    0.562
SchmideSSEASMOpt  0.546	  0.546	   0.546    0.546
cogmanSimplesol   1.295	  1.186	   1.263    1.248
iCyborgSimplesol  1.217	  1.092	   1.217    1.232
iCyborgEuclidSol  .000733 .000717  .000718  .000702

iCyborg · Jul 20, 2010

Cogman said:
Small note, cyborgs solutions calculation was wrong... It is off by a factor of 10 (missed a zero )
There are really 6616 solutions.

I was actually aware of that, realized it that when I increased the number of runs, the number of solutions also increased, and I normalized everything to the original 10 runs. Btw, all times reported are also per 10 runs, not per 1 run

As for the MT version, I actually planned to do that, perhaps you can try with Scali's suggestion, I won't be able to until later today.
I was going to divide work in a bit better way though. You have it by n, e.g. for 2 cores, it'd be 1-25 and 26-50. The thing is that those inner m and k loops will be executed a lot more for smaller n's, so it would be better to have something like 1-15/16-50 (and say 7, 10, 16, 17 for 4 cores) as the first thread will have lots more to do than the last. It would need some hardcoding though, and perhaps experimenting, but the overhead does make it quite unlikely that much can be gained, unless Interlocked add helps.

How much electricity would be saved worldwide if Windows was writen in Assembly?

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Banned

Lifer

Member

Lifer

Lifer

Lifer

Member

Lifer

Banned

Golden Member

Golden Member