HAY GUYZ!
Wow this thread is fucking epic. I don't care about Windows in ASM, but generating pythagorean triples quickly does pique my curiosity! (This is how we know I'm an academic...) Anyway, I don't have time to implement this right now, but I'm pretty sure I have a faster algorithm than anything proposed so far. (Not to mention that I'm too much of a noob to hand-code anything in ASM, much less SSE in ASM. I can barely do SSE with intrinsics... typically rely on ICC to save me. Hats off to Schmide for being awesome.)
Basically, looping a & b is wasteful. If I generate (3,4,5) then I know many more triples... like (6,8,10), (9,12,15), etc. So why not only generate "primitive" triples (triples that are not [integer >1] multiples of other triples) and then scale them up?
How do you generate all primitive triples? That's easy; we'll do it by construction. Take two positive integers, (m,n):
(m^2 - n^2)^2 + (2*m*n)^2 = (m^2+n^2)^2
Then the triple is: (m^2-n^2, 2*m*n, m^2+n^2).
So the idea is to generate a primitive triple. Then write down all the multiples of that triple that are within the size-constraint.
Implementation-wise... we know m>n (otherwise m^2-n^2 is 0 or less). So your loop should start n at 1 and m at n+1=2. (Which will give (3,4,5)). Then keep incrementing m until the maximum leg-length is reached, and start over at the next n value. For a given n value, it's easy to figure out how high m can go before the OP's bound is exceeded. (He wanted triples where one leg is at most 10,000??)
So:
for(n=1; n<too_lazy_to_compute; n++){
for(m=n+1, m<_too_lazy_again; m++){
}
}
Every (m,n) pair generates a valid triple: (a,b,c). Identify whether a or b is bigger. (I bet if you're smart about it, you can know this without checking explicitly.) Divide max-leg-length by the larger of a,b. Say this gives me K possibilities... like with (3,4,5) and a max-leg-length of 10,000, I would now want to output (2*3,2*4,2*5), (3*3, 3*4, 4*5), ... , (2500*3, 2500*4, 2500*5).
K, I'm done with (3,4,5) (m=2,n=1), so move on to (m=3,n=1), rinse and repeat.
Though I'm not sure how to code this thing to be a fair comparison with the previous examples. My version never involves checking the validity of my triples: they're all valid and exhaustive by the construction of the (m,n) formula. To prevent the compiler from ignoring code, probably the easiest way would be to also output the total number of triples found. Or perhaps output the final state of some inner-loop temp quantity.
This should be WAY faster:
1) No floating point arithmetic, no sqrts. With proper loop unrolls, SSE should really rock some socks here b/c we only need 32 bit unsigned integers and we can run wild with those new 256-bit wide integer registers (errr... maybe that's not out yet, I forgot). All we need to do are integer multiplies (most of the work), integer adds, and the occasional division (to figure out how many non-primitive triples must be generated from each primitive triple).
2) Iterating over far fewer numbers, at least in the (m,n) loops. Somewhere btwn half as many and sqrt as many outer loop iterations. Like suppose I want to count all triples involving 11. 2mn ever equals 11, so we need all solutions for m^2 - n^2 = 11, s.t. m>n >=1. Only choice is m=6,n=5: (11,60,61). Voila, all triples with a=11 without checking b=1..59,61..10000.
3) No unpredictable if-checks.
Edit: NUTS, iCyborg already got it. I didn't read the whole thread before I started thinking about how to design a better algorithm. Apparently it's been known to mankind for a few thousand years! Haha, well played Euclid.
Edit2: Didn't consider uniqueness, whoops. Seems like a hash table would get the job done better than a gigantic array. Also I believe you only need to check the uniqueness of triples generated with the base (m,n). No need to check the multiples, because this algorithm only generates primitive triples. And it would be impossible to have two primitive triples (a,b,c) and (x,y,z) such that (Na,Nb,Nc) = (Mx,My,Mz), as this would imply one triple is non-primitive. I think anyway. Indexing on (m,n) only should knock your table size down into the hundreds of thousands.