Cogman
Lifer
- Sep 19, 2000
- 10,286
- 147
- 106
I was actually aware of that, realized it that when I increased the number of runs, the number of solutions also increased, and I normalized everything to the original 10 runs. Btw, all times reported are also per 10 runs, not per 1 run
As for the MT version, I actually planned to do that, perhaps you can try with Scali's suggestion, I won't be able to until later today.
I was going to divide work in a bit better way though. You have it by n, e.g. for 2 cores, it'd be 1-25 and 26-50. The thing is that those inner m and k loops will be executed a lot more for smaller n's, so it would be better to have something like 1-15/16-50 (and say 7, 10, 16, 17 for 4 cores) as the first thread will have lots more to do than the last. It would need some hardcoding though, and perhaps experimenting, but the overhead does make it quite unlikely that much can be gained, unless Interlocked add helps.
Meh, the critical section is only used once per thread, and critical sections are really only slow if there is a collision (really not all that likely). I did an interlocked add just to make sure and got similar results (Ok, it was with wonky gcc asm as mingw doesn't support the InterlockedAdd function).
Where the speeds are ~2x that of the single threaded version, I'm going to say that thread overhead is the big killer here. Threading will probably only benefit this if we are finding more solutions, or using a crappier algorithm. Perhaps if I'm bored, I'll thread my original solution for the heck of it.
