Also, it seems as though Java's StrictMath (to which Java defaults when Math.sin() and Math.cos() are used) uses a 14-term Taylor series itself . . .
this looks very strange to me, where have you seen this explained ?
Also, it seems as though Java's StrictMath (to which Java defaults when Math.sin() and Math.cos() are used) uses a 14-term Taylor series itself . . .
this looks very strange to me, where have you seen this explained ?
What is generally agreed-upon is that StrictMath uses fdlibm.
FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/
the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better precision), not a Taylor series
if I count well, in the general case this implementation uses 10 mul and 8 add/sub, muls and add/sub are well balanced so this is a nice example to compile for FMA targets
How well can we expect 3DPM to scale with very high core counts and multiple sockets (eg, 4P)?
Looking through Anandtech bench it appears to scale well on the 1100T and the i7-5960X. This taking into account the 1C turbo on the 1100T and the hyperthreading on the i7-5960X.
FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/
the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better precision), not a Taylor series
if I count well, in the general case this implementation uses 10 mul and 8 add/sub, muls and add/sub are well balanced so this is a nice example to compile for FMA targets
SinglethreadBuild: 692015
1024 steps completed.
It took 252 milliseconds to complete the workload.
It took 37 milliseconds to write the output file.
Build: 692015
10240 steps completed.
It took 1378 milliseconds to complete the workload.
It took 20 milliseconds to write the output file.
Build: 692015
102400 steps completed.
It took 13687 milliseconds to complete the workload.
It took 25 milliseconds to write the output file.
Build: 692015
1024000 steps completed.
It took 131440 milliseconds to complete the workload.
It took 19 milliseconds to write the output file.
ScalingBuild: 692015
1024 steps completed.
It took 719 milliseconds to complete the workload.
It took 51 milliseconds to write the output file.
Build: 692015
10240 steps completed.
It took 6012 milliseconds to complete the workload.
It took 19 milliseconds to write the output file.
Build: 692015
102400 steps completed.
It took 59810 milliseconds to complete the workload.
It took 16 milliseconds to write the output file.
Build: 692015
1024000 steps completed.
It took 598482 milliseconds to complete the workload.
It took 15 milliseconds to write the output file.
So, assuming that the codes are accurate, a direct comparison of method 1 means that the new fixed method is running at just over half the speed.<s1>142.4068</s1>
<s2>163.9074</s2>
<s3>102.9352</s3>
<s4>65.1068</s4>
<s5>46.2636</s5>
<s6>30.6991</s6>
<MT>1</MT>
I have actually examined the accuracy of my alternate sin/cos methods this time, and found them to be accurate to about 3 decimal places, which is . . . okay, but not great.
On the A10 there is a significant improvement 42M/s -> 58M/s but on the i7 the implementation seems borked. Perhaps because the compiler settings are not tuned well.
The code using java has 8x the memory footprint. Went from something ~ 1.7MB to ~ 13-14MB.
As it stands, on the tested platform, its a significant step backwards.
if such low accuracy is OK for you, I'll suggest to use a 1024-entry lookup table instead, it's probably too low precision for the problem at hand, though
anyway, without access to the original 3DPM source and some checksums/results to compare with your version comparing timings is a worthless exercise
Hmm! That's interesting. I'm not sure why we'd see an improvement on Kaveri but no improvement on Ivy Bridge. That is, assuming Dr. Cutress' 3DPM is reporting steps/second, and I'm not really sure it's doing that?
edit: The performance dropoff on the 3630qm here might have something to do with l2 cache size . . . the 3630qp "only" has 1 mb l2, while the 7700k has 4mb l2. Also, right now, I'm not using any JVM switches, so all compiler/JVM settings are running at default for a 64-bit machine.
Results are then expressed in the form of million particle movements per second.
Java is a memory hog. You pay a hefty price just to run something in the JVM. Or, at least, that's my experience with it.
Thanks for taking the time to run it, though. I was hoping to see it run on a Ivy Bridge.
It's hard to sort out an exact comparison between the data from my program and Dr. Cutress' since we don't know what the reported number from 3DPM really means. Steps/second seems like a rational interpretation. I'm going to tentatively agree with bronxzv here by saying that a direct comparison between the two is difficult.
What I did want to see was a comparative between an Ivy and Kaveri on both benchmarks so we could see relative performance between the two.
In 3DPM:
7700k: ~43
3630qm: ~142
performance delta: 3630qm is 330% faster
In 3DPMRedux 692015:
7700k: 17509ms (most common data point)
3630qm: 13687ms
performance delta: 3630qm is ~28% faster
There's still more I can do to improve performance and accuracy, so I'm going to do that before I try to draw too many conclusions. Eventually it may require a shift to C/C++.
No, I don't consider the current implementation's precision to be very good. It's a start, but it needs more work.
I considered lookup tables. Since the original implementation is meant to be an fp benchmark, it would probably not be a good idea to switch to lookup tables.
I'm going to tentatively agree with you here. The only thing we can do is examine relative performance. Until my own attempt is of better quality, though, such examinations need to be taken with a grain of salt.
If I had to comment I would say that the error in the trig functions is problematic.
It is. In computational chemistry the goal is not to get to the wrong answer as quickly as possible. I got my phd in comp chem/quantum chem and to be sure everyone wants to arrive at a usable (low error) answer.
Look-up tables were more-desirable a solution back when processor speed did not so far-exceed memory I/O.
That's also why I haven't (yet) pre-generated random values and passed them along prior to starting
in this case (pre-generated random values) you can simply store precomputed sin(alpha) cos(alpha) values since alpha is dependent on a single random source
Why not just pre-compute the random angles? You could even generate a handful and-reuse them if you're worried about memory footprint?
If a dominant part of the benchmark is generating random numbers, not pushing the particles, that seems like it's something you want to eliminate.
I could just spit out a bunch of random numbers from -1 to 1 and use those in lieu of the outputs of sin/cos alpha[], but the relationship of those values would change unless I made sure the random values followed sine/cosine curves, and the best way to do that would be to use those functions during the pre-compute segment of the benchmark.
Yeah, I mean, you certainly can't draw from a rectangular distribution and expect to get a spherical distribution out.
It might be more fair to pre-generate the random values in randomone and randomtwo and keep all the trig in the main benchmark.
not three flat rectangular distributions?
