3DPM: Can we fix it? Yes we can!

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

bronxzv

Senior member
Jun 13, 2011
460
0
71
Also, it seems as though Java's StrictMath (to which Java defaults when Math.sin() and Math.cos() are used) uses a 14-term Taylor series itself . . .

this looks very strange to me, where have you seen this explained ?
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
this looks very strange to me, where have you seen this explained ?

http://stackoverflow.com/questions/25150660/how-does-java-calculates-trigonometry

That's one place where some posters seem to think that StrictMath uses a Taylor series. That information may be inaccurate. What is generally agreed-upon is that StrictMath uses fdlibm.

I have done some accuracy testing on my code and found it to be simply awful. That is what I get for not properly implementing range reduction! Version 662015 is not producing accurate results.

Right now I'm poking around with this stuff:

http://www.research.scea.com/gdc2003/fast-math-functions_p1.pdf
http://www.research.scea.com/gdc2003/fast-math-functions_p2.pdf

and

http://lab.polygonal.de/?p=205
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
How well can we expect 3DPM to scale with very high core counts and multiple sockets (eg, 4P)?

Looking through Anandtech bench it appears to scale well on the 1100T and the i7-5960X. This taking into account the 1C turbo on the 1100T and the hyperthreading on the i7-5960X.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
What is generally agreed-upon is that StrictMath uses fdlibm.

FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/

the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better precision), not a Taylor series

if I count well, in the general case this implementation uses 10 mul and 8 add/sub, muls and add/sub are well balanced so this is a nice example to compile for FMA targets
 
Last edited:

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/

the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better precision), not a Taylor series

if I count well, in the general case this implementation uses 10 mul and 8 add/sub, muls and add/sub are well balanced so this is a nice example to compile for FMA targets

The source of k_sin is MUCH more like what I expected a sin function in a math library to be. It has a 13-term polynomial but it doesn't blindly apply that, it considers a "small x" case, a sin(x+y) case, and even-odd qualities of sin to use a really good polynomial approximation on a more restricted domain.
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
How well can we expect 3DPM to scale with very high core counts and multiple sockets (eg, 4P)?

Looking through Anandtech bench it appears to scale well on the 1100T and the i7-5960X. This taking into account the 1C turbo on the 1100T and the hyperthreading on the i7-5960X.

Based on the sample snippet from Dr. Cutress, it should scale almost perfectly with core count. There's zero thread interdependency, and it spawns (apparently) 10k different threads much must be completed throughout the course of the benchmark.

My own attempt "only" spawns 1250 threads (handling 8 particles per thread).

FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/

the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better precision), not a Taylor series

if I count well, in the general case this implementation uses 10 mul and 8 add/sub, muls and add/sub are well balanced so this is a nice example to compile for FMA targets

Yeah, I found that, I just didn't know what it was that I was looking at when I saw it. When others started saying, "yeah looks like a Taylor series", I just sort of went with that.

It would be interesting to see what would be the performance of attempting to implement that polynomial myself. I may try that in future versions.

For now, I have corrected the glaring flaw in version 662015. Here is 692015:

https://www.dropbox.com/s/5k0ckp9ltj2hdmg/3DPMRedux692015.zip?dl=0

With source:

https://www.dropbox.com/s/lysfssq8p7vlwbe/3DPMReduxSource692015.zip?dl=0

I have actually examined the accuracy of my alternate sin/cos methods this time, and found them to be accurate to about 3 decimal places, which is . . . okay, but not great. I think I can do better. I also think I could make the current implementation faster by doing better range reduction to get rid of the branchy nature of the code and by processing and entire array at once instead of finding the sin/cos of each array entry individually (more SIMD opportunities that way). Given the accuracy, it may not be worth the trouble.

I also added results recording (the program now creates results.txt, or appends results if results.txt already exists), switching between multi-threaded and single-threaded mode, and . . . that's it.

More to come as time permits. Thanks for all feedback, and maybe now that the sin/cos implementations aren't complete garbage, maybe somebody will start reporting some completion times? If you do decide to run the program, please consider reporting your results from Dr. Cutress' 3DPM (Stage 1 in particular) so we can compare and contrast.

edit: here's some of my results

A10-7700k: 3.4 GHz (turbo disabled)/1800MHz NB DDR3-1600 9-11-12-32 1T
Three runs:

Build: 692015
102400 steps completed.
Mode: multi-threaded

It took 17509 milliseconds to complete the workload.
It took 60 milliseconds to write the output file.

It took 17509 milliseconds to complete the workload.
It took 23 milliseconds to write the output file.

It took 17112 milliseconds to complete the workload.
It took 38 milliseconds to write the output file.

3DPM original, Stage 1:
Three runs:

41.993
43.0722
43.028

Later on, I'll crank it up to 4.7 ghz and post new results.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
3630qm @ 3.2 ghz

Multithreaded

Build: 692015
1024 steps completed.

It took 252 milliseconds to complete the workload.
It took 37 milliseconds to write the output file.

Build: 692015
10240 steps completed.

It took 1378 milliseconds to complete the workload.
It took 20 milliseconds to write the output file.

Build: 692015
102400 steps completed.

It took 13687 milliseconds to complete the workload.
It took 25 milliseconds to write the output file.

Build: 692015
1024000 steps completed.

It took 131440 milliseconds to complete the workload.
It took 19 milliseconds to write the output file.
Singlethread

Build: 692015
1024 steps completed.

It took 719 milliseconds to complete the workload.
It took 51 milliseconds to write the output file.

Build: 692015
10240 steps completed.

It took 6012 milliseconds to complete the workload.
It took 19 milliseconds to write the output file.

Build: 692015
102400 steps completed.

It took 59810 milliseconds to complete the workload.
It took 16 milliseconds to write the output file.

Build: 692015
1024000 steps completed.

It took 598482 milliseconds to complete the workload.
It took 15 milliseconds to write the output file.
Scaling

1024 - 1.91x
10240 - 4.36x
102400 - 4.37x
1024000 - 4.55x

Using the 4 option MT 1024000 steps.

10,000 particle x 1,024,000 steps = 1.024 x 10^10 / 131.4 sec = 78M steps per second

Cutress' MT algorithm gives ~ 547M steps/sec total with

<s1>142.4068</s1>
<s2>163.9074</s2>
<s3>102.9352</s3>
<s4>65.1068</s4>
<s5>46.2636</s5>
<s6>30.6991</s6>
<MT>1</MT>
So, assuming that the codes are accurate, a direct comparison of method 1 means that the new fixed method is running at just over half the speed.

On the A10 there is a significant improvement 42M/s -> 58M/s but on the i7 the implementation seems borked. Perhaps because the compiler settings are not tuned well.

The code using java has 8x the memory footprint. Went from something ~ 1.7MB to ~ 13-14MB.

As it stands, on the tested platform, its a significant step backwards.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
I have actually examined the accuracy of my alternate sin/cos methods this time, and found them to be accurate to about 3 decimal places, which is . . . okay, but not great.

if such low accuracy is OK for you, I'll suggest to use a 1024-entry lookup table instead, it's probably too low precision for the problem at hand, though

anyway, without access to the original 3DPM source and some checksums/results to compare with your version comparing timings is a worthless exercise
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
On the A10 there is a significant improvement 42M/s -> 58M/s but on the i7 the implementation seems borked. Perhaps because the compiler settings are not tuned well.

Hmm! That's interesting. I'm not sure why we'd see an improvement on Kaveri but no improvement on Ivy Bridge. That is, assuming Dr. Cutress' 3DPM is reporting steps/second, and I'm not really sure it's doing that?

edit: The performance dropoff on the 3630qm here might have something to do with l2 cache size . . . the 3630qp "only" has 1 mb l2, while the 7700k has 4mb l2. Also, right now, I'm not using any JVM switches, so all compiler/JVM settings are running at default for a 64-bit machine.

The code using java has 8x the memory footprint. Went from something ~ 1.7MB to ~ 13-14MB.

Java is a memory hog. You pay a hefty price just to run something in the JVM. Or, at least, that's my experience with it.

As it stands, on the tested platform, its a significant step backwards.

Thanks for taking the time to run it, though. I was hoping to see it run on a Ivy Bridge.

It's hard to sort out an exact comparison between the data from my program and Dr. Cutress' since we don't know what the reported number from 3DPM really means. Steps/second seems like a rational interpretation. I'm going to tentatively agree with bronxzv here by saying that a direct comparison between the two is difficult.

What I did want to see was a comparative between an Ivy and Kaveri on both benchmarks so we could see relative performance between the two.

In 3DPM:

7700k: ~43
3630qm: ~142
performance delta: 3630qm is 330% faster

In 3DPMRedux 692015:

7700k: 17509ms (most common data point)
3630qm: 13687ms
performance delta: 3630qm is ~28% faster

There's still more I can do to improve performance and accuracy, so I'm going to do that before I try to draw too many conclusions. Eventually it may require a shift to C/C++.

if such low accuracy is OK for you, I'll suggest to use a 1024-entry lookup table instead, it's probably too low precision for the problem at hand, though

No, I don't consider the current implementation's precision to be very good. It's a start, but it needs more work.

I considered lookup tables. Since the original implementation is meant to be an fp benchmark, it would probably not be a good idea to switch to lookup tables.

anyway, without access to the original 3DPM source and some checksums/results to compare with your version comparing timings is a worthless exercise

I'm going to tentatively agree with you here. The only thing we can do is examine relative performance. Until my own attempt is of better quality, though, such examinations need to be taken with a grain of salt.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Hmm! That's interesting. I'm not sure why we'd see an improvement on Kaveri but no improvement on Ivy Bridge. That is, assuming Dr. Cutress' 3DPM is reporting steps/second, and I'm not really sure it's doing that?

edit: The performance dropoff on the 3630qm here might have something to do with l2 cache size . . . the 3630qp "only" has 1 mb l2, while the 7700k has 4mb l2. Also, right now, I'm not using any JVM switches, so all compiler/JVM settings are running at default for a 64-bit machine.

http://www.anandtech.com/show/6533/...essor-motherboard-through-a-scientists-eyes/9

I'm taking the scores from here.

Results are then expressed in the form of million particle movements per second.

Java is a memory hog. You pay a hefty price just to run something in the JVM. Or, at least, that's my experience with it.

Thanks for taking the time to run it, though. I was hoping to see it run on a Ivy Bridge.

It's hard to sort out an exact comparison between the data from my program and Dr. Cutress' since we don't know what the reported number from 3DPM really means. Steps/second seems like a rational interpretation. I'm going to tentatively agree with bronxzv here by saying that a direct comparison between the two is difficult.

What I did want to see was a comparative between an Ivy and Kaveri on both benchmarks so we could see relative performance between the two.

In 3DPM:

7700k: ~43
3630qm: ~142
performance delta: 3630qm is 330% faster

In 3DPMRedux 692015:

7700k: 17509ms (most common data point)
3630qm: 13687ms
performance delta: 3630qm is ~28% faster

There's still more I can do to improve performance and accuracy, so I'm going to do that before I try to draw too many conclusions. Eventually it may require a shift to C/C++.

No, I don't consider the current implementation's precision to be very good. It's a start, but it needs more work.

I considered lookup tables. Since the original implementation is meant to be an fp benchmark, it would probably not be a good idea to switch to lookup tables.

I'm going to tentatively agree with you here. The only thing we can do is examine relative performance. Until my own attempt is of better quality, though, such examinations need to be taken with a grain of salt.

I think a C implementation is the way to go simply because java introduces a ton of overhead (why a lot of scientific computations are done in C/C++, fortran, or python) and is pretty slow.

If I had to comment I would say that the error in the trig functions is problematic.

I also doubt its the l2 on the 3630qm as ivy bridg'e l3 is faster than kaveri's l2.
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
Okay, so it IS millions of particle movements per second. That clarifies things a bit. In fact, we could approximate more of the stages of 3DPM using that information. Thanks for linking to that page.

Hmm, he isn't using standard random number generation either . . . might be time to use someone's MT implementation then.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
If I had to comment I would say that the error in the trig functions is problematic.

It is. In computational chemistry the goal is not to get to the wrong answer as quickly as possible. I got my phd in comp chem/quantum chem and to be sure everyone wants to arrive at a usable (low error) answer.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
It is. In computational chemistry the goal is not to get to the wrong answer as quickly as possible. I got my phd in comp chem/quantum chem and to be sure everyone wants to arrive at a usable (low error) answer.

This. I agree 100%. I didn't want to say anything but I guess I will add something now.

For scientific computing it really doesn't matter how fast the program runs if it doesn't give the right result. Accurate to three decimal places is pretty much garbage unless you can justify it or the problem/program has intrinsic error (ie you are assuming or leaving out effects that are expected to have large contributions to the final answer that you cannot/do not, for some justified reason, include). The problem with accuracy to three decimal places is that the accumulated error will be accurate on the order of 2 decimal places, and perhaps worse if the error is systematically always higher ( or lower) than the correct result. Errors may be worse for small values 0.523 +/- 0.002 is much better than 0.078 +/- 0.002 and can get really troublesome if a value is being squared (1.02 ^ 2 = 1.0404).

Right now I'm doing something similar for my degree (stochastic processes in physics) and my advisor is pretty much: "code in doubles and be as accurate as possible, try to aim for at least 5-7 significant digits". My case is significantly more troublesome than 3DPM because small errors cause massive decay in the system as I iterate over millions of steps.

However, for something like 3DPM I would expect simply that the random number introduces so much 'fuzz' that this kind of accuracy is not really needed (the 'randomness' simply fuzzes everything up and you run the simulation thousands of times and average everything out). The one thing to really worry about is systematic rounding up or down of the trig functions (values are consistently larger or smaller than they should be). You should state and justify the trig approximation methods if you use them.
 

videogames101

Diamond Member
Aug 24, 2005
6,783
27
91
Aren't LUT's the real solution to accuracy without consuming compute? Of course it costs memory and I/O rather than compute. Not speaking about overall precision, just trig approximations.
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
Look-up tables were more-desirable a solution back when processor speed did not so far-exceed memory I/O. Some are in a habit of saying that look-up tables are "so 1985".

Going with a look-up table would take the benchmark further away from being an fp bench and more towards a cache/memory benchmark, which is the opposite of what I'm trying to do anyway. That's also why I haven't (yet) pre-generated random values and passed them along prior to starting the timer, though Dr. Cutress did mention that as being a valid technique in his lengthy commentary concerning 3DPM from 2013. I may consider it in the future as a cheap way to boost performance. 3DPMRedux is already a memory hog.

On the plus side, I was able to reintroduce my Taylor series solution from 662015 and add range reduction which made it more accurate than the solution I used in 692015. The negative consequence of my branchy range reduction code is slower performance. Instead of completing the benchmark at setting #3 in ~17500 ms, it's now closer to 22000 ms.

It's still more than twice as fast as using the standard Java API Math.sin/cos methods (~59000ms on my machine), mostly due to the fact that Java insists on working with doubles instead of floats and provides no alternate methods. And that really cuts to the core of why it is necessary to replace the standard Math library methods for sin and cos (and eventually sqrt): not because it's wise to do something inaccurate as quickly as possible, but because the current specification of 3DPM is to use 32-bit floating-point values (which bring with them their own degree of error). There's no sane reason to use doubles in cpu-intensive calculations to produce floats, unless you just like watching stuff be unnecessarily slow.

Regardless, the worst-case accuracy for the existing solution appears to be accurate out to five decimal places. That's better than before. I'm still not happy enough with it to sign off on a new build.

edit: I squeezed a little extra performance out by switching to a "minimax" Taylor polynomial:

sin(x) = x - .16666x^3 + .0083143x^5 - .00018542x^7

The error there is the same as my Taylor series solution (five decimal places in worst-case), but performance jumped a bit so that I can now clear 102400 steps in ~19400ms. Still slower than 692015, but it's an improvement.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Look-up tables were more-desirable a solution back when processor speed did not so far-exceed memory I/O.

note that small LUTs are typically used for polynomial coefficients in piecewise approximations, this is useless for smooth functions like sin/cos, though

That's also why I haven't (yet) pre-generated random values and passed them along prior to starting

in this case (pre-generated random values) you can simply store precomputed sin(alpha) cos(alpha) values since alpha is dependent on a single random source
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
in this case (pre-generated random values) you can simply store precomputed sin(alpha) cos(alpha) values since alpha is dependent on a single random source

Tempting. That would feel like gutting the entire benchmark. Still tempting.

Regardless, I'm going to go ahead and release build 6112015:

https://www.dropbox.com/s/ro1wyazng8i6qwq/3DPMRedux6112015.zip?dl=0

source:
https://www.dropbox.com/s/qo06kpljlbom6u0/3DPMReduxSource6112015.zip?dl=0

Changes include: better accuracy (5+ decimal places of precision), slightly worse performance (10-15% slower), particle z values now positive-only (reading Dr. Cutress' entry from 2013 about the bench convinced me that negative z values may be effectively meaningless for what the bench is intended to represent). According to my analysis, the biggest hit to performance is coming from Random.nextFloat(), so I'm looking at replacing it with a Mersenne Twister implementation:

http://cs.gmu.edu/~sean/research/mersenne/MersenneTwisterFast.java

I had considered switching to an alternate method for sqrt calculation, but even removing Math.sqrt() completely without replacing it with anything else only sped up selection 3 by ~500 ms. Doesn't seem worth the effort.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
Why not just pre-compute the random angles? You could even generate a handful and-reuse them if you're worried about memory footprint?

If a dominant part of the benchmark is generating random numbers, not pushing the particles, that seems like it's something you want to eliminate.
 

DrMrLordX

Lifer
Apr 27, 2000
22,965
13,057
136
Why not just pre-compute the random angles? You could even generate a handful and-reuse them if you're worried about memory footprint?

If a dominant part of the benchmark is generating random numbers, not pushing the particles, that seems like it's something you want to eliminate.

It's a question of whether or not the random angles I pre-computed followed the same distribution of data that I'm getting now (which is what is produced by 3DPM). I could just spit out a bunch of random numbers from -1 to 1 and use those in lieu of the outputs of sin/cos alpha[], but the relationship of those values would change unless I made sure the random values followed sine/cosine curves, and the best way to do that would be to use those functions during the pre-compute segment of the benchmark.

At that point, I would have moved all the trig out of the benchmark itself and into the initialization sequence. Outside of tracking z movement, that's the majority of the benchmark itself. It would certainly be faster in terms of the reported times.

It might be more fair to pre-generate the random values in randomone and randomtwo and keep all the trig in the main benchmark.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
I could just spit out a bunch of random numbers from -1 to 1 and use those in lieu of the outputs of sin/cos alpha[], but the relationship of those values would change unless I made sure the random values followed sine/cosine curves, and the best way to do that would be to use those functions during the pre-compute segment of the benchmark.

Yeah, I mean, you certainly can't draw from a rectangular distribution and expect to get a spherical distribution out.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Yeah, I mean, you certainly can't draw from a rectangular distribution and expect to get a spherical distribution out.

note that the original pseudo code isn't a proper spherical distribution, the delta Z component is chosen over an uniform [-1.0,1.0] range so the elevation angle (aka "beta" in the literature) distribution isn't uniform
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
It might be more fair to pre-generate the random values in randomone and randomtwo and keep all the trig in the main benchmark.

one may also argue that it's not fair to write the code with 8x unrolling by hand as you do, how much performance does it buy you btw ?
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
That seems strange too...maybe I mis-understanding something, but if you're simulating something like brownian motion of thermal particles, don't you want their velocities to be drawn from a 3D maxwellian (or equivalently 3x 1D Maxwellians, whatever, it gives the same result), not three flat rectangular distributions?

I guess it doesn't matter if you're not trying to use the code to do anything but push imaginary particles, but this is the sort of thing people actually think about.