3DPM: Can we fix it? Yes we can!

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
Cue up Bob the Builder intro music.

Some of you may remember this thread in which Ian Cutress showed some code which is allegedly representative of what is in his 3DPM benchmark. Subsequent posts by bronxzv and JoeRambo go into some detail as to why the code is perhaps less-than-great. We still haven't gotten solid answers as to whether or not the actual code in the 3DPM benchmark is just as flawed, but results seem to indicate that the false cache line sharing issue pointed out by bronxzv in post 226 is in full effect. I expect that 3DPM could be made faster on Intel processors and considerably faster on AMD processors by restructuring the program as recommended in post 226 (or something similar).

Without the full source code, a complete fix of the existing benchmark is impossible. Dr. Cutress has not released all of the source. What we CAN do is attempt to reproduce one of the six testing stages (Stage 1: TRIG) by examining the code presented by Dr. Cutress in the above posts. Perhaps we can do more with his cooperation.

I'll throw my hat into the ring with a Java implementation, though I am uncertain as to what it is that is represented by the benchmark scores. For example, locked at 3.4 ghz, my 7700k scores ~43 in Stage 1 (which is the second-fastest stage for me; Stage 2 is a bit higher with ~50). What does that score mean? Without knowing more about the scores produced by 3DPM, it would be difficult to compare results from an "optimized" version of 3DPM (or an approximation thereof) with the original.

edit: apparently the score means "millions of particle movements per second".

Software is now available:

https://www.dropbox.com/s/k18zcf6yza6e63k/3DPMRedux7122015.zip?dl=0

There's a simple file called runme.bat inside the archive. Just extract the archive somewhere and double-click runme.bat and it should start right up. Java 8 required. Source is here:

https://www.dropbox.com/s/artotdvmbra5c1l/3DPMReduxSource7122015.zip?dl=0

Older builds

792015:
https://www.dropbox.com/s/mxzoy30cj79n3u2/3DPMRedux792015.zip?dl=0

source:
https://www.dropbox.com/s/yne3bdaeusmqndy/3DPMReduxSource792015.zip?dl=0

782015:
https://www.dropbox.com/s/c2pd4hicoy93bdb/3DPMRedux782015.zip?dl=0

source:
https://www.dropbox.com/s/i8alkrsn99so3r7/3DPMReduxSource782015.zip?dl=0

6222015:
https://www.dropbox.com/s/vpa8t5yl04ajape/3DPMRedux6222015.zip?dl=0

source:
https://www.dropbox.com/s/enz9kz2u8up8v2x/3DPMReduxSource6222015.zip?dl=0

6162015:
https://www.dropbox.com/s/158tbdoo8rwdno8/3DPMRedux6162015.zip?dl=0

source:
https://www.dropbox.com/s/crjlp6tnk5twoab/3DPMReduxSource6162015.zip?dl=0

6132015:
https://www.dropbox.com/s/a7j864pdnu30gld/3DPMRedux6132015.zip?dl=0

source:
https://www.dropbox.com/s/471xjzpykaze7s9/3DPMReduxSource6132015.zip?dl=0

6112015:
https://www.dropbox.com/s/ro1wyazng8i6qwq/3DPMRedux6112015.zip?dl=0

source:
https://www.dropbox.com/s/qo06kpljlbom6u0/3DPMReduxSource6112015.zip?dl=0

692015:
https://www.dropbox.com/s/5k0ckp9ltj2hdmg/3DPMRedux692015.zip?dl=0

source:
https://www.dropbox.com/s/lysfssq8p7vlwbe/3DPMReduxSource692015.zip?dl=0

Results reporting

If you wish to report results in this thread (which I highly encourage), here are my recommendations:

1). Report the build first. Example: build released on 6/9/2015 is 692015.
2). Make sure to let us know if you are running in single-threaded or multi-threaded mode.
2). Try running it 2-3 times to see what kind of variations you get. They're small, but they do exist.
3). While results from all the different settings could prove useful, results from selections 3-5 will probably be the most useful.
4). Please also report your results from Stage 1 of Dr. Ian Cutress' 3DPM program. Let us know if you are using the single-threaded or multi-threaded version of the program.

You will see that the formatting of results.txt is such that it makes it easy for you to report build, steps, threading, and completion time just by copy/pasting from the file. 3DPMRedux will append future results (it does not delete results.txt), so you can carry the same results file over between versions to keep a record of all your results.

For now, I have applied no license to the software with the understanding that it is derivative of a code snippet posted by Dr. Ian Cutress earlier this year. Until I have a chance to discuss the matter with him, I'm going to hold open the possibility that, legally speaking, this code may fall under his copyright.

Until I have had a chance to discuss the matter with him or receive an authoritative legal opinion on the matter, I offer the code and binaries in the same spirit in which the original code snippet was shared on this forum. Note that it would otherwise be my intention to offer the software under a BSD 2-clause license.

Further discussion of this software is in subsequent posts (check 17 and later).
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
You can fix it or allow it to exist as it is.

particle.x = r*cosf(alpha); particle.y = r*sinf(alpha); particle.z = newz;


Is how a lot of people in Cutress' field are going to write it. They aren't computer scientistis and won't be optimizing for cache. It may not be a good indication of processor performance but it will be a good indication for the average scientist how code is going to run.

http://www.agner.org/optimize/instruction_tables.pdf

The cos and sin functions are using around 130 cycles each.
 

Abwx

Lifer
Apr 2, 2011
10,947
3,457
136
You can fix it or allow it to exist as it is.

If a bench would show a Kabini trouncing a i5 i m not sure that you would had written the bolded part of your sentence, i guess that biaised benches suit you perfectly provided they favour the good brand..


It may not be a good indication of processor performance

Yet before Bronxzv pointed the discretanpcies you were stating that it was representative of FP perfs, going as far as to claim that Baytrail having 2x the FP IPC of a Piledriver wasnt weird at all and quite expected...

Thanks to this flawed bench Ian Cutress did write that Baytrail has superior FP IPC than Kabini, nevermind that the latter show 32 and 38% better IPC in Cinebench 11.5 and Povray respectively, we must abide by the results provided by an amateurish piece of code that i wouldnt dare calling a software..
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
You can fix it or allow it to exist as it is.



Is how a lot of people in Cutress' field are going to write it.
.

yes but replacing "=" with "+=" very obviously

the fact this is in AoS (not vectorizer friendly) is orthogonal to the main issue IMO which is the false cache line sharing problem (due to the way threads are assigned to particles, way too fine grained IIRC)

The cos and sin functions are using around 130 cycles each.

this is irrelevant since x87 sin and cos function aren't used according to vTune in the real world bench (well IIRC)
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
What are you using then?

if it was me I will use a library such as Intel's SVML for such applications

IIRC the bench discussed here is compiled with an obsolete VC++ compiler and software synthetised sin/cos with x87 code, which is probably better than native x87 sin/cos btw, anyone interested can download it and profile it, as I have done in the past discussion at the link provided by the OP

now, I will not waste more time on this this time around
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
Without the full source code, a complete fix of the existing benchmark is impossible. Mr. Cutress has not released all of the source. What we CAN do is attempt to reproduce one of the six testing stages (Stage 1: TRIG) by examining the code presented by Mr. Cutress in the above posts. Perhaps we can do more with his cooperation.

Is your goal to reproduce Ian's 3DPM movement, specifically, or is it to develop/build a float-point intensive benchmark that simulates a scientific code?
 

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
what's the significance of this 3DPM bench?

It is a benchmark that has been used by some to show the weakness of AMD's floating-point performance, when in reality it shows the weakness of AMD's cache implementation in situations where multiple, repeated cache flush/refill operations are inevitable. It commonly shows up in Anandtech reviews, and is the source of much consternation in AMD vs. Intel threads that keep popping up here and elsewhere.

Fixing the problem, at least in a limited sense, could show us what 3DPM should have revealed some time ago: that both Intel and AMD processors should be able to perform better than indicated by the existing benchmark.

Is your goal to reproduce Ian's 3DPM movement, specifically, or is it to develop/build a float-point intensive benchmark that simulates a scientific code?

Mostly it is to reproduce Ian's 3DPM benchmark, at least for Stage 1 anyway (Stages 2-6 use algorithms he hasn't chosen to share with us).
 

videogames101

Diamond Member
Aug 24, 2005
6,777
19
81
If a bench would show a Kabini trouncing a i5 i m not sure that you would had written the bolded part of your sentence, i guess that biaised benches suit you perfectly provided they favour the good brand..




Yet before Bronxzv pointed the discretanpcies you were stating that it was representative of FP perfs, going as far as to claim that Baytrail having 2x the FP IPC of a Piledriver wasnt weird at all and quite expected...

Thanks to this flawed bench Ian Cutress did write that Baytrail has superior FP IPC than Kabini, nevermind that the latter show 32 and 38% better IPC in Cinebench 11.5 and Povray respectively, we must abide by the results provided by an amateurish piece of code that i wouldnt dare calling a software..

Most developers write "amateurish" code mate. Cache performance is important too.
 

Abwx

Lifer
Apr 2, 2011
10,947
3,457
136
Most developers write "amateurish" code mate. Cache performance is important too.

It s quite possible but usual products are improved in later iterations, while a bench cant be upgraded by the definition, so it must be well designed from the start or be thrown to the trash bin definitly.

That s why despite the criticisms Cinebench 11.5 is still widely used as FP bench, it favours somewhat Intel but it s still very efficent to check a uarch perfs and evolution.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
It s quite possible but usual products are improved in later iterations, while a bench cant be upgraded by the definition, so it must be well designed from the start or be thrown to the trash bin definitly.

That s why despite the criticisms Cinebench 11.5 is still widely used as FP bench, it favours somewhat Intel but it s still very efficent to check a uarch perfs and evolution.

So how come Geekbench is still around? ;)

/don't flame me bro

But seriously now, there's a huge mountain of legacy math libraries that will never get updated and CPU designers need to either throw all of them under a bus or just make sure that the CPU is robust enough to handle crappy code. :)
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
So how come Geekbench is still around? ;)

/don't flame me bro

But seriously now, there's a huge mountain of legacy math libraries that will never get updated and CPU designers need to either throw all of them under a bus or just make sure that the CPU is robust enough to handle crappy code. :)

Agree. Geekbench is an exceptionally bad piece of software for comparing architectural performance.

Its good at what it does and to their credit, the devs are pretty up front about what it does. It basically shows how naively written code without any optimization performs across architectures.
 

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
3DPM isn't the only "bad" benchmark out there. It is somewhat-unique in its ability to show poor performance on AMD platforms, true, but doubtless there are other benchmarks out there which have their problems. Some of them use dated libraries or have not been updated to use modern ISA extensions.

If it were a cache benchmark that showed cache bandwidth and latency, or showed performance in optimized vs unoptimized code blocks, then it would make more sense. As it stands, the original release info clearly states that AMD is crippled by the FP-heavy nature of the test. We now know that to be generally untrue.

If nobody here has any specific insight regarding the meaning of the scores reported by 3DPM, then it will be possible to establish a different scoring system which we may use to compare Intel and AMD processors to detect performance deltas between specific CPUs. We can then match those performance deltas up to the ones presented by the established 3DPM benchmark and see if there is any difference, at least when it comes to Stage 1 anyway.

There are other questions remaining: what are the values of the constants steps and particles? The value of particles is probably 10000, since:

a). 3DPM spawns one thread per particle and
b). the release post for 3DPM says that it forces the host computer to handle 10k threads

Not really sure that the value of steps will make a big difference in relative performance, so any old value could possibly suffice. Perhaps it and particles should be made user-selectable before the benchmark is run?

edit: there's also the issue of significant figures. How much accuracy is needed here? The need for accuracy can't be that great, or else Dr. Cutress would not have opted for 32-bit fp.
 
Last edited:

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
As it stands, the original release info clearly states that AMD is crippled by the FP-heavy nature of the test. We now know that to be generally untrue.

We do?

There are other questions remaining: what are the values of the constants steps and particles? The value of particles is probably 10000, since:

a). 3DPM spawns one thread per particle and
b). the release post for 3DPM says that it forces the host computer to handle 10k threads

Isn't this just "OMP parallel for" looped over the num particles?
 

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136

Yes. Bay Trail-D beat out some AMD processors in 3DPM that had/have faster 32-bit fp throughput according to some other software (AIDA64, etc). In light of observations made about cache size/speed on Bay Trail-D and the thread linked in the OP where Dr. Cutress' code snippet is held up for some review, it is logical to conclude that there is more to 3DPM than a simple fp benchmark.

Isn't this just "OMP parallel for" looped over the num particles?

Yes. It should create one separate thread each time it loops over the constant particles. The release post says that it creates 10k threads, so it's reasonable to assume that the value of particles is 10000.

My first attempt to improve on Dr. Cutress' code snippet is now available. I have updated the OP to reflect this fact, but for those reading this far down in the thread, it is here:

https://www.dropbox.com/s/4q4f9unhwhtswcf/3DPMRedux662015.zip?dl=0

As with my other Java software, there's a simple file called runme.bat inside the archive. Just extract the archive somewhere and double-click runme.bat and it should start right up. Java 8 required. Source is here:

https://www.dropbox.com/s/jrpvwxr8aq3au8a/3DPMReduxSource662015.zip?dl=0

The program creates a small text file called output.txt which shows the final x,y, and z coordinates for all 10000 particles handled by the benchmark. It will attempt to delete old output files before writing a new one. The user is free to select how many "steps" of movement are applied to each particle. Be warned that selections past 4 can be a little slow.

You may notice selection 1 and 2 running faster the second time and all subsequent times if you select them repeatedly during the same program execution. This is likely due to Java-style "warmup", though you will probably not notice it in selections 3 or higher.

This program will probably run a lot faster on AVX2-enabled Haswell processors than anything else. Java 8 is capable of producing AVX/AVX2 instructions via autovectorization through the JVM. The code itself blatantly hints to the JVM that autovectorization should happen, and the same techniques have born fruit in other software I've written in the past.

3DPMRedux does feature some deviations from what was in Dr. Cutress' code snippet that is meant to be representative of 3DPM. Key differences:

1). Instead of using stock sin/cos functions, I implemented a simplified Taylor Series for both functions. The "fakesin" and "fakecos" methods should still produce accuracy out to about 6 significant figures, which is probably "good enough" when using 32-bit floating point numbers.

2). I allowed negative values for z-axis particle movement.

3). I allowed user-selection of the "steps" value.

4). Instead of reporting a performance number without discernible units, I track the total time of execution for the entire benchmark process (save the process of initializing the particles themselves when the program is first opened).

Possible bugs:

Repeated operation of the program may eventually cause the Particles objects to store X and Y values that can no longer increment upwards. This flaw stems from the fact that 32-bit fp only permits so much accuracy, to the point that it is impossible to increase the value of a 32-bit fp number beyond a certain point if the maximum additive increment has some fixed maximum value.

For example, if you have a 32-bit fp variable and you repeatedly add 1.0f to it, eventually it will max out at a value where if you continue adding 1.0f to it, it will go up by exactly 0.

3DPMRedux makes no attempt to compensate for this problem. Later versions could reinitialize all Particles objects to reset the starting values to 0, but for now, it doesn't seem to be a big issue. It might be for very large values of steps.
 
Last edited:

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
A bold first pass!

1). Instead of using stock sin/cos functions, I implemented a simplified Taylor Series for both functions. The "fakesin" and "fakecos" methods should still produce accuracy out to about 6 significant figures, which is probably "good enough" when using 32-bit floating point numbers.

This jumped out at me. Are you familiar with the implementation of the built-in sin/cos functions? I'm certainly not, but re-writing the metaphorical wheel might not reliably be faster than calls to the math library. But then again lib calls could be expensive...
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
This is the problem with closed-source CPU benchmark programs: most people don't actually know what the code is doing unless the binary is decompiled, and sometimes not even then.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
1). Instead of using stock sin/cos functions, I implemented a simplified Taylor Series for both functions. The "fakesin" and "fakecos" methods
should still produce accuracy out to about 6 significant figures, which is probably "good enough" when using 32-bit floating point numbers.

Taylor series are generally a bad idea, you have far better results (speed vs. accuracy) with Legendre or Chebyshev polynomial approximations

also for sin/cos, with any approximation a challenging part is range reduction

a very good book on the subject that I'll recommend is Elementary Functions, Algorithms and Implementation, Jean-Michel Muller, Birkhäuser, 2005
 

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
Its Dr. Cutress, isn't it?

Ah! Thanks for that. I'll edit my posts to correct that issue.

A bold first pass!

This jumped out at me. Are you familiar with the implementation of the built-in sin/cos functions? I'm certainly not, but re-writing the metaphorical wheel might not reliably be faster than calls to the math library. But then again lib calls could be expensive...

According to my limited research, the sin/cos implementations in the standard Java API use Taylor series approximations out to 14 terms. In my personal experience in using them, they seem resistant to JVM autovectorization for whatever reason. They also return double values, which is unnecessary given the fact that Dr. Cutress' benchmark uses 32-bit fp. Any computation involving doubles will be slower than simple floats, so by switching to an alternative version of sin and cos functions, we can avoid double-precision fp math AND we can avoid casting as well.

edit: taking a second look at it, StrictMath may not be using a Taylor Series at all. I am seeing contradictory information on the topic.

Taylor series are generally a bad idea, you have far better results (speed vs. accuracy) with Legendre or Chebyshev polynomial approximations

also for sin/cos, with any approximation a challenging part is range reduction

a very good book on the subject that I'll recommend is Elementary Functions, Algorithms and Implementation, Jean-Michel Muller, Birkhäuser, 2005

An excellent suggestion. I did notice similar recommendations on StackExchange and elsewhere. For my first attempt, I stuck with something simple that I could implement within the course of a day. Also, it seems as though Java's StrictMath (to which Java defaults when Math.sin() and Math.cos() are used) uses a 14-term Taylor series itself . . .

edit: A correction. Depending on the JVM, Math.sin() may or may not default to StrictMath's implementation of the sine function.

edit edit: The next build will have some upgrades here and there. Here's what I have planned:

Improved sin/cos implementations: If I can find something faster/more accurate than the Taylor serieseseses I'm using right now, I'll implement them. That may take a little time, and depending on that factor, it may take another build before I actually put that in the code.

Results reporting: I'll include some code to record the amount of time it took to perform the benchmark. It'll include the selection and the build number, at least.

Single-threaded mode: I'll add a toggle to switch between Single-threaded and Multi-threaded mode.
 
Last edited: