Java software not gaining much from HT

DrMrLordX · Jul 7, 2015

Some of you might have read this thread in the CPU forum:

http://forums.anandtech.com/showthread.php?t=2433693

I also solicited for testers on OCN's Intel forum and got one guy with a 4770k to help me by testing the software on his machine:

http://www.overclock.net/t/1563664/please-help-me-make-this-software-more-intel-friendly/0_100

Long story short: the software I wrote works just fine and dandy on my A10-7700k, showing a notable improvement over the program it was intended to emulate - Dr. Cutress' 3DPM (Stage 1 only). It runs rather poorly on Enigmoid's i7-3630qm compared to 3DPM, which I still can not explain.

On top of that, it is slightly slower than 3DPM on an i7-4770K with HT on, but faster than 3DPM with HT off. Dr. Cutress' 3DPM Stage 1 gains ~67% on the 4770K from HT, while my 3DPMRedux only gains ~20% from HT. And that's on Haswell . . . I haven't seen how HT affects things on Ivy Bridge.

The source code for 3DPM (Stage 1 or otherwise) is not available in its entirety, but there is a link to the source for the latest build. Or heck I'll just link it here:

https://www.dropbox.com/s/enz9kz2u8up8v2x/3DPMReduxSource6222015.zip?dl=0

So can anyone here think of why HT is having such a muted effect on 3DPMRedux? If I had to guess why it's helping 3DPM Stage 1 so much, it's that there are probably some pipeline stalls leaving open a lot of execution resources for the extra logical processors to use on their assigned threads. It is also possible that the Java version is experiencing fewer pipeline stalls, but is (overall) running more slowly thanks to Java-inflicted overhead. It's close, darn close, but it isn't quite "there" yet.

There is also the possibility that the way I've set up my thread pool is slowing things down for HT, but I'm not really sure why that might be.

beginner99 · Jul 7, 2015

You can't just compare Java with C(++). Yes, Java has the rumor of being slower and it is slower than optimized C++ in most cases. in a few cases JIT actually makes Java faster. Problem is optimizing C++ is time-consuming, often not needed at all and often not worth it from a business perspective.

In this case it can be anything from Java overhead to a specific implementation detail. Do you know how HT works?

DrMrLordX · Jul 7, 2015

Vaguely. I have always regarded it as technology that allows a single core to assign execution resources to two different threads if there are any open stages in the pipeline. Or, at least, that's how I interpret Intel's implementation. There's more to it than that.

I know that Java has a rep for being slower, compared to optimized C/C++ code. Compared to poorly-written C++, good Java can be faster. I'm not trying to say that my Java is "good" per se, but I did manage to wring faster performance on the same basic algorithm out of Java than someone else did with C++, at least on my A10-7700k. That was also the case on a Haswell with HT disabled. Enabling HT sent the C++ implementation back into the lead by a small margin - approximately 11% faster.

If it's just Java overhead then I'd have no choice but to go back to the drawing board, or try to convert it to some competent C/C++ (I stink at both, so that option is not entirely desirable).

Spungo · Jul 8, 2015

Is it possible Java is assigning threads to the wrong cores at the wrong times? I remember HT having a problem where games would assign things to two logical cores that are the same physical core. The core numbers would be like this:
0 = first core
1 = HT of first core
2 = second core
3 = HT of second core

Games would assign work to cores 0 and 1, which are really the same core. Turning off HT would make the game faster because it would then be assigning work to cores 0 and 2 only.

Cogman · Jul 8, 2015

Spungo said:
Is it possible Java is assigning threads to the wrong cores at the wrong times? I remember HT having a problem where games would assign things to two logical cores that are the same physical core. The core numbers would be like this:
0 = first core
1 = HT of first core
2 = second core
3 = HT of second core

Games would assign work to cores 0 and 1, which are really the same core. Turning off HT would make the game faster because it would then be assigning work to cores 0 and 2 only.

Nope, java relies on the OS thread scheduling

Cogman · Jul 8, 2015

Ok, few things.

First, looking at the code it doesn't surprise me that you are getting little performance benefits from HT. This code is crazy heavily FP math with almost no memory lookup/branching, etc. Congrats! (That is usually pretty hard to do)

It also doesn't surprise me that it varies wildly with architectures. You're pretty tightly bound to the number of FP execution units available on a processor. The more that particular architecture has, the more benefits you'll see from HTing. Honestly, you're best bet for seeing which is best is going to be just doing a test run on the target proc.

With a few small tweaks, I was able to get a 10% performance increase with the java. I don't think you can get much more without some major refactoring. (I'm at work now and we don't have access to dropbox... so I'm going to spout this from memory).

The three big thing that improved performance the most were statics on the constants in the top of the Partical run thinger (private static final for constants in java, always. It allows for optimizations that the jvm doesn't do otherwise). Some of the arrays are uneeded in that inner loop. You pretty much only need the newx/y/z arrays. The other arrays can and should be floating point variables.

Finally. You need to realize that the JVM optimizes at the function level. On top of that, it does a better job with smaller functions (Inlining, code alignment, etc. Everything gets better with a smaller function). I split the inner loop into two functions, one to load the random values and another to update the particle with the new information. That alone gave about a 5% performance bump.

With all that said. If possible, I think your best bet is to dump java and C++ and instead use CUDA or OpenCL for that hot loop. This problem looks like it should be very amiable to GPGPU computing.

DrMrLordX · Jul 8, 2015

Spungo said:
Is it possible Java is assigning threads to the wrong cores at the wrong times?

I had thought that might be the case, which is why I asked the helpful poster on OCN to disable HT in the first place. But the OS scheduler handles how threads are assigned to cores/processors, and he reports 100% utilization on all cores throughout the benchmark. So, I can't assume that the thread pool is doing a poor job of assigning workers to threads.

Cogman said:
Ok, few things.

First, looking at the code it doesn't surprise me that you are getting little performance benefits from HT. This code is crazy heavily FP math with almost no memory lookup/branching, etc. Congrats! (That is usually pretty hard to do)

Thanks! I've been working on getting better at that for a little while now . . . anything less seems to run like a dog on these newer AMD chips.

It also doesn't surprise me that it varies wildly with architectures. You're pretty tightly bound to the number of FP execution units available on a processor. The more that particular architecture has, the more benefits you'll see from HTing. Honestly, you're best bet for seeing which is best is going to be just doing a test run on the target proc.

Makes sense, though the only test units I have available right now are an A10-7700k and an E1-2500 (lulz). My goal was to put together something that could improve performance for both AMD and Intel processors, but I did not expect to hurt any Intel CPU performance in the process. Java just happens to be what I know.

With a few small tweaks, I was able to get a 10% performance increase with the java. I don't think you can get much more without some major refactoring. (I'm at work now and we don't have access to dropbox... so I'm going to spout this from memory).

Any chance you could paste a link to your modifications at a later date? I can try modifying it on my end as per your instructions, but it would be interesting to compare/contrast what you did with what I may attempt later today.

The three big thing that improved performance the most were statics on the constants in the top of the Partical run thinger (private static final for constants in java, always. It allows for optimizations that the jvm doesn't do otherwise). Some of the arrays are uneeded in that inner loop. You pretty much only need the newx/y/z arrays. The other arrays can and should be floating point variables.

Finally. You need to realize that the JVM optimizes at the function level. On top of that, it does a better job with smaller functions (Inlining, code alignment, etc. Everything gets better with a smaller function). I split the inner loop into two functions, one to load the random values and another to update the particle with the new information. That alone gave about a 5% performance bump.

Okay, I'll try that. It shouldn't be too hard to implement those suggestions.

edit: I tried all three changes you recommended and got a performance bump on the 7700k. I'll do more testing, clean things up, and put the new code + classes on dropbox later today.

Here is the source for the 782015 build incorporating your suggestions:

https://www.dropbox.com/s/i8alkrsn99so3r7/3DPMReduxSource782015.zip?dl=0

I went from ~67M movements per second to ~75M movements per second. Pretty good stuff!

With all that said. If possible, I think your best bet is to dump java and C++ and instead use CUDA or OpenCL for that hot loop. This problem looks like it should be very amiable to GPGPU computing.

I agree completely. It's just that there is a talent/skill deficit on my end. If I could code as well as mat and poke349 (authors of GPUPI and y-cruncher, respectively), I could crank out OpenCL 1.2/2.0 AND CUDA and do SIMD via intrinsics. I'm still trying to figure out how to set up a suitably fast threading model for a C++ version of 3DPMRedux in Windows. I'm not wedded to the idea of using VS2013, and I'd rather use MinGW. Thread pools are easy to set up in Java with an ExecutorService.

Also, if I got with OpenCL/CUDA, now I'm benching GPUs instead of CPUs. Not that there's anything wrong with that. I would love to put my iGPU to work! Actually, I'd prefer to do it with HSA, but . . . grumble grumble . . . yeah.

Cogman · Jul 8, 2015

Ok, so here is my time wasted for the afternoon

https://github.com/cogman/ThreeDBMRedux

This is my best shot after a couple of hours. I didn't really get much more out of it, the gains are hard to come by but I did manage a little bit of a bump (I honestly don't remember all little the changes.. soo.. yeah. Lots of guess and check).

One thing that I did notice was that "over saturating" the cpu (more threads than cores) did result in a slight performance increase. You might consider looking into exceeding the core count by somewhere around 2/3 to get out the best performance.

I did "javaify" some things and cut down a little of logic. Also, I threw this into a maven project, just 'cause.

Let me know if you have any questions.

DrMrLordX · Jul 9, 2015

Thanks! I'll give it a look and see how it performs on my test machine.

By the way, what was your test machine when working with the code?

edit: nnngh lambdas *head explodes*. Nah it wasn't that bad. Though the rework on the thread pool as presented slowed things down some on my 7700k . . . however, your changes to the "meat" of minicos sped it up a teeny tiny bit (or were just as fast, hard to tell exactly), which, in light of the input Ken g6 offered with respect to his Wolfdale, might be a superior offering. Also, your changes to Particles.java (notably the change in the arrays to final) sped things up a tiny bit too.

I tried oversaturating the thread count by +50%. It didn't seem to hurt performance, maybe bumped it up by ~200 ms @ 102400 steps.

Cogman · Jul 10, 2015

I'm on an intel i7 3770K (not overclocked).

I didn't see any performance loss with the changes to the thread pool. Though, I did make it so the threads start with min priority (what I believe you were trying to achieve by having organize extend thread). My guess is you are seeing the losses due to the fact that the threads aren't running at normal priority, but rather min priority. Try changing this

Code:

ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(threads, (r) -> {
			Thread thread = new Thread(r);
			thread.setPriority(Thread.MIN_PRIORITY);
			return thread;
		});

to this

Code:

ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(threads);

and see if that buys back the performance lost.

DrMrLordX · Jul 12, 2015

Okay, looks like removing the minimum priority bit got me back some of the performance. What really helped me was switching away from MersenneTwisterFast and to Java8's SplittableRandom(). Sadly it has no .nextFloat() method, but even using .nextDouble() and casting that to float, I saw a big performance increase.

If you don't mind me asking, can you tell me how the change to SplittableRandom affects performance on your 3770k? Source is here:

https://www.dropbox.com/s/artotdvmbra5c1l/3DPMReduxSource7122015.zip?dl=0

It's identical to 792015 except for the change in rng.

Search

Java software not gaining much from HT

DrMrLordX

Lifer

beginner99

Diamond Member

DrMrLordX

Lifer

Spungo

Diamond Member

Cogman

Lifer

Cogman

Lifer

DrMrLordX

Lifer

Cogman

Lifer

DrMrLordX

Lifer

Cogman

Lifer

DrMrLordX

Lifer

TRENDING THREADS