Float-heavy code, HT, and top

Essence_of_War · Jun 4, 2015

I have some very float-intensive fortran code that is totally serial (no OMP, no MPI, no pthreads, no nothing). I typically run multiple instances of it with slightly different parameters to do simulation work, and recently I've had my first opportunity to run it on a Xeon that has HT.

I have a quad-core Xeon w/ HT and when I load up 4 instances of this code, watching top consistently shows a load of 3.8. Am I leaving performance on the table if I only run 4 instances? If I loaded up an additional 4 instances, assuming I have enough RAM to not start swapping, will I slow down all of my simulation runs by forcing the extra scheduling on non-physical threads?

For reference, this system doesn't have to do anything but run this simulation code.

Accord99 · Jun 4, 2015

Trying it is the only real way to tell. Though if it's not perfectly optimized code, you'll probably see a modest bump in throughput of 10-20%.

Ken g6 · Jun 5, 2015

It might not benefit at all either. Prime95 doesn't and it's FP-heavy. But it's also often RAM bandwidth-limited.

If your code doesn't benefit from HT, it might benefit from turning HT off in the BIOS. Prime95 does for sure.

Essence_of_War · Jun 5, 2015

So is 'top' not a great indication of total float resources being used?

Ken g6 · Jun 5, 2015

"top" is an indication of total CPU thread-seconds being used. It doesn't say anything about particular resources inside the CPU. I'll reiterate Accord99's statement:

Accord99 said:
Trying it is the only real way to tell.

Cogman · Jun 5, 2015

Scheduling is usually pretty dang fast. I wouldn't really worry about 8 threads being too many threads.

Have you profiled the code? How old is the code?

Profiling, if you haven't done it yet, will probably get you the best performance gains/buck. If the code is really super old you may get some good gains by simply rewriting it using modern idioms and patterns. Old code did a lot in the name of performance that can really get in the way of an optimizing compiler, things like in code loop unrolling, inlining, stuff like that. It would also often optimize for memory size. Something you shouldn't really worry about today. Const correctness and "as tight as possible" access restrictions give the compiler a lot of room for optimization. Building the project and using whole project optimization will also give a good amount of benefits.

For floating point heavy code, memory bandwidth is often a problem, you might look into some techniques in optimizing CPU cache utilization.

And finally, if all else fails, you may be able to get some good bang for your buck by introducing some well placed intrinsic methods.

Merad · Jun 5, 2015

If you are running x64 code on a recent-ish Xeon (like Nehalem or maybe slightly older) then the FP code is running in the SSE2 XMM registers. I think there should be one SSE unit per core, obviously shared between HT cores. I would suspect that two FP-intensive threads running on the same physical core may hurt performance due to sharing the SSE unit, but the only way to know for sure is to test it.

Schmide · Jun 5, 2015

Floating point is a deep pipeline, often 14-17 cycles, so unless you have highly optimized vector code, dependencies are going to keep you from saturating it.

Cogman · Jun 5, 2015

Schmide said:
Floating point is a deep pipeline, often 14-17 cycles, so unless you have highly optimized vector code, dependencies are going to keep you from saturating it.

They also have split units. So saturation is going to be even harder (there are different units for addition/multiplication). Not impossible, but hard.

beginner99 · Jun 5, 2015

AFAIK HT performs best when you have a mixed workload so that 1 thread can use integer resources and the other floating point resources. So for float only or integer only HT has little to negative benefit.

Essence_of_War · Jun 5, 2015

So after the discussion here, I just decided to go all-in and try it. I fired up another four instances of the code, and my top load is now ~7.9, each instance produces a small output to the console after every time-step, and it doesn't look like the inter-arrival time of these check-in messages has slowed at all.

Cogman · Jun 5, 2015

beginner99 said:
AFAIK HT performs best when you have a mixed workload so that 1 thread can use integer resources and the other floating point resources. So for float only or integer only HT has little to negative benefit.

It really depends on what you are doing. Even with floating point or integer heavy computations, you can see benefits if you are doing a lot of memory access.

Take this benchmark as an example of that

http://www.tomshardware.com/reviews/winrar-winzip-7-zip-magicrar,3436-11.html

Compression is mostly integer bound and does very little floating point work. Yet we see here that hyperthreading has anywhere from a huge impact to a negative impact. The difference is, IMO, memory access. You'll notice that the faster compression methods perform worse, why is that? Because they tend to have much larger dictionaries and deeper memory searches. On the other hand, the fast compression methods have much smaller dictionaries and more shallow memory access.

Point being, you can't really say anything about the impacts of hyperthreading without running some tests of your own.

Scooby Doo · Jun 5, 2015

Would CUDA or a video card in general help with this (provided it was written in a more modern language with support)?

Ken g6 · Jun 6, 2015

Regardless of the project, migrating to CUDA from Fortran would be a huge change. The first step would be to analyze the project to see if massive parallelism like that was possible. (Think thousands of parallel threads.)

I would say the second step should be to find the inner loop and implement it in C or C++. Apparently, mixing Fortran and C like this is possible.

If massive GPU parallelism isn't workable, maybe SSE or AVX intrinsic functions in C could provide a parallelism boost? (They're not available in Fortran.) There's a lot less back-and-forth latency with intrinsics, but they can only work on at most 4 doubles or 8 singles at a time. And the instruction set is limited.

Gryz · Jun 6, 2015

I'm not sure that top is a good indicator for what you want to know.

The number you're looking at (load) is the average number of processes (or threads) that is in the run queue. That means, process that want to run. Not included are processes that are sleeping or waiting for I/O, etc. If the load is 7.9, it means your 8 processes want to run all the time. It doesn't say anything about how much time each of them got to run (which isn't just 100% when you use HyperThreading). It doesn't say anything about efficiency.

As said before here, the best way is to benchmark it. I don't even think benchmarks need to take a lot of time. Just 10-30 seconds should be enough. Run your program for 30 seconds, and see how much work gets done. Repeat for different setups.

neocpp · Jun 7, 2015

I'm not sure on exactly what system you're using, but I like using the Zoom profiler from rotateright (http://www.rotateright.com/zoom/) to check the performance of my simulation code. It used to be a paid product but is free now.

If you use the perf driver, you can do a "scheduler trace" report, which shows how the threads are interacting. In my case, it was pretty obvious that the threads were waiting due to some false sharing in the initial versions of the code, which I was able to eliminate. It can also show you where stalls and misses occur. Although in your case it sounds like you have completely separate instances running that do not need to communicate with each other, so this may not be as helpful.

It does a pretty decent job of figuring out where time is being spent, and although it isn't perfect, I find that it gives a pretty good starting point for optimization. I started using this because gprof has major issues with multi-threaded code, and I find this better in practically every way even with single-threaded code.

Fair warning, I've only used this with C and C++, so I'm not sure how it will perform with Fortran. Just something to think about if you plan on going the route that Cogman suggested.

Fallen Kell · Jun 23, 2015

As others have already stated, so much of this is code and compiler specific that you simply need to test your code and see if using hyper-threads does indeed give you a throughput bonus. It sounds like you essentially do monte carlo simulations with your code. The best test is to time how long it takes to perform the same exact runs (say 16 or 32 runs total), and one time run with only 4 instances, and in the other, run with 8. If the overall time to finish takes less time with 8, then HT is indeed giving you a performance boost.

For reference, our code was able to gain 60% of the performance of a real core on a hyperthreaded core when running on all real+hyperthreaded cores in the system. That is essentially almost 2/3'rd of another computer for free by using the hyperthreaded cores.

Float-heavy code, HT, and top

Essence_of_War

Platinum Member

Accord99

Platinum Member

Ken g6

Programming Moderator, Elite Member

Essence_of_War

Platinum Member

Ken g6

Programming Moderator, Elite Member

Cogman

Lifer

Merad

Platinum Member

Schmide

Diamond Member

Cogman

Lifer

beginner99

Diamond Member

Essence_of_War

Platinum Member

Cogman

Lifer

Scooby Doo

Golden Member

Ken g6

Programming Moderator, Elite Member

Gryz

Golden Member

neocpp

Senior member

Fallen Kell

Diamond Member

TRENDING THREADS