• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Float-heavy code, HT, and top

Essence_of_War

Platinum Member
I have some very float-intensive fortran code that is totally serial (no OMP, no MPI, no pthreads, no nothing). I typically run multiple instances of it with slightly different parameters to do simulation work, and recently I've had my first opportunity to run it on a Xeon that has HT.

I have a quad-core Xeon w/ HT and when I load up 4 instances of this code, watching top consistently shows a load of 3.8. Am I leaving performance on the table if I only run 4 instances? If I loaded up an additional 4 instances, assuming I have enough RAM to not start swapping, will I slow down all of my simulation runs by forcing the extra scheduling on non-physical threads?

For reference, this system doesn't have to do anything but run this simulation code.
 
Trying it is the only real way to tell. Though if it's not perfectly optimized code, you'll probably see a modest bump in throughput of 10-20%.
 
It might not benefit at all either. Prime95 doesn't and it's FP-heavy. But it's also often RAM bandwidth-limited.

If your code doesn't benefit from HT, it might benefit from turning HT off in the BIOS. Prime95 does for sure.
 
"top" is an indication of total CPU thread-seconds being used. It doesn't say anything about particular resources inside the CPU. I'll reiterate Accord99's statement:
Trying it is the only real way to tell.
 
Scheduling is usually pretty dang fast. I wouldn't really worry about 8 threads being too many threads.

Have you profiled the code? How old is the code?

Profiling, if you haven't done it yet, will probably get you the best performance gains/buck. If the code is really super old you may get some good gains by simply rewriting it using modern idioms and patterns. Old code did a lot in the name of performance that can really get in the way of an optimizing compiler, things like in code loop unrolling, inlining, stuff like that. It would also often optimize for memory size. Something you shouldn't really worry about today. Const correctness and "as tight as possible" access restrictions give the compiler a lot of room for optimization. Building the project and using whole project optimization will also give a good amount of benefits.

For floating point heavy code, memory bandwidth is often a problem, you might look into some techniques in optimizing CPU cache utilization.

And finally, if all else fails, you may be able to get some good bang for your buck by introducing some well placed intrinsic methods.
 
If you are running x64 code on a recent-ish Xeon (like Nehalem or maybe slightly older) then the FP code is running in the SSE2 XMM registers. I think there should be one SSE unit per core, obviously shared between HT cores. I would suspect that two FP-intensive threads running on the same physical core may hurt performance due to sharing the SSE unit, but the only way to know for sure is to test it.
 
Floating point is a deep pipeline, often 14-17 cycles, so unless you have highly optimized vector code, dependencies are going to keep you from saturating it.
 
Floating point is a deep pipeline, often 14-17 cycles, so unless you have highly optimized vector code, dependencies are going to keep you from saturating it.

They also have split units. So saturation is going to be even harder (there are different units for addition/multiplication). Not impossible, but hard.
 
AFAIK HT performs best when you have a mixed workload so that 1 thread can use integer resources and the other floating point resources. So for float only or integer only HT has little to negative benefit.
 
So after the discussion here, I just decided to go all-in and try it. I fired up another four instances of the code, and my top load is now ~7.9, each instance produces a small output to the console after every time-step, and it doesn't look like the inter-arrival time of these check-in messages has slowed at all.
 
AFAIK HT performs best when you have a mixed workload so that 1 thread can use integer resources and the other floating point resources. So for float only or integer only HT has little to negative benefit.

It really depends on what you are doing. Even with floating point or integer heavy computations, you can see benefits if you are doing a lot of memory access.

Take this benchmark as an example of that

http://www.tomshardware.com/reviews/winrar-winzip-7-zip-magicrar,3436-11.html

Compression is mostly integer bound and does very little floating point work. Yet we see here that hyperthreading has anywhere from a huge impact to a negative impact. The difference is, IMO, memory access. You'll notice that the faster compression methods perform worse, why is that? Because they tend to have much larger dictionaries and deeper memory searches. On the other hand, the fast compression methods have much smaller dictionaries and more shallow memory access.

Point being, you can't really say anything about the impacts of hyperthreading without running some tests of your own.
 
Would CUDA or a video card in general help with this (provided it was written in a more modern language with support)?
 
Regardless of the project, migrating to CUDA from Fortran would be a huge change. The first step would be to analyze the project to see if massive parallelism like that was possible. (Think thousands of parallel threads.)

I would say the second step should be to find the inner loop and implement it in C or C++. Apparently, mixing Fortran and C like this is possible.

If massive GPU parallelism isn't workable, maybe SSE or AVX intrinsic functions in C could provide a parallelism boost? (They're not available in Fortran.) There's a lot less back-and-forth latency with intrinsics, but they can only work on at most 4 doubles or 8 singles at a time. And the instruction set is limited.
 
I'm not sure that top is a good indicator for what you want to know.

The number you're looking at (load) is the average number of processes (or threads) that is in the run queue. That means, process that want to run. Not included are processes that are sleeping or waiting for I/O, etc. If the load is 7.9, it means your 8 processes want to run all the time. It doesn't say anything about how much time each of them got to run (which isn't just 100% when you use HyperThreading). It doesn't say anything about efficiency.

As said before here, the best way is to benchmark it. I don't even think benchmarks need to take a lot of time. Just 10-30 seconds should be enough. Run your program for 30 seconds, and see how much work gets done. Repeat for different setups.
 
I'm not sure on exactly what system you're using, but I like using the Zoom profiler from rotateright (http://www.rotateright.com/zoom/) to check the performance of my simulation code. It used to be a paid product but is free now.

If you use the perf driver, you can do a "scheduler trace" report, which shows how the threads are interacting. In my case, it was pretty obvious that the threads were waiting due to some false sharing in the initial versions of the code, which I was able to eliminate. It can also show you where stalls and misses occur. Although in your case it sounds like you have completely separate instances running that do not need to communicate with each other, so this may not be as helpful.

It does a pretty decent job of figuring out where time is being spent, and although it isn't perfect, I find that it gives a pretty good starting point for optimization. I started using this because gprof has major issues with multi-threaded code, and I find this better in practically every way even with single-threaded code.

Fair warning, I've only used this with C and C++, so I'm not sure how it will perform with Fortran. Just something to think about if you plan on going the route that Cogman suggested.
 
As others have already stated, so much of this is code and compiler specific that you simply need to test your code and see if using hyper-threads does indeed give you a throughput bonus. It sounds like you essentially do monte carlo simulations with your code. The best test is to time how long it takes to perform the same exact runs (say 16 or 32 runs total), and one time run with only 4 instances, and in the other, run with 8. If the overall time to finish takes less time with 8, then HT is indeed giving you a performance boost.

For reference, our code was able to gain 60% of the performance of a real core on a hyperthreaded core when running on all real+hyperthreaded cores in the system. That is essentially almost 2/3'rd of another computer for free by using the hyperthreaded cores.
 
Back
Top