Single Threaded FP Benchmarks Still Matter in the Real World.

mustache143

Junior Member
Sep 11, 2007
1
0
0
I have been tasked with evaluating hardware platforms at work for an extremely demanding FP intensive algorithm. The nature of it is confidential; however, the gist of it is a large lagrangian matrix needs to be solved for an optimal solution using double precision floating point calculations. This solution typically takes almost 100 million iterations to solve and runs for about 4 minutes on a 1.15Ghz Alpha 21264. We have evaluated Itanium 2s at 1.6Ghz and have seen only a modest improvement (10-20%). The problem I am having is that all FP benchmarks I find are multithreaded. Because, my application needs the output of the first iteration before the second is run; I am stuck using only one core. Ideally, I would like to just port the application to Power 6 and x86-64 and run a benchmark. However, that port is extremely expensive and I need to have a high level of confidence before taking on that expense. My guess is that Barcelona?s FPU would offer the greatest performance for the money invested. However, a port to prove that can not be justified unless I have direct comparison of single threaded performance between Alpha 21264, Itanium 2, Xeon, Opteron and Power5/6. Below are a few more details about the specific problem. Please respond if you have any suggestions or need more details.

FORTRAN algorithm
Double float precision
Very serial
Large matrix manipulation
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Unless your program is solely optimized for nothing else but to put the maximum performance out of the CPU, and CPU only, other variables would matter, like bandwidth. Probably in pure CPU performance in FP, Barcelona/Phenom core and Core 2 cores perform similar. But Core 2 does lack in memory bandwidth and would put Barcelona/Phenom in the lead.

The best should be Power 6 with massive memory bandwidth, super high clocks and its dual FMACs. Nothing should come even close to that. That's in theory of course, so it could be different with your application.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Accord99
Based on the single-threaded SPECfp2006 (a suite of common FP type applications) score, the fastest would be the Power6, followed closely by the 3GHz C2D E6850.

http://www.spec.org/cpu2006/results/cfp2006.html

Exactly. Lagrangian matrices are encountered in computational quantum chemistry (Gaussian, Turbomol, NWCHEM, GAMESS, etc) and I would expect anything that does well with the 416.gamess routine would handle Lagrangian matrices well too.

You can read about the individual components of specfp here: http://www.spec.org/cpu2006/CFP2006/
 

firewolfsm

Golden Member
Oct 16, 2005
1,848
29
91
You would want a higher clocked dual core rather than a high-end quad. The E6850 will probably be your best bet, especially if overclocking is possible.
 

bdobz

Junior Member
Apr 18, 2007
3
0
0
I would bet that the Core 2 systems would beat Barcelona in these types of calculations... but if you could clarify your statement about requiring 100 million iterations to converge, and doing so (converging) in about 4 minutes, that would help. Is it four minutes PER iteration, or four minutes total? I guess what I'm getting at here is, how much memory does the code typically require? (I would've guessed on the small side if the entire problem runs in 4 minutes!)

I have run on a Barcelona and I've run on a Xeon 5xxx system (Core 2). In my codes, most of which iteratively solve large matrices, the Core 2 based systems are faster on a single processor. It IS interesting, on 8-core nodes, what happens when you scale out to using all cores, but since this doesn't interest you I won't get side-tracked. Fact of the matter is that for most applications I've seen, a 2.0 Ghz Barcelona is just about the same as a 2.0 Ghz Xeon when using a single core. But given that Xeons are available at much faster speeds, you'd do well to go that route.

Finally, what makes you think the 'port' would be difficult? Having ported SGI and RS/6000 apps to x86 systems over the years, I've found that these days it's often as simple as a recompile, and maybe a tweaking of a few things because of compiler differences (ie, IBM allows redefinition of Fortran subroutines, whereas PGI on Linux complains by default, etc. - not other issues). I'd say give it a shot, it might take less time than you're spending reading these forums. :)

Good luck!

(And, ooh, a PS since I can't resist: depending upon which algorithm you're using for solving the Lagrangian, you might be able to parallelize some of the solve... definitely the sort of thing you want to look into since, speaking in terms of the future, that's more likely to increase more rapidly than clock-speed.)

EDIT: I'm guessing four minutes total, since otherwise 100m iterations at 4 minutes each is a long time to wait for results. So, memory size probably isn't so large, so go Core 2. I'd say, if this is really important, go with Xeon chips for ECC memory, but if performance is the end-all, just get a Core 2 desktop chip at the highest clock-rate possible.