Linpack Challenge

Hyperlite · Dec 3, 2009

Heh. So Athlon II X4 scales basically 100%:

WildW · Dec 3, 2009

Had a brief play with my E5400 last night. 3.6GHz (267*13.5) Due to lousy memory dividers my DDR2 is only running at 889MHz

Depending how much memory I let it play with it can score anywhere between about 17 and 21 GFLOPS. Around 512MB lets it score highest, anything smaller is over too quickly to score as high. Giving it all memory drops to around 17. What a curious benchmark. Guess it's something that likes a fast memory subsystem.

Idontcare · Dec 3, 2009

dank69 said:
Only did 5 runs each...

i7920(D0)@4.03 GHz

Compiled google doc for IDC:
LinXi920D0

Thanks dank69, I sent you pm requesting access to the google doc, I think you have it set to non-share at the moment.

Hyperlite said:
Heh. So Athlon II X4 scales basically 100%:

There is something goofy about your results, the computation times are in an entirely different ballpark than mine. Are you using max ram? (how much ram are you using?)

For example your single-thread process time is ~96 seconds whereas on my Q6600 they are consistently requiring ~350 seconds to complete.

The comparisons between chips/architectures won't be apples-to-apples if we can't convince LinX to run the same matrices thru our processors (this goes for Lopri's gflops database too).

I have little experience with LinX, but looking at your "Problem Size" I see it is 11530 versus mine which is 16134...I'm wondering if we need to hold the "problem size" constant across all of our tests in order to make sure we are generating the same GFlops number from the same matrices?

Anyone have more experience with benching Linpack that can chime in here?

edit: yep I just did a test run and confirmed the problem size is what counts if we want to generate apples-to-apples thread scaling results:

Rubycon · Dec 3, 2009

You're going to want to select a problem size that's large but not too large as some users may have 4GB or less total ram. If the ram size needed is close to what's free and the system starts paging excessively the numbers will be skewed downward!

Idontcare · Dec 3, 2009

Yeah I am noticing that if you have 32bit OS then LinX sets max problem size to 16134 and max memory usage actually tops out at 1999MB even if you have more ram available.

So it looks like the most reasonable thing to do is for everyone to use problem size 16134 (max allowed in 32bit OS, only uses 2GB of your ram which most people here are likely to have) and that way at least we know everyone's rig is plowing thru the same set of matrices in Linpack so the GFlops numbers are apples-to-apples.

dank69 · Dec 3, 2009

Idontcare said:
Thanks dank69, I sent you pm requesting access to the google doc, I think you have it set to non-share at the moment.

I set the access so anyone can view it now.

Idontcare · Dec 3, 2009

I just did a quickie scaling calc using dank69's data (only the peak, did not enter all the data to compute the average which would be preferred) to show the roll-off in thread scaling that occurs:

This roll-off could be "artificial" because of resource contention in the implementation of hyperthreading or it could be "real" owing to resource contention in terms of bandwidth and latency at the IMC with that many active threads.

dank69 if you don't mind running more tests, would you please run the tests again using problem size 16134 and with hyperthreading disabled? (so 1-4 threads) And again with hyperthreading enabled (so 1-8 threads)?

Also what is your ram speed?

edit: nevermind the ram speed - I see it now in the spreadsheet. Thanks!

dank69 · Dec 3, 2009

I will try to run some more tests tonight when I get home. The good news is the tests should run a lot quicker at the smaller problem size. The ram speed is ~1684 GHz (I think I listed some system info in the google doc at the bottom). EDIT: I see you found it.

The 1684 is what I think I remember seeing in the BIOS. 202x20, 2:8, 16x settings in BIOS, not sure how that = 1684 when 200x20, 2:8, 16x = 1600...

I also want to point out that if you notice the 3 and 4 thread non HT results the 920 gets back to the 96-97% scaling. HT seems to really hurt the LinX performance.

Idontcare · Dec 3, 2009

Cool. Yeah if in doubt about your ram speed/timings just pull open a cpuz window and do a quick check. I hope you don't mind I added a column for standard deviation to your spreadsheet and added the +/- 3-sigma error bars to my scaling graph above to highlight the fact that LinX went a little bi-modal in the 3 and 4 thread runs.

dank69 · Dec 3, 2009

No problem, I gave you r/w access so you can tweak however you need to.

I think some background process(es) might have fired up during the 3-4 thread HT runs, skewing the results.

Idontcare · Dec 3, 2009

Its also the kind of symptoms we'd expect to see if the thread scheduler is double-teaming two threads onto the same physical core not realizing another physical core is sitting idle/unused.

So it needlessly invokes resource contention with hyperthreading on a core or two.

dank69 · Dec 3, 2009

Idontcare said:
Its also the kind of symptoms we'd expect to see if the thread scheduler is double-teaming two threads onto the same physical core not realizing another physical core is sitting idle/unused.

So it needlessly invokes resource contention with hyperthreading on a core or two.

Well, looking at the STDEV column you added, I think that alone adds a little insight. You can see even from the small sample size that the 1 and 8 thread passes have the smallest deviation and it scales up as you move to the 2,7 and again to the 3,6. From the temp readings during the tests it was obvious that the threads were bouncing from core to core faster than realtemp could change its output. So, with a single thread you never have 2 threads running on the same core (ignoring any minor OS threads). With 2 threads running you see a slight increase in deviation due to the times when the scheduler runs both on a single core. With 3 threads you get an even higher chance that 2 threads will go to one core while 2 cores are unused.

Working back from 8 cores you see a similar pattern, 8 = always 2 threads per core. Hmm I would think 7 cores would be similar deviation to 1 core as you will always have 3 cores with 2 threads and 1 core with 1 thread, but it is slightly higher. 6 threads introduces the same probability as 2 threads in that 1 core could go unused part of the time.

TBH though, I'm a fkn noob to all of this so anything I am thinking should be taken with a grain of salt, preferably after a shot of tequila.

Hyperlite · Dec 3, 2009

alright i'll re-run with that problem size later today when i get some free system time.

side note: anyone like to hazard a guess as to what TMPIN(x) in HWmonitor are?

Idontcare · Dec 3, 2009

thx Hyperlite. No idea on the TMPIN deal though, sorry.

Mir96TA · Dec 3, 2009

Here is my little AMD Stock with 2 unlock core (stock Clock)
Didn't waten ti run 20 times
39.7265

lopri · Dec 4, 2009

Hyperlite said:
anyone like to hazard a guess as to what TMPIN(x) in HWmonitor are?

On my system TMPIN0 matches "motherboard" temp, and TMPIN1 matches "CPU" temp (not the individual core but somewhere around the socket, I think). No idea where TMPIN2 comes from.

BTW, I'm loving the HD 5870's idle temp. The above shot was taken while running LinX and room temp was around 68F. It sits quietly @38C and is cooler than a HDD! (The 41C HDD is running a couple VMs, though. Hehe)

lopri · Dec 4, 2009

I've run it with varying thread counts, but with a fixed problem size of 15000. LinX assigned 1729MB of memory.

CPU: Phenom II 955 BE
Core Frequency: 3600 MHz
NB/Uncore Frequency: 2250 MHz
Memory Frequency: DDR2-900 (5-5-5-15)

Thread Count : GFlops

1 : 12.24 GFlops
2 : 23.96 GFlops
3 : 35.20 GFlops
4 : 45.73 GFlops

http://cid-17de86f1059defe0.skydriv...%20Challenge/Perf^_per^_Thread^_955BE.jpg
(898MHz reported by cpu-z is due to C'nQ)

CPU: Core 2 Duo E8400
Core Frequency: 3600 MHz
Memory Frequency: DDR2-900 (5-5-5-15)
Thread Count : GFlops

1 : 13.42 GFlops
2 : 25.28 GFlops

http://cid-17de86f1059defe0.skydriv...inpack Challenge/Perf^_per^_Thread^_E8400.jpg

lopri · Dec 4, 2009

Rubycon said:
You're going to want to select a problem size that's large but not too large

That's what I was trying to say the other day. A larger problem size will likely produce a better score, but that itself encounters diminishing returns.

Johan De Gelas said:
So we started an in depth comparison of the 45 nm Opterons, Xeons and Core i7 CPUs. One of our benchmarks, the famous LINPACK (you can read all about it here) painted a pretty interesting performance picture. We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform. That should not be a huge problem as we tested with only one CPU. We normally need about 4 GB for each quadcore CPU to reach the best performance.

http://it.anandtech.com/weblog/showpost.aspx?i=528

Judging from Dank69's results, it looks like Linpack doesn't know what to do with HyperThreading. (or vice versa) The Flops shouldn't vary that much for each pass. The results become much more consistent with HT off, and are in line with n7's. (4.0 GHz ~ 60 GFlops)

Hyperlite · Dec 4, 2009

OK i don't feel like posting screens again but heres what i did:

16133 problem size (2048mb ram)
Athlon II X4 @ 3211mhz, 1976mhz HT, 2476mhz FSB
4GB DDR3 @ 9-9-9-24

4 threads: 40.34 Gflops
3 threads: 31.55 Gflops
2 threads: 19.29 Gflops
1 thread: 9.24 Gflops

Hyperlite · Dec 4, 2009

Mir96TA said:
Here is my little AMD Stock with 2 unlock core (stock Clock)
Didn't waten ti run 20 times
39.7265

any chance you could run that again using 2Gb of ram?

Idontcare · Dec 4, 2009

Hyperlite said:
OK i don't feel like posting screens again but heres what i did:

16133 problem size (2048mb ram)
Athlon II X4 @ 3211mhz, 1976mhz HT, 2476mhz FSB
4GB DDR3 @ 9-9-9-24

4 threads: 40.34 Gflops
3 threads: 31.55 Gflops
2 threads: 19.29 Gflops
1 thread: 9.24 Gflops

I'm not questioning the validity of your results, I believe them, I just can't fathom a computer-science based reason to explain them.

Every thread you added to your computation resulted in super-linear speedup, which just defies reason.

I've seen cases of legitimate super-linear speedup in the past but those were explainable by way of certain datasets falling outside cache boundaries when fewer cores were used but as more cores were added to the system/calculation the computations were then able to fit inside a higher tier (faster) level of cache and performance suddenly sped-up at a rate that exceeded linear.

Here's the graph of your data, note how it all falls above the black line, the black line representing the theoretical maximum speedup provided by linear scaling (requires code to be 100% parallelized and zero interprocessor communications delay to attain).

Unless you have some weird confluence of thread migration and CnQ settings that are resulting in serious performance penalty when running any application that uses less than all four cores?

To check this theory could you run the single-thread test again only this time disable CnQ (if it is enabled) and use task manager to set your thread affinity such that LinX is forced to only use one core during the test.

I'm curious to see how much higher than 9.24 Gflops the single-threaded test turns out when you do that.

Idontcare · Dec 4, 2009

lopri said:
I've run it with varying thread counts, but with a fixed problem size of 15000. LinX assigned 1729MB of memory.

CPU: Phenom II 955 BE
Core Frequency: 3600 MHz
NB/Uncore Frequency: 2250 MHz
Memory Frequency: DDR2-900 (5-5-5-15)

Thread Count : GFlops

1 : 12.24 GFlops
2 : 23.96 GFlops
3 : 35.20 GFlops
4 : 45.73 GFlops

Damn lopri that Deneb architecture really rocks on the LinX thread scaling, your data falls almost precisely on the Amdahl limit (red line, which is not a curve fit to your data despite it looking like it is) implying the interprocessor communications are not a rate-limiting step in any of the computations whatsoever. (which could be taken as meaning either the IMC is silly awesome compared to the FSB/MC of the old Kentsfield/Yorkfield rigs or the L3$ on phenom II is just that good)

That's amazing

Now if only we had the thread scaling results for 1-4 threads on a Nehalem with HT turned off...

Hyperlite · Dec 4, 2009

Idontcare said:
I'm not questioning the validity of your results, I believe them, I just can't fathom a computer-science based reason to explain them.

Every thread you added to your computation resulted in super-linear speedup, which just defies reason.

I've seen cases of legitimate super-linear speedup in the past but those were explainable by way of certain datasets falling outside cache boundaries when fewer cores were used but as more cores were added to the system/calculation the computations were then able to fit inside a higher tier (faster) level of cache and performance suddenly sped-up at a rate that exceeded linear.

Here's the graph of your data, note how it all falls above the black line, the black line representing the theoretical maximum speedup provided by linear scaling (requires code to be 100% parallelized and zero interprocessor communications delay to attain).

Unless you have some weird confluence of thread migration and CnQ settings that are resulting in serious performance penalty when running any application that uses less than all four cores?

To check this theory could you run the single-thread test again only this time disable CnQ (if it is enabled) and use task manager to set your thread affinity such that LinX is forced to only use one core during the test.

I'm curious to see how much higher than 9.24 Gflops the single-threaded test turns out when you do that.

interesting. brb with results!

Hyperlite · Dec 4, 2009

CnQ OFF. LinX affinity 0. 1 thread. Beats me IDC!! any other ideas?

Heh at least my OC is stable...albeit backed up against a wall the size of the three gorges dam.

Idontcare · Dec 4, 2009

Hyperlite said:
CnQ OFF. LinX affinity 0. 1 thread. Beats me IDC!! any other ideas?

Heh at least my OC is stable...albeit backed up against a wall the size of the three gorges dam.

Bingo! You got it. Your single-threaded performance jumped from 9.24GFlops to 10.76Gflops.

That makes the 40.34GFlops value with four-threads (which won't be affected by core affinity) represent a thread scaling of 3.75 right in line with lopri's results.

Super-linear speedup results explained, your single-threaded performance was sucking wind.

If you have the time/energy/desire would you mind re-running the 2 thread test with LinX affinity set to two cores, and again for the 3 thread test (affinity lock to 3 cores) for the sake of completion?

Linpack Challenge

Diamond Member

Senior member

Elite Member

Madame President

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Lifer

Diamond Member

Elite Member

Golden Member

Elite Member

Elite Member

Elite Member

Diamond Member

Diamond Member

Elite Member

Elite Member

Diamond Member

Diamond Member

Elite Member