Linpack Challenge

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hyperlite

Diamond Member
May 25, 2004
5,664
2
76
Heh. So Athlon II X4 scales basically 100%:

scaling1.jpg



scaling2.jpg



scaling3.jpg



currentoc.jpg
 
Last edited:

WildW

Senior member
Oct 3, 2008
984
20
81
evilpicard.com
Had a brief play with my E5400 last night. 3.6GHz (267*13.5) Due to lousy memory dividers my DDR2 is only running at 889MHz

Depending how much memory I let it play with it can score anywhere between about 17 and 21 GFLOPS. Around 512MB lets it score highest, anything smaller is over too quickly to score as high. Giving it all memory drops to around 17. What a curious benchmark. Guess it's something that likes a fast memory subsystem.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Only did 5 runs each...

i7920(D0)@4.03 GHz

Compiled google doc for IDC:
LinXi920D0

Thanks dank69, I sent you pm requesting access to the google doc, I think you have it set to non-share at the moment.

Heh. So Athlon II X4 scales basically 100%:

There is something goofy about your results, the computation times are in an entirely different ballpark than mine. Are you using max ram? (how much ram are you using?)

For example your single-thread process time is ~96 seconds whereas on my Q6600 they are consistently requiring ~350 seconds to complete.

The comparisons between chips/architectures won't be apples-to-apples if we can't convince LinX to run the same matrices thru our processors (this goes for Lopri's gflops database too).

I have little experience with LinX, but looking at your "Problem Size" I see it is 11530 versus mine which is 16134...I'm wondering if we need to hold the "problem size" constant across all of our tests in order to make sure we are generating the same GFlops number from the same matrices?

Anyone have more experience with benching Linpack that can chime in here?

edit: yep I just did a test run and confirmed the problem size is what counts if we want to generate apples-to-apples thread scaling results:
LinXproblemsize11530.png
 
Last edited:

Rubycon

Madame President
Aug 10, 2005
17,768
485
126
You're going to want to select a problem size that's large but not too large as some users may have 4GB or less total ram. If the ram size needed is close to what's free and the system starts paging excessively the numbers will be skewed downward!
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Yeah I am noticing that if you have 32bit OS then LinX sets max problem size to 16134 and max memory usage actually tops out at 1999MB even if you have more ram available.

LinXmaxramusagefor32bitOS.png


So it looks like the most reasonable thing to do is for everyone to use problem size 16134 (max allowed in 32bit OS, only uses 2GB of your ram which most people here are likely to have) and that way at least we know everyone's rig is plowing thru the same set of matrices in Linpack so the GFlops numbers are apples-to-apples.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I just did a quickie scaling calc using dank69's data (only the peak, did not enter all the data to compute the average which would be preferred) to show the roll-off in thread scaling that occurs:

Corei79204GHzwithHT.png


This roll-off could be "artificial" because of resource contention in the implementation of hyperthreading or it could be "real" owing to resource contention in terms of bandwidth and latency at the IMC with that many active threads.

dank69 if you don't mind running more tests, would you please run the tests again using problem size 16134 and with hyperthreading disabled? (so 1-4 threads) And again with hyperthreading enabled (so 1-8 threads)?

Also what is your ram speed?

edit: nevermind the ram speed - I see it now in the spreadsheet. Thanks!
 
Last edited:

dank69

Lifer
Oct 6, 2009
37,349
32,977
136
I will try to run some more tests tonight when I get home. The good news is the tests should run a lot quicker at the smaller problem size. The ram speed is ~1684 GHz (I think I listed some system info in the google doc at the bottom). EDIT: I see you found it.

The 1684 is what I think I remember seeing in the BIOS. 202x20, 2:8, 16x settings in BIOS, not sure how that = 1684 when 200x20, 2:8, 16x = 1600...

I also want to point out that if you notice the 3 and 4 thread non HT results the 920 gets back to the 96-97% scaling. HT seems to really hurt the LinX performance.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Cool. Yeah if in doubt about your ram speed/timings just pull open a cpuz window and do a quick check. I hope you don't mind I added a column for standard deviation to your spreadsheet and added the +/- 3-sigma error bars to my scaling graph above to highlight the fact that LinX went a little bi-modal in the 3 and 4 thread runs.
 

dank69

Lifer
Oct 6, 2009
37,349
32,977
136
No problem, I gave you r/w access so you can tweak however you need to.

I think some background process(es) might have fired up during the 3-4 thread HT runs, skewing the results.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Its also the kind of symptoms we'd expect to see if the thread scheduler is double-teaming two threads onto the same physical core not realizing another physical core is sitting idle/unused.

So it needlessly invokes resource contention with hyperthreading on a core or two.
 

dank69

Lifer
Oct 6, 2009
37,349
32,977
136
Its also the kind of symptoms we'd expect to see if the thread scheduler is double-teaming two threads onto the same physical core not realizing another physical core is sitting idle/unused.

So it needlessly invokes resource contention with hyperthreading on a core or two.

Well, looking at the STDEV column you added, I think that alone adds a little insight. You can see even from the small sample size that the 1 and 8 thread passes have the smallest deviation and it scales up as you move to the 2,7 and again to the 3,6. From the temp readings during the tests it was obvious that the threads were bouncing from core to core faster than realtemp could change its output. So, with a single thread you never have 2 threads running on the same core (ignoring any minor OS threads). With 2 threads running you see a slight increase in deviation due to the times when the scheduler runs both on a single core. With 3 threads you get an even higher chance that 2 threads will go to one core while 2 cores are unused.

Working back from 8 cores you see a similar pattern, 8 = always 2 threads per core. Hmm I would think 7 cores would be similar deviation to 1 core as you will always have 3 cores with 2 threads and 1 core with 1 thread, but it is slightly higher. 6 threads introduces the same probability as 2 threads in that 1 core could go unused part of the time.

TBH though, I'm a fkn noob to all of this so anything I am thinking should be taken with a grain of salt, preferably after a shot of tequila.
 

Hyperlite

Diamond Member
May 25, 2004
5,664
2
76
alright i'll re-run with that problem size later today when i get some free system time.

side note: anyone like to hazard a guess as to what TMPIN(x) in HWmonitor are?
 

Mir96TA

Golden Member
Oct 21, 2002
1,950
37
91
Here is my little AMD Stock with 2 unlock core (stock Clock)
Didn't waten ti run 20 times
39.7265
GFlops.jpg
 

lopri

Elite Member
Jul 27, 2002
13,314
690
126
anyone like to hazard a guess as to what TMPIN(x) in HWmonitor are?
On my system TMPIN0 matches "motherboard" temp, and TMPIN1 matches "CPU" temp (not the individual core but somewhere around the socket, I think). No idea where TMPIN2 comes from.



BTW, I'm loving the HD 5870's idle temp. The above shot was taken while running LinX and room temp was around 68F. It sits quietly @38C and is cooler than a HDD! (The 41C HDD is running a couple VMs, though. Hehe)
 

lopri

Elite Member
Jul 27, 2002
13,314
690
126
I've run it with varying thread counts, but with a fixed problem size of 15000. LinX assigned 1729MB of memory.

CPU: Phenom II 955 BE
Core Frequency: 3600 MHz
NB/Uncore Frequency: 2250 MHz
Memory Frequency: DDR2-900 (5-5-5-15)

Thread Count : GFlops

1 : 12.24 GFlops
2 : 23.96 GFlops
3 : 35.20 GFlops
4 : 45.73 GFlops


http://cid-17de86f1059defe0.skydriv...%20Challenge/Perf^_per^_Thread^_955BE.jpg
(898MHz reported by cpu-z is due to C'nQ)

CPU: Core 2 Duo E8400
Core Frequency: 3600 MHz
Memory Frequency: DDR2-900 (5-5-5-15)
Thread Count : GFlops

1 : 13.42 GFlops
2 : 25.28 GFlops

http://cid-17de86f1059defe0.skydriv...inpack Challenge/Perf^_per^_Thread^_E8400.jpg
 
Last edited:

lopri

Elite Member
Jul 27, 2002
13,314
690
126
You're going to want to select a problem size that's large but not too large
That's what I was trying to say the other day. A larger problem size will likely produce a better score, but that itself encounters diminishing returns.

Johan De Gelas said:
So we started an in depth comparison of the 45 nm Opterons, Xeons and Core i7 CPUs. One of our benchmarks, the famous LINPACK (you can read all about it here) painted a pretty interesting performance picture. We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform. That should not be a huge problem as we tested with only one CPU. We normally need about 4 GB for each quadcore CPU to reach the best performance.

http://it.anandtech.com/weblog/showpost.aspx?i=528

Judging from Dank69's results, it looks like Linpack doesn't know what to do with HyperThreading. (or vice versa) The Flops shouldn't vary that much for each pass. The results become much more consistent with HT off, and are in line with n7's. (4.0 GHz ~ 60 GFlops)
 

Hyperlite

Diamond Member
May 25, 2004
5,664
2
76
OK i don't feel like posting screens again but heres what i did:

16133 problem size (2048mb ram)
Athlon II X4 @ 3211mhz, 1976mhz HT, 2476mhz FSB
4GB DDR3 @ 9-9-9-24

4 threads: 40.34 Gflops
3 threads: 31.55 Gflops
2 threads: 19.29 Gflops
1 thread: 9.24 Gflops
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
OK i don't feel like posting screens again but heres what i did:

16133 problem size (2048mb ram)
Athlon II X4 @ 3211mhz, 1976mhz HT, 2476mhz FSB
4GB DDR3 @ 9-9-9-24

4 threads: 40.34 Gflops
3 threads: 31.55 Gflops
2 threads: 19.29 Gflops
1 thread: 9.24 Gflops

I'm not questioning the validity of your results, I believe them, I just can't fathom a computer-science based reason to explain them.

Every thread you added to your computation resulted in super-linear speedup, which just defies reason.

I've seen cases of legitimate super-linear speedup in the past but those were explainable by way of certain datasets falling outside cache boundaries when fewer cores were used but as more cores were added to the system/calculation the computations were then able to fit inside a higher tier (faster) level of cache and performance suddenly sped-up at a rate that exceeded linear.

Here's the graph of your data, note how it all falls above the black line, the black line representing the theoretical maximum speedup provided by linear scaling (requires code to be 100% parallelized and zero interprocessor communications delay to attain).

LinXAthlonIIX4.png


Unless you have some weird confluence of thread migration and CnQ settings that are resulting in serious performance penalty when running any application that uses less than all four cores?

To check this theory could you run the single-thread test again only this time disable CnQ (if it is enabled) and use task manager to set your thread affinity such that LinX is forced to only use one core during the test.

I'm curious to see how much higher than 9.24 Gflops the single-threaded test turns out when you do that.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I've run it with varying thread counts, but with a fixed problem size of 15000. LinX assigned 1729MB of memory.

CPU: Phenom II 955 BE
Core Frequency: 3600 MHz
NB/Uncore Frequency: 2250 MHz
Memory Frequency: DDR2-900 (5-5-5-15)

Thread Count : GFlops

1 : 12.24 GFlops
2 : 23.96 GFlops
3 : 35.20 GFlops
4 : 45.73 GFlops

Damn lopri that Deneb architecture really rocks on the LinX thread scaling, your data falls almost precisely on the Amdahl limit (red line, which is not a curve fit to your data despite it looking like it is) implying the interprocessor communications are not a rate-limiting step in any of the computations whatsoever. (which could be taken as meaning either the IMC is silly awesome compared to the FSB/MC of the old Kentsfield/Yorkfield rigs or the L3$ on phenom II is just that good)

LinXPhenomIIX4.png


That's amazing
icon14.gif


Now if only we had the thread scaling results for 1-4 threads on a Nehalem with HT turned off...
 

Hyperlite

Diamond Member
May 25, 2004
5,664
2
76
I'm not questioning the validity of your results, I believe them, I just can't fathom a computer-science based reason to explain them.

Every thread you added to your computation resulted in super-linear speedup, which just defies reason.

I've seen cases of legitimate super-linear speedup in the past but those were explainable by way of certain datasets falling outside cache boundaries when fewer cores were used but as more cores were added to the system/calculation the computations were then able to fit inside a higher tier (faster) level of cache and performance suddenly sped-up at a rate that exceeded linear.

Here's the graph of your data, note how it all falls above the black line, the black line representing the theoretical maximum speedup provided by linear scaling (requires code to be 100% parallelized and zero interprocessor communications delay to attain).

LinXAthlonIIX4.png


Unless you have some weird confluence of thread migration and CnQ settings that are resulting in serious performance penalty when running any application that uses less than all four cores?

To check this theory could you run the single-thread test again only this time disable CnQ (if it is enabled) and use task manager to set your thread affinity such that LinX is forced to only use one core during the test.

I'm curious to see how much higher than 9.24 Gflops the single-threaded test turns out when you do that.

interesting. brb with results!
 

Hyperlite

Diamond Member
May 25, 2004
5,664
2
76
affinity0.jpg


CnQ OFF. LinX affinity 0. 1 thread. Beats me IDC!! any other ideas?

Heh at least my OC is stable...albeit backed up against a wall the size of the three gorges dam.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
CnQ OFF. LinX affinity 0. 1 thread. Beats me IDC!! any other ideas?

Heh at least my OC is stable...albeit backed up against a wall the size of the three gorges dam.

Bingo! You got it. Your single-threaded performance jumped from 9.24GFlops to 10.76Gflops.

That makes the 40.34GFlops value with four-threads (which won't be affected by core affinity) represent a thread scaling of 3.75 right in line with lopri's results.

Super-linear speedup results explained, your single-threaded performance was sucking wind.

If you have the time/energy/desire would you mind re-running the 2 thread test with LinX affinity set to two cores, and again for the 3 thread test (affinity lock to 3 cores) for the sake of completion?