Heh. So Athlon II X4 scales basically 100%:
Thanks dank69, I sent you pm requesting access to the google doc, I think you have it set to non-share at the moment.
Its also the kind of symptoms we'd expect to see if the thread scheduler is double-teaming two threads onto the same physical core not realizing another physical core is sitting idle/unused.
So it needlessly invokes resource contention with hyperthreading on a core or two.
On my system TMPIN0 matches "motherboard" temp, and TMPIN1 matches "CPU" temp (not the individual core but somewhere around the socket, I think). No idea where TMPIN2 comes from.anyone like to hazard a guess as to what TMPIN(x) in HWmonitor are?
That's what I was trying to say the other day. A larger problem size will likely produce a better score, but that itself encounters diminishing returns.You're going to want to select a problem size that's large but not too large
Johan De Gelas said:So we started an in depth comparison of the 45 nm Opterons, Xeons and Core i7 CPUs. One of our benchmarks, the famous LINPACK (you can read all about it here) painted a pretty interesting performance picture. We had to test with a matrix size of 18000 (2.5 GB of RAM necessary), as we only had 3 GB of DDR-3 on the Core i7 platform. That should not be a huge problem as we tested with only one CPU. We normally need about 4 GB for each quadcore CPU to reach the best performance.
Here is my little AMD Stock with 2 unlock core (stock Clock)
Didn't waten ti run 20 times
39.7265
![]()
OK i don't feel like posting screens again but heres what i did:
16133 problem size (2048mb ram)
Athlon II X4 @ 3211mhz, 1976mhz HT, 2476mhz FSB
4GB DDR3 @ 9-9-9-24
4 threads: 40.34 Gflops
3 threads: 31.55 Gflops
2 threads: 19.29 Gflops
1 thread: 9.24 Gflops
I've run it with varying thread counts, but with a fixed problem size of 15000. LinX assigned 1729MB of memory.
CPU: Phenom II 955 BE
Core Frequency: 3600 MHz
NB/Uncore Frequency: 2250 MHz
Memory Frequency: DDR2-900 (5-5-5-15)
Thread Count : GFlops
1 : 12.24 GFlops
2 : 23.96 GFlops
3 : 35.20 GFlops
4 : 45.73 GFlops
I'm not questioning the validity of your results, I believe them, I just can't fathom a computer-science based reason to explain them.
Every thread you added to your computation resulted in super-linear speedup, which just defies reason.
I've seen cases of legitimate super-linear speedup in the past but those were explainable by way of certain datasets falling outside cache boundaries when fewer cores were used but as more cores were added to the system/calculation the computations were then able to fit inside a higher tier (faster) level of cache and performance suddenly sped-up at a rate that exceeded linear.
Here's the graph of your data, note how it all falls above the black line, the black line representing the theoretical maximum speedup provided by linear scaling (requires code to be 100% parallelized and zero interprocessor communications delay to attain).
![]()
Unless you have some weird confluence of thread migration and CnQ settings that are resulting in serious performance penalty when running any application that uses less than all four cores?
To check this theory could you run the single-thread test again only this time disable CnQ (if it is enabled) and use task manager to set your thread affinity such that LinX is forced to only use one core during the test.
I'm curious to see how much higher than 9.24 Gflops the single-threaded test turns out when you do that.
CnQ OFF. LinX affinity 0. 1 thread. Beats me IDC!! any other ideas?
Heh at least my OC is stable...albeit backed up against a wall the size of the three gorges dam.