Linpack Challenge

Idontcare · Sep 10, 2010

Markfw900 said:
one thread 8.0352 gflops ( will update as I can)
two threads 15.65 gflops
3 threads 23.2 gflops
4 threads 30.46 gflops
5 threads 36.37 gflops

OK, I went to 6 threads... All of a sudden it went to 12 gflops !!!

WTF ??

Yep. Is that with HT or without?

How are your dimm's populated?

dual-socket mobo is NUMA iirc, and windows is basically totally ignorant of the performance ramifications of NUMA.

Markfw · Sep 10, 2010

it was non-ht, and I have the two closest dimms to the socket of 3 on each bank populated. They are 2 gig dimms for 8 gig total.

Shmee · Sep 10, 2010

hi, is there a way to disable HT with linx? Or do i have to do it in bios? 980x, 4.3

Shmee · Sep 10, 2010

doing 4.4 ht off, max is 82 Gflops

Shmee · Sep 10, 2010

max temps were 68 C, using a TRUE

Tsavo · Sep 10, 2010

Markfw900 said:
ahhh, yes. I don't have the $600 mobo to overclock....

lowly 2366 mhz. (each)

Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.

faxon · Sep 10, 2010

Shmee said:
doing 4.4 ht off, max is 82 Gflops

did you see IDC's post about setting your problem size on the last page? see what it posts once you do that, since you were using that run as a stress test

Idontcare · Sep 10, 2010

Markfw900 said:
it was non-ht, and I have the two closest dimms to the socket of 3 on each bank populated. They are 2 gig dimms for 8 gig total.

I wonder how well 32bit XP handles the NUMA when it only allows addressing of ~3.25GB of that 8GB. Could it be that windows is only addressing 3.25GB of the 4GB you have installed on one CPU?

So threads on the other CPU are incurring that extra QPI hop to access the 3.25GB of system ram?

Don't be surprised of this is related the thread-migration too. If thread on CPU1 migrates to CPU0 then its portion of data stored in cache has to migrate as well.

This is part of the reason AMD implemented that snoop-filter on their multi-socket Istanbul chips (and MC).

I think you should continue to investigate this Mark, these "quirks" are real and the more you learn about them the better you'll be able to maximize your system performance. At the moment you are still in the discovery phase.

Tests to try - pull a dimm from each side, force the condition where the OS is accessing ram from both CPU channels.

Also try locking thread affinity. For two threads try both cases there each thread is locked to cores on the same physical CPU, and then the case where each thread is locked to a core on different CPUs.

^ that last set of tests will tell you whether you have a ram access issue if you run these tests prior to pulling a couple of dimms.

Idontcare · Sep 10, 2010

Tsavo said:
Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.

No need to be so judgemental...who's to say he didn't get a really good deal on them? Look at his rigs...the guy obviously likes playing around with hardware, he probably snagged these for a song and just wants to see what they can deliver in F@H. No reason to crap on his wheaties, is there?

Markfw · Sep 10, 2010

Tsavo said:
Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.

I picked the system, since I got the cpus dirt cheap. And at what point did I say, or even intonate that they would be the schitt ?

There is no problem with the motherboard or memory, what would make you even say that ? And most all server motherboards don't have overclocking options. The only one that does is $600. I wanted to play with these dual-sockets and see what they would do, so that wasn't an option.

WildW · Sep 17, 2010

So I was messing with tweaking my overclock a bit last night, and tried running LinX to see if what I was doing made any difference. I'm running a Phenom II at 3.8GHz with 4GB DDR2-1000. Was trying to see if there was any difference between 800 and 1000MHz on the ram. Basically switched from 200x19, 2000MHz NB DDR2-800 to 246x15.5, 2200MHz NB, DDR2-984.

Mostly this program tells me a roughly consistant 48/49 GFlops, but I was seeing a few runs with wildly deviating numbers. I saw a 60-something, followed by a 25, a 35... At first I thought maybe I was maybe getting thermal-throttled, reseated my cooler, but later I saw low scores followed by higher scores. Temps are hitting max 54C after a few runs, which I know is a bit high for a Phenom.

The odd thing is, the numerical results that LinX outputs are consistant across runs. All that seems to change is the time taken - at the very end of a run I seem to drop to only one thread running that flits between cores. Presumably this is some kind of "finishing up" that brings everything together across the 4 cores. . .the run times just seem to vary from about 90 to 120 seconds.

I haven't seen anyone else on this thread mention this, but does anyone have any clue what's going on? It's really weird. As I said, I've managed one run at nearly 70 GFLOPS, and it got the same "right" answer as all the lower scores I got in the same set of runs. WTF?

DrMrLordX · Sep 17, 2010

Ruby has mentioned before that throughput from LinX can jump around on you. If you do enough runs consecutively, you should get a pretty clear idea of what throughput you really can get with your CPU.

Idontcare · Sep 17, 2010

WildW said:
So I was messing with tweaking my overclock a bit last night, and tried running LinX to see if what I was doing made any difference. I'm running a Phenom II at 3.8GHz with 4GB DDR2-1000. Was trying to see if there was any difference between 800 and 1000MHz on the ram. Basically switched from 200x19, 2000MHz NB DDR2-800 to 246x15.5, 2200MHz NB, DDR2-984.

Mostly this program tells me a roughly consistant 48/49 GFlops, but I was seeing a few runs with wildly deviating numbers. I saw a 60-something, followed by a 25, a 35... At first I thought maybe I was maybe getting thermal-throttled, reseated my cooler, but later I saw low scores followed by higher scores. Temps are hitting max 54C after a few runs, which I know is a bit high for a Phenom.

The odd thing is, the numerical results that LinX outputs are consistant across runs. All that seems to change is the time taken - at the very end of a run I seem to drop to only one thread running that flits between cores. Presumably this is some kind of "finishing up" that brings everything together across the 4 cores. . .the run times just seem to vary from about 90 to 120 seconds.

I haven't seen anyone else on this thread mention this, but does anyone have any clue what's going on? It's really weird. As I said, I've managed one run at nearly 70 GFLOPS, and it got the same "right" answer as all the lower scores I got in the same set of runs. WTF?

Yep, Hyperlite encountered this issue, when that final thread is thrashing about between all the cores it really makes a marked negative impact on the apparent performance.

See this post and the one's preceding it.

Also you can see from this post the range represented by the error bars on the graph (included below for clarity) is likewise due to thread migration and the frequency x duration of threads concurrently residing on the same physical core when HT is involved.

The performance impact you are seeing are real, it really does impact performance when your threads migrate around on the CPU like you are witnessing. The cache contents have to migrate with the thread, that takes time.

Plus if your CPU is actively trying to change C-states (power-savings) with threads migrating around in the background then you add additional delay as the C-states take time to change. This is in part what doomed AMD's original implementation of CnQ on the 65nm Phenoms.

WildW · Sep 17, 2010

Thanks, and doh. I had a quick scan through the thread before I posted but I obviously wasn't awake yet.

So as far as thread contention goes, I guess that variable results are likely. The single thread leaping around is due to. . . Windows being crap? I mean, with one thread fully utilizing a core, and all the Windows background threads idling along, why move the busy thread around?

The thing still confusing me is the one off 60+GFLOP result compared to the usual 48ish. . . is that still real? Is that what my CPU could do if nothing else bothered it for the whole run?

Idontcare · Sep 17, 2010

WildW said:
The single thread leaping around is due to. . . Windows being crap? I mean, with one thread fully utilizing a core, and all the Windows background threads idling along, why move the busy thread around?

We've frequently wondered the same question here...no satisfactory answer has come to light.

WildW said:
The thing still confusing me is the one off 60+GFLOP result compared to the usual 48ish. . . is that still real? Is that what my CPU could do if nothing else bothered it for the whole run?

Yes, it is highlighting the difference between "actual" and "maximal" IPC for your platform.

The architecture is designed to be capable of delivering a "maximal" IPC...all the crap you the user load onto the system that runs in the background, as well as how you specifically implement the cpu in a platform (ram bandwidth, harddrive, etc) grinds that maximal value down into the ground until you arrive at your "actual" IPC.

Every now and then the thread contention going on with your CPU just so happens to line up such that the cache misses and so on are "golden" and your "actual" IPC shoots upwards towards that "maximal" value.

People spend thousands and thousands of dollars eliminating bottlenecks in their rigs so that their actual IPC better approaches that maximal value. For some the performance/dollar involved just doesn't warrant it.

Mir96TA · Jan 10, 2011

Mir96TA said:
With Old P35 Chipset MB

I just upgraded the MB to P45 Chipset form P35, and I got fairly huge jump in performance!
I always thought they were about same; I guess I was wrong!
It has gave me 24% jump with Same OS and CPU and Mem

coffeejunkee · Jan 11, 2011

Ehm, you're using almost 3 times as much memory and double problemsize on P45. No wonder GFlops is higher...

Mir96TA · Jan 11, 2011

coffeejunkee said:
Ehm, you're using almost 3 times as much memory and double problemsize on P45. No wonder GFlops is higher...

I didn't knew mem size can make that sort of difference
None a less here is with exact size memory run

coffeejunkee · Jan 12, 2011

Problemsize 1000:

Problemsize 10000:

i5 750 stock. The diffence is pretty clear. Using all ram (4GB total) will give about 38 GFlops as well though.

Maybe the P35 Linx run wasn't fully utilizing the cpu. Sometimes it happens, don't know why.

Or P45 is indeed the better platform...

muskie32 · Jan 12, 2011

That download link does not work for me...

Where should i download it from?

aigomorla · Jan 12, 2011

Juddog said:
Does it make a difference with these Linpacks if Hyper Threading is running or not? I'm hitting 54.9 or so Gflops on my i7 w/ hyper threading enabled, but I need to find a good screenshot host to show it.

its because when linX does work on the HT cores, the HT is only half to 1/3rd the actual speed of a physical core.

So you tend to get lower scores in calculation speed because of HT starting on a pending thread but doing work slowly until a physical core can take it over.

Idontcare said:
yes

Is that with the assumption of the HT cores being slower?
Because if you disable HT, he will get the red line.

Its like this... without HT i can pull around 79GFlops... no joke.
With HT i pull roughly 60 GFlops.

Rubycon · Jan 12, 2011

Actual cores will give better numbers.

Idontcare · Jan 13, 2011

aigomorla said:
Is that with the assumption of the HT cores being slower?
Because if you disable HT, he will get the red line.

Its like this... without HT i can pull around 79GFlops... no joke.
With HT i pull roughly 60 GFlops.

I'm not following, the post you quoted contains a graph of actual scaling data from a member here, there are no assumption to be made or detailed.

What am I missing?

aigomorla · Jan 13, 2011

Idontcare said:
I'm not following, the post you quoted contains a graph of actual scaling data from a member here, there are no assumption to be made or detailed.

What am I missing?

linX does not do bulk processesing across all threads and adding to the total gflop number.

it loads up as many cores as you want it to load up, an then averages the gflops across all your working cores.

The HT cores counting for half your total threads will down the average on linX.

If you disable HT, you will get a better number on LinX because ur not averaging the slower cores into your total gflop count.

996GT2 · Jan 13, 2011

i5-2500k @ 4.2 GHz
8GB DDR3-1333 @ 9-9-9-24

Linpack Challenge

Elite Member

Moderator Emeritus, Elite Member

Memory & Storage, Graphics Cards Mod Elite Member

Memory & Storage, Graphics Cards Mod Elite Member

Memory & Storage, Graphics Cards Mod Elite Member

Platinum Member

Platinum Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member

Senior member

Lifer

Elite Member

Senior member

Elite Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Madame President

Elite Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Diamond Member