Linpack Challenge

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
one thread 8.0352 gflops ( will update as I can)
two threads 15.65 gflops
3 threads 23.2 gflops
4 threads 30.46 gflops
5 threads 36.37 gflops

OK, I went to 6 threads... All of a sudden it went to 12 gflops !!!

WTF ??

Yep. Is that with HT or without?

How are your dimm's populated?

dual-socket mobo is NUMA iirc, and windows is basically totally ignorant of the performance ramifications of NUMA.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,258
16,116
136
it was non-ht, and I have the two closest dimms to the socket of 3 on each bank populated. They are 2 gig dimms for 8 gig total.
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
8,219
3,130
146
hi, is there a way to disable HT with linx? Or do i have to do it in bios? 980x, 4.3
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
8,219
3,130
146
doing 4.4 ht off, max is 82 Gflops
 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
8,219
3,130
146
max temps were 68 C, using a TRUE
 

Tsavo

Platinum Member
Sep 29, 2009
2,645
37
91
ahhh, yes. I don't have the $600 mobo to overclock....

lowly 2366 mhz. (each)

Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.
 

faxon

Platinum Member
May 23, 2008
2,109
1
81
doing 4.4 ht off, max is 82 Gflops
did you see IDC's post about setting your problem size on the last page? see what it posts once you do that, since you were using that run as a stress test :)
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
it was non-ht, and I have the two closest dimms to the socket of 3 on each bank populated. They are 2 gig dimms for 8 gig total.

I wonder how well 32bit XP handles the NUMA when it only allows addressing of ~3.25GB of that 8GB. Could it be that windows is only addressing 3.25GB of the 4GB you have installed on one CPU?

So threads on the other CPU are incurring that extra QPI hop to access the 3.25GB of system ram?

Don't be surprised of this is related the thread-migration too. If thread on CPU1 migrates to CPU0 then its portion of data stored in cache has to migrate as well.

This is part of the reason AMD implemented that snoop-filter on their multi-socket Istanbul chips (and MC).

I think you should continue to investigate this Mark, these "quirks" are real and the more you learn about them the better you'll be able to maximize your system performance. At the moment you are still in the discovery phase.

Tests to try - pull a dimm from each side, force the condition where the OS is accessing ram from both CPU channels.

Also try locking thread affinity. For two threads try both cases there each thread is locked to cores on the same physical CPU, and then the case where each thread is locked to a core on different CPUs.

^ that last set of tests will tell you whether you have a ram access issue if you run these tests prior to pulling a couple of dimms.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.

No need to be so judgemental...who's to say he didn't get a really good deal on them? Look at his rigs...the guy obviously likes playing around with hardware, he probably snagged these for a song and just wants to see what they can deliver in F@H. No reason to crap on his wheaties, is there?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,258
16,116
136
Why did you pick that system with cpu's at 2.2 GHz?

?

And at what point did you think that two low clocked Xeons would = a schitt.
Beyond that, you've got a problem with the board or the ram. Not much of a problem, though. Low clocks are always going to bench as a joke.

I can't believe you bought 2 CPUS to run at stock clocks on a board that has NO o/c.

I picked the system, since I got the cpus dirt cheap. And at what point did I say, or even intonate that they would be the schitt ?

There is no problem with the motherboard or memory, what would make you even say that ? And most all server motherboards don't have overclocking options. The only one that does is $600. I wanted to play with these dual-sockets and see what they would do, so that wasn't an option.
 

WildW

Senior member
Oct 3, 2008
984
20
81
evilpicard.com
So I was messing with tweaking my overclock a bit last night, and tried running LinX to see if what I was doing made any difference. I'm running a Phenom II at 3.8GHz with 4GB DDR2-1000. Was trying to see if there was any difference between 800 and 1000MHz on the ram. Basically switched from 200x19, 2000MHz NB DDR2-800 to 246x15.5, 2200MHz NB, DDR2-984.

Mostly this program tells me a roughly consistant 48/49 GFlops, but I was seeing a few runs with wildly deviating numbers. I saw a 60-something, followed by a 25, a 35... At first I thought maybe I was maybe getting thermal-throttled, reseated my cooler, but later I saw low scores followed by higher scores. Temps are hitting max 54C after a few runs, which I know is a bit high for a Phenom.

The odd thing is, the numerical results that LinX outputs are consistant across runs. All that seems to change is the time taken - at the very end of a run I seem to drop to only one thread running that flits between cores. Presumably this is some kind of "finishing up" that brings everything together across the 4 cores. . .the run times just seem to vary from about 90 to 120 seconds.

I haven't seen anyone else on this thread mention this, but does anyone have any clue what's going on? It's really weird. As I said, I've managed one run at nearly 70 GFLOPS, and it got the same "right" answer as all the lower scores I got in the same set of runs. WTF?
 

DrMrLordX

Lifer
Apr 27, 2000
22,931
13,014
136
Ruby has mentioned before that throughput from LinX can jump around on you. If you do enough runs consecutively, you should get a pretty clear idea of what throughput you really can get with your CPU.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
So I was messing with tweaking my overclock a bit last night, and tried running LinX to see if what I was doing made any difference. I'm running a Phenom II at 3.8GHz with 4GB DDR2-1000. Was trying to see if there was any difference between 800 and 1000MHz on the ram. Basically switched from 200x19, 2000MHz NB DDR2-800 to 246x15.5, 2200MHz NB, DDR2-984.

Mostly this program tells me a roughly consistant 48/49 GFlops, but I was seeing a few runs with wildly deviating numbers. I saw a 60-something, followed by a 25, a 35... At first I thought maybe I was maybe getting thermal-throttled, reseated my cooler, but later I saw low scores followed by higher scores. Temps are hitting max 54C after a few runs, which I know is a bit high for a Phenom.

The odd thing is, the numerical results that LinX outputs are consistant across runs. All that seems to change is the time taken - at the very end of a run I seem to drop to only one thread running that flits between cores. Presumably this is some kind of "finishing up" that brings everything together across the 4 cores. . .the run times just seem to vary from about 90 to 120 seconds.

I haven't seen anyone else on this thread mention this, but does anyone have any clue what's going on? It's really weird. As I said, I've managed one run at nearly 70 GFLOPS, and it got the same "right" answer as all the lower scores I got in the same set of runs. WTF?

Yep, Hyperlite encountered this issue, when that final thread is thrashing about between all the cores it really makes a marked negative impact on the apparent performance.

See this post and the one's preceding it.

Also you can see from this post the range represented by the error bars on the graph (included below for clarity) is likewise due to thread migration and the frequency x duration of threads concurrently residing on the same physical core when HT is involved.

Corei79204GHzwithHT.png


The performance impact you are seeing are real, it really does impact performance when your threads migrate around on the CPU like you are witnessing. The cache contents have to migrate with the thread, that takes time.

Plus if your CPU is actively trying to change C-states (power-savings) with threads migrating around in the background then you add additional delay as the C-states take time to change. This is in part what doomed AMD's original implementation of CnQ on the 65nm Phenoms.
 

WildW

Senior member
Oct 3, 2008
984
20
81
evilpicard.com
Thanks, and doh. I had a quick scan through the thread before I posted but I obviously wasn't awake yet.

So as far as thread contention goes, I guess that variable results are likely. The single thread leaping around is due to. . . Windows being crap? I mean, with one thread fully utilizing a core, and all the Windows background threads idling along, why move the busy thread around?

The thing still confusing me is the one off 60+GFLOP result compared to the usual 48ish. . . is that still real? Is that what my CPU could do if nothing else bothered it for the whole run?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
The single thread leaping around is due to. . . Windows being crap? I mean, with one thread fully utilizing a core, and all the Windows background threads idling along, why move the busy thread around?

We've frequently wondered the same question here...no satisfactory answer has come to light.

The thing still confusing me is the one off 60+GFLOP result compared to the usual 48ish. . . is that still real? Is that what my CPU could do if nothing else bothered it for the whole run?

Yes, it is highlighting the difference between "actual" and "maximal" IPC for your platform.

The architecture is designed to be capable of delivering a "maximal" IPC...all the crap you the user load onto the system that runs in the background, as well as how you specifically implement the cpu in a platform (ram bandwidth, harddrive, etc) grinds that maximal value down into the ground until you arrive at your "actual" IPC.

Every now and then the thread contention going on with your CPU just so happens to line up such that the cache misses and so on are "golden" and your "actual" IPC shoots upwards towards that "maximal" value.

People spend thousands and thousands of dollars eliminating bottlenecks in their rigs so that their actual IPC better approaches that maximal value. For some the performance/dollar involved just doesn't warrant it.
 

Mir96TA

Golden Member
Oct 21, 2002
1,950
37
91
With Old P35 Chipset MB
Linx.jpg
I just upgraded the MB to P45 Chipset form P35, and I got fairly huge jump in performance!
I always thought they were about same; I guess I was wrong!
It has gave me 24% jump with Same OS and CPU and Mem
LinX-1.jpg
 

coffeejunkee

Golden Member
Jul 31, 2010
1,153
0
0
Ehm, you're using almost 3 times as much memory and double problemsize on P45. No wonder GFlops is higher...
 

coffeejunkee

Golden Member
Jul 31, 2010
1,153
0
0
Problemsize 1000:

k2he6e.jpg


Problemsize 10000:

2v1qhdw.jpg


i5 750 stock. The diffence is pretty clear. Using all ram (4GB total) will give about 38 GFlops as well though.

Maybe the P35 Linx run wasn't fully utilizing the cpu. Sometimes it happens, don't know why.

Or P45 is indeed the better platform...
 

muskie32

Diamond Member
Sep 13, 2010
3,115
7
81
That download link does not work for me...

Where should i download it from?
 
Last edited:

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,067
3,574
126
Does it make a difference with these Linpacks if Hyper Threading is running or not? I'm hitting 54.9 or so Gflops on my i7 w/ hyper threading enabled, but I need to find a good screenshot host to show it.

its because when linX does work on the HT cores, the HT is only half to 1/3rd the actual speed of a physical core.

So you tend to get lower scores in calculation speed because of HT starting on a pending thread but doing work slowly until a physical core can take it over.


Is that with the assumption of the HT cores being slower?
Because if you disable HT, he will get the red line.

Its like this... without HT i can pull around 79GFlops... no joke.
With HT i pull roughly 60 GFlops.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Is that with the assumption of the HT cores being slower?
Because if you disable HT, he will get the red line.

Its like this... without HT i can pull around 79GFlops... no joke.
With HT i pull roughly 60 GFlops.

I'm not following, the post you quoted contains a graph of actual scaling data from a member here, there are no assumption to be made or detailed.

What am I missing?
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,067
3,574
126
I'm not following, the post you quoted contains a graph of actual scaling data from a member here, there are no assumption to be made or detailed.

What am I missing?

linX does not do bulk processesing across all threads and adding to the total gflop number.

it loads up as many cores as you want it to load up, an then averages the gflops across all your working cores.

The HT cores counting for half your total threads will down the average on linX.

If you disable HT, you will get a better number on LinX because ur not averaging the slower cores into your total gflop count.