Why the scaling of my core i7 920 is so poor, even worse than Q6600?

jsmith001

Junior Member
Nov 21, 2008
2
0
0
Update:

I redid my tests on windows XP, with/without HT. I used similar compiler flags as SPEC 2006.
I also overclock Core i7 from 2.66G to 3.5G. Below are the setup and comparison.

Core i7 computer
CPU: core i7 920 at 2.66G
MB: ASUS P6T
memory: 6G DDR3 1600
OS: Fedora 10 64bit
Compiler: Intel Fortran Compiler Professional 11.0.069
Compiler flags: -r8 -xSSE4.2 -ip -openmp -opt-prefetch

OS: Window XP 64bit
Compiler: Intel Visual Fortran Compiler Professional 11.0.066
Compiler flags: /Qautodouble /QxSSE4.2 /Qip /Qopenmp /Qopt-prefetch /F1000000000

Core 2 Quad computer
CPU: Core 2 Q6600 at 2.4G
MB: gigabyte p35
memory: 4G DDR2 800
OS: fedora 9 64bit
Compiler: Intel Fortran Compiler Professional 11.0.069
Compiler flags: -r8 -ip -openmp

Dual Xeon computer
CPU: dual Xeon E5410 at 2.33G
MB: Tyan S5396A2NRF
memory: 8G DDR2 667 ECC Fully Buffered
OS: fedora 9 64bit
Compiler: Intel Fortran Compiler Professional 11.0.069
Compiler flags: -r8 -ip -openmp


The testing code is a openmp CFD(computational fluid dynamics) code. The code used about 500MB memory. The testing results:

results

*For Core i7, Turbo mode is OFF.

For four threads, Core i7 920@3.55G is slower than Q6600@2.4G!
The scaling of Xeon E5410 is the best.

----------------------------------------------------------------------------------------------

I read a lot reviews and forums about the new Intel CPU core i7. The new Intel CPU looks great. For example, according to spec cfp2006,
http://www.amdzone.com/phpbb3/...opic.php?f=52&t=135802, Core i7 is about 100% faster than core 2 for float point performance. Although someone suggested that Intel might optimize the CPU for the benchmark codes, I still decided to upgrade my computer from core 2(Q6600) to core i7(920). Because I thought the better bandwidth of core i7 would help the multithread computation. Well, I was wrong. My testing results made me very disappoint to the new core i7 CPU.

My new computer
CPU: core i7 920 at 2.66G
MB: asus p6t
memory: 6G DDR3 1600
OS: Linux fedora 10 64bit

My old computer
CPU: core 2 Q6600 at 2.4G
MB: gigabyte p35
memory: 4G DDR2 800
OS: Linux fedora 9 64bit

compiler: Inter ifort 11

The testing code is a openmp CFD(computational fluid dynamics) code. The code used about 500MB memory. The testing results:

1 thread : 202.7s (core i7 920) , 213.1s (q6600), 5.13%(core i7 advantage)
2 threads: 109.0s (core i7 920) , 109.3s (q6600), 0.27%(core i7 advantage)
4 threads: 96.1s (core i7 920) , 68.3s (q6600), -28.93%(core i7 advantage)

(core i7 HT is off, it is slower if HT is on. the clock of core i7 920 is about 10% higher than core 2 Q6600)

The scaling of core i7 for 4 threads is really bad. Considering the bandwidth of core i7 is about twice of core 2, this result is really strange. I don't know why, maybe the OS didn't use all the potentials of the new CPU?
 

Borealis7

Platinum Member
Oct 19, 2006
2,901
205
106
well, it could be that the program youre using isnt optimized for quadcores (you say that turning HT on makes it slower, which could be due to technical reasons)

...or simply PEBCAK. :D
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: jsmith001
The scaling of core i7 for 4 threads is really bad. Considering the bandwidth of core i7 is about twice of core 2, this result is really strange. I don't know why...?

I came to the exact same conclusions from detailed analysis of two other scaling benchmarks from real application data as well, as JackyP noted.

I'll toss your data into my spreadsheet and will report back the results by similar analysis. Stay tuned.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
http://i272.photobucket.com/al...DFBenchmarkScaling.gif

As you can see in the graph the i7 920 scaling performance on this code does appear quite dismal in comparison to that of the Q6600.

However the scaling data for the i7 in this case does not conform to the standard Almasi/Gottlieb modified Amdahl equation.

My primary suspicion for this being the case is that maybe you left the dynamic clocking feature enabled during the benchmarking? If this is true then we can't really ratio the results from four threads to that of one thread and call it a scaling ratio in the terms that scaling ratio numbers are intended to imply.

In other words, with dynamic clocking enabled the performance for 1 thread is actually based on a clockspeed of 2.93GHz, whereas the data for 2 threads would be derived from core speeds of 2.8GHz and finally the benchmark results from 4 threads were generated from cores with clockspeeds of 2.66GHz.

If this is the case then naturally the scaling results from 1 to 4 threads are going to be flat and non-conformal with the standard models.

To really generate scaling data for the Nehalem architecture for your specific app (to see if there is a weakness in the underlying interprocessor communications) you need to generate scaling data with a fixed core clockspeed. You should be able to disable dynamic clocking (aka turbomode) in the BIOS.

In doing so we would expect the result for four threads to remain the same as before (but check it just to confirm) while the absolute performance for 1 thread and 2 threads would decrease relative to your initial scores...thus boosting the scaling number for the 4 thread result commensurately more.

Remember we do this only to generate scaling results so we can analyze the data to determine if there is a weakness or an opportunity to improve something in the underlying memory subsystem to ultimately increase the raw performance of the 4 thread result, we are not trying to intentionally generate poor results just to make the scaling look good on a graph.
 

Griswold

Senior member
Dec 24, 2004
630
0
0
Originally posted by: SunnyD
Originally posted by: palladium
why did you use fedora 9 for your i7 and fedora 10 for your Q6600?

Indeed... I was wondering the same thing. Fedora 10 probably has better optimizations.

If he keeps the kernel and relevant applications up to date, there is no reason to upgrade the entire distro just because there is a new version out. Thats the beauty of Linux compared to Windows, you know.
 

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
I think you shoudl use the same fedora version to do the comparison. Result would be more comparable.

But let's assume the results are valid, then going from 2->4 core there's definitely a bottleneck in the system. Where? that's a good question, probably someone more familiar with i7 architecture might be able to say. But you only talking about one application on i7 running in Fedora still not enough data to draw a broad conclusion on i7 thread scaling yet. You got other threaded apps to run say in Windows etc for a test? Q6600 is quite a bit more matured by now, i7 still is young, so it may possibly be a problem with compiler/OS code.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: nyker96
I think you shoudl use the same fedora version to do the comparison. Result would be more comparable.

But let's assume the results are valid, then going from 2->4 core there's definitely a bottleneck in the system. Where? that's a good question, probably someone more familiar with i7 architecture might be able to say. But you only talking about one application on i7 running in Fedora still not enough data to draw a broad conclusion on i7 thread scaling yet. You got other threaded apps to run say in Windows etc for a test? Q6600 is quite a bit more matured by now, i7 still is young, so it may possibly be a problem with compiler/OS code.

My suspicion is that the data are skewed because of dynamic clocking (turbomode) having elevated the 1 thread and 2 thread results, making it seem like the scaling from 2 to 4 threads is silly low. I'm hoping the OP runs the tests I suggested above to prove or disprove this theory.
 

magreen

Golden Member
Dec 27, 2006
1,309
1
81
If that theory were true it would resolve the i7 scaling discrepancy, but would bode very badly for the single-threaded performance of i7 vs. Penryn, at least in this application.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: magreen
If that theory were true it would resolve the i7 scaling discrepancy, but would bode very badly for the single-threaded performance of i7 vs. Penryn, at least in this application.

Originally posted by: jsmith001
1 thread : 202.7s (core i7 920) , 213.1s (q6600), 5.13%(core i7 advantage)

Yeah no kidding magreen. The i7 scores a 5% performance advantage over the 2.4GHz Q6600 in a single-thread that is either (1) best case scenario running 2.66GHz on i7 (10% higher than Q6600), or (2) worst case it is running at 2.93GHz (22% clock increase over Q6600) thanks to turbo mode implying the IPC of i7 on this app is worse than the IPC for Kentsfield in all cases. :shocked:

If this is the case, if the 202.7s score is from a 2.93GHz nehalem core in comparison to a 2.4GHz Kentsfield scoring 213.1s then I'd have to conclude this application is L2$ starved on the i7 architecture as the Kentsfield gives the thread 4MB of L2$ while the Nehalem gives the thread a smaller 256KB L2$.

The OP could very well have taken a large step back in upgrading from Q6600 to i7, the real upgrade for them would have been to step-up to a Q9550 or Q9650 with the even larger/faster L2$ and faster core clockspeed.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
what compiler flags did you use? prefetching optimization on nehalem is quite different compared to a merom.