Core i7 Reviews

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
(2) why Nehalem 8MB shared L3$ cache runs so slow...2.66GHz...and has such piss-poor latency...compared to the nearly same size (6MB) shared L2$ on Penryn which runs faster clockspeed and MUCH lower latency?
Simple, it doesnt:
http://www.incrysis.com/forums/viewtopic.php?pid=412057

The bandwidth above is significantly higher on the L3 cache than the one from Xbitlabs which conclude 2.66GHz. I don't assume they are wrong on the clock at the point they benchmarked it, but its not running at a fixed clock. I'd bet that the L3 cache can vary clock speeds to save power.
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Originally posted by: IntelUser2000
Looks like its not really Hyperthreading that affects multi-thread performance increase but rather the core itself. The multi-core performance scaling must be much better on Core i7 than Core 2. Increased bandwidth, better thread synchronization, L3 acting as a buffer for multi-threading.

???
HT is at work here. If it was core improvement then I would expect 2.44/3.57 ratio results in favor of i7 for 1, 2, 4 threads scenarios. That's not the case.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: IntelUser2000
To Idontcare:

You do live up to the name. Go look up at Xbitlabs and Tomshardware results. Compare the i7 benchmarks against QX9770 and see the HT-off results against QX9770. Most of the benefits come from the core changes

Xbitlabs

Core i7-965 SMT off/Turbo off vs Core 2 Quad QX9770: http://www.xbitlabs.com/articl...-core-i7_10.html#sect0

There are significant advantages to be had by CPU core changes alone, average benefit from non-SMT changes and SMT changes(not including PCMark and 3DMark) shows 8% each.

Tomshardware

Core i7-965 SMT on vs Core 2 Quad QX9770:
http://www.tomshardware.com/re...7-Nehalem,2057-11.html

Core i7-965 SMT off/on comparison:
http://www.tomshardware.com/re...7-Nehalem,2057-12.html

Substantial amount of performance comes from core changes.

Cinebench shows no difference because its a real world app that doesn't reflect majority of other real world apps.

Indifferent to cache change(<1% difference), when lots of apps will be affected by it
http://www.nordichardware.com/.../?skrivelse=514&page=5

:thumbsup: Thanks for the synopsis and links :)
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
By: Janooo
???
HT is at work here. If it was core improvement then I would expect 2.44/3.57 ratio results in favor of i7 for 1, 2, 4 threads scenarios. That's not the case.
That's simply cause Cinebench doesn't reflect vast majority of real world benchmarks. I said on the post above it doesn't even scale with cache sizes.

In other applications, around half of Core i7's multi-threading performance advantage comes from other than multi-threading.
(3) all that work to transition from domino to static CMOS and the power savings are where? I expected vastly better power consumption numbers, vastly better.
You aren't supposed to see it. Because its focused on performance. Doublling cores every process tech generation + increasing performance per core isn't gonna work in the future. It'll run so hot that we'll be in the ages of Prescott in no time. Things like static CMOS and advanced clocking techniques are there just to keep the power in control while keeping performance higher.

That's why we won't see 8 core consumer Nehalem's. Even 32nm Westmere has 6 core as baseline. It'll be at 22nm with Haswell we'll see 8 core consumer chips.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
I don't get all the forum-level hub-bub over SMT: does it benefit or not.

Quite simply you can't expect SMT to help anymore than adding additional cores would help...i.e. your application of interest has to have >4 threads before the capability of processing >4 threads can be expected to provide a benefit.

Finding random benchmarks that show SMT improvement is pretty much useless in terms of trying to paint a picture of "SMT improves performance by an average of XX%".

To me the only purposeful discussion to have regarding SMT is one regarding the efficacy of SMT relative to having more hypothetical cores instead of logical ones.

In other words for a >4 thread application how well does SMT handle the extra threads over and above the native core threads versus the "what if" scenario of having more native core thread capability without SMT.

These tests that reviewers do where they manually enable/disable SMT and then plot a graph showing x% improvement or deficit is really just highlighting the performance degradation that can come from thread migration versus the benefit of being able to process that 5th thread without impacting (by much) the processing speed of the other four active threads.

This kind of analysis is pretty useless to the consumer, but you can't fault the reviewers as they don't have experience reviewing or operating in an environment where multi-core multi-threading is a way of life. (as evidenced by the fact not a single review site crunched their data so as to present the viewer with scaling graphs as one would encounter in the HPC community)
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Originally posted by: Idontcare
I don't get all the forum-level hub-bub over SMT: does it benefit or not.

Quite simply you can't expect SMT to help anymore than adding additional cores would help...i.e. your application of interest has to have >4 threads before the capability of processing >4 threads can be expected to provide a benefit.

Finding random benchmarks that show SMT improvement is pretty much useless in terms of trying to paint a picture of "SMT improves performance by an average of XX%".

To me the only purposeful discussion to have regarding SMT is one regarding the efficacy of SMT relative to having more hypothetical cores instead of logical ones.

In other words for a >4 thread application how well does SMT handle the extra threads over and above the native core threads versus the "what if" scenario of having more native core thread capability without SMT.

These tests that reviewers do where they manually enable/disable SMT and then plot a graph showing x% improvement or deficit is really just highlighting the performance degradation that can come from thread migration versus the benefit of being able to process that 5th thread without impacting (by much) the processing speed of the other four active threads.

This kind of analysis is pretty useless to the consumer, but you can't fault the reviewers as they don't have experience reviewing or operating in an environment where multi-core multi-threading is a way of life. (as evidenced by the fact not a single review site crunched their data so as to present the viewer with scaling graphs as one would encounter in the HPC community)
Hmm, you live up to your name. :)
You know there are people who make money by rendering. They take any performance increase that is available. They don't care about your point of view they are interested in the 'XX%'.
 

JackyP

Member
Nov 2, 2008
66
0
0
To answer your question idontcare, first we'd need to know how much of the die is used up by HT logic (%)? Or how much the power budget is increased due to HT (%).
I think HT would need to take up ~20% of the cores or increase their power draw by ~20% so that Intel could economically add another core instead. However, (some?) software does not really cope well with core numbers that are not multiples of 2 IIRC, so Intel probably had the choice to go with 4 cores + 4 virtual ones or 6 cores, but the latter was probably impossible due to die size/heat/power draw considerations.
5 cores would at best increase performance by up to 25%, HT can increase performance more than that and I'm pretty sure that 6 cores would not be feasible, so that makes HT superior in my opinion.

I hope this makes any sense *g*
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
Originally posted by: JackyP
However, (some?) software does not really cope well with core numbers that are not multiples of 2 IIRC, so Intel probably had the choice to go with 4 cores + 4 virtual ones or 6 cores, but the latter was probably impossible due to die size/heat/power draw considerations.

Not only is it possible, 6 cores has already been done, and even reviewed.
 

JackyP

Member
Nov 2, 2008
66
0
0
I should have worded it more clearly as in not feasible for a desktop/two socket server platform. (die too big, too costly, too high power draw - making it way too expensive for the end consumer).
Neither Beckton nor Tukwila launch for desktop first, even though they can be produced one day ;.)
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: Janooo
Hmm, you live up to your name. :)

Why do people keep saying this lately? You guys are going to start making me paranoid ;)

Originally posted by: Janooo
You know there are people who make money by rendering. They take any performance increase that is available. They don't care about your point of view they are interested in the 'XX%'.

You aren't disagreeing with what I wrote, you are discounting the fact that such folks will most certainly be comparing 8 thread performance from a dual-socket harpertown board to the 8 thread performance of an i7.

They won't be interested in a table showing rendering performance with SMT turned off relative to rendering performance with SMT turned on. (which is more on point to what I was attempting to lament about)

Again SMT is not about speeding up applications involving fewer threads than the CPU's core count. Anyone who thinks this is what SMT is about is getting side-tracked. If SMT improves application performance then so too would a skulltrail system over a single-socket yorkfield. This isn't news, at least it shouldn't be, which is why I am perplexed at the response to SMT from many forum go'ers.

Originally posted by: JackyP
To answer your question idontcare, first we'd need to know how much of the die is used up by HT logic (%)? Or how much the power budget is increased due to HT (%).
I think HT would need to take up ~20% of the cores or increase their power draw by ~20% so that Intel could economically add another core instead. However, (some?) software does not really cope well with core numbers that are not multiples of 2 IIRC, so Intel probably had the choice to go with 4 cores + 4 virtual ones or 6 cores, but the latter was probably impossible due to die size/heat/power draw considerations.
5 cores would at best increase performance by up to 25%, HT can increase performance more than that and I'm pretty sure that 6 cores would not be feasible, so that makes HT superior in my opinion.

I hope this makes any sense *g*

I don't need to know any layout specifics of SMT on i7 to analyze the efficacy of the logical cores relative to the native cores (not that there is a distinguishment to be made until >1 thread resides on a given core). For this I merely need a set of carefully designed and controlled scaling tests for a given application.

This is not uncommon in the HPC community, it appears to be a relatively new to the enthusiast community and I see a big wheel is about to be re-invented in the desktop evaluation/review community (and why not, fresh eyes may reveal new ways to analyze scaling data compared to the ways we've done it since Amdahl and the 1960's).
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Originally posted by: Idontcare
...
You aren't disagreeing with what I wrote, you are discounting the fact that such folks will most certainly be comparing 8 thread performance from a dual-socket harpertown board to the 8 thread performance of an i7.
...
Well, partly correct. They'll be comparing price/performance ratio.
Dual-socket is more expensive. If money is no issue then dual i7 will be great.
Nevertheless any performance increase is welcome. HT works, great.
 

JackyP

Member
Nov 2, 2008
66
0
0
Originally posted by: Idontcare


I don't need to know any layout specifics of SMT on i7 to analyze the efficacy of the logical cores relative to the native cores (not that there is a distinguishment to be made until >1 thread resides on a given core). For this I merely need a set of carefully designed and controlled scaling tests for a given application.
I just offered another take on this problem, I don't even know if you disagree with my methodology. Maybe you really don't care.

You said:
"To me the only purposeful discussion to have regarding SMT is one regarding the efficacy of SMT relative to having more hypothetical cores instead of logical ones."

This statement made me think you wanted to find out whether Intel should've implemented more cores or rather HT. Hypothetical comparisons may be interesting, but I prefer the practical approach. Even if HT is less efficient than more core(s) it's absolutely irrelevant for this generation, if it was impossible to have more cores due to die size or thermal constraints.

I thought efficacy must always be seen in relation to die size or power draw.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: JackyP
I just offered another take on this problem, I don't even know if you disagree with my methodology. Maybe you really don't care.

I don't disagree with your methodology, but it isn't necessary to answer the question I am personally interested in. Sure if we could answer the questions you pose then I'd like to have that analyses as well, but I view it secondary to my current questions about Nehalem's SMT implementation.

Originally posted by: JackyP
I thought efficacy must always be seen in relation to die size or power draw.

And you would be correct if the objective were to understand the efficacy of SMT relative to die size or power draw.

While such info would be nice to know, I am primarily interested in understanding how efficient SMT is in truly emulating a full core in the absence of SMT.

Whether it does a good job at this task, emulating a full-blown core, or not is of relevance to me and anyone who pays per thread (vs per socket) for their apps or is currently creating multi-threaded code and needs to prioritize development resources to balance threading versus timeline to release.

The efficacy I am referring to is in ability to emulate another core in thread processing capability and not so much to answer a question of whether it does a shitty job but a justifiably shitty job because it only adds a paltry amount of xtors to the core or watts to the power consumption.

Those numbers would be great to have as well for the enthusiast/geeky side of me, but the business side needs to know is 5 threads on an i7 are going to scale as expected based on the performance scaling of 1 thru 4 threads. The Euler benchmark data discussed earlier does in fact suggest SMT is nearly hardware equivalent to emulating full-blown cores. More thread scaling data from more applications would be whats needed here. (easily generated by setting application affinity in order to exclude cores in sequential fashion)

I'm sure knowing SMT's power footprint or xtor footprint so we can create performance/footprint numbers would be nice too.
 

Tempered81

Diamond Member
Jan 29, 2007
6,374
1
81
Originally posted by: Idontcare
Originally posted by: JackyP
I just offered another take on this problem, I don't even know if you disagree with my methodology. Maybe you really don't care.

I don't disagree with your methodology, but it isn't necessary to answer the question I am personally interested in. Sure if we could answer the questions you pose then I'd like to have that analyses as well, but I view it secondary to my current questions about Nehalem's SMT implementation.

Originally posted by: JackyP
I thought efficacy must always be seen in relation to die size or power draw.

And you would be correct if the objective were to understand the efficacy of SMT relative to die size or power draw.

While such info would be nice to know, I am primarily interested in understanding how efficient SMT is in truly emulating a full core in the absence of SMT.

Whether it does a good job at this task, emulating a full-blown core, or not is of relevance to me and anyone who pays per thread (vs per socket) for their apps or is currently creating multi-threaded code and needs to prioritize development resources to balance threading versus timeline to release.

The efficacy I am referring to is in ability to emulate another core in thread processing capability and not so much to answer a question of whether it does a shitty job but a justifiably shitty job because it only adds a paltry amount of xtors to the core or watts to the power consumption.

Those numbers would be great to have as well for the enthusiast/geeky side of me, but the business side needs to know is 5 threads on an i7 are going to scale as expected based on the performance scaling of 1 thru 4 threads. The Euler benchmark data discussed earlier does in fact suggest SMT is nearly hardware equivalent to emulating full-blown cores. More thread scaling data from more applications would be whats needed here. (easily generated by setting application affinity in order to exclude cores in sequential fashion)

I'm sure knowing SMT's power footprint or xtor footprint so we can create performance/footprint numbers would be nice too.

I get the impression that you just don't care, buddy.

What gives?

:laugh:
:)
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: dmens
(3) all that work to transition from domino to static CMOS and the power savings are where? I expected vastly better power consumption numbers, vastly better.

your expectations were misplaced.

Anand's latest article actually addresses my expectations quite deftly.

Note that the idle power on the i7-965 is very low, one thing that must be enabled to achieve this is the QPI power management option in the X58 BIOS which for whatever reason was disabled by default in our original review.

Power consumption is actually pretty nicely improved based on the results published here:

http://www.anandtech.com/cpuch...howdoc.aspx?i=3453&p=4
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
i think that is the PCU's fine-grained power control at work rather than the domino->static conversion, imo. domino can be better than static logic in some cases.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: dmens
i think that is the PCU's fine-grained power control at work rather than the domino->static conversion, imo. domino can be better than static logic in some cases.

I don't know enough about the technical details of any of that to add further value by commenting but it continues to beg the question "what the hell was Intel doing at IDF bothering everyone with these slides and graphs"?

I'm not trying to make a big deal out of the static/domino deal, I am trying to figure out why Intel (and by consequence their press kits, and by consequence every reviewer out their who regurgitated the press kits, etc) went to such lengths to make a big deal out of it.

I mean they don't usually waste their time or our time (at least lately) with superfluous marketing stuff so I am assuming a priori that it is me that is missing the boat here, ergo my line of questioning - where's the beef? (or perhaps I should ask why did Intel tell me there would be beef but then they brought beef that tastes like the same old chicken they been serving since 1990?)
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
lol good question. i think it makes a great talking point. the funny thing is, there's still lots of domino in the design in register files. its not a completely static design at all.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Thanks, I appreciate the sanity check, so I'm not entirely off-base on this I guess.

Still though I like the new power numbers that Anand published. It's actually quite a nice showing for Nehalem.

Not sure why nobody actually crunches the data into performance/watt metrics anymore, guess its not sexy enough anymore. It's so 2007.

I went ahead and crunched Anand's data to convert it to performance/watt:

CPU...................................QX9770 (3.2GHz)..........Core i7-965 (3.2GHz).............Improvement
POV-Ray..............................11.4 PPS/Watt..............17.5 PPS/Watt......................53%
Cinebench (1 thread)............20.3 CBMarks/Watt.......26.6 CBMarks/Watt...............31%
Cinebench (max threads)......61.8 CBMarks/Watt.......81.5 CBMarks/Watt...............32%
3dsmax 9 SPECapc CPU........0.060 /Watt..................0.084 /Watt..........................41%
x264 HD Encode Test............0.32 fps/Watt................0.44 fps/Watt.......................38%
DivX 6.8.3............................2.61 Watts...................1.84 Watts............................29%
Windows Media Encoder........2.01 Watts....................1.34 Watts............................33%
Age of Conan.......................0.35 fps/Watt................0.46 fps/Watt........................31%
Race Driver GRID.................0.30 fps/Watt...............0.34 fps/Watt........................15%
Crysis..................................0.14 fps/Watt...............0.16 fps/Watt........................15%
FarCry 2..............................0.32 fps/Watt................0.42 fps/Watt........................34%
Fallout 3...............................0.25 fps/Watt...............0.37 fps/Watt........................45%

Unless I made a mistake in the math the i7 beat the QX9770 in every test. The average percent power consumption reduction per unit of work being done is 33% for the i7 over yorkfield.

Now I am finally seeing the 30-40% power consumption reduction numbers I was expecting once performance is normalized :D Me much happier now!
 

SsupernovaE

Golden Member
Dec 12, 2006
1,128
0
76
Originally posted by: Idontcare
Thanks, I appreciate the sanity check, so I'm not entirely off-base on this I guess.

Still though I like the new power numbers that Anand published. It's actually quite a nice showing for Nehalem.

Not sure why nobody actually crunches the data into performance/watt metrics anymore, guess its not sexy enough anymore. It's so 2007.

I went ahead and crunched Anand's data to convert it to performance/watt:

CPU...................................QX9770 (3.2GHz)..........Core i7-965 (3.2GHz).............Improvement
POV-Ray..............................11.4 PPS/Watt..............17.5 PPS/Watt......................53%
Cinebench (1 thread)............20.3 CBMarks/Watt.......26.6 CBMarks/Watt...............31%
Cinebench (max threads)......61.8 CBMarks/Watt.......81.5 CBMarks/Watt...............32%
3dsmax 9 SPECapc CPU........0.060 /Watt..................0.084 /Watt..........................41%
x264 HD Encode Test............0.32 fps/Watt................0.44 fps/Watt.......................38%
DivX 6.8.3............................2.61 Watts...................1.84 Watts............................29%
Windows Media Encoder........2.01 Watts....................1.34 Watts............................33%
Age of Conan.......................0.35 fps/Watt................0.46 fps/Watt........................31%
Race Driver GRID.................0.30 fps/Watt...............0.34 fps/Watt........................15%
Crysis..................................0.14 fps/Watt...............0.16 fps/Watt........................15%
FarCry 2..............................0.32 fps/Watt................0.42 fps/Watt........................34%
Fallout 3...............................0.25 fps/Watt...............0.37 fps/Watt........................45%

Unless I made a mistake in the math the i7 beat the QX9770 in every test. The average percent power consumption reduction per unit of work being done is 33% for the i7 over yorkfield.

Now I am finally seeing the 30-40% power consumption reduction numbers I was expecting once performance is normalized :D Me much happier now!


So you think this architecture would work well in notebooks?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Originally posted by: SsupernovaE
So you think this architecture would work well in notebooks?

Would work very well I would think. Especially a dual-core quad-thread chip.

Set the max TDP lower of course, 35W or less and let the chip regulate itself so that the multiplier and GHz always keeps the chip within the budget while the PCU shuts off entire cores when unused and idle. (which yorkfield cannot do, 1 loaded core means the other 3 cores are at fullspeed and full voltage)

I don't see how it could not be a winner performance wise...may not be priced to our liking though :)
 

Denithor

Diamond Member
Apr 11, 2004
6,298
23
81
Originally posted by: Idontcare
Would work very well I would think. Especially a dual-core quad-thread chip.

I wouldn't be surprised to see Intel launch the notebook version as a dual-core without HT enabled. When HT kicks in and works like it was designed power consumption goes up like crazy (more cores are utilized fully). On a desktop that's not a problem but on a laptop that eats battery time fast.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
im still waiting for dynamic HT turnon/turnoff. it's there, where's the software to use it?
 

sunnn

Member
Oct 30, 2008
30
0
0
Originally posted by: Idontcare
Thanks, I appreciate the sanity check, so I'm not entirely off-base on this I guess.

Still though I like the new power numbers that Anand published. It's actually quite a nice showing for Nehalem.

Not sure why nobody actually crunches the data into performance/watt metrics anymore, guess its not sexy enough anymore. It's so 2007.

I went ahead and crunched Anand's data to convert it to performance/watt:

CPU...................................QX9770 (3.2GHz)..........Core i7-965 (3.2GHz).............Improvement
POV-Ray..............................11.4 PPS/Watt..............17.5 PPS/Watt......................53%
Cinebench (1 thread)............20.3 CBMarks/Watt.......26.6 CBMarks/Watt...............31%
Cinebench (max threads)......61.8 CBMarks/Watt.......81.5 CBMarks/Watt...............32%
3dsmax 9 SPECapc CPU........0.060 /Watt..................0.084 /Watt..........................41%
x264 HD Encode Test............0.32 fps/Watt................0.44 fps/Watt.......................38%
DivX 6.8.3............................2.61 Watts...................1.84 Watts............................29%
Windows Media Encoder........2.01 Watts....................1.34 Watts............................33%
Age of Conan.......................0.35 fps/Watt................0.46 fps/Watt........................31%
Race Driver GRID.................0.30 fps/Watt...............0.34 fps/Watt........................15%
Crysis..................................0.14 fps/Watt...............0.16 fps/Watt........................15%
FarCry 2..............................0.32 fps/Watt................0.42 fps/Watt........................34%
Fallout 3...............................0.25 fps/Watt...............0.37 fps/Watt........................45%

Unless I made a mistake in the math the i7 beat the QX9770 in every test. The average percent power consumption reduction per unit of work being done is 33% for the i7 over yorkfield.

Now I am finally seeing the 30-40% power consumption reduction numbers I was expecting once performance is normalized :D Me much happier now!

just to caution you on throwing numbers out there. i965 data is hardly a representative of the whole nehalem family. tests were done with turbo on. WITH TURBO OFF , i920 power consumption is hardly (5-10%?) improved compared to q9450.
also i failed to find power consumption for other tests. maybe i just missed them, can you kindly point me to them?
lastly, i965 are being marketed towards people with multi-gpu's, people who overclock the hell out of their cpu's, performance/watt metric is the least of their concerns.

data taken from nehalem:dark knight page 9:
cpu.......................q9450..................i920..........imrovement...i965-over-q9770
crysis...................0.154fps/w.........0.144fps/w.......-6.9%.......15%
pov-ray 3.7.........13.32pps/w.......17.44pps/w.......31.1%.......53%
cinebench xcpu...65.6cbm/w.........77.71cbm/w......18.6%.......32%
x264 hd...............0.37fps.w...........0.42fpw/w........13.5%.......38%
edit: a bit off topic:
its been generally accepted that in gaming, phenom is 15% and 20% slower compared to kentsfield and yorksfield clock-for-clock. so far, out of the 5 games anand tested here (9950be vs. q9450), the average difference is 19.5%.