Anand Sandy Bridge performance preview is up

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Why would Intel be packing the PCIe controller onto the CPU now if they were going to try and kill off the discrete GPU market? What else would you need such PCIe bandwidth for other then GPU?

The 40 lanes of PCIe 3.0 on the LGA2011 SB chips are much more exciting than the integrated GPU in these "fusion" products.
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
IMO the reason it happened is because they wanted to keep to the "Tick/Tock" schedule. Although Intel is generally regarded as uncontested leaders in process technology, only small portion of the products have that lead.

Servers were what they were lacking badly, so they must have decided Bloomfield arrive as fast as possible. The dual cores could have arrived with Lynnfield but they were delayed and the 32nm parts were pulled instead(search for Havendale). Otherwise, the first 32nm parts might have been March/April with Gulftown instead of January with Clarkdale.
Didn't server bloomfield arrive like 6 months after desktop bloomfield?
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
March 30, so 4 months? But other than that, its basically the same platform. They could have decided to bring Bloomfield at late 09 and launch Lynnfield at early 09.
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
Bump. :)
Did Anand ever find-out/disclose if the GPU used was GT1 or GT2?
There was an update article but at that point Anand still wasn't sure if it was 6 or 12 EUs.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Nope, he never said anything. But Intel is claiming 2x over previous gen in literally every presentation so there's still a chance the Anandtech preview numbers are very close to final.
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,160
566
126
Just wondering if there any reason the L2 cache is so small in the Sandy Bridge CPUs? For most models such as the i3-2500 it's only 6MB. The older QuadCores such as the Q9550 had 12MB, i.e. twice as much. Why are the Sandy Bridge CPUs scaling back on the amount cache memory?
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
Just wondering if there any reason the L2 cache is so small in the Sandy Bridge CPUs? For most models such as the i3-2500 it's only 6MB. The older QuadCores such as the Q9550 had 12MB, i.e. twice as much. Why are the Sandy Bridge CPUs scaling back on the amount cache memory?

Supposedly the current architecture (Bloomfield and SB) do not need a lot of cache. If you look at the recently-launched i7 quad-core Xeons with 12MB, they really don't show much of a difference in performance compared to the 8MB versions. There are certainly cases where the cache could make a difference, but the performance doesn't seem to suffer from a lack of cache.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
Just wondering if there any reason the L2 cache is so small in the Sandy Bridge CPUs? For most models such as the i3-2500 it's only 6MB. Why are the Sandy Bridge CPUs scaling back on the amount cache memory?
The 6MB cache is L3, not L2. The L2 remains at 256KB per core, same as Nehalem, so it is not just Sandy Bridge that fits your observation.

The cache design has changed a lot since Core2. Lack of L3 (in Core2) was one of the things cited as an enterprise shortcoming (in their C2Q-based Xeon lines, of course), and it didn't matter that it had tremendous amounts of L2. Understanding what cache does and doesn't do (and what L1/L2/L3 are supposed to accomplish and their respective roles) will provide part of the answer to your question.

The rest of the answer is simply: design trade-offs and constraints. Given an unlimited transistor budget, R&D budget, and freedom from economic constraints (such as not having to hit a particular price bracket), then their design would be totally different. As that is clearly impossible since they are limited by their transistor budget, R&D budget, and have to meet several economic/business goals such as different SKUs with different performance levels that must be sold so as to meet Intel's particularly high gross margins, then the design has to be balanced and optimized so as to make best use of the respective budgets and meet said goals. In this case (Sandy Bridge, and Nehalem before it), they made optimizations and improvements all around (some examples: Nehalem's IMC, 2nd level branch predictor, HTT; Sandy Bridge's uop cache), including a different cache hierarchy. This is not to say there is no need for more cache - if they had unlimited transistor budget, they certainly would have added more. But given that they don't operate on "no limits" scenario, the design has become what it is.

A particular self-imposed design constraint Intel had that they started implementing for their desktop processors starting from Nehalem was the 2:1 rule: any feature to make its way into the design had to increase performance by 2% for every 1% it added to power consumption, otherwise it won't make the cut. This has no doubt affected the outcome of the design as well - without this rule, all bets are off and who knows how Nehalem's final design would have turned out.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Supposedly the current architecture (Bloomfield and SB) do not need a lot of cache. If you look at the recently-launched i7 quad-core Xeons with 12MB, they really don't show much of a difference in performance compared to the 8MB versions. There are certainly cases where the cache could make a difference, but the performance doesn't seem to suffer from a lack of cache.

If you consider what the shared cache is supposed to accomplish for the architecture (reduce interprocessor communication latency in multi-threaded apps) its a simple argument to make that for a select group of apps the cache isn't all that needed.

Move to another select group of apps, found predominantly in the server space or the HPC space, and suddenly the interprocessor communication requirements really need the shared cache to avoid performance bottlenecks.

The vast majority (if not all) of desktop/consumer apps that are multi-threaded are also what one would consider to be embarrassingly parallel and course-grained. Video transcoders, image filters, file compression etc. Very little to actually challenge the thread communication fabric in a desktop processor.

The Althon II X4 speaks to this directly. You won't find a shared cacheless CPU in the server space performing all that well, nor in the HPC space, but its quite reasonable in the desktop space because the apps are just so darned coarse-grained its a non-care about.
 

OBLAMA2009

Diamond Member
Apr 17, 2008
6,574
3
0
im a little confused about the designations. why is a performance optimized s chip clocked lower? isnt that power optimized?????


also didnt this site already show that diagram a few months ago??
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Well it really is kinda hard to talk about the feel and the quickness of the circuits. The low level cache combined with the latency is less sig. Combine that with two other variables Cache speed increase with CPU O/C. The ringbus also increases with xnumber of stops .

Its hard to really describe the feel of SB , The best discriptive I could use is fluid.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
Perhaps somebody could go into more detail on this?

Well I can try to give it a go in general terms, although there are many guys that are way more knowledgeable here that can explain in much more detail :)

Think of Cache (all levels) as very high speed memory that boosts the access to instructions/data when the CPU requests them. If we look at a modern CPU, it is much faster than main memory, and if it had no cache, every time it needs certain data it will have to go through the comparatively slow memory bus to fetch the data. A modern CPU fetches data as follows:

CPU -> L1 -> L2 -> L3 -> RAM -> SSD/HD/CD

The L1 cache is the smallest but also the fastest, so that is where a CPU looks first. If there is a "hit", meaning that the instruction/data is found there, then it can be executed, if there is a "miss", then the CPU scans the L2 cache and so on. Usually every level is bigger than the one before it, meaning that it will hold more data, but it also ends up being slower. Lets give an example about the speed difference between data found on L2 cache and main memory:

Lets say a CPU accesses L1 in 3 cycles, L2 in 15 cycles, and main memory in 70 cycles. First we got a "miss" on L1 so we lost 3 CPU cycles, then a "hit" on L2 (15 cycles) so in total we found the needed data after 18 cycles. If the same data needs to be accessed from RAM, then 70 cycles are wasted. 18 vs 70 is less than 1/3 of the time needed, so you can see how cache can boost performance tremendously.

The way cache works is that the data that is most used is kept there, increasing the chance of getting a hit. Another way to increase chances of getting a hit, is having larger cache sizes (more data). Generally speaking, larger caches improve performance even at the cost of having to take more CPU cycles to access. For example, lets say our L2 cache is 512KB in size, and after doubling that to 1MB the latency increased by 20%. Which would be preferable? In such a case the 1MB cache would have increased the performance because it can hold twice as much data (increasing hit rate), so it is preferable. But we have to keep in mind that cache takes a big chunk of the CPU die size, so it can't be increased indefinitely. Also depending on the application/software used there can be situations where a certain threshold will be met, meaning that increasing the size may not help much and could even affect performance negatively.

There are many cache design decisions that have pros and cons, such as size, associativity, inclusive and exclusive etc, and there is no one perfect design for all situations. That's why if we compare Intel and AMD processors, their cache subsystem may seem similar on the outside, but is actually very different. Bulldozer for example will reportedly have a 1 to 1 cache ratio between L2 and L3 cache, 4MB L2 and 4MB L3 or 8MB and 8MB. AMD usually uses "exclusive" caches, meaning that data won't be duplicated on all levels of cache, so the L3 cache would have data other than what the L2 cache contains. Intel uses "inclusive" caches, meaning that data is duplicated on all caches. This method works well when the next level of cache is much larger than the one before it. For example, L1 64K, L2 2MB, L3 8MB.

I know there was a lengthy discussion about how AMD is making a mistake with their L2 and L3 cache ratios for BD, and I hope if doesn't open up again here, but there are so many details that we have no idea about (and never will). So talking about size alone being a handicap is futile in my humble view. Also keep in mind when we look at BD's cache on a per-Module basis, a Module has 2MB L2 and 4MB L3 (or 8MB for 4-Modules), so the L3 is still larger when looking at it this way.

Finally I just want to add that I believe even though AMD was the first to have an integrated memory controller/NB, it's actually Intel that has an edge here in terms of cache/memory subsystem performance. This is one area where I see Bulldozer improving on, and at least having parity with current i7 CPUs. Sandybridge of course will be a another beast, but like I said before BD would most likely compete against it with higher frequency SKUs.
 
Last edited:

ehume

Golden Member
Nov 6, 2009
1,511
73
91
Oooh. Only 1.472 Volts to get 5100MHz! If that were a Lynnfield it would be under the 'absolute maximum' of 1.55v Intel originally set for that chip. At 32nm it's nice he hasn't burned an electron path across the chip. I'm impressed.
 

khon

Golden Member
Jun 8, 2010
1,318
124
106
If accurate those voltage numbers scare the shit outta me. If it really takes that much voltage to do 5Ghz....

He's running it at 5.1GHz with HT on....

I wouldn't personally run it that high, but it seems like 4.5GHz might be a fairly easy overclock.

Also those SuperPI scores are sick. Both 1M and 32M look like they easily beat the previous world records on water cooling.
 
Last edited:

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
Damn. In that case, 1.4xv seems much more mild in comparison. Given, those are hex cores.