First Sandy Bridge Numbers?

taltamir · Jun 14, 2010

I am not talking about their business strategy (although that might delay it)

AFAIK llano/bobcat development is far behind sandy bridge.
With sandy bridge having samples, and being moved ahead to release sooner. While llano is nowhere is sight as of yet.

I could certainly be wrong about it though, its not like I have concrete data... sorta my overall assessment of the tiny snippets of information I have managed to acquire thus far.

Intel also tested the water with a GPU+CPU on the same package (but different die)

Idontcare · Jun 14, 2010

Considering that SB was booting windows some 9 months before Llano taped out I'd have to say your expectations are well-founded.

taltamir · Jun 14, 2010

Idontcare said:
Considering that SB was booting windows some 9 months before Llano taped out I'd have to say your expectations are well-founded.

This is basically what I was talking about. 🙂
AFAIK taping out happens before it is ready to boot windows. So we are talking about intel having more than 9 month lead, so if neither company pulls ahead intel has quite a headstart.

Now that I think about it thought, I am curious to know, what does "taped out" actually means... and how long did it take SB to go from being "taped out" to booting windows?

Idontcare · Jun 14, 2010

Tape out has roots in the fact that back in the day when an IC design was finalized the layout information was stored on magnetic tapes which were then sent to the mask manufacturer (usually an internal customer back then).

Naturally layout designs are not transferred via magnetic tape in modern times, it is all transmitted via secure ftp.

What the term is intended to convey is the completion of a specific milestone that exists on the timeline of an IC stemming from conception to production. "tape out" means a specific stepping has been finalized and the masks are being ordered from the mask producer.

The masks themselves are physical entities that go into the litho tool such that the specific layout features can be printed in the photoresist on the wafer. A "mask set" entails all the various layout levels (30-50 for a complex IC like a CPU) and an individual mask is sometimes referred to as a reticle (the terms have come to be used interchangeably but they did not always mean the same thing).

For leading edge process nodes the mask set itself can cost anywhere from $1m to $10m.

So generally they like to avoid "taping out" an IC until thorough validation has been completed in silico as uncovering design/layout mistakes once you get samples back from the fab is costly both in terms of time as well as literally in terms of money wasted on unusable reticles.

In terms of putting numbers to the length scale that comprehends the milestones...it takes about a month to go from tape out to having "first silicon" in hand. This can be expedited at great cost (and human suffering! 😛) to around 2 weeks if desired.

So if the design does not have fatal flaws precluding it from functionally operating (i.e. it does math like 1+1=2 correctly, but maybe only at 500MHz and with 2V Vcc) then it is conceivable to boot an OS in as little as 2 weeks but more typical is 1 month for aggressively timelined projects. Longer still if resources are prioritized to expediting the pace of other milestones that are running in parallel of course.

And longer still if there are fatal flaws in the design or if the yield turns out to be zero for fab related issues that are not related to the functional design of the IC itself.

RussianSensation · Jun 14, 2010

Nemesis 1 said:
Its been awhile sense I have heard anythingabout the 1355 socket . Do you have a recent link. I tried google didn't come up with anything really new other than the 2011 server version. Also the high end Desktop B2 which I found listed as 2011 socket 4 memory channels

Based on the info in the link below, it's pretty hard to get excited about 1355 (if accurate info). This basically means no 6- or 8-core Sandy's until Q3 2011.

http://www.bit-tech.net/hardware/cpus/2010/04/21/intel-sandy-bridge-details-of-the-next-gen/1

Idontcare · Jun 14, 2010

RussianSensation said:
Based on the info in the link below, it's pretty hard to get excited about 1355 (if accurate info). This basically means no 6- or 8-core Sandy's until Q3 2011.

http://www.bit-tech.net/hardware/cpus/2010/04/21/intel-sandy-bridge-details-of-the-next-gen/1

Is this the intended Bulldozer competitor from Intel? Will LGA2011 cpu's incorporate the IGP or will this be more like current clarkdale vs. gulftown difference when it comes to IGP?

If BD is not a Fusion product (would seem to be a requirement if it is to be socket G34/AM3 compatible) then it stands to reason that Intel would likely not hamstring the thermal budget (and cost structure) of their leading edge enthusiast platform with an IGP as well.

khon · Jun 14, 2010

Idontcare said:
Is this the intended Bulldozer competitor from Intel? Will LGA2011 cpu's incorporate the IGP or will this be more like current clarkdale vs. gulftown difference when it comes to IGP?

If BD is not a Fusion product (would seem to be a requirement if it is to be socket G34/AM3 compatible) then it stands to reason that Intel would likely not hamstring the thermal budget (and cost structure) of their leading edge enthusiast platform with an IGP as well.

Yeah that's the bulldozer competition, and it will indeed be without an IGP.

As I understand it there are 4 desktop processors coming:

AMD
Llano : 4 Phenom derived cores + IGP
Bulldozer : 2-4 bulldozer modules with 2 cores each (4-8 cores total) - No IGP

Intel
Sandy Bridge-DT : 2-4 cores + IGP
Sandy Bridge-B2 : 6-8 cores - No IGP

Llano and Sandy Bridge-DT should come out first (Q4-2010 or Q1-2011), with Sandy Bridge having the CPU advantage and Llano having the GPU advantage. It's anyones guess whether Sandy Bridge-B2 or Bulldozer is faster.

taltamir · Jun 14, 2010

Idontcare said:
Tape out has roots in the fact that back in the day when an IC design was finalized the layout information was stored on magnetic tapes which were then sent to the mask manufacturer (usually an internal customer back then).

Naturally layout designs are not transferred via magnetic tape in modern times, it is all transmitted via secure ftp.

What the term is intended to convey is the completion of a specific milestone that exists on the timeline of an IC stemming from conception to production. "tape out" means a specific stepping has been finalized and the masks are being ordered from the mask producer.

The masks themselves are physical entities that go into the litho tool such that the specific layout features can be printed in the photoresist on the wafer. A "mask set" entails all the various layout levels (30-50 for a complex IC like a CPU) and an individual mask is sometimes referred to as a reticle (the terms have come to be used interchangeably but they did not always mean the same thing).

For leading edge process nodes the mask set itself can cost anywhere from $1m to $10m.

So generally they like to avoid "taping out" an IC until thorough validation has been completed in silico as uncovering design/layout mistakes once you get samples back from the fab is costly both in terms of time as well as literally in terms of money wasted on unusable reticles.

In terms of putting numbers to the length scale that comprehends the milestones...it takes about a month to go from tape out to having "first silicon" in hand. This can be expedited at great cost (and human suffering! 😛) to around 2 weeks if desired.

So if the design does not have fatal flaws precluding it from functionally operating (i.e. it does math like 1+1=2 correctly, but maybe only at 500MHz and with 2V Vcc) then it is conceivable to boot an OS in as little as 2 weeks but more typical is 1 month for aggressively timelined projects. Longer still if resources are prioritized to expediting the pace of other milestones that are running in parallel of course.

And longer still if there are fatal flaws in the design or if the yield turns out to be zero for fab related issues that are not related to the functional design of the IC itself.

thank you very much for clarifying.

Shilohen · Jun 14, 2010

Edited out, another poster linked to Charlie's article already...

Idontcare · Jun 14, 2010

khon said:
Yeah that's the bulldozer competition, and it will indeed be without an IGP.

As I understand it there are 4 desktop processors coming:

AMD
Llano : 4 Phenom derived cores + IGP
Bulldozer : 2-4 bulldozer modules with 2 cores each (4-8 cores total) - No IGP

Intel
Sandy Bridge-DT : 2-4 cores + IGP
Sandy Bridge-B2 : 6-8 cores - No IGP

Llano and Sandy Bridge-DT should come out first (Q4-2010 or Q1-2011), with Sandy Bridge having the CPU advantage and Llano having the GPU advantage. It's anyones guess whether Sandy Bridge-B2 or Bulldozer is faster.

I like how that stacks up. Things could get really interesting if GloFo's 32nm HKMG+SOI implementation delivers anywhere close to the same performance gap (45nm->32nm scaling-wise) that Intel's HKMG did for them (albeit 65nm->45nm of course).

Bifurcating the consumer desktop markets by socket - mainstream vs. enthusiast - is another interesting development as well. We naturally resist the changes we perceive as not producing immediate benefits to ourselves but without change in this industry where would we be?

I'm quite curious to see how this twist plays out.

evolucion8 · Jun 14, 2010

Idontcare said:
If I dragged this horse any closer to the water I'd be drowning it.

I hear Bulldozer has a single-FPU unit per module...I sure hope the LAST thing anyone with a BD sample bothers to check is how the FPU performance stacks up. Why would anyone care to check out one of the key microarchitecture differences of a new chip? beats the shit out of me I guess

Most consumer applications barely uses the FPU, and current FPU performance with the Phenom II X4/X6 is quite stellar, able to sniff the heels of the Core i7, while the integer performance is the opposite, but except that Core i7 has a considerable lead in integer performance, hence that's why is faster in most desktop applications, but such domination in the server market isn't the same.

Idontcare · Jun 14, 2010

TuxDave · Jun 14, 2010

Idontcare said:
So if the design does not have fatal flaws precluding it from functionally operating (i.e. it does math like 1+1=2 correctly, but maybe only at 500MHz and with 2V Vcc) then it is conceivable to boot an OS in as little as 2 weeks but more typical is 1 month for aggressively timelined projects.

Tick projects are a little unfair to include into this benchmark I guess. We have some pretty record breaking "time to get the chip to boot an OS" metrics going on in here if we include those.

IntelUser2000 · Jun 14, 2010

evolucion8 said:
Most consumer applications barely uses the FPU, and current FPU performance with the Phenom II X4/X6 is quite stellar, able to sniff the heels of the Core i7, while the integer performance is the opposite, but except that Core i7 has a considerable lead in integer performance, hence that's why is faster in most desktop applications, but such domination in the server market isn't the same.

Actually, the Nehalem only has ~15% lead in single threaded applications. What really helps Intel here is Hyperthreading. Multi-threading performance without Hyperthreading is still good, but Hyperthreading is the main helper here.

And server advantage is considerable, much more than desktop: http://www.anandtech.com/show/2774

The Thuban does more favorably against the desktop Bloomfield normalized to core and clock. The pricing is whack, but they are just merely pricing it to performance differences.

Llano and Sandy Bridge-DT should come out first (Q4-2010 or Q1-2011), with Sandy Bridge having the CPU advantage and Llano having the GPU advantage. It's anyones guess whether Sandy Bridge-B2 or Bulldozer is faster.

Although if the "Fusion" detail is done well here, the advantage of CPU detail can cancel out GPU detail, and to a lesser extent, vice versa. 🙂

KingstonU · Jun 15, 2010

khon said:
As I understand it there are 4 desktop processors coming:

AMD
Llano : 4 Phenom derived cores + IGP
Bulldozer : 2-4 bulldozer modules with 2 cores each (4-8 cores total) - No IGP

Intel
Sandy Bridge-DT : 2-4 cores + IGP
Sandy Bridge-B2 : 6-8 cores - No IGP

Llano and Sandy Bridge-DT should come out first (Q4-2010 or Q1-2011), with Sandy Bridge having the CPU advantage and Llano having the GPU advantage. It's anyones guess whether Sandy Bridge-B2 or Bulldozer is faster.

Excellent summary Thanks! Though i'm sad to see rumors of yet more delays for Bulldozer (wasn't it originally supposed to be released in 2009??) and that AMD will initially still only have Phenom II derived cores to compete against Intel's new Sandy Bridge.

Idontcare · Jun 15, 2010

TuxDave said:
Tick projects are a little unfair to include into this benchmark I guess. We have some pretty record breaking "time to get the chip to boot an OS" metrics going on in here if we include those.

<2 weeks tape-out to boot? That is impressive.

For DSP's, a vastly simpler architecture and ISA, I've seen 11 days (6ML, not your usual 10-11ML) but that was balls-out sprint pace that made life absolutely miserable for everyone involved (save the executive team of course 😉).

Probably helps, by hours at best in those conditions, that Intel has their mask house internal still. The rest of us chumps have to wait for reticles to cross the ocean as they usually come from Japan.

The complex SUN stuff usually took no less than a full month to get samples back into their hands including time to boot Solaris.

VirtualLarry · Jun 15, 2010

evolucion8 said:
Most consumer applications barely uses the FPU, and current FPU performance with the Phenom II X4/X6 is quite stellar, able to sniff the heels of the Core i7, while the integer performance is the opposite, but except that Core i7 has a considerable lead in integer performance, hence that's why is faster in most desktop applications, but such domination in the server market isn't the same.

It will be interesting if AMD can correctly spin the PR for their BD chip, being that it has two INT pipelines for each FPU pipeline. Either the market is going to think that they have: 1) Average INT speed, and lower than average FPU speed, or 2) Better than average INT speed, and Average FPU speed.

AMD needs to spin it as 2.

Unfortunately, for the distributed-computing crowd, they may see it as 1.

TuxDave · Jun 15, 2010

Idontcare said:
Probably helps, by hours at best in those conditions, that Intel has their mask house internal still. The rest of us chumps have to wait for reticles to cross the ocean as they usually come from Japan.

I forgot which tapeout had the email which asked to see which engineer was travelling between specific Intel sites so they could recruit him to carry the wafers from the fab to the lab and get them tested asap.

jvroig · Jun 15, 2010

VirtualLarry said:
AMD needs to spin it as 2.
Unfortunately, for the distributed-computing crowd, they may see it as 1.

Whether it is seen as #1 or #2 will be decided by benchmark results against its competition, not spinning.

IntelUser2000 · Jun 15, 2010

VirtualLarry said:
.AMD needs to spin it as 2.

Unfortunately, for the distributed-computing crowd, they may see it as 1.

Because most applications running FP code is limited by bandwidth, it isn't bad as it really sounds. Applications like Linpack is a different story, but HPC apps are not Linpack like.

It's hard to say something about Integer, but the multi-threaded performance should be really good, which is what they lack against Nehalem.

extra · Jun 15, 2010

jvroig said:
Whether it is seen as #1 or #2 will be decided by benchmark results against its competition, not spinning.

I agree with both of you I think. There will be marketing by AMD, for sure. Just like Intel puts some marketing spin about hyperthreading, and some AMD fans make fun of it.

And AMD *does* really...really...REALLY badly need to work on their marketing. lol.

However, all that matters is performance/dollar for us enthusiasts, or performance/dollar/watt for server/hpc stuff...

Idontcare · Jun 16, 2010

IntelUser2000 said:
Because most applications running FP code is limited by bandwidth, it isn't bad as it really sounds. Applications like Linpack is a different story, but HPC apps are not Linpack like.

I'm sure you meant to say this as being true only within some context. Linpack is pretty much matrix manipulations, and many HPC apps use matrix operations.

Nearly 100% of the math done in computational chemistry involves matrices for example.

I;ve no doubt you know this, so I'm only left to conclude you meant to add some caveat to your statement that I am not following. How is it that HPC apps are not Linpack like?

IntelUser2000 · Jun 16, 2010

Idontcare said:
I;ve no doubt you know this, so I'm only left to conclude you meant to add some caveat to your statement that I am not following. How is it that HPC apps are not Linpack like?

ftp://ftp.software.ibm.com/eserver/benchmarks/wp_Linpack_072905.pdf

Because the code can fit easily into cache, it does not really stress memory bandwidth, in contrast to more HPC apps which are bound by memory bandwidth.

Idontcare · Jun 16, 2010

Ah, know I see the basis on which you were making the delineation.

Sure the specific matrix size and maths performed changes the relevance linpack as an indicator (proxy really) of performance in specific apps, no more than benching the fps of one game and then making performance projections for another game.

All the standard benching caveats apply.

But I think you'd be hard-pressed to find corner-cases where a change in the hardware that improves linpack performance (lower-latency cache, higher single-thread IPC, etc) does not also elevate the performance of a given HPC app that relies on matrix manipulations at its core.

That's not to say a 5% linpack improvement means a 5% improvement in Euler3D, but the direction and approximate scaling of the performance improvement trend ought to be maintained as true.

It's no different for any benchmarking approach, Spec_FP, etc. We can always get pedantic and go straight to the QED by stating the only performance that matters is that of the end-user's specific application of interest. (which is always true)

But sometimes it helps to have some manner of generic benchmark, linpack serves that purpose for cache-bound matrix math, crystalmark serves that purpose for 4K random IO, etc.

evolucion8 · Jun 16, 2010

IntelUser2000 said:
Actually, the Nehalem only has ~15% lead in single threaded applications. What really helps Intel here is Hyperthreading. Multi-threading performance without Hyperthreading is still good, but Hyperthreading is the main helper here.

And server advantage is considerable, much more than desktop: http://www.anandtech.com/show/2774

The Thuban does more favorably against the desktop Bloomfield normalized to core and clock. The pricing is whack, but they are just merely pricing it to performance differences.

Although if the "Fusion" detail is done well here, the advantage of CPU detail can cancel out GPU detail, and to a lesser extent, vice versa. 🙂

IPC performance also matters. Nehalem having 4 cores with hyper threading, means that in the best case scenario, a 50% performance gain can be obtained when multi threading is used. That's why Nehalem is up to 50% faster compared to a similar Core 2 Quad in heavily multi threading scenario, but you never see it being twice faster.

But check this out, Thuban which has six cores compared to Phenom II X4, shows a 50% of performance boost in multi threading scenario, but why is still slower than Nehalem? Because AMD's current architecture has a lower IPC and higher cache latency. If Phenom II X4 managed to stay competitive with Core 2 Quad, I wouldn't expect by simply adding two cores to outperform Nehalem.

First Sandy Bridge Numbers?

Lifer

Elite Member

Lifer

Elite Member

Elite Member

Elite Member

Golden Member

Lifer

Member

Elite Member

Platinum Member

Elite Member

Lifer

Elite Member

Golden Member

Elite Member

No Lifer

Lifer

Platinum Member

Elite Member

Golden Member

Elite Member

Elite Member

Elite Member

Platinum Member