A look into Intel's sandy Bridge.

dangerman1337 · Sep 28, 2010

Here is a article looking into the microarchitecture of Sandy Bridge:

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937.

Nemesis 1 · Sep 28, 2010

Good read. While I was there I went to home and read the BD Article also. Kanter does a good job .

nyker96 · Sep 28, 2010

excellent article, very enjoyable read.

Borealis7 · Sep 28, 2010

my cousin is an engineer who worked on some of the graphics elements of SB in the Haifa (Israel) branch of Intel, what i understand from him (obviously he couldn't say much) is that there is going to be an initial product released, then Intel is going to sit and wait for BD, and then come out with a response, possibly with a chip with 12EU (or more??) on the GPU side.

he also said that SB has some really nice features that Intel isnt advertising because they aren't "sexy" to consumers. he whined about having worked on part of the chip and that it wasn't even mentioned at IDF and instead they focused on some trivial feature that's not even new but sounds more appealing.

Martimus · Sep 28, 2010

While the entire article was very interesting (I think I found my new favorite tech site!), the conclusion seems pretty spot on:

In looking at the two designs, it is sensible to compare a multi-threaded Sandy Bridge core to a Bulldozer module and separately consider single threaded operation as a special case. Both support two threads although the resources are very different. At a high level, Sandy Bridge shares everything between threads, whereas Bulldozer flexibly shares the front-end and floating point units, while separating the integer cores.

A Sandy Bridge core should have substantially higher performance than a Bulldozer module across the board for single threaded or lightly threaded code. It will also have an additional advantage for floating point workloads that use AVX, (e.g. numerical analysis for finance, engineering). With AVX, each Sandy Bridge core can have up to 2X the FLOP/cycle of a Bulldozer module, although they would be at parity if the code is compiled to use AMD’s FMA4 (e.g. via OpenCL). FMA4 will be relatively rare because, while elegant, it is likely to be a historical footnote for x86, supplanted by Intel’s FMA3. For software still relying on SSE, the difference between the two should be minimal. In comparison, Bulldozer will favor heavily multi-threaded software. Each module has twice the memory pipelines and slightly more resources (e.g. retirement queue/ROB entries, memory buffers) than a single Sandy Bridge core with two threads, so Bulldozer should do very well in many highly parallel integer workloads that exercise the memory pipelines.

In many ways, the strengths of Sandy Bridge reflect the intentions of the architects. Sandy Bridge is first and foremost a client microprocessor – which requires single threaded performance. Bulldozer is firmly aimed at the server market, where sacrificing single threaded performance for aggregate throughput is an acceptable decision in some cases. Perhaps in future articles, we can examine the components of performance in greater detail (e.g. frequency, IPC, etc.), but for now, high level guidance seems appropriate – given the level of disclosure from both vendors.

tokie · Sep 28, 2010

So basically, anything which is difficult to parallelize is likely going to perform better on Sandy Bridge?

I guess that includes game engines, seeing as how they are already reaching the maximum realistic amount of threads. Although Tim Sweeney did give that talk about the future of the Unreal Engine and how it was all about going wide. Maybe they won't hit a brick wall after all.

Dark_Archonis · Sep 28, 2010

Excellent article.

Single thread performance will definitely improve on Sandy Bridge. The whole argument of "parallel is the future" is a bit silly and it's a poor excuse for not pushing technology forward to enable higher clock speeds.

We already have Lynnfield and Bloomfield CPUs that can run on air with stability at around 4Ghz. The top Sandy Bridge LGA 1155 will be clocked at 3.4Ghz and will be able to turbo to 3.8Ghz. They will be able to run overclocked on air with stability at almost 5Ghz.

Per-clock performance will improve and clock speeds will continue to go up in the future.

ydnas7 · Sep 29, 2010

Basically its about as new a CPU design as you will get from Intel
"The Sandy Bridge CPU cores can truly be described as a brand new microarchitecture that is a synthesis of the P6 and some elements of the P4. Although Sandy Bridge most strongly resembles the P6 line, it is an utterly different microarchitecture. Nearly every aspect of the core has been substantially improved over the previous generation Nehalem. Many of these changes, such as the uop cache or physical register files, are drawn from aspects of or concepts behind the P4 microarchitecture. ....
...Nehalem and Westmere rely on the same mechanisms that date back to the original P6. Sandy Bridge changes the underlying out-of-order engine and uses the more efficient approach taken by the EV6 and P4. That one change alone qualifies Sandy Bridge as a different breed entirely from the P6. But, there are changes in almost every other aspect of the design."

Riek · Sep 29, 2010

Martimus said:
While the entire article was very interesting (I think I found my new favorite tech site!), the conclusion seems pretty spot on:

just curious how it can be spot on if we have litterally no information about BD? We have no id about the front-end, data prefetching, no id how much the throughput is of their integer EU (Agen?), no information about the FPU throughput, absolutely no information about the clock speeds, etc etc.

I do believe Borealis7 story holds some form of truth. That intel will keep an ace hidden for SB when needed. don't think it will be 12EU GPU, since BD doesn't even have GPU onboard. That would be more a reaction vs llano).

Martimus · Sep 29, 2010

imported_Riek said:
just curious how it can be spot on if we have litterally no information about BD? We have no id about the front-end, data prefetching, no id how much the throughput is of their integer EU (Agen?), no information about the FPU throughput, absolutely no information about the clock speeds, etc etc.

I do believe Borealis7 story holds some form of truth. That intel will keep an ace hidden for SB when needed. don't think it will be 12EU GPU, since BD doesn't even have GPU onboard. That would be more a reaction vs llano).

It is true that we don't know much about the design of the BD chip, but from what we do know, it has the execution resources tailored more for agregate throughput than for retiring a single thread as quickly as possible.

I like that the two designs have completely different design goals, where the Intel processor is focused on single-threaded performance with improved multithreaded added but not optimized, while the AMD processor is focused on multithreaded performance with improved singlethreaded performance but not optimized.

It seems like the two chips will have their own market segment, although there will be some crossover between them. It appears there will be some real options next year.

Cerb · Sep 29, 2010

tokie said:
So basically, anything which is difficult to parallelize is likely going to perform better on Sandy Bridge?

I guess that includes game engines, seeing as how they are already reaching the maximum realistic amount of threads. Although Tim Sweeney did give that talk about the future of the Unreal Engine and how it was all about going wide. Maybe they won't hit a brick wall after all.

Timing, though. Game engines that will scale out much more than 4 threads, even if started now, would not be ready to license for what, 5 years? By then, AMD and Intel will both be talking about a CPU three generations ahead of what they are now.

BD will really need many threads to shine, assuming IB doesn't smash it to little bits even in those cases, and games won't have those in the very near future. AMD will need to be able to make decent money on <$200 BD CPUs for the desktop. That is, plan for something like the current Athlon II/Phenom II situation from the get-go. We may not know the whole story on BD, but I can't imagine it will compete toe to toe, on the desktop, with near-future performance-oriented CPUs from Intel.

Dark_Archonis · Sep 30, 2010

We don't know about BD's FPU throughput? Really? I thought at least that much was quite clear by now, the FPU throughput per BD core, specifically AVX throughput.

I think it's a bit shortsighted to say Sandy Bridge has improved multi-threading but not optimized. SB's memory controller is supposed to have improved bandwidth compared to Nehalem. Also lets not forget about the L3 ring bus. Intel has said that the more cores you add, the more the bandwidth for the ring bus increases. The ring bus L3 makes Sandy Bridge very scalable in terms of multithreaded performance. Also the many other advancements and improvements of Sandy Bridge should help Hyper-threading performance.

BD is simply a different way or perspective to approach multi-threaded performance. BD also seems mainly focused on aggregate throughput while with Sandy Bridge, Intel seems focused on a balance of aggregate core throughout/multi-threaded performance as well as single-core performance and efficiency.

Based off of everything I have seen and read about Bulldozer, I can only conclude that it was designed primarily as a server chip, and may not show exemplary performance in the desktop market.

Martimus · Sep 30, 2010

Dark_Archonis said:
We don't know about BD's FPU throughput? Really? I thought at least that much was quite clear by now, the FPU throughput per BD core, specifically AVX throughput.

I think it's a bit shortsighted to say Sandy Bridge has improved multi-threading but not optimized. SB's memory controller is supposed to have improved bandwidth compared to Nehalem. Also lets not forget about the L3 ring bus. Intel has said that the more cores you add, the more the bandwidth for the ring bus increases. The ring bus L3 makes Sandy Bridge very scalable in terms of multithreaded performance. Also the many other advancements and improvements of Sandy Bridge should help Hyper-threading performance.

BD is simply a different way or perspective to approach multi-threaded performance. BD also seems mainly focused on aggregate throughput while with Sandy Bridge, Intel seems focused on a balance of aggregate core throughout/multi-threaded performance as well as single-core performance and efficiency.

Based off of everything I have seen and read about Bulldozer, I can only conclude that it was designed primarily as a server chip, and may not show exemplary performance in the desktop market.

Both designs are balanced. SB is obviously slanted toward single threaded performance, as almost every change in the design was made with this in mind (some at the detriment to multithreaded performance, which has even been shown in some of the few benchmarks we have seen). BD seems much more focused on the total output, rather than the individual core output.

Both have improvements to both multi-threaded throughput and single-threaded throughput, but Intel focused on Single-threaded throughput, while AMD focused on multi-threaded throughput. I stand by my statement that Intel did not optimize multi-threaded throughput in order to optimize single-threaded throughput. I also stand by my statement that AMD did not optimize single-threaded throughput in order to optimize multi-threaded throughput

Dark_Archonis · Sep 30, 2010

Martimus said:
Both designs are balanced. SB is obviously slanted toward single threaded performance, as almost every change in the design was made with this in mind (some at the detriment to multithreaded performance, which has even been shown in some of the few benchmarks we have seen). BD seems much more focused on the total output, rather than the individual core output.

Both have improvements to both multi-threaded throughput and single-threaded throughput, but Intel focused on Single-threaded throughput, while AMD focused on multi-threaded throughput. I stand by my statement that Intel did not optimize multi-threaded throughput in order to optimize single-threaded throughput. I also stand by my statement that AMD did not optimize single-threaded throughput in order to optimize multi-threaded throughput

Well then it seems we differ in our interpretations of optimized in this context.

IntelUser2000 · Oct 1, 2010

In response to Martimus in the other thread.

Think of a marble and a pipe.

A ring bus combined with a ring buffer allows n amount of data to simultaneously exist on the ring(where the n stands for amount of ring stops), while a crossbar only allows one data to exist at a time. So on the ring, every cycle allows new data to the output, while on a crossbar it has to wait until the first data arrives before sending the second one. It's like sending one marble at a time(crossbar) vs. filling the whole pipe with marbles and pushing them through(ring bus), which will allow more marble output.

It sacrifices a little bit of latency for bandwidth without increasing complexity greatly.

maddie · Oct 1, 2010

Did not Kanter himself say that BD appeared to be optimised for higher clocks?

Throughput is (IPC x Clockrate).

Is it not a mistake to concentrate on only one part of the equation? I find a lot of people doing that.

IntelUser2000 · Oct 1, 2010

My guess is Bulldozer should be able to clock 30-40% better than Barcelona.

Nemesis 1 · Oct 1, 2010

^ I would say that is what AMD is shooting for. IF they get it right it could be interesting. But I think they need 50-60% higher clocks to compete.

IntelUser2000 · Oct 1, 2010

If they can aim for 2x the cores they can at least get top multi-threading performance.

A look into Intel's sandy Bridge.

Senior member

Lifer

Diamond Member

Platinum Member

Diamond Member

Golden Member

Member

Member

Senior member

Diamond Member

Elite Member

Member

Diamond Member

Member

Elite Member

Diamond Member

Elite Member

Lifer

Elite Member