- Sep 16, 2010
- 393
- 45
- 91
Here is a article looking into the microarchitecture of Sandy Bridge:
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937.
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937.
In looking at the two designs, it is sensible to compare a multi-threaded Sandy Bridge core to a Bulldozer module and separately consider single threaded operation as a special case. Both support two threads although the resources are very different. At a high level, Sandy Bridge shares everything between threads, whereas Bulldozer flexibly shares the front-end and floating point units, while separating the integer cores.
A Sandy Bridge core should have substantially higher performance than a Bulldozer module across the board for single threaded or lightly threaded code. It will also have an additional advantage for floating point workloads that use AVX, (e.g. numerical analysis for finance, engineering). With AVX, each Sandy Bridge core can have up to 2X the FLOP/cycle of a Bulldozer module, although they would be at parity if the code is compiled to use AMD’s FMA4 (e.g. via OpenCL). FMA4 will be relatively rare because, while elegant, it is likely to be a historical footnote for x86, supplanted by Intel’s FMA3. For software still relying on SSE, the difference between the two should be minimal. In comparison, Bulldozer will favor heavily multi-threaded software. Each module has twice the memory pipelines and slightly more resources (e.g. retirement queue/ROB entries, memory buffers) than a single Sandy Bridge core with two threads, so Bulldozer should do very well in many highly parallel integer workloads that exercise the memory pipelines.
In many ways, the strengths of Sandy Bridge reflect the intentions of the architects. Sandy Bridge is first and foremost a client microprocessor – which requires single threaded performance. Bulldozer is firmly aimed at the server market, where sacrificing single threaded performance for aggregate throughput is an acceptable decision in some cases. Perhaps in future articles, we can examine the components of performance in greater detail (e.g. frequency, IPC, etc.), but for now, high level guidance seems appropriate – given the level of disclosure from both vendors.
While the entire article was very interesting (I think I found my new favorite tech site!), the conclusion seems pretty spot on:
just curious how it can be spot on if we have litterally no information about BD? We have no id about the front-end, data prefetching, no id how much the throughput is of their integer EU (Agen?), no information about the FPU throughput, absolutely no information about the clock speeds, etc etc.
I do believe Borealis7 story holds some form of truth. That intel will keep an ace hidden for SB when needed. don't think it will be 12EU GPU, since BD doesn't even have GPU onboard. That would be more a reaction vs llano).
Timing, though. Game engines that will scale out much more than 4 threads, even if started now, would not be ready to license for what, 5 years? By then, AMD and Intel will both be talking about a CPU three generations ahead of what they are now.So basically, anything which is difficult to parallelize is likely going to perform better on Sandy Bridge?
I guess that includes game engines, seeing as how they are already reaching the maximum realistic amount of threads. Although Tim Sweeney did give that talk about the future of the Unreal Engine and how it was all about going wide. Maybe they won't hit a brick wall after all.
We don't know about BD's FPU throughput? Really? I thought at least that much was quite clear by now, the FPU throughput per BD core, specifically AVX throughput.
I think it's a bit shortsighted to say Sandy Bridge has improved multi-threading but not optimized. SB's memory controller is supposed to have improved bandwidth compared to Nehalem. Also lets not forget about the L3 ring bus. Intel has said that the more cores you add, the more the bandwidth for the ring bus increases. The ring bus L3 makes Sandy Bridge very scalable in terms of multithreaded performance. Also the many other advancements and improvements of Sandy Bridge should help Hyper-threading performance.
BD is simply a different way or perspective to approach multi-threaded performance. BD also seems mainly focused on aggregate throughput while with Sandy Bridge, Intel seems focused on a balance of aggregate core throughout/multi-threaded performance as well as single-core performance and efficiency.
Based off of everything I have seen and read about Bulldozer, I can only conclude that it was designed primarily as a server chip, and may not show exemplary performance in the desktop market.
Both designs are balanced. SB is obviously slanted toward single threaded performance, as almost every change in the design was made with this in mind (some at the detriment to multithreaded performance, which has even been shown in some of the few benchmarks we have seen). BD seems much more focused on the total output, rather than the individual core output.
Both have improvements to both multi-threaded throughput and single-threaded throughput, but Intel focused on Single-threaded throughput, while AMD focused on multi-threaded throughput. I stand by my statement that Intel did not optimize multi-threaded throughput in order to optimize single-threaded throughput. I also stand by my statement that AMD did not optimize single-threaded throughput in order to optimize multi-threaded throughput
