Rumour: AMDs 8core "Bullsharks" coming close to Gulftown.

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

PlasmaBomb

Lifer
Nov 19, 2004
11,636
2
81
+1

If AMD can produce a CPU that matches a 980X, then we can expect it to be priced similar to the 980X (perhaps a little cheaper). They already do this price dance with Nvidia on the Graphics side of the house (because they can). The people looking for a $200 980X killer are going to be very disappointed.

Exactly. If an 8 Core BD can beat Gulftown expect it around 700 dollars

You are talking as if Sandy Bridge didn't exist...

Assuming that Bulldozer performance is around 980X and 2600K then to compete on value for money the price is going to have to be nearer the 2600k level than the 980X level...

Circa $200-300.
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
Price Performance is more important than IPC.

I agree with Rubycon here. IPC is very important because many programs these days are not highly threaded. It may be that way someday, but I also want good single and dual-threaded performance. To get this, I need high IPC and high clockspeed. I wonder if BD will have either (initially at least).

Definitely need some actual benchmarks and proposed price points to make a final decision. Does make me a little nervious that no benches or previews are around...if this is launching this year, I would hope it was a little more visible.
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
Close in what? IPC? Clockspeed? Both? And is this serial or embarrassingly multi-threaded apps?
Am I the only one that reads "BullShark" and thinks "BullSh!t"?

Probably the latter. AMD has always been strong in heavily multi-threaded applications. BD seems to stay true to that tradition, but only time will tell for sure. More real details would be awesome.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
It is quite easy to get a general estimation of AMD Bulldozer performance, as there are such clear hints of the already presented architecture. Of course the exact performance has to be measured when the parts are out.

Here is how BD 1st. gen. performance will be:

In Integer it will be faster than any current Core i7 including Sandybridge, maybe by some 10-50 % depending on application and when comparing equal number of Bulldozer Modules with Sandybridge Cores which is the right comparison regarding die size.

The reason why Bulldozer will be faster is totally simple, because Bulldozer just has the double amount of cores compared to Intel. With that it is just easy to surpass Sandy Bridge. They have HT but this is not enough. You can expect a performance improvement over Stars by around 70-100%(*), Bulldozer will be a great leap in integer performance.

On the other hand IPC will be lower than of any Intel offering if you compare a AMD core (half of the module) with a Intel core. So AMD Bulldozer will be faster but not too much, because of the lower IPC.

Sandy Bridge performance improvements over previous generation come mostly from higher overclocking in turbo mode, so Sandy Bridge was only a very little improvement regarding IPC. There is a critical barrier in IPC improvements and AMD has overcome that with Bulldozer.

Surely Bulldozer will have a great start until Intel implements the Bulldozer Architecture type in their future products which will be 1-1.5 years later. But Intel will have more difficulties with that, because they have to redesign their decoder front end and mem access backend as well. The first CPUs using this architecture was IBM's Power 7 processors. If Intel will make this architecture switch with their next generation they will be in the lead again.

So I guess we will see some repetition of 2005/2006 when AMD was first with significant architectural improvements. AMD has to go this way as they cannot compete in IPC.

Regarding FP, results could be more interesting, but I guess AMD Bulldozer will get ~ on par with Sandy Bridge.

That is basically all which can be said without benchmarks, especially as it is unclear (*) where in the 4-5 GHz range the first AMD Bulldozer units will actually run.

The Fudzilla rumor is just presenting that: 4 module AMD Bulldozer ~ on par with 6 core Gulftown means that AMD Bulldozer is ~50% faster than Gulftown (4 Modules vs. 6 Cores) at equal core count (equal die size).
 
Last edited:

HW2050Plus

Member
Jan 12, 2011
168
0
0
I agree with Rubycon here. IPC is very important because many programs these days are not highly threaded. It may be that way someday, but I also want good single and dual-threaded performance. To get this, I need high IPC and high clockspeed. I wonder if BD will have either (initially at least).
Bulldozer will have high clock but low IPC. If you multiply clock and IPC Bulldozer will still be lower than e.g. Intel Sandy Bridge. As said before they do the trick with doubling the cores.

However the question is if you really need high performance in applications which do not consider multithreading? I say no, already by now, despite you have any very old software licences which you do not want to upgrade to newer versions.

There are lots of applications out which are CPU power hungry like hell, but all of them are heavily multi threaded (renderer, packers, en-/decoder, database, web server, web browser, chess, anything@home, games which need, etc.).

Your windows calculator or ASCII editor however will e.g. run slower, but does it matter?
 

hamunaptra

Senior member
May 24, 2005
929
0
71
Yeah so, from what Ive read so far, it pretty much comes down to this:
We can expect some pretty impressive multithreaded performance from this CPU.
We can expect ok performance in single threaded situations.

Its an architecture designed to run at pretty high clock speeds, there is much more latency in the L2 cache , there are many stages to the pipelines that have been added all to attain higher clockspeeds. - kinda like netburst.

But the hopeful outcome this time, is through the space savings mechanisms they have implemented (mostly shared resources per modules, while still maintaining fully independent ALU pipes) they have attained the ability to reach those higher clocks while keeping a very low TDP.

So, hopefully their multithreading will be similar or greater than SB , but I am pretty sure their single threaded performance will fall short of the ixxxx processors clock for clock.

BUT, if they come out of the gate with speeds above 4ghz on the midend, maybe around 3.6-3.8ghz on lowend and up to 4.5ghz on the highend. Then we could have an impressively performing processor, surpassing SB at stock speeds in both single threaded situations and definitely multithreaded situations. All maintaining the TDP of current processors.

One of the big determining factors of FPU performance is still up in the air. That is, its still unknown (afaik) whether or not the 2 FPU pipes can be ganged together to perform both halves of a 256bit AVX instruction down each pipe simultaneously , rather than splitting it into 2 128bit macro ops and having the schedulers deal with how they should go through the pipes.

I think one of the major things AMD has going for it is its change from a vastly dedicated scheduling and other dedicated things per pipeline of the cores and changed over to the Unified schedulers that intel has had in their cores since Core2.
Hopefully the Unified schedulers and so on are big enough to accommodate for AMD's INT and FPU core throughput capabilities.

There are still other things up in the air, which is obvious... but in the end it will be a pretty dang interesting architecture and hopefully, a huge potential for OCers! If this thing was designed for high clocks, I would love to see those extreme OC people hit 10ghz+ on these chips!
 

PreferLinux

Senior member
Dec 29, 2010
420
0
0
I agree with Rubycon here. IPC is very important because many programs these days are not highly threaded. It may be that way someday, but I also want good single and dual-threaded performance. To get this, I need high IPC and high clockspeed. I wonder if BD will have either (initially at least).

Definitely need some actual benchmarks and proposed price points to make a final decision. Does make me a little nervious that no benches or previews are around...if this is launching this year, I would hope it was a little more visible.
Well, actually price to single-threaded performance is more important than IPC.
 

OCGuy

Lifer
Jul 12, 2000
27,224
37
91
It is quite easy to get a general estimation of AMD Bulldozer performance, as there are such clear hints of the already presented architecture. Of course the exact performance has to be measured when the parts are out.

Here is how BD 1st. gen. performance will be:

In Integer it will be faster than any current Core i7 including Sandybridge, maybe by some 10-50 % depending on application and when comparing equal number of Bulldozer Modules with Sandybridge Cores which is the right comparison regarding die size.

The reason why Bulldozer will be faster is totally simple, because Bulldozer just has the double amount of cores compared to Intel. With that it is just easy to surpass Sandy Bridge. They have HT but this is not enough. You can expect a performance improvement over Stars by around 70-100%(*), Bulldozer will be a great leap in integer performance.

On the other hand IPC will be lower than of any Intel offering if you compare a AMD core (half of the module) with a Intel core. So AMD Bulldozer will be faster but not too much, because of the lower IPC.

Sandy Bridge performance improvements over previous generation come mostly from higher overclocking in turbo mode, so Sandy Bridge was only a very little improvement regarding IPC. There is a critical barrier in IPC improvements and AMD has overcome that with Bulldozer.

Surely Bulldozer will have a great start until Intel implements the Bulldozer Architecture type in their future products which will be 1-1.5 years later. But Intel will have more difficulties with that, because they have to redesign their decoder front end and mem access backend as well. The first CPUs using this architecture was IBM's Power 7 processors. If Intel will make this architecture switch with their next generation they will be in the lead again.

So I guess we will see some repetition of 2005/2006 when AMD was first with significant architectural improvements. AMD has to go this way as they cannot compete in IPC.

Regarding FP, results could be more interesting, but I guess AMD Bulldozer will get ~ on par with Sandy Bridge.

That is basically all which can be said without benchmarks, especially as it is unclear (*) where in the 4-5 GHz range the first AMD Bulldozer units will actually run.

The Fudzilla rumor is just presenting that: 4 module AMD Bulldozer ~ on par with 6 core Gulftown means that AMD Bulldozer is ~50% faster than Gulftown (4 Modules vs. 6 Cores) at equal core count (equal die size).


Well that is surely a rosey way to look at things.....
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Sandy Bridge performance improvements over previous generation come mostly from higher overclocking in turbo mode, so Sandy Bridge was only a very little improvement regarding IPC. There is a critical barrier in IPC improvements and AMD has overcome that with Bulldozer.

First, Turbo brings LESS for Sandy Bridge than Lynnfield:
http://www.computerbase.de/artikel/...-sandy-bridge/47/#abschnitt_skalierungsrating

Core i7 2600: 2%
Core i7 870: 8%

Second, you seem to be confusing multi-thread performance with single thread IPC.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
That is, its still unknown (afaik) whether or not the 2 FPU pipes can be ganged together to perform both halves of a 256bit AVX instruction down each pipe simultaneously , rather than splitting it into 2 128bit macro ops and having the schedulers deal with how they should go through the pipes.
That is now clarifyed with an additional article of Mr. Fruehe from AMD: The AVX operations can be executed in parallel using both 128 FMAC units (or if the other FMAC is busy e.g. used by other core can be executed in serial).
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
First, Turbo brings LESS for Sandy Bridge than Lynnfield:
http://www.computerbase.de/artikel/...-sandy-bridge/47/#abschnitt_skalierungsrating

Core i7 2600: 2%
Core i7 870: 8%

Second, you seem to be confusing multi-thread performance with single thread IPC.
First this is just not the case and very obvious. Think about e.g. Core i7 2600, it runs at 11% higher clock resulting in 2% higher performance? Computerbase is wrong here, though you do not know exactly what they mean. My statement was about IPC and therefore single thread analysis.

Yes I did maybe some confusion because in my post I write about overall performance but in that specific paragraph I write about IPC (I pointed that out but that was maybe not clear enough).

So Sandy Bridge brought performance improvements. They came from better scaling, better HT, higher overclocking in TURBO mode and higher IPC. However the IPC gain is very very little, which is because Intel reached already a very high level. I gave this statement to point out why AMD did not focus on IPC gains (besides very costly regarding RD power) but used other techniques to achieve a high performance level. Also Intel used mainly other techniques for Sandy Brdige than improving IPC (though they have high RD power and already very high IPC). This was already true for Nehalem.

So mainly in the past 3-4 years you have stagnation regarding IPC with Intel on a very high level and AMD on a high level, though they still squeeze out some 1-3% with each generation.

Performance improvements in the last 3-4 comes from more cores, more clock or better scaling (means that multiple cores are less influenced when other cores are as well fully busy, or threads switch from one core to another).

Therefore situation for AMD Bulldozer was like this:
a) Push IPC from high to very high as Intel -> very costly regarding R&D power
or
b) Improve core count/scaling/frequency

For Bulldozer they followed route b: double core count, higher frequency, lower IPC (compared with Star core, compensated by higher frequency), better scaling

Basically you can assume, that lower IPC is compensated by higher frequency, so then the double core count by using this "module technique" remains giving an ~80% performance boost over current Phenom CPUs. And that is enough to surpass Sandy Bridge.

As you can see, by no way they could have achieved such a tremendous performance improvement by improving IPC.

Or to explain you that in another way:
Sandy Bridge Core superscalarity = 3
Bulldozer Core superscalarity = 2
Bulldozer Module superscalarity = 4 (2*2)*

*It is even better than 4, because it is two independent of 2, therefore any mispredictions/pipeline stalls always affect only 2 pipelines in Bulldozer (all 3 in Sandy Bridge, though for some of these stall types HT can be used)
 

JFAMD

Senior member
May 16, 2009
565
0
0
One of the big determining factors of FPU performance is still up in the air. That is, its still unknown (afaik) whether or not the 2 FPU pipes can be ganged together to perform both halves of a 256bit AVX instruction down each pipe simultaneously , rather than splitting it into 2 128bit macro ops and having the schedulers deal with how they should go through the pipes.

I think one of the major things AMD has going for it is its change from a vastly dedicated scheduling and other dedicated things per pipeline of the cores and changed over to the Unified schedulers that intel has had in their cores since Core2.
Hopefully the Unified schedulers and so on are big enough to accommodate for AMD's INT and FPU core throughput capabilities.

We have already said the the 2 dedicated FMACs can be merged into a single 256-bit FMAC to handle AVX operations.

I don't think enough attention has been given to the dedicated schedulers in the architectures. We have 3 schedulers per module, two integer and one FP. We have far more entries to handle the integer and FP executions. Much has been made of our shared front end, even those that have tried (unsuccessfully) to argue that it would be a bottleneck, yet they neglect the shared scheduler that intel employs in their architecture.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
We have already said the the 2 dedicated FMACs can be merged into a single 256-bit FMAC to handle AVX operations.

I don't think enough attention has been given to the dedicated schedulers in the architectures. We have 3 schedulers per module, two integer and one FP. We have far more entries to handle the integer and FP executions. Much has been made of our shared front end, even those that have tried (unsuccessfully) to argue that it would be a bottleneck, yet they neglect the shared scheduler that intel employs in their architecture.
Indeed. One problem of AMD with Phenom was, that they could not make use of their high performance front (high bandwith decoder) and back end because all was stuck in execution units or the usual stalls.

Seperating the execution power by using two integer schedulers to overcome these natural limitations improved the situation in stalls extremly and adding an additional execution unit/AGU now makes use of the high performance front/back end which has in addition again improved over Stars.

That is why a 2 scheduler * 2 IEU is much more powerful than 1 scheduler * 4 execution units would have been (and more than a 1 scheduler * 3 execution units used by Intel, despite that Intel does not have a dedicated FP scheduler).

That and the higher clock are the reasons for nearly double performance of Bulldozer vs. Stars.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
First this is just not the case and very obvious. Think about e.g. Core i7 2600, it runs at 11% higher clock resulting in 2% higher performance? Computerbase is wrong here, though you do not know exactly what they mean. My statement was about IPC and therefore single thread analysis.

I'd like to say thanks for the good analysis.

What I meant for 2 and 8% is for Turbo on and off, which they have there. They also have clock-equalized comparison which shows 15% gain.
 

HW2050Plus

Member
Jan 12, 2011
168
0
0
I'd like to say thanks for the good analysis.

What I meant for 2 and 8% is for Turbo on and off, which they have there. They also have clock-equalized comparison which shows 15% gain.
Yes I saw that, but I do not know what they tested. If they tested with an application using all cores and all threads then that means that turbo mode was on much lower frequency. If they tested otherwise their test is simply wrong.

Since I was talking about IPC I was not interested in a test fully using all cores since the gains from scaling and hypertransport do not allow any conclusions on IPC.

I analyzed the benchmark results made in the Anandtech tests of the Sandy Bridge review and came to an approximatly improvement of 1-3% in IPC from Sandy Bridge to Nehalem. I took according benchmarks and stripped off HT, scaling (cache architecture) and turbo mode frequency gain. I took several suited benchmarks and normalized to a frequency by simple division. This simple division has some error margin since with higher frequency memory speed remains same, but it is good enough for an estimation.

Therefore Sandy Bridge is a faster but only a minor part comes from IPC improvements. That was already like this with Nehalem as improvements came from integrated interfaces, mem controller and cache architecture (so they also lost something there regarding L2 size by the way).

That makes sense since the Conroe core was a big leap with lot of improvements (superscalarity increased by one, micro/macro op fusion, faster division, speculatice read/write and so on and so on) and Nehalem and Sandy Bridge only added minor improvements. There were additions in SSE and AVX of course. You did not here about lot of such core improvements for Nehalem and Sandy Bridge. There are some but in number and effect only a few and little.

Same applies to AMD which used the same architecture from K7 to K10.5. But now with Bulldozer they have a complete redesign as it was done by Intel with Core/Conroe, though you can say, that integration of memory controller and inter chip interface with K8 and Nehalem was also a major design change.

With Sandy Bridge very little was done regarding the cores itself (okay AVX added of course), same as AMD did only little with Athlon to Phenom transition (mainly additional L3 cache and core/uncore decoupling) and only little core changes, however many on front end back end, but this heavy improvements there (doubling of capacity!) showed little effect, as explained in one of my previous posts, they used that to make this module switch.

Anyway this Computerbase results are strange or at least unclear, I rely more on the Anandtech benches though there was no seperate test like this in the review. I mean the clock gain from Turbo mode is higher than of Nehalem, so why should it have less gain? That makes just no sense, that would mean Sandy Bridge has a severe clock scaling issue but you can trust me that it has not.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Yes I saw that, but I do not know what they tested. If they tested with an application using all cores and all threads then that means that turbo mode was on much lower frequency. If they tested otherwise their test is simply wrong.

I'm not sure how you pulled 2-3% off that review. I was talking about the first results in the page where they are all clocked at 2.8GHz and have Turbo off. They made it even convenient to compare by showing it in percentages when you hover over the names with your mouse.

That shows 15% gain with 2500 and 17% with 2600. Certainly not low.

Another one: http://www.hardware.fr/articles/815-15/intel-core-i7-core-i5-lga-1155-sandy-bridge.html

Shows 11% gain
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Seperating the execution power by using two integer schedulers to overcome these natural limitations improved the situation in stalls extremly and adding an additional execution unit/AGU now makes use of the high performance front/back end which has in addition again improved over Stars.

Aside from the fact that you are somehow trying to make the IPC gain as being only 2-3%, you seem to be knowledgeable.

You can put the discussion in two ways:
1. Favor of multi-core: Programs are becoming more and more threaded and those are usually the demanding ones that need processing power
2. Favor of single-thread performance: While programs are becoming more threaded, the adoption is extremely slow. Single thread IPC allows everyone to gain performance. Plus, if multi-threading performance was all that mattered, Larrabee-style core or even GPU is better off.