AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Soleron · Nov 15, 2009

Fox5 said:
I don't exactly understand the difference.

Is bulldozer basically creating a mixed mode dual core/single core? As in, if can function as two simple dual cores (two wide int execution), or one 4 wide int execution single core, per core?

No, it has 2 lots of 4 int execution units. Think of it like two cores, but sharing the fetch/decode stage, an L2 cache, and a single FPU between them.

"1 Bulldozer core = (2 Int units + 1 FP unit)". Am I correct?

Two bulldozer cores = One module = 2 Int units + 1 FP unit. So their 16-core chip features 8 modules, 16 cores, 16 threads, 16 Int units (each 4 wide), 8 L2 caches, 4 memory links and one L3 cache.

HT takes less than the 5% of the total transistor budget because it only duplicates register states and some schedulers, so it's approach is to maximize the execution engine usage. AMD approach takes more die space and adds you more execution resources.

AMD's approach uses the same die space, in percentage terms.

piesquared · Nov 15, 2009

Yeah, i'm sure there will be a ton of if's and but's until the architecture details are revealed. There is obviously much info that won't be released due to competitive reasons.

GaiaHunter · Nov 15, 2009

Sincerely what I want to know is if Zambezi can do 8 or 16 threads.

We get this question answered and we can have a better perception of AMD approach and tactics.

If Zambezi can do 16 threads, it will have 8 BD modules. If it is only 8, only 4 BD modules will be present.

16 threads and 8 BD modules would mean absurdly powerful computation power and a good chance of being competitive with Sandy Bridge, without knowing any more details on speed clock or IPC improvement.

If it is "only" 4 BD modules/ 8 threads, it depends more on architecture and the aforementioned variables.

BD does seem quite interesting, mind you, and should be a nice upgrade to Phenom II. But I guess many here, are hoping for AMD to get another Athlon 64/X2 to improve its situation, and no out of love for AMD, but for the future of competition.

Even IDC, I believe, that is all about "R&D budget wins", does seem to have some tiny hope on a stroke of genius.

In the end BD with 4 modules/8 threads might be a brilliant CPU, but 8 modules/16 threads is more impressive (and sandy bridge is rumored to be up to 8 cores/16 threads). That is all.

ilkhan · Nov 15, 2009

Soleron said:
No, it has 2 lots of 4 int execution units. Think of it like two cores, but sharing the fetch/decode stage, an L2 cache, and a single FPU between them.

Two bulldozer cores = One module = 2 Int units + 1 FP unit. So their 16-core chip features 8 modules, 16 cores, 16 threads, 16 Int units (each 4 wide), 8 L2 caches, 4 memory links and one L3 cache.

As I've been saying, I think that explanation is wrong. But until we get more details we can't KNOW anything.

Soleron · Nov 16, 2009

ilkhan said:
As I've been saying, I think that explanation is wrong. But until we get more details we can't KNOW anything.

Since I was paraphrasing JF and the Analyst Day presentations, and JF actually made some of those presentations, this is the most reliable it's going to get short of release-day articles.

@Gaia

If you go by JF, then 4 modules/8 cores/8 threads. I agree, an 8C/16T Sandy Bridge would be faster. But desktop apps can't take advantage of 8 threads, never mind 16, so it will still come down to performance on one, two, three or four threads, and those CPUs should be pretty close as they will be running a single thread per module or core.

On the server, AMD will probably have more native cores than Intel even in 2011. They will be selling 12 cores cheaper than Intel's 8C/16T, and Intel hasn't stated they'll outdo that in 2010.

jvroig · Nov 16, 2009

Bulldozer has taped out, yes? (A1, or is it A0? I think cpu makers use A0 instead of A1 to refer to the 1st tape out?). If so, how long before we get a real preview? Perhaps March? Seems to me that's the only way we actually resolve this

Personally, I side with the "Anand" interpretation, simply because, in my limited knowledge, I believe that's the best way AMD would have a chance to have a solid win against Intel.

If the "Matthias" or JF way is true... well, the engineers at AMD are far smarter than me.

Idontcare · Nov 16, 2009

It would be A0, A0 is always first tape-out. Now first silicon isn't always necessarily A0 because the possibility always exists for a critical flaw being discovered after A0 tapeout and maskset printing but before first round A0 wafers actually reach fab exit. In those rare events the wafers are scrapped (no point continuing processing on known-dead ICs) and a new stepping is issues to correct the fatal bugs found in A0 stepping, and so on.

This can happen at any point in the stepping iteration enumeration by the way, stepping B2 can tape-out and be running in the fab on first silicon for that stepping when a fatal flaw is uncovered during inline probe or the ever persistent ongoing layout verification that runs in parallel with finer and finer granularity and B2 silicon never makes it out of the fab.

Regarding the whole "who knows best - JH-AMD or Anand and ROW?" debacle over core counting methodology in upcoming bulldozer derivatives I am just going to say AMD has history with otherwise inaccurate statements and representations by "rogue" employees...if I were a betting man I would assume at this point in time that Anand and his sources at AMD likely know the answer to the question and that JH-AMD is either (a) slightly out of the loop, (b) misunderstood the question, or (c) has had his response taken out of context.

In big companies it is difficult enough getting everyone on the same page at the same time within just the circles of people that are supposed to be "in the know"...you add in a few more enthusiastic employees who just so badly want to be in the know or desperately want to be perceived as being in the know and suddenly you have the appearance of left-hands not talking to right-hands and the folks who do know are just slapping their foreheads going "dear god not this again...".

So is JH-AMD in the know and the AMD folks who talked to all the media and enthusiast sites the ones behind the curve? Or is it the other way around? We'll never hear an official response from AMD either way, those "40% faster than Clovertown" comments were never retracted or qualified, AMD just pushed them into the closet.

I say until Anand says otherwise we should probably assume his AMD contacts know what they are talking about as Anand's website has a little more public visibility than JH-AMD and surely AMD would contact Anand and ensure the right information is out there if AMD really felt Anand's version of the story was in error.

Soleron · Nov 16, 2009

Idontcare said:
Regarding the whole "who knows best - JH-AMD or Anand and ROW?" debacle over core counting methodology in upcoming bulldozer derivatives...

If you look at his AMDzone posts, JF makes it very clear a number of times what he believes to be true. I don't think he could be taken out of context on this, so either he is wrong (and I accept that could be the case) or Anand is (and given the lack of detail in his article compared to the Analyst Day as a whole it could be that we are interpreting Anand wrong too).

But it's irrelevant. What matters is performance, not naming. And both JF and the Analyst Day talks said that another thread on the same module is an 80% speedup while not using more die area than HT would. That's clearly a lot better than HT assuming 80% would be reached a lot of the time. Given that that architecture is about performance consistency (while HT is opportunistic based on workload) I think we can assume that.

"Since I have answered the question at least a dozen times, let me be completely explicit here.

The graphic that you saw for analyst's day was a bulldozer MODULE. There is no such thing as a "bulldozer core." The cores inside the Bulldozer module are integer cores, NOT bulldozer cores.

Here is how it works.

A Bulldozer module has the following: 2 Integer cores. One 256-bit shared FPU (that can be addressed as a single 256-bit unit or 2 128-bit units per cycle). Shared front end. Shared L2 cache.

Each Bulldozer module is seen by the OS as 2 cores. The OS does not see a module, only cores.

Interlagos has 8 Bulldozer modules for a total of 16 cores.

Valencia has 4 Bulldozer modules for a total of 8 cores."

ilkhan · Nov 16, 2009

OK, that was a rather explicit explaination. Basically, I was right.

One module = one core = 2 threads.
4 modules = 4 cores = 8 threads; 8 modules = 8[2x4] cores = 16 threads.
L1 per module/core, L2 is per module/core, L3 is per die (if there is any?).

JFAMD · Nov 16, 2009

Let me explain clearly to everyone what is going on.

There are bulldozer modules, not bulldozer cores. Let's all get on the same page here and this will go a lot quicker. Half of the problem is someone confusing a core for a module.

I will use interlagos for this explanation since I am in the server business (I will never comment on desktop, don't know enough about that business.)

Interlagos is a 16-core processor. It will have 16 logical integer cores and it will appear to the hardware and OS as 16 cores.

An interlagos will be made up of EIGHT bulldozer modules. Each module will have 2 integer cores plus a shared 256-bit FPU (which we will get to in a second). 8 x 2 = 16.

Each integer core will run one thread (there are 4 pipelines). That means 2 cores per module, simultaneously.

The FPU is 256-bit. During each clock cycle it can be either 256-bit for either core OR it can be 128-bit for each core simultaneously.

Now, on to HT. Proponents of HT claim "performance improvement with ~5% die space increase." The problem with the performance increase is that it is generally ~10-20%. Sometimes it is negative (in which case they recommend that you turn off HT). So, as a tradeoff, 5% die space for ~20% performance increase seems fine, right?

Well we had our engineers do the math on our core. If I took an Interlagos (16 cores, 8 modules) and pulled out half of the cores, I would save ~5% of the die space. You see, there is a lot on the die other than the integer cores. There is cache, northbridge, FPU, etc. In our case, a 16-core interlagos should perform ~80%+ faster than an 8-core Interlagos. With ~5% more die space.

Some will still try to argue HT as a better technology, but it boils down to this: If you are going to add 5% die space to a CPU, would you rather have 10-20% performance increase (with the chance that it is also negative) or would you rather have 80%+ performance increase.

We believe that real cores and real threads give you the best performance.

JFAMD · Nov 16, 2009

ilkhan said:
OK, that was a rather explicit explaination. Basically, I was right.
One module = one core = 2 threads.
4 modules = 4 cores = 8 threads; 8 modules = 8[2x4] cores = 16 threads.
L1 per module/core, L2 is per module/core, L3 is per die (if there is any?).

No. One module = 2 cores = 2 threads.

JFAMD · Nov 16, 2009

Idontcare said:
Regarding the whole "who knows best - JH-AMD or Anand and ROW?" debacle over core counting methodology in upcoming bulldozer derivatives I am just going to say AMD has history with otherwise inaccurate statements and representations by "rogue" employees...if I were a betting man I would assume at this point in time that Anand and his sources at AMD likely know the answer to the question and that JH-AMD is either (a) slightly out of the loop, (b) misunderstood the question, or (c) has had his response taken out of context.

In big companies it is difficult enough getting everyone on the same page at the same time within just the circles of people that are supposed to be "in the know"...you add in a few more enthusiastic employees who just so badly want to be in the know or desperately want to be perceived as being in the know and suddenly you have the appearance of left-hands not talking to right-hands and the folks who do know are just slapping their foreheads going "dear god not this again...".

So is JH-AMD in the know and the AMD folks who talked to all the media and enthusiast sites the ones behind the curve? Or is it the other way around? We'll never hear an official response from AMD either way, those "40% faster than Clovertown" comments were never retracted or qualified, AMD just pushed them into the closet.

I say until Anand says otherwise we should probably assume his AMD contacts know what they are talking about as Anand's website has a little more public visibility than JH-AMD and surely AMD would contact Anand and ensure the right information is out there if AMD really felt Anand's version of the story was in error.

As the director of product marketing for servers and the guy who put many of the slides together for analyst's day, I think I fall into the camp of "in the know".

Anand and I have met several times in the past and I would be more than happy to sit down for another interview.

Idontcare · Nov 16, 2009

Soleron, I agree that thread count per socket and performance scaling over that thread count are what matters to the majority of enthusiasts, once normalized with respect to cost naturally.

It isn't clear to me at this stage though that we should just assume a Bulldozer module will outperform a Sandy Bridge core though when it comes to single-threaded and dual-threaded apps.

Projecting performance scalings beyond that (one SB core w/HT versus one BD module) are really academic until we have some basis for making assumptions regarding resource utilization and congestion of shared resources for even the simplest of workload scenarios.

AMD saying another thread in the same module is an 80% speedup is another of the same kind of super-general totally-unqualified statements that got us all needlessly excited over Barcelona back in the day. I'd like to see some qualifications to those statements, specific application of interest, the specific workloads being characterized as delivering 80% scaling on the same BD module.

Otherwise there really is no value in the 80% statement as is, it could end up representing some absolute best case corner-case example that is never seen outside AMD's testing centers and that hardly makes it relevant then when comparing it to the efficacy of hyperthreading on Sandy Bridge for scaling efficiency purposes.

ilkhan · Nov 16, 2009

JFAMD said:
No. One module = 2 cores = 2 threads.

No, we're just counting differently. You want us to call an int unit a core. Thats fine, but its not really honest. An int unit can run a separate thread, but without the shared hardware it does nothing.

JFAMD · Nov 16, 2009

evolucion8 said:
HT takes less than the 5% of the total transistor budget because it only duplicates register states and some schedulers, so it's approach is to maximize the execution engine usage. AMD approach takes more die space and adds you more execution resources.

Actually, they both take ~5% of die space. The difference is in what we duplicate vs. what they duplicate. We duplicate the integer cores and that results in a bigger uptick in performance. HT is 10-20% increase in server workloads, and there are places where it actually has better performance when disabled. With two discrete cores, you'll see ~80%+ more throughput.

Idontcare · Nov 16, 2009

JFAMD said:
Actually, they both take ~5% of die space. The difference is in what we duplicate vs. what they duplicate. We duplicate the integer cores and that results in a bigger uptick in performance. HT is 10-20% increase in server workloads, and there are places where it actually has better performance when disabled. With two discrete cores, you'll see ~80%+ more throughput.

Given that there are shared resources allocated between two threads within a Bulldozer module, from fetch to decode to L2$, how confident are you that thread performance won't decrease from resource contention and cache congestion on a bulldozer module as it does with hyperthreading in an Intel core?

When you make these "10-20% vs. 80%" statements, are these for identical workloads and applications?

You are confident that applications which HT only delivers 10% speedup will see an 80% speedup with thread scaling on Bulldozer?

JFAMD said:
As the director of product marketing for servers and the guy who put many of the slides together for analyst's day, I think I fall into the camp of "in the know".

Anand and I have met several times in the past and I would be more than happy to sit down for another interview.

Thanks for confirming that the discrepancy does not lie with you, now if we could just get to the bottom of why Anand reported differently then I think a lot of folks would be better off.

Also, thank you for personally coming here to clarify the subject matter. Its always a treat to get a chance to interact with the knowledgeable pros in the industry versus filtering through all the speculation that we usually do around here. I hope you find time to continue to help educate us and clarify misinterpretations of AMD's publicly disclosed info.

GaiaHunter · Nov 16, 2009

JFAMD said:
Let me explain clearly to everyone what is going on.

There are bulldozer modules, not bulldozer cores. Let's all get on the same page here and this will go a lot quicker. Half of the problem is someone confusing a core for a module.

Well you have to understand our confusion, when on the own AMD slides AMD talk about "Bulldozer cores".

And ty for explanations.

deimos3428 · Nov 16, 2009

If the Int schedulers are considered cores, is there a reason why the FP schedulers don't rate?

Maybe I'm over-simplifying things, but I count three cores per module -- two general-purpose and one specialized for FP.

Soleron · Nov 16, 2009

deimos3428 said:
If the Int schedulers are considered cores, is there a reason why the FP schedulers don't rate?

Most workloads traditionally associated with CPUs are Int.

deimos3428 · Nov 16, 2009

Soleron said:
Most workloads traditionally associated with CPUs are Int.

Yeah, I get that. But based on that logic, the 80387 wasn't a processor.

evolucion8 · Nov 16, 2009

May be that's why some people refers the Bulldozer as a 1.5 core

JFAMD · Nov 16, 2009

Idontcare said:
Given that there are shared resources allocated between two threads within a Bulldozer module, from fetch to decode to L2$, how confident are you that thread performance won't decrease from resource contention and cache congestion on a bulldozer module as it does with hyperthreading in an Intel core?

When you make these "10-20% vs. 80%" statements, are these for identical workloads and applications?

You are confident that applications which HT only delivers 10% speedup will see an 80% speedup with thread scaling on Bulldozer?

It is always difficult to make definite statements about a processor in the future that I don't have in my hand.

The 80% already reflects the fact that there are shared resources. With 100% dedicated resources, there is ~20% hit. But there is a big power savings in sharing those resources.

The 10-20% is based off of most server workloads from the customers I talk to. There are client workloads that may scale better, but I am a server guy. The 80% is an aggregate estimate, but should be close enough for most cases. Some may be higher, some may be lower, but they shouldn't vary wildly.

Idontcare · Nov 16, 2009

@JFAMD, fair enough. I understand it is not going to be easy qualify the 80% statements until chips are released to reviewers, I appreciate you taking the time to qualify it as much as you already have.

evolucion8 said:
May be that's why some people refers the Bulldozer as a 1.5 core

Let's be fair to ourselves and to AMD though...we want them to do something different, we want to see them step outside the box and one-up Intel so we can consumers get to see some competitive pressures in the high-end space. (remember those heady days leading up to 1GHz where we had new SKU's released every month?)

So its not really fair to want AMD to do something different but at the same time expect them to not do anything different.

So they are fusing cores together, maximizing resource utilization in ways that seems foreign to us arm-chair microarchitects. We'll just have to get over it.

Who knows, we saw what AMD did to the whole "GHz race" (IPC x GHz matters, not just GHz) maybe they are about to do the equivalent to the "core race" by showing its not just the core count that matters but how well you use those cores. (thread performance scaling, etc)

Whatever AMD is doing here at least we can all agree it is definitely the start of something different, and different can be good if change is what is needed.

JFAMD · Nov 16, 2009

I couldn't agree more.

deputc26 · Nov 16, 2009

The definition of a "core" is blurring and it looks like the industry will now have the unenviable job of accurately portraying performance to the consumer without being able to use the term "core" in an apples-to-apples way. Just as Ghz became a clearly inadequate measure of performance earlier.

Will AMD and Intel have the honesty to come up with new metrics that the average consumer will understand or will they exploit the blurred lines for less-than-100%-accurate marketing purposes? I sincerely hope that both choose the former.

AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Senior member

Golden Member

Diamond Member

Golden Member

Senior member

Platinum Member

Elite Member

Senior member

Golden Member

Senior member

Senior member

Senior member

Elite Member

Golden Member

Senior member

Elite Member

Diamond Member

Senior member

Senior member

Senior member

Platinum Member

Senior member

Elite Member

Senior member

Senior member