AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

IntelUser2000 · Dec 2, 2009

Ohhh, I get it now.

So its actually like this:

JFAMD · Dec 3, 2009

bingo, we have a winner.

Kuzi · Dec 3, 2009

IntelUser2000 said:
They don't want to make 500mm2 chips, server or not. If my estimates of Nehalem core=1 BD module, an 8 module version will reach over 330mm2 even at 32nm process. If they want to do 16 module, it'll be at greater than 660mm2 if they add nothing for better scability.

I believe an 8-module BD would end up larger than 330mm2. Remember that L3 cache size can make a big difference. But if I was going to make a wild guess about the die size of BD, it would be as follows:
2-module/4-core (with 4mb L3 cache) = ~150mm2
4-module/8-core (with 8mb L3 cache) = ~300mm2
8-module/16-core (with 16mb L3 cache) = ~600mm2

If we compare to CPUs available today, and the projected performance of BD, it would seem like a really great CPU. Lets take Istanbul for example, it has 6 cores, 6mb L3 cache and a 345mm2 die size. It's very possible a 4-module BD would end up smaller than that, while offering much better performance, and also higher clocks. (or the same clock, 2.6GHz, with lower TDP).

Kuzi · Dec 3, 2009

GaiaHunter said:
Even if a BD module has the same resources and half the FP resources of 2 Nehalem CORES, is FP that important compared to integer? Or do you need so much FP?

The FPU in a BD module can run as 1x256bits or 2x128bits, a Nehalem FPU is 128bits wide max, so I would think the FPU in a BD Module performs just as well as the 2xNehalem FPUs, and probably much better.

The Nehalem FPU is exactly the same as the Core2 FPU, there was no change there, except the addition of SSE4.2 instructions. The FPU in K10/K10.5 is comparable in performance to the Intel counterparts. And according to Anand's article, the FP unit in BD should offer 25% higher performance compared to Phenom.

The only architecture where the BD FPU may seem weak against is Sandy Bridge. Because each core will have it's own full 256bit wide unit. And who knows, maybe AMD's plan is to throw more "Modules" and/or a GPU in to compete

Idontcare · Dec 3, 2009

Kuzi said:
I believe an 8-module BD would end up larger than 330mm2. Remember that L3 cache size can make a big difference. But if I was going to make a wild guess about the die size of BD, it would be as follows:
2-module/4-core (with 4mb L3 cache) = ~150mm2
4-module/8-core (with 8mb L3 cache) = ~300mm2
8-module/16-core (with 16mb L3 cache) = ~600mm2

If we compare to CPUs available today, and the projected performance of BD, it would seem like a really great CPU. Lets take Istanbul for example, it has 6 cores, 6mb L3 cache and a 345mm2 die size. It's very possible a 4-module BD would end up smaller than that, while offering much better performance, and also higher clocks. (or the same clock, 2.6GHz, with lower TDP).

Kuzi are you accounting for the 45nm->32nm transition between Istanbul and Bulldozer when tallying up your Bulldozer diesize estimates?

jvroig · Dec 3, 2009

Idontcare said:
Kuzi are you accounting for the 45nm->32nm transition between Istanbul and Bulldozer when tallying up your Bulldozer diesize estimates?

I am curious about this as well, only because of the node difference.

IDC, how much smaller would it be, given that node labels are arbitrary? We just know that it should be smaller, but by how much? Is 45nm->32nm a 40% decrease in die size, all other factors being equal? 30%? 60%?

Triskain · Dec 3, 2009

By studying the die plots of AMD's 45nm CPU's I have determined the following data on a 45nm K10 Core, it has a total area of 16.61 mm² (give or take 0.5 mm²) including L1 Cache and about 34 million transistors. The breakdown is as follows:

- 5.81 mm² for the Decoders, Branch Prediction, the Instruction Cache, ITLB, Microcode ROM etc.

- 2.2 mm² for the Integer ALU's, Registers, Schedulers; ROB etc.

- 4.6 mm² for the LoadStore Unit and the L1 Data Cache

- 4.0 mm² for the FPU Execution Units, Registers, Schedulers, Rename etc.

If I take that as a base, I can approximately determine the possible Die area of a Bulldozer Module:

- The Front End will have to feed the two Integer Cores and the FPU, so I would propose at least a 50% increase in size --> 8.72 mm²

- The Integer Cores will probably be 4-Way (up from 3-Way as it is on K10) so I say we add 40% on top and double them --> 6.16 mm²

- The amount of LS Units and Data Caches will double (and their capabilities will surely improve from what they are now) so I say 110% --> 19.32 mm²

- The K10 FPU can do 1 128-bit MUL and 1 128-bit ADD now, according to Dresdenboy the Bulldozer FPU will be able to do twice of that (2 128-bit MUL and ADD), so we should assume a least a doubling of the size --> 8.4 mm²

Summed together we come to a total area of 42.6 mm² for a hypothetical Bulldozer module in 45 nm and about 87 million Transistors.
If we assume an average scaling of 0.6 from 45nm to 32nm (which is realistic if we look at the size of a K10 Core in 32nm - 9.69 mm² according to the ISSC presentation from AMD) we come to a Core size of 25.26 mm².
That is pretty much the size of a Nehalem Core at 45nm, so I think my guesstimates are realistic.
What do you think IDC? As a former process engineer you can surely give us some additional info about process scaling.

Idontcare · Dec 3, 2009

@JVroig: The scaling efficiencies are actually dictated by necessity of economics.

If the physical scaling is too little then the cost incurred in doing the design/layout work to shrink the chip overwhelms to return-on-investment of doing the shrink in the first place. In industry-speak we refer to the concept of "entitlement" when discussing the scaling/cost/gross-margin balance that goes into developing new nodes and the chips to be manufactured on them.

Suffice to say a good rule of thumb is that successive nodes shrink the area of sram by 50% (so 70.7% linear shrinkage, 0.707x0.707 = 0.50 areal shrink) and logic by 30-40% and IO by 10-20% (at best).

There is no physical limitation/reasoning for the varying shrink ratios, it all comes down to development time and R&D expense. It would cost a lot more money (or time, or both) to develop a process node that shrunk the logic areas by the same scaling factor as that seen in sram.

In a world of limited budget and finite time for the project the priorities get ranked (and budget allocated accordingly) such that sram scaling is done more aggressively than logic which in turn is pursued more aggressively than IO.

Have you seen these images from the CELL 90nm->65nm->45nm shrink progression before?

Source: http://pc.watch.impress.co.jp/docs/column/kaigai/20090824_310373.html

Idontcare · Dec 3, 2009

Triskain said:
By studying the die plots of AMD's 45nm CPU's I have determined the following data on a 45nm K10 Core, it has a total area of 16.61 mm² (give or take 0.5 mm²) including L1 Cache and about 34 million transistors. The breakdown is as follows:

- 5.81 mm² for the Decoders, Branch Prediction, the Instruction Cache, ITLB, Microcode ROM etc.

- 2.2 mm² for the Integer ALU's, Registers, Schedulers; ROB etc.

- 4.6 mm² for the LoadStore Unit and the L1 Data Cache

- 4.0 mm² for the FPU Execution Units, Registers, Schedulers, Rename etc.

If I take that as a base, I can approximately determine the possible Die area of a Bulldozer Module:

- The Front End will have to feed the two Integer Cores and the FPU, so I would propose at least a 50% increase in size --> 8.72 mm²

- The Integer Cores will probably be 4-Way (up from 3-Way as it is on K10) so I say we add 40% on top and double them --> 6.16 mm²

- The amount of LS Units and Data Caches will double (and their capabilities will surely improve from what they are now) so I say 110% --> 19.32 mm²

- The K10 FPU can do 1 128-bit MUL and 1 128-bit ADD now, according to Dresdenboy the Bulldozer FPU will be able to do twice of that (2 128-bit MUL and ADD), so we should assume a least a doubling of the size --> 8.4 mm²

Summed together we come to a total area of 42.6 mm² for a hypothetical Bulldozer module in 45 nm and about 87 million Transistors.
If we assume an average scaling of 0.6 from 45nm to 32nm (which is realistic if we look at the size of a K10 Core in 32nm - 9.69 mm² according to the ISSC presentation from AMD) we come to a Core size of 25.26 mm².
That is pretty much the size of a Nehalem Core at 45nm, so I think my guesstimates are realistic.
What do you think IDC? As a former process engineer you can surely give us some additional info about process scaling.

That's a pretty damn good estimate in my humble opinion

Can you do us a favor and take a diemap of Shanghai/Deneb and graphically highlight the various areas of the chip with labels so we can see which part is the FPU/ALU/etc?

I don't make this request as a means to critique or scrutinize your claims, rather I want to learn and yours is perhaps the most succinct post on the topic of the intricacies of the K10 core architecture as I have read.

Triskain · Dec 3, 2009

Here is something I threw together quickly, but it shows the most important facts:

It is based on that picture by AMD : http://www.flickr.com/photos/amd_unprocessed/3585623217/sizes/l/

If you want to find out more about how AMD's Architecture works, I recommend you to read this: http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
It may be about the K8 Architecture, but as there were no significant changes between K8 and K10 most of it is still correct. It is a very enjoyable read, at least for someone who is technically interested.

Idontcare · Dec 3, 2009

Awesome! Thanks Triskain...man that is a blast from the past to read thru Hans' old opteron dissection, I poured over that thing years and years ago. Reading thru it again and it is kind of sad to realize how much I've forgotten in the 6yrs since. Wow, what a resource, gonna bookmark it this time

Kuzi · Dec 3, 2009

Idontcare said:
Kuzi are you accounting for the 45nm->32nm transition between Istanbul and Bulldozer when tallying up your Bulldozer diesize estimates?

Sure IDC. I did account for the transition to 32nm, although I used a worst case scenario and lowered the chip size by 35%, instead of 40%. And I just noticed from your post you mentioned that SRAM could shrink by up to 50% from each successive node shrink, so my L3 cache size estimate on BD was probably larger than it should be.

I used Triskain's nice estimate at 25mm2 for each BD module, and changed my L3 cache numbers to shrink by 45% compared to 45nm (instead of my earlier 35% shrink). I got the following:
2-module/4-core (with 4mb L3 cache) = ~130mm2
4-module/8-core (with 8mb L3 cache) = ~260mm2
8-module/16-core (with 16mb L3 cache) = ~520mm2

Okay, these numbers look better, the info you guys provided helped

These numbers are very rough estimates, but they could give an idea what a future BD on a new process would be like. Another guess I want to make is that for BD, AMD may go the Nehalem route and have smaller/faster L2 caches, for example 256KB L2 per core (512KB per Module).

Martimus · Dec 3, 2009

Kuzi said:
Another guess I want to make is that for BD, AMD may go the Nehalem route and have smaller/faster L2 caches, for example 256KB L2 per core (512KB per Module).

That may be true, but with two cores sharing each L2 Cache I think it is more likely they will increase the size of the cache or leave it the same than they would be to reduce it.

Idontcare said:
Martimus the bulldozer cores share more resources than just the FPU. The integer units share resources as well (within the same module).

At any rate there is a distinction to be made here between expecting 20% and expecting more than 23% as a minimum. 20% is reasonable as an upper-limit of what we tend to expect of microarchitecture. Expecting more than that as the minimum start value for our range is what seems unrealistic.

It is just numerology and opinion, no reason mine should be any more valid than yours. When was the last time we had a new architecture debut which improved IPC by 25% across the range of virtually every application category?

Also do go ahead and read JFAMD's comments in this thread, he stated plainly (as far as I perceived it to be) that the 1.8x thread scaling was "best case". I'm not doing anything here blindly, I am taking AMD's marketing at face value and assuming it is 100% correct. From there I am simply walking out the implications and ramifications of those statements.

IDC, I haven't responded to your earlier questions since I have been rather busy. I thought I remembered JFAMD stating that the 80% estimate was conservative, and not best case. I couldn't find where he said it was best case, but I couldn't find where he said it was conservative either. I did see his explanation for where the figure came from, and I don't have the link handy but will say it was like you said: Some of the components are shared between the two cores, so he extrapolated a 20% loss in efficiency.

What I was trying to get at with my earlier post was that a 20% increase in a specific application is not uncommon at all with even small architecture improvements. You often see tweaks that will correct issues with older architectures in specific applications that cause large increases in those specific areas. This is only one specific application where the increase is said to be ~20-30%, not a broad spectrum of applications. If you were going to retort my argument, I expected you to respond by explaining what the application stresses, and showing me that increases in that application shouldn't be as high due to the known improvements in those areas on the BD chip being insignificant. As it is, I still don't have an idea as to why you feel BD didn't improve in integer calculations over Magny-cours by an appreciable value.

With that being said, I actually think that the comparison was at a TDP equalized test, and not a clockspeed equalized test. This is the way they have shown results with Shanghai, so I would expect the same with Bulldozer. While it doesn't give you a direct IPC comparison, it does show you a better comparison of the actual end-products (since clockspeed isn't an external constraint in the design, while TDP is. Clockspeed will likely be the highest possible while staying within the design constraints.)

On a different tangent, how come we are getting so much information about the BD architecture, but almost nothing about the upcoming Intel architecture? Is it because not much is going to be different, since it should be a minor change? I find it somewhat strange, since BD is so far away, yet some members on our forum already have some of the Intel chips.

Kuzi · Dec 3, 2009

Martimus said:
That may be true, but with two cores sharing each L2 Cache I think it is more likely they will increase the size of the cache or leave it the same than they would be to reduce it.

I just checked the slide and you are right about the shared L2 cache. For a shared cache, the data does not have to be duplicated for each "core", but because we have two cores and maybe two threads using this shared cache, it can not be too small. So your assumption that it would stay the same size (512KB) or increase slightly (1MB) seems correct.

GaiaHunter · Dec 3, 2009

Kuzi said:
2-module/4-core (with 4mb L3 cache) = ~130mm2
4-module/8-core (with 8mb L3 cache) = ~260mm2
8-module/16-core (with 16mb L3 cache) = ~520mm2

Do we have any data on how much L3 cache are they going to use?

I think L3$ will be shared by all the modules so 4 modules/8 modules may as well be something 4-6MB and the 2 module/4 core 2-4MB.

Fox5 · Dec 3, 2009

Kuzi said:
I just checked the slide and you are right about the shared L2 cache. For a shared cache, the data does not have to be duplicated for each "core", but because we have two cores and maybe two threads using this shared cache, it can not be too small. So your assumption that it would stay the same size (512KB) or increase slightly (1MB) seems correct.

2 threads sharing 512KB sounds bad, but I guess the i7 pulls it off with 256KB. Still, cpus with 512KB L2 cache have been available since 2002, I'd think a lot of software would be designed to expect that much to be available.

Triskain · Dec 3, 2009

I just noticed two mistakes in my calculations, I counted the LSU twice though I already did that before. So the corrected size at 32nm is ~ 20 mm². But the other mistake is that I forgot that there will be a second Instruction Cache, so that counterbalances that a bit. I would say that it adds about 1 mm² so we come to 21 mm².
I'm in the process of doing some calculations on possible die sizes, I'll let you hear about it once I'am finished.

JFAMD · Dec 3, 2009

We have not released the L2 or L3 cache data to the public yet. Nor the L1 for that matter.

As for the 80%, yes I said that and I also said that I believe it to be a conservative number. However, without silicon in my hand, I am going off of engineering estimates. We tend to be very conservative with our estimates, especially when we are this far out.

When I do performance estimates, I never use a single benchmark as a data point, it is always a "basket" of several benchmarks. So think of the typical server benchmarks that you have (int, FP, java, database, web, etc.) and think of an aggregate of those.

There will definitely be places that it will be over 80% and definitely places where it will be under 80%. Because we tend to be conservative, there would be more above than below. And I don't expect that the variance should be huge. Much of this is still to early to put the stake in the ground. However, I am pretty confident about the fact that you won't see any cases of negative performance (i.e. throughput of 2 threads being lower than the throughput of 1 thread) like you do with SMT. That was one of the biggest drivers on the architectural choices that we make. Consistency and predictability are high on the list for goals. As a marketing organization you don't want to have to defend turning off a performance feature in order to get more performance.

Kuzi · Dec 3, 2009

GaiaHunter said:
Do we have any data on how much L3 cache are they going to use?

I think L3$ will be shared by all the modules so 4 modules/8 modules may as well be something 4-6MB and the 2 module/4 core 2-4MB.

From the recent AMD slides, I got that a 4-module Zambezi has 8MB L3 cache, so I assumed the 2-module version would have 4MB, and the 8-module server BD would have 16MB. I'm sure for different market segments AMD would have different versions of BD with different L3 configurations. For example, a cacheless 1-module and 2-module BD for the notebook market is possible. Or an 8-module BD for servers with 24MB L3 cache (like Nehalem EX).

Taking AMD's previous cache densities, we can now say that every 4MB cache added to the CPU increases die size by around ~16-20mm2 (@32nm). So it's up to AMD to decide whether it's worth it or not to add more cache.

IntelUser2000 · Dec 3, 2009

Triskain said:
.
- The Integer Cores will probably be 4-Way (up from 3-Way as it is on K10) so I say we add 40% on top and double them --> 6.16 mm²

- The amount of LS Units and Data Caches will double (and their capabilities will surely improve from what they are now) so I say 110% --> 19.32 mm²

to a Core size of 25.26 mm².

-Why do we double the integer cores? Because there are 4 for each Integer core? Based on the diagrams given by AMD/PCWatch, some patent extrapolating by Dresdenboy, and guesstimation by Hans say the ALUs will also have the capability to do AGU functions, so there are no seperate Load/Store unit, but more ALUs
-The data caches per Integer Core are smaller according to one report

Kuzi: 20mm2/4MB L3
Triskain: 25.26mm2 per die

280mm2 for JUST the L3 and cores alone on a 8 module part. Can the unknown(IMC/Hypertransport/Tags/Misc) add so much to make it 500mm2 from that?(+220mm2).

cbn · Dec 3, 2009

JFAMD said:
I feel pretty confident that the argument in 2011 is not going to be about single core performance, its going to be more about scheduler efficiency.

Scheduler efficiency? So this sounds like a huge bottleneck then.

JFAMD said:
My guess is that in 2011 the real discussion is going to be the price, the performance and the power consumption of competing processors. That is all people really care about.

I just hope AMD can have a justifiable reason to eventually beat Intel to smaller nodes. Obviously this starts with having really good designs.

cbn · Dec 3, 2009

Kuzi said:
I believe an 8-module BD would end up larger than 330mm2. Remember that L3 cache size can make a big difference. But if I was going to make a wild guess about the die size of BD, it would be as follows:
2-module/4-core (with 4mb L3 cache) = ~150mm2
4-module/8-core (with 8mb L3 cache) = ~300mm2
8-module/16-core (with 16mb L3 cache) = ~600mm2

If we compare to CPUs available today, and the projected performance of BD, it would seem like a really great CPU. Lets take Istanbul for example, it has 6 cores, 6mb L3 cache and a 345mm2 die size. It's very possible a 4-module BD would end up smaller than that, while offering much better performance, and also higher clocks. (or the same clock, 2.6GHz, with lower TDP).

So the single module Bulldozer would end up being the same size as the Core i3 CPU?

If that is true then I hope this CPU is much more powerful than I am thinking.

ilkhan · Dec 3, 2009

Computer Bottleneck said:
So the single module Bulldozer would end up being the same size as the Core i3 CPU?

If that is true then I hope this CPU is much more powerful than I am thinking.

Actually, if a BD module is the same size as an i3 then it'll be a good match, the i3 die doesn't include the memory controller, GPU, DMI, or PCI-E busses, and its 2 cores;, its JUST the CPU and QPI bus. Would make for a good comparison point.

IntelUser2000 · Dec 3, 2009

Do we have confirmation that 16 core Bulldozer is indeed 60-80% faster than 12 core Magny Cours? Is that an estimate from a ambigous graph or is that roughly true?

cbn · Dec 3, 2009

IntelUser2000 said:
Do we have confirmation that 16 core Bulldozer is indeed 60-80% faster than 12 core Magny Cours? Is that an estimate from a ambigous graph or is that roughly true?

If that is true, then it is much more impressive than octo core being 60%-80% faster than hex core.

AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Elite Member

Senior member

Senior member

Senior member

Elite Member

Platinum Member

Member

Elite Member

Elite Member

Member

Elite Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Member

Senior member

Senior member

Elite Member

Lifer

Lifer

Golden Member

Elite Member

Lifer