IntelUser2000
Elite Member
- Oct 14, 2003
- 8,686
- 3,787
- 136
They don't want to make 500mm2 chips, server or not. If my estimates of Nehalem core=1 BD module, an 8 module version will reach over 330mm2 even at 32nm process. If they want to do 16 module, it'll be at greater than 660mm2 if they add nothing for better scability.
Even if a BD module has the same resources and half the FP resources of 2 Nehalem CORES, is FP that important compared to integer? Or do you need so much FP?
I believe an 8-module BD would end up larger than 330mm2. Remember that L3 cache size can make a big difference. But if I was going to make a wild guess about the die size of BD, it would be as follows:
2-module/4-core (with 4mb L3 cache) = ~150mm2
4-module/8-core (with 8mb L3 cache) = ~300mm2
8-module/16-core (with 16mb L3 cache) = ~600mm2
If we compare to CPUs available today, and the projected performance of BD, it would seem like a really great CPU. Lets take Istanbul for example, it has 6 cores, 6mb L3 cache and a 345mm2 die size. It's very possible a 4-module BD would end up smaller than that, while offering much better performance, and also higher clocks. (or the same clock, 2.6GHz, with lower TDP).
I am curious about this as well, only because of the node difference.Kuzi are you accounting for the 45nm->32nm transition between Istanbul and Bulldozer when tallying up your Bulldozer diesize estimates?
By studying the die plots of AMD's 45nm CPU's I have determined the following data on a 45nm K10 Core, it has a total area of 16.61 mm² (give or take 0.5 mm²) including L1 Cache and about 34 million transistors. The breakdown is as follows:
- 5.81 mm² for the Decoders, Branch Prediction, the Instruction Cache, ITLB, Microcode ROM etc.
- 2.2 mm² for the Integer ALU's, Registers, Schedulers; ROB etc.
- 4.6 mm² for the LoadStore Unit and the L1 Data Cache
- 4.0 mm² for the FPU Execution Units, Registers, Schedulers, Rename etc.
If I take that as a base, I can approximately determine the possible Die area of a Bulldozer Module:
- The Front End will have to feed the two Integer Cores and the FPU, so I would propose at least a 50% increase in size --> 8.72 mm²
- The Integer Cores will probably be 4-Way (up from 3-Way as it is on K10) so I say we add 40% on top and double them --> 6.16 mm²
- The amount of LS Units and Data Caches will double (and their capabilities will surely improve from what they are now) so I say 110% --> 19.32 mm²
- The K10 FPU can do 1 128-bit MUL and 1 128-bit ADD now, according to Dresdenboy the Bulldozer FPU will be able to do twice of that (2 128-bit MUL and ADD), so we should assume a least a doubling of the size --> 8.4 mm²
Summed together we come to a total area of 42.6 mm² for a hypothetical Bulldozer module in 45 nm and about 87 million Transistors.
If we assume an average scaling of 0.6 from 45nm to 32nm (which is realistic if we look at the size of a K10 Core in 32nm - 9.69 mm² according to the ISSC presentation from AMD) we come to a Core size of 25.26 mm².
That is pretty much the size of a Nehalem Core at 45nm, so I think my guesstimates are realistic.
What do you think IDC? As a former process engineer you can surely give us some additional info about process scaling.
Kuzi are you accounting for the 45nm->32nm transition between Istanbul and Bulldozer when tallying up your Bulldozer diesize estimates?
Another guess I want to make is that for BD, AMD may go the Nehalem route and have smaller/faster L2 caches, for example 256KB L2 per core (512KB per Module).
Martimus the bulldozer cores share more resources than just the FPU. The integer units share resources as well (within the same module).
At any rate there is a distinction to be made here between expecting 20% and expecting more than 23% as a minimum. 20% is reasonable as an upper-limit of what we tend to expect of microarchitecture. Expecting more than that as the minimum start value for our range is what seems unrealistic.
It is just numerology and opinion, no reason mine should be any more valid than yours. When was the last time we had a new architecture debut which improved IPC by 25% across the range of virtually every application category?
Also do go ahead and read JFAMD's comments in this thread, he stated plainly (as far as I perceived it to be) that the 1.8x thread scaling was "best case". I'm not doing anything here blindly, I am taking AMD's marketing at face value and assuming it is 100% correct. From there I am simply walking out the implications and ramifications of those statements.
That may be true, but with two cores sharing each L2 Cache I think it is more likely they will increase the size of the cache or leave it the same than they would be to reduce it.
2-module/4-core (with 4mb L3 cache) = ~130mm2
4-module/8-core (with 8mb L3 cache) = ~260mm2
8-module/16-core (with 16mb L3 cache) = ~520mm2
I just checked the slide and you are right about the shared L2 cache. For a shared cache, the data does not have to be duplicated for each "core", but because we have two cores and maybe two threads using this shared cache, it can not be too small. So your assumption that it would stay the same size (512KB) or increase slightly (1MB) seems correct.
Do we have any data on how much L3 cache are they going to use?
I think L3$ will be shared by all the modules so 4 modules/8 modules may as well be something 4-6MB and the 2 module/4 core 2-4MB.
.
- The Integer Cores will probably be 4-Way (up from 3-Way as it is on K10) so I say we add 40% on top and double them --> 6.16 mm²
- The amount of LS Units and Data Caches will double (and their capabilities will surely improve from what they are now) so I say 110% --> 19.32 mm²
to a Core size of 25.26 mm².
I feel pretty confident that the argument in 2011 is not going to be about single core performance, its going to be more about scheduler efficiency.
JFAMD said:My guess is that in 2011 the real discussion is going to be the price, the performance and the power consumption of competing processors. That is all people really care about.
I believe an 8-module BD would end up larger than 330mm2. Remember that L3 cache size can make a big difference. But if I was going to make a wild guess about the die size of BD, it would be as follows:
2-module/4-core (with 4mb L3 cache) = ~150mm2
4-module/8-core (with 8mb L3 cache) = ~300mm2
8-module/16-core (with 16mb L3 cache) = ~600mm2
If we compare to CPUs available today, and the projected performance of BD, it would seem like a really great CPU. Lets take Istanbul for example, it has 6 cores, 6mb L3 cache and a 345mm2 die size. It's very possible a 4-module BD would end up smaller than that, while offering much better performance, and also higher clocks. (or the same clock, 2.6GHz, with lower TDP).
Actually, if a BD module is the same size as an i3 then it'll be a good match, the i3 die doesn't include the memory controller, GPU, DMI, or PCI-E busses, and its 2 cores;, its JUST the CPU and QPI bus. Would make for a good comparison point.So the single module Bulldozer would end up being the same size as the Core i3 CPU?
If that is true then I hope this CPU is much more powerful than I am thinking.
Do we have confirmation that 16 core Bulldozer is indeed 60-80% faster than 12 core Magny Cours? Is that an estimate from a ambigous graph or is that roughly true?
