Any Transistor count est. for BullDozer 8 core yet?

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
Just wondering if any transistor count est. surfaced from the 8 core BD chip?

I did some calculation, 8 core BD = 1.5 x 6 core PII = 9 core PII, that gives each BD core just 12.5% more efficiency if they operate on the same frequency that is. AMD has in the past made an incredible claim that each additional core only cost about 15% more die space compare to PII cores but achieving like 75% performance. Just wondering how much transistor the final BD design actually end up with. Although it seems to have a lot more L3, 16MB as rumored.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
My guess:

2 core, 1 module with 2MB L2: 213 million transistors(from ISSCC)

4 module: 852 million

+8MB L3: 1.32 billion(~60 million transistors/MB)
+I/O and memory controller: 1.37 billion
 

ElFenix

Elite Member
Super Moderator
Mar 20, 2000
102,402
8,574
126
My guess:

2 core, 1 module with 2MB L2: 213 million transistors(from ISSCC)

4 module: 852 million

+8MB L3: 1.32 billion(~60 million transistors/MB)
+I/O and memory controller: 1.37 billion

now that we're using big L3 caches, is it possible to move to eDRAM or 1T-SRAM and save some real estate? or does that involve too many other tradeoffs?
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
1T-SRAM would require licensing from MoSys plus some R&D work since MoSys' IP targets large ASICs like SOCs.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
Well GlobalFoundries licensed T-RAM technology in 2009, and I think T-RAM could be a viable alternative to conventional SRAM, as long as there is no performance loss from using it.

It's possible we might see T-RAM used in future AMD processors at the 22nm and smaller nodes, as the improvement in die size/power characteristics of such processors would be huge.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
Any estimates about the die size of BD?

As some have commented before, it seems AMD improved the cache density at the 32nm process to where they basically match Intel. Keeping this in mind I did my own calculations and came up with this die size range for BD :

2-Module/4-Core (4MB L2/4MB L3 cache) = 150-160mm^2
4-Module/6-Core (6MB L2/8MB L3 cache) (dead/weak module disabled) = 300-320mm^2
4-Module/8-Core (8MB L2/8MB L3 cache) (same size as 6-core)= 300-320mm^2
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
JAMD said 8-core BD will be smaller than PII. I think that's the only hard fact we have right now.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Well GlobalFoundries licensed T-RAM technology in 2009, and I think T-RAM could be a viable alternative to conventional SRAM, as long as there is no performance loss from using it.

That would make sense - I think 1T-RAM is used mainly in SOCs.
 

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
From the initial numbers doesn't look like BD per core has been significantly enhanced via K10 architecture. But the fact that they can make 8 cores using far less transistor count compare to K10 is the real advancement.

I'd predict BD will have to spam a lot more cores to match multithreaded performance of SB or Ivy. Single or low threaded apps don't isn't going to be comparable to SB/Ivy.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
From the initial numbers doesn't look like BD per core has been significantly enhanced via K10 architecture. But the fact that they can make 8 cores using far less transistor count compare to K10 is the real advancement.

I'd predict BD will have to spam a lot more cores to match multithreaded performance of SB or Ivy. Single or low threaded apps don't isn't going to be comparable to SB/Ivy.

I'm not sure what you mean by initial numbers, since I haven't seen a single benchmark from the Bulldozer architecture. I do know from the things that were released during Hot Chips about the architecture that the vast majority of the architecture has been improved from K10.

The front end has been completely overhauled, including the branch prediction which probably is the most improved part of this architecture (although it was a weakness for the STARS architecture, so how improved this is will have a big impact on the Bulldozer performance since the new architecture has deeper pipelines.) The Branch target buffer now uses a two level hierarchy, just like Intel does on Nehalem and Sandybridge. Plus, now a mispredicted branch will no longer corrupt the entire stack, which means that the penalties for a misprediction are far less than in the STARS architecture. (Nehalem also has this feature, so it brings Bulldozer to parity with Nehalem wrt branch mispredictions)

Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle. Bulldozer also brings branch fusion to AMD, which is a feature that Intel introduced with C2D. This allows for some instructions to be decoded together, saving clock cycles. Again, this seems to bring Bulldozer into parity with Nehalem (although this is more cloudy, as there are restrictions for both architectures, and since Intel has more experience with this feature they are likely to have a more robust version of branch fusion.)

Bulldozer can now retire up to 4 Macro-ops per cycle, up from 3 in the STARS architecture. It is difficult for me to compare the out-of-order engine between STARS and Bulldozer, as they seem so dissimilar. I can say that it seems a lot more changed than just being able to retire 50% more instructions per cycle. Mostly the difference seems to be moving from dedicated lanes using dedicated ALUs and AGUs, to a shared approach.

Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem. The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a strictly in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D. Another change is that L1 cache is now duplicated in L2 cache (which Intel has been doing as long as I remember). Although L3 cache is still exclusive.

Bulldozer now implements true power gating. Although unlike Intel who gates at each core, they power gate at the module level. This shouldn't really effect IPC, but might effect the max frequency so it is a point to bring up when discussing changes to performance. The ability to completely shut off modules should allow higher turbo frequencies than we saw in Thuban, but we won't know what they are until we see some reviews.

Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies. I expect a good IPC improvement, along with a large clockspeed boost. Although I can't say how much, and I really am looking more for parity with Nehalem based processors than I am with Sandybridge based processors.

References:
Butler, Mike. "Bulldozer" A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
Great post, Martimus.

I am wondering about the power gating and turbo. I've had so many problems with Cool N Quiet on my PII 720, particularly after overclocking. I just have it turned off now because with it on my computer would often stutter as it clocks up and down. Is this a common problem and, more importantly, is it present in Thubans? Since bulldozer is more reliant on this capability, I want to know whether it is actually going to work properly and whether motherboards/bioses support it properly.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a strictly in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D.

This presentation says "Load can bypass other loads and non-conflicting stores", which means family 10h was not "strictly in-order", but rather less out-of-order than Nehalem. This Anandtech article seems to say that even K8 could do some (extremely limited) reordering of loads.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
This presentation says "Load can bypass other loads and non-conflicting stores", which means family 10h was not "strictly in-order", but rather less out-of-order than Nehalem. This Anandtech article seems to say that even K8 could do some (extremely limited) reordering of loads.

It seems that I have been corrected. Thanks for the clarification. I thought that AMD didn't have any out-of-order memory structure until Istanbul, but even that was very limited. This new setup should be very similar to what Nehalem has.

For those that don't want to read through the entire presentation CTho posted, this is what it says about the K10's out of order memory pipeline:
•More out of order Ld/St capability.
• Loads can bypass other loads and non-conflicting stores.
• LS1 queue (12 entries) - can issue 2 operations per cycle (load or store tag check).
• LS2 queue (32 entries) - holds requests that miss in L1 cache.
 
Last edited:

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
It seems that I have been corrected. Thanks for the clarification. I thought that AMD didn't have any out-of-order memory structure until Istanbul, but even that was very limited. This new setup should be very similar to what Nehalem has.

For those that don't want to read through the entire presentation CTho posted, this is what it says about the K10's out of order memory pipeline:
•More out of order Ld/St capability.
• Loads can bypass other loads and non-conflicting stores.
• LS1 queue (12 entries) - can issue 2 operations per cycle (load or store tag check).
• LS2 queue (32 entries) - holds requests that miss in L1 cache.

thanks for the post, martimus. I'm in no way that knowledgeable about chip design so is simply going by the Amd claim that BD 8 core runs about 50% faster than current PII x6. which gives a rough per core efficiency estimate. also added to the claim that every other core only requires slightly more real estate it seems logical that AMD has improved per core efficiency over Thuban slightly while achieving this using less transistor count. Of course, like you said since there's no benchmark of actual product, it's only a speculation.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
JAMD said 8-core BD will be smaller than PII. I think that's the only hard fact we have right now.

He meant the PII X6 right? As the PII X4 is 258mm^2 in size, and the PII X6 is 346mm^2. An 8-core BD will probably fall in between this range, which is another reason to believe that AMD should easily be able to push the clock speeds of an 8-core BD over X6 clocks (+3.4GHz).
 

JFAMD

Senior member
May 16, 2009
565
0
0
thanks for the post, martimus. I'm in no way that knowledgeable about chip design so is simply going by the Amd claim that BD 8 core runs about 50% faster than current PII x6. which gives a rough per core efficiency estimate. also added to the claim that every other core only requires slightly more real estate it seems logical that AMD has improved per core efficiency over Thuban slightly while achieving this using less transistor count. Of course, like you said since there's no benchmark of actual product, it's only a speculation.

What I said was a 16-core Interlagos will have ~50% more throughput than a 12-core AMD Opteron 6100 processor (top bin to top bin).

I have not made any comparisons to client products because I am in the server division. Client workloads are different. Clients measure speed, servers measure throughput. That is like saying an SUV can pull a 5000lb boat so it can go 0-60 in 4 seconds. Two different metrics.

He meant the PII X6 right? As the PII X4 is 258mm^2 in size, and the PII X6 is 346mm^2. An 8-core BD will probably fall in between this range, which is another reason to believe that AMD should easily be able to push the clock speeds of an 8-core BD over X6 clocks (+3.4GHz).

No, I said an 8-core Valencia die is smaller than a 6-core AMD Opteron 4100 die. The client dies *may* be the same size, but I did not make a statement about client die sizes, only server die sizes.
 

Soleron

Senior member
May 10, 2009
337
0
71
@nyker96 about per-core performance

Bear in mind a 50% speedup from 12-core MC to 16-core BD doesn't tell you much about desktop performance at all, because server dials down the clockspeeds a lot compared to its capability (MC launched at 2.3GHz, BD is "capable" of 3.5GHz+ officially but I guess server won't get that high due to thermals).

Also scaling is hardly perfect from less to more cores. An i7 980X usually performs about 30% faster than an i7 975 rather than the 50% the core count implies, so if BD actually does get 50% from 16 to 12 that's a lot more than 12/16*1.5=12.5% per core a naive calculation would suggest.

Without final clockspeeds, or information about performance in desktop apps I don't think it's possible to get a performance estimate for the product this forum would care about; the top desktop BD compared to SB.

With perfect turbo and the headroom of 8 cores applied to 1 core, they could technically run one thread on one core at 5GHz or something and get amazing perfomance.
 
Last edited: