I do see the die size utilization as a performance issue. As you know you cannot make a chip as large as you want. There are many limits, speed pathes, TDP and finally costs.
As long as a BD module's area w/ L2 is about the same SB core w/ L3 and core sizes too, this shouldn't be a performance issue. The Orochi die size might affect latencies between modules and L3 subcaches, IMC, HT-PHYs, but so does a ring buffer (where data moves from stop to stop until reaching its destination). And if I look at both dies side by side the likely signal traveling distances between components on the Orochi die don't look that long.
There is no direct relation between die size and TDP (which is actually a specified value the chips are binned for). I could make a 1000 mm² die with one BD module at a given clock speed, NB, IMC, DDR pads. It wouldn't have a higher power consumption than when putting the same components on a <100 mm² die. Instead the larger die might provide a better hot spot separation.
The observed higher power consumption of larger dies actually comes from more transistors being used. They have leakage and switching power consumption, not the blank silicon.
AMD will not suffer from initial Zambezi die size. But they already suffer when it comes to Interlagos and they will suffer also in desktop area latest in 2012 when Sandybridge E parts are available.
Interlagos is not just larger in overall die size (with the benefit of improved binning opportunities due to being a MCM) and thus might cost maybe $40 more to make, but it also offers more performance in a single socket (this is also about density).
Let me just explain what I mean so you understand where I am seeing the problem with an example:
Intel 4C/8T: ~150-160 mm²
AMD 4M/8C: ~280 mm²
Intel 6C/12T: ~220 mm²
Intel 8C/16T: ~290 mm²
(theor. AMD 8M/16C: 560 mm²)
All that still on 32 nm! Means that with roughly the same die size consumption Intel's architecture is much faster than AMD Bulldozer.
As others already said, we don't know anything about BD's performance. And 32 nm processes are not that easily comparable. Gate First HKMG vs. Gate Last, bulk vs. SOI, a lot of other differences (stress, metal gate materials, geometries etc.). Think about Ontario produced with TSMC's 40 nm process compared to 45 nm Atom.
An 8C BD is a throughput machine, but a 4C variant (6C might be harvested from 8C) could be ~150 mm² as well, but might clock higher (~30-40% with actually twice the TDP available (sqrt) -> but yield needs to be better because it would be a volume product).
I mean AMD Bulldozer will surpass Intel SB 4C in performance but Intel can just double the core count and AMD is in the same position as now: Significantly behind the competition. Okay you could say why crying it is the same as of now but the difference is that AMD issued a brand new architecture but did not change anything by that. That will affect average reslling prices. The problem in performance per die size is not the die size it is the performance.
You mean, doubling the core count is that simple? Due to roughly half the TDP available per core, base clocks would have to go down (as you also said) and its performance profile would look different, only mitigated by its advanced Turbo mechanisms.
Performance per die size indirectly translates to a price/performance metric.
Now die size is not everything because when you are TDP limited then a small die size does not help out. But there the same bad news. Intel can do their SB 8C/16T with roughly the maximum TDP they have (still on 32 nm). AMD obviously has also problems there because for the high end 4M/8C part they already speciofied 125W TDP. So AMD would have to drop also frequency significantly when adding new cores (problem of Interlagos).
Doubling the cores usually meant dropping clock frequency by 30%. Thanks to most recent Turbo boost modes, the highest boosted frequency might actually stay at the same level (or even increase, if TDP increased too). But as long as we don't know, if it needs a 125W BD to be faster than a 95W SB or if a 95W BD is enough, we can only speculate here.
BTW the 8C SB-EP/EX (~400 mm² according to the ISSCC 2011 die photo presentation) also needs to take away some TDP for the additional memory channels and PCIe lanes. OTOH TDP can go up to 150W. This might become a top performer, but surely not at price points <$500. There is the problem again: Do we look for the top performing CPU? x86 or non-x86? 2C, 3C or 4C memory? Desktop or server? Same price or max price?
As you see though they surpass Sandy Bridge 4C this is just not enough to be competitive. And by that I mean that AMD is able to raise average selling prices. Again the 2H2011 might be well for AMD because of Bulldozer but in 2012 it will look the same or (hopfully not) even worse for AMD as of now.
IB is said to be 20% faster at the same TDP. This is in line with Intel's data about their Tri-Gate process. Die size and idle power (if Vcc still is provided, no power gating) will shrink as well, so will costs.
AMD's options are:
- µarch improvements ("enhanced BD", some interesting patented options weren't implemented in BD1)
- natural process maturity and process improvements at GF (similar to CTI), something like low-k for Thuban or smaller steps like adding a new way of applying stress etc.
- adding a few cores (like 10 with Komodo -> 25% more) with a small drop (if any at all) in base frequency
BTW, is Intel improving their processes in the same way or do they make it as good as possible (under cost considerations) from the start that any improvements are left for the next smaller process node?
So Bulldozer should fix two design issues:
a) CMT -> This is just not effective enough regarding die size.
b) High Frequency Design -> Obviously not effective regarding TDP (I mean AMD has even SOI and higher TDP).
c) High uncore die consumption
There is no proof and even not a hint that a) or b) are true, while c) has an obvious effect on die size. Further the die size is not directly related to the module size. There is a lot more to consider. Better compare module sizes with core sizes than just die sizes. This is like deriving the efficiency of a motor by measuring the car's length.
While we are at it:
High Frequency Design (it's not extreme with ~20-25% increase) has two ways to use it: clock the logic faster at the same voltage (which influences the transistors' switching speed), or keep the clock frequency while lowering voltage. According to Mike Butler they did increase frequency while not increasing power consumption and keeping IPC constant to their previous arch (10h/12h).
So my proposal for AMD for Bulldozer II:
to solve a):
BD brings a lot of fine things with it: A decoder capable of doing 4 ops decode / cycle. This is an advantage over Intel. So just add another one to feed the other integer core. That takes ~6 mm²/core. A lot but you gain a lot.
Now also widen up to 4 ALU + 2 AGLU. That costs very little ~2 mm²/core. Also the scheduler must be changed of course but that is no issue as the scheduler step back was related to b) and as we fix b) that should be no issue. The result could be a core that is equal or faster than SB core. On top of that utilizing the amazing decoder they can add SMT. So another SB feature included.
As others already wrote, you might see this stuff in a too simple way. You need to think of signal traveling times and such things. Adding port 5 to Intel's core also meant more complex scheduling, bypass logic, read/write ports (ROB, RRF). Such changes actually cause more cycle time and might require some trade offs. Adding further execution units and issue ports to BD also needs more bypassing logic, more register ports, a more complex scheduler logic (complexity increases roughly quadratically), flags processing logic, etc.
So while you "fix" b) you will create a bigger, slower clocked core and need to rebalance everything. The need to exploit higher ILP with a fatter core can actually only be solved by involving SMT in the integer cores (but BD's huge front end, L2 cache and FPU already process 2 threads). You save that by having a narrower core churning faster. Example: 4-wide ALUs: to make efficient use of them there need to be many opportunities to find 3 or 4
independent (not waiting for other op's results) ALU ops to be executed. Then remember that x86 has 8 GPRs in 32b mode and 16 in 64b mode, one not generally available due to being used as a stack pointer. There is not much room for ILP. With 2-wide ALUs you only have to find 2 instructions. If they run faster, they might provide results to dependent ops after 60-70% the time (4-wide might need to run at 60-70% the frequency at the same power consumption).
BTW, the IRF is already replicated in BD to need only 4R/2W ports and have short routing distances (dunno if they want to clock it 4GHz at 1.2V or . So for your idea there need to be another IRF. Further the whole integer core is only 2.37 mm² according to an AMD paper.
b) Remove high speed design, fix latencies (esp. INT-SSE) etc. That will fix TDP issues
And what will we do with b) if there are no TDP issues?
c) Optimize uncore: That will result in even smaller chip though we added more transistors with a)
This is a real option, even w/o a).