Cerb
Elite Member
- Aug 26, 2000
- 17,484
- 33
- 86
a) only known by people under NDA, right now. Definitely not the slam dunk for high performance on the desktop, but that hasn't been BD's goal from the start. If they can sell cheaper chips w/o losing money on them, approximating Nehalem performance on the desktop will be good.So Bulldozer should fix two design issues:
a) CMT -> This is just not effective enough regarding die size.
b) High Frequency Design -> Obviously not effective regarding TDP (I mean AMD has even SOI and higher TDP).
c) High uncore die consumption
No, AMD chose that for an implementation of CMT. One hallmark of CMT is sharing high-throughput resources that are not timing-sensitive, and leaving timing-sensitive resources separate.Oh yes I know. CMT causes that 2 cores have to share a single decoder.
Sorry, but no, only people under AMD NDAs know how a Llano core compares to a BD core in performance, how 2 Llano cores compare to a BD module, and if there are situations where the decoder will be a bottleneck. AMD has been tight-lipped, and very careful, here.That is the issue. You invest a lot of silicon with extremly little gain. Just compare die size of Llano core (~9.6 mm²with that of BD module (~18.9 mm²
. As two Llano cores are faster than a BD module AMD even has a reduced performance vs. die size efficiency than they had with their current chips.
b) again, unknown. Higher TDP than if they had designed BD to top out at 3GHz, instead of having 4GHz turbos for release models? Sure. But, there's no reason to believe that they did not tune the design for GloFo's 32nm, and have plenty of headroom.
c) probably. They do have a history. In the end, we'll just have to see, though. Maybe they've fixed that. We just plain do not know.
If it would only take 6mm^2, would not increase power consumption much, and would not reduce clock speeds, I imagine AMD would have done it. I was surprised to see that they only do 4 for two cores, rather than 5-6 over two cores, but 4 more? Even if it weren't x86, you'd need VLIW or very low speeds to make that worth it. Neither are the best choices for general purpose computing.So my proposal for AMD for Bulldozer II:
to solve a):
BD brings a lot of fine things with it: A decoder capable of doing 4 ops decode / cycle. This is an advantage over Intel. So just add another one to feed the other integer core. That takes ~6 mm²/core. A lot but you gain a lot.
Now that's just insane. 2:2, while very odd, gives them the ability to have some of the 4-issue cake, I imagine to bottleneck the ALUs less than with their other options, while being able to maximize the effectiveness of those 2 ALUs.Now also widen up to 4 ALU + 2 AGLU. That costs very little ~2 mm²/core. Also the scheduler must be changed of course but that is no issue as the scheduler step back was related to b) and as we fix b) that should be no issue. The result could be a core that is equal or faster than SB core. On top of that utilizing the amazing decoder they can add SMT. So another SB feature included.
You mean those perfectly reasonable TDPs that AMD has stated and hinted at? Some latencies are fairly high, but I doubt making a slower chip would allow them to rival Intel on servers, which is the goal. x86, regardless of desktop or server, tends to like higher speeds, over higher IPC. Code that might get 2-3 average IPC on a slow RISC CPU, plain can't, using x86. But, try to get code that can't theoretically do better than 1 IPC to perform better on that slower RISC CPU, and...it won't, at least not by much. You end up better off with higher speeds and deeper pipelines (unless you go to Netburst-like extremes), even at the cost of CPI, and as the speeds get higher (relative to cache and main memory latencies), the real world advantages of wider issue get smaller and smaller (as do the differences between ISAs). Since so much actual code is not high-IPC, going with speed makes a lot of sense. Given the poor scaling of all resources related to managing more instructions at once, it is more efficient to have separate sets of them, than to maximize their use via SMT.b) Remove high speed design, fix latencies (esp. INT-SSE) etc. That will fix TDP issues
How much would it cost to do, though? It could be a case where they might not be able to make the money back, if it isn't a major success. meanwhile, if it is a major success, they could make plenty of money selling more chips, and will have been better off by not delaying it any further.c) Optimize uncore: That will result in even smaller chip though we added more transistors with a)
Red herring. It will be awhile before BD becomes an APU.APUs alone will not do it for AMD as you do not earn enough to finance the business.
In the end, what you're basically saying is that you think CMT sucks the big one, and they should have designed it more like Intel's CPUs. Meanwhile, cost- and power-efficient CMT is pretty much the whole point of BD, and if they have to stick with a design for awhile, better that they design for the future now, than design for the now and be stuck without enough R&D money, or time, to change pace (even Intel and AMD don't know what the x86 market will be like in 5-10 years, so designs that are forward-looking and flexible are a must). However, right now, we just don't know. BD is too different from any CPUs we are used to, at this time.
Last edited: