A couple of weeks ago, I gave best and worst case estimates for AMD's GPU releases. What I described as the "most pessimistic plausible scenario" is, essentially, what actually happened:
The Bulldozer analogy may be harsh; while the construction equipment CPU cores were a complete failure, Fury X isn't too far behind its strongest competitor (the GTX 980 Ti) in either performance or perf/watt. And while Bulldozer proved to be a complete architectural dead end that did nothing but waste AMD's limited R&D budget, the HBM expertise that AMD gained with Fiji is likely to be helpful to them in designing the next generation of GPUs. But in one important respect, the analogy works. Bulldozer managed to barely keep up with Sandy Bridge in heavily multi-threaded benchmarks, but in single-threaded performance, it fell significantly behind. Likewise, Fury X manages to barely keep up with the GTX 980 Ti in 4K performance on most titles, but at 1080p, it falls significantly behind. In both cases, AMD optimized for "the future" at the expense of present-day performance, and naively expected everyone else to play along. Note how low-level APIs (formerly Mantle, now DirectX 12) have often been touted as the magic cure-all for AMD's performance shortcomings in both arenas.
There is a strong connection between performance per transistor, performance per shader, and performance per watt. And because there is a maximum feasible die size for any manufacturing process, the most efficient architecture will usually win the battle for highest raw gaming performance, as well. This is what's been killing AMD ever since Maxwell came out. 2048 Maxwell shaders on a 398 sq. mm. die (GM204) can handily beat 2816 GCN 1.1 shaders on a much more densely packed 438 sq. mm. die (Hawaii). Likewise, the GM206 die is only a few square millimeters bigger than Pitcairn, has about the same number of transistors, and has the same number of shaders, TMUs, and ROPs (1024/64/32) as the *cut-down* Pitcairn. Yet even the full Pitcairn chip can't come close to GM206's performance.
Why does this happen? In some cases, AMD is shorting their chips on ROPs. This is definitely the case with Fiji compared to GM200, and Tonga compared to GM204. AMD obstinately continues to pretend that everyone should write games to favor their architecture, despite the fact that they have 20% market share. It's absurd when AMD and their fans act as if tessellation is some kind of dirty trick.
But as we saw with Pitcairn vs. GM206, even if AMD's chips have adequate ROPs, they still fall short of their Maxwell counterparts. Clock speeds are another part of the equation. At 1080p, the GTX 960 at stock clocks has a 35.7% lead against the R7 265. But the GTX 960 has a base clock of 1127 MHz and a boost clock of 1178 MHz, while the R7 265 only has a 900 MHz base clock and 925 MHz boost clock. If the R7 265 is overclocked to 1140 MHz, it gets a 21.3% boost in performance, which doesn't completely close the gap, but does indicate that it isn't as great as it might first appear. But this simply raises the question: why do Nvidia's Maxwell cards overclock so much better than AMD's GCN offerings? One possibility is that GCN has too short a pipeline, and this is hindering clock rates.
So what needs to happen next? Well, for one thing, AMD needs 14nm FinFET *yesterday*. AMD has to get the drop on Nvidia, beating them to market on FinFETs, preferably by at least 6 months. It might not be feasible to get FinFET GPUs out the door by the end of the year (the Financial Analyst Day presentation has them scheduled for 2016), but if AMD can do it, they should. Even if it's a small part that uses standard GDDR5, it would still give AMD a temporary jump on Nvidia, and paper over their architectural shortcomings for the moment. Nvidia has won 28nm; AMD needs to move on to the next node immediately. But this will only give them some breathing room. Nvidia will eventually get to 16nm FinFET with HBM2, and at that point, AMD needs to have a competitive architecture. They can't stick with little-modified die-shrinks of GCN for too long, or they'll get bulldozed by the competition yet again.
If AMD doesn't have the R&D necessary to do this, they need to hurry up and sell the company to Samsung.
How bad can it get? Pretty bad. If AMD is unable to tame the challenges of first-generation HBM, then Fiji could prove to be more of a tech demo than a viable commercial product. We could be looking at Bulldozer 2.0, and that's a blow AMD may not recover from. Fiji is a massive, go-for-broke chip; if despite its technical advancements it fails to recover the performance crown (falling short of Titan X by 5%-10%), then it will be a failure. AMD won't be able to sell it for more than $499, and that price won't be sufficient to recoup R&D costs. Meanwhile, let's suppose that the pessimistic rumors are true and all the cards except Fiji are straight rebrands. No port to GloFo process, just better binning, higher clocks, and more RAM. This leaves AMD's entire lineup in an uncompetitive position; without bringing anything new to the table, they'll continue to have to compete on price with the legion of ex-mining cards out there, not to mention the cheap R9 200 series clearance sales. AMD continues to hemorrhage market share for 18 months or more, waiting for the FinFET+ process to arrive... assuming they survive that long. The only real hope at that point is a buyout offer from Samsung.
The Bulldozer analogy may be harsh; while the construction equipment CPU cores were a complete failure, Fury X isn't too far behind its strongest competitor (the GTX 980 Ti) in either performance or perf/watt. And while Bulldozer proved to be a complete architectural dead end that did nothing but waste AMD's limited R&D budget, the HBM expertise that AMD gained with Fiji is likely to be helpful to them in designing the next generation of GPUs. But in one important respect, the analogy works. Bulldozer managed to barely keep up with Sandy Bridge in heavily multi-threaded benchmarks, but in single-threaded performance, it fell significantly behind. Likewise, Fury X manages to barely keep up with the GTX 980 Ti in 4K performance on most titles, but at 1080p, it falls significantly behind. In both cases, AMD optimized for "the future" at the expense of present-day performance, and naively expected everyone else to play along. Note how low-level APIs (formerly Mantle, now DirectX 12) have often been touted as the magic cure-all for AMD's performance shortcomings in both arenas.
There is a strong connection between performance per transistor, performance per shader, and performance per watt. And because there is a maximum feasible die size for any manufacturing process, the most efficient architecture will usually win the battle for highest raw gaming performance, as well. This is what's been killing AMD ever since Maxwell came out. 2048 Maxwell shaders on a 398 sq. mm. die (GM204) can handily beat 2816 GCN 1.1 shaders on a much more densely packed 438 sq. mm. die (Hawaii). Likewise, the GM206 die is only a few square millimeters bigger than Pitcairn, has about the same number of transistors, and has the same number of shaders, TMUs, and ROPs (1024/64/32) as the *cut-down* Pitcairn. Yet even the full Pitcairn chip can't come close to GM206's performance.
Why does this happen? In some cases, AMD is shorting their chips on ROPs. This is definitely the case with Fiji compared to GM200, and Tonga compared to GM204. AMD obstinately continues to pretend that everyone should write games to favor their architecture, despite the fact that they have 20% market share. It's absurd when AMD and their fans act as if tessellation is some kind of dirty trick.
But as we saw with Pitcairn vs. GM206, even if AMD's chips have adequate ROPs, they still fall short of their Maxwell counterparts. Clock speeds are another part of the equation. At 1080p, the GTX 960 at stock clocks has a 35.7% lead against the R7 265. But the GTX 960 has a base clock of 1127 MHz and a boost clock of 1178 MHz, while the R7 265 only has a 900 MHz base clock and 925 MHz boost clock. If the R7 265 is overclocked to 1140 MHz, it gets a 21.3% boost in performance, which doesn't completely close the gap, but does indicate that it isn't as great as it might first appear. But this simply raises the question: why do Nvidia's Maxwell cards overclock so much better than AMD's GCN offerings? One possibility is that GCN has too short a pipeline, and this is hindering clock rates.
So what needs to happen next? Well, for one thing, AMD needs 14nm FinFET *yesterday*. AMD has to get the drop on Nvidia, beating them to market on FinFETs, preferably by at least 6 months. It might not be feasible to get FinFET GPUs out the door by the end of the year (the Financial Analyst Day presentation has them scheduled for 2016), but if AMD can do it, they should. Even if it's a small part that uses standard GDDR5, it would still give AMD a temporary jump on Nvidia, and paper over their architectural shortcomings for the moment. Nvidia has won 28nm; AMD needs to move on to the next node immediately. But this will only give them some breathing room. Nvidia will eventually get to 16nm FinFET with HBM2, and at that point, AMD needs to have a competitive architecture. They can't stick with little-modified die-shrinks of GCN for too long, or they'll get bulldozed by the competition yet again.
If AMD doesn't have the R&D necessary to do this, they need to hurry up and sell the company to Samsung.