Is AMD mismanaging Fiji? It can't be ROP ratio.

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
The more I look into this, the more I think Fiji is architecturally flawed.

Ok, Fiji and/or HBM is crap right now. Hitman DX12 shows that all that extra Texture Fillrate/FLOPs and memory bandwidth results in a negligible advantage. But DX11 shows hardly better.

7870XT vs 7970 is very similar to 390X vs Fury. This is why it cannot be the Pixel Fillrate alone because these two comparisons have a very similar ROP ratio.

Given the very close ratios, you'd expect a 7970 (925Mhz vanilla version) to have lead over the 7870XT just moderately more than the Fury's (Air) lead over the 390X. More, but not drastically more - certainly not double or anywhere close to that.

7870XT vs 7970:
24 CUs @975MHz vs 32 CUs @925MHz = 7970 Advantage 26%
32 ROPs @975MHz vs 32 ROPs @925MHz = 7870XT Advantage 5.4%
192 GB/s vs 264 GB/s memory bandwidth = 7970 Advantage 37.5%

390X vs Fury
44 CUs @ 1050MHz vs 56 CUs @ 1000MHz = Fury Advantage 21.2%
64 ROPs @ 1050MHz vs 64 ROPs @ 1000Mhz = 390X advantage 5%
384GB/s vs 512GB/s memory bandwidth = Fury Advantage 33%

So the 7970 vanilla has more of shader and bandwidth lead compared to the Fury. But it's not drastic and thus the Fury should only have a little less lead over its subordinate card - moderately less.

But Fiji and/or HMB is crap. AMD is incapable of managing it properly. They don't know what they are doing. The struggle is real, marked, sad, and worthy of pity and shaming.

TPU 7870XT Launch: 7970 has a 22-25% adv at 1200p-1600p
TPU Fury Launch: Fury has a 8-11% advantage at 1080p-1440p
TPU latest GPU review: Fury has a 7-14% advantage at 1080p-1440p

Of course 7870XT and 7970 are both GCN 1.0; both Tahiti; both GDDR5. 390X and Fury are GCN 1.1 vs 1.2; Hawaii vs Fiji; GDDR5 vs HBM. AMD lack the ability to properly harness their new architecture + memory. The ratio of ROPs and Shaders is very similar - if Fury is ROP limitted, then the 7970 should be as well. Yet Fury has about half the advantage the 7970 has. It's not a ROP limit. Now the 7970 has a VRAM advantage over its subordinate; the Fury has a disadvantage - could that be it? However, most games still use below 4GB and the Fury gets a larger lead over the 390X at 4K than at 1440p so I do not believe VRAM capacity is the culprit. It's a failure to optimize. What else could it be? Any thoughts or corrections to my post are appreciated, because right now I see waste.
 
Last edited:

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I don't believe ROPs are the bottleneck but I'd wager on it being the front end along with the somewhat anemic geometry engines ...
 
Last edited:

flynnsk

Member
Sep 24, 2005
98
0
0
look back through AMDs previous comments about Fiji, specifically how the 4GB "limit" of HBM was not as limited based on framebuffer size..

(hint: AMD, now RTG, specifically mentioned about having to fine tune games/applications to make better use of HBM)..

http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/7

while AMD can’t do anything about the amount of VRAM they have, they can and are working on doing a better job of using it

The good news here is that the current situation leaves overhead that AMD can optimize around. AMD has been creating both generic and game-specific memory optimizations in order to better manage VRAM usage and what resources are held in local VRAM versus paging out to system memory. By controlling duplicate resources and clamping down on overzealous caching by games, it is possible to get more mileage out of the 4GB of VRAM AMD has.

In short, better memory management and hand tuning is most likely the short term answer... its not like the front end was overhauled THAT much

lets not forget the 390 (290redux) is a very mature part with very mature drivers, given that AMD/ATI has been the main driver of video memory standards, I myself and willing to give them the benefit of doubt..
 

Head1985

Golden Member
Jul 8, 2014
1,867
699
136
How GCN works:
Basic is compute unit (CU)
cumpute-unit.jpg

Each CU have 64SP,4TMU
Cu is organised in 4 blocks

Next is Shader GCN engine(Pipeline)
shader-engine.jpg

Each pipeline have:
Compute Units(CU)
1 geometry Procesor
1 Rasterizer
MAX 4x RB per pipeline
EACH RB have 4 ROPS UNITS
SO IF fiji have only 4x pipelines it also must have 64rops.

290x 2808SP 4X pipelines.Its Ok for 2808SP but how can 4x pipelines still "feed" 4096SP on fiji?Pretty much cant.
HawaiiArch_575px.png

Full Fiji diagram.Its bottleneck by front-end(only 4x pipelines for Too much CU/Shaders)Fiji can do now 4x triangles/clock and 64pixes/clock same as 290X
fiji-varianta-1.jpg


how fiji should look like to not be bottleneck by 4x pipelines.It have 8X pipelines and 128Rops because Each RB have 4x Rops.This Fiji could do 8x Triangles/clock and 128Pixels/clock.Twice as 290x and current Fiji.8x Pipelines will feed 4096 with zero problem.
fiji-varianta-2.jpg


i think 8x pipeline FIJI will be 10-15% faster than TITANX in 1920x1080 and around 20-25% in 4K(maybe even more in 4K because 4k is bottleneck by number of rops and 128rops will just destroy TITANX in 4k)
WHY AMD dont do that?Well Fiji is already 600mm2 and IF they make it with 8x pipelines it will be much bigger and probably cant make it on 28nm.So yeah Fiji is bad design from start.
 
Last edited:

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
how fiji should look like to not be bottleneck by 4x pipelines.It have 8X pipelines and 128Rops because Each RB have 4x Rops.This Fiji could do 8x Triangles/clock and 128Pixels/clock.Twice as 290x and current Fiji.8x Pipelines will feed 4096 with zero problem.
fiji-varianta-2.jpg


i think 8x pipeline FIJI will be 10-15% faster than TITANX in 1920x1080 and around 20-25% in 4K.

Doubling Hawaii is impossible given a die space limit of 600 mm^2 considering Hawaii was already 438 mm^2. I think the sweet spot should have been 5 shader engines with 55 CUs ...

There's also other things to take into considering like the fact that it has a lot more changes then one thinks!

GCN 3 is NOT backwards compatible the GCN 1/2 (well technically GCN 2 isn't backwards compatible with the original GCN architecture because they removed 4 instructions) since they have different microcode formats so the compiler engineers definitely need to do some optimizing on their part despite the microarchitectures having similar characteristics ...
 

HurleyBird

Platinum Member
Apr 22, 2003
2,800
1,528
136
It should be a lot better in future DX12 games that are able to utilize the shaders better. On the other hand, it will also start to be bottlenecked by the 4GB framebuffer.

Fury X would be a good card if it weren't front end bottlenecked (to take the DX11 crown) or if it had a larger frame buffer (to take the future game crown), but the lack of either makes it not very enticing.
 

Head1985

Golden Member
Jul 8, 2014
1,867
699
136
Btw NV incresed number of Pipelines in GTX980 vs Titanx.So there is no bottleneck.
980 4x pipelines
GeForce_GTX_980_Block_Diagram_FINAL_575px.png

TITANX 6 pipelines
TITAN_X_Block_Diagram_FINAL_575px.png
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
The L2 cache: The L2 cache only went up in size by 33% from 768KB to 1MB going from Tahiti to Hawaii. It only increased in bandwidth by 33% to 1TB/s. From Tahiti to Fiji we have doubled the amount of CUs and ROps yet still the same L2 size and bandwidth as Hawaii. I'm expecting 2MB of L2 for Polaris.

The Memory Controller: While Hawaii gained a 512-bit bus the memory controller shrunk in size compared to the one in Tahiti. This limited the memory speeds it could handle relative to Tahiti. Fiji pretty much retains these components from Hawaii. This is why Fiji only really taps into 333-387GB/s of its 512GB/s (32-54% loss) and Hawaii only 263GB/s of its 320GB/s (22% loss). So if we take 384GB/s for Grenada and subtract 22% we end up with 300GB/s or an 11-29% advantage for Fiji over Grenada in terms of actual memory bandwidth.

The Geometry Units: Though some tweaks were made from Hawaii to Fiji, they only account for a small bump in real world performance.

The Rasterizers: Same Gtris/s rate so same Polygon throughput.

So really, don't count on the theoreticals. I'm thinking that this is why Polaris comes with a new memory controller, new Geometry processors (primitive discard accelerators) as well as L2 Cache. Most probably more shader engines so more rasterizers.
 
Last edited:

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Remember that GCN 1.2+ also has better color compression, including delta compression, but I'm guessing that needs to be tweaked per game?

Since AMD is only meeting us half-way here we don’t know much more about what this does. Though the fact that they’re calling it delta compression implies that AMD has implemented a further layer of compression that works off of the changes (deltas) in frame buffers, on top of the discrete compression of the framebuffer. In this case this would not be unlike modern video compression codecs, which between keyframes will encode just the differences to reduce bandwidth requirements (though in AMD’s case in a lossless manner).

AMD’s own metrics call for a 40% gain in memory bandwidth efficiency, and if that is the average case it would more than make up for the loss of memory bandwidth from working on a narrower memory bus. We’ll see how this plays out over our individual games over the coming pages, but it’s worth noting that even our most memory bandwidth-sensitive games hold up well compared to the R9 280, never losing anywhere near the amount of performance that such a memory bandwidth reduction would imply (if they lose performance at all).

http://www.anandtech.com/show/8460/amd-radeon-r9-285-review/3
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Remember that GCN 1.2+ also has better color compression, including delta compression, but I'm guessing that needs to be tweaked per game?



http://www.anandtech.com/show/8460/amd-radeon-r9-285-review/3
That's to save on memory bandwidth. So clearly, AMD is aware of their memory bandwidth issues and that explains the switch to HBM for Fiji and then the Memory Controller and L2 Cache changes for Polaris. AMD is including newer compression techniques for Polaris as well.

Bandwidth is likely the main bottleneck.

See..
71bb06f6ee3185fc7c686622d6d4bb30.jpg


Oh and GCNs ROPs are individualy more powerful than those of Kepler and Maxwell. So it's not likely that 64 ROps affected AMD negatively compared to NVIDIAs ROps count.

Fyi:
The GTX 980 has 64 ROps
GTX 980 Ti/TitanX have 96ROps

Fiji has 64 ROps and competes with the 96 ROps from GM200.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,650
2,481
136
WHY AMD dont do that?Well Fiji is already 600mm2 and IF they make it with 8x pipelines it will be much bigger and probably cant make it on 28nm.So yeah Fiji is bad design from start.

It's not just about size. Per comments said by AMD rep online, the original GCN design simply maxed out at 4 pipelines, to have more you have to redo the entire frontend. (I'm looking for the quote of this right now.) They didn't do this for Fiji. Hopefully, this will be fixed for Polaris.

I7oeOzP.jpg
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
4 shader engines not pipelines.

And the fact that AMD didn't change the rasterizer in the Polaris Die shot or the render back end hints at a move to 6 shader engines (96 ROps), perhaps for Vega, on the horizon. That would explain the focus on a new L2 cache and memory controller to accommodate more ROps, CUs and TMUs.

HBM2 Vega 11 is going to be a monster GPU. It would take 144 Maxwell ROps (with color compression schemes) to match 96 GCN 1.2 ROps. Imagine 96 Vega 11 ROps with their own color compression schemes? Wowzers! Hello 4K gaming at 60FPS+ if that's what Vega 11 brings to the table.
 
Last edited:

Dygaza

Member
Oct 16, 2015
176
34
101
That's to save on memory bandwidth. So clearly, AMD is aware of their memory bandwidth issues and that explains the switch to HBM for Fiji and then the Memory Controller and L2 Cache changes for Polaris. AMD is including newer compression techniques for Polaris as well.

Bandwidth is likely the main bottleneck.

See..

Explains why Fiji generally scales quite well when you overclock HBM. Shame it can only be overclocked in steps (500/545,45/600/666MHz). I can't get mine to run 600 :(
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
That's to save on memory bandwidth. So clearly, AMD is aware of their memory bandwidth issues and that explains the switch to HBM for Fiji and then the Memory Controller and L2 Cache changes for Polaris. AMD is including newer compression techniques for Polaris as well.

Bandwidth is likely the main bottleneck.

See..
71bb06f6ee3185fc7c686622d6d4bb30.jpg


Oh and GCNs ROPs are individualy more powerful than those of Kepler and Maxwell. So it's not likely that 64 ROps affected AMD negatively compared to NVIDIAs ROps count.

Fyi:
The GTX 980 has 64 ROps
GTX 980 Ti/TitanX have 96ROps

Fiji has 64 ROps and competes with the 96 ROps from GM200.

How do you measure the capacity of rops though? Maybe if you overclock the memory and massively downclock the card or something?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
When the 2 engineers gets relocated it will go completely wrong. And Fury will make 780 owners look lucky ;)

If you want good stable performance, get GCN 1.1 and avoid GCN 1.2. There are also only 3 parts based on GCN 1.2 and one of those is an APU.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
How do you measure the capacity of rops though? Maybe if you overclock the memory and massively downclock the card or something?
Well theoretically Maxwell has a significantly larger Pixel throughout than GCN 1.1 or 1.2:
974d29132a536b97367050bab3baa31f.jpg


Maxwell can output around:

GTX 980: 75 GPixels/s
GTX 980 Ti: 85 GPixels/s
TitanX: 95 GPixels/s
See here:
235afe3140e35971a9b03930eafcdb57.jpg


That makes Maxwell GM20x around 88, 89 and 92% efficiency, respectively, in terms of theoretical vs real world ROps throughput.

Fiji, on the other hand, can theoretically output 67 GPixels/s but manages 64 GPixels/s in the real world. That's a ROps efficiency of 96%.

So if we clocked all the parts at 1Ghz we'd get these theoretical throughput results:
FuryX: 64 GPixels/s
GTX 980: 64 GPixels/s
GTX 980 Ti: 96 Gpixels/s
TitanX: 96 GPixels/s

Real world results would look like this:
FuryX: 61.4 GPixels/s
GTX 980: 56.3 GPixels/s
GTX 980 Ti: 85.4 GPixels/s
TitanX: 88.3 GPixels/s

Which appears to give Maxwell a distinct advantage except when we look at Maxwell memory bandwidth efficiency. Maxwell is even moreso limited by bandwidth than Hawaii or Fiji are:
d2543366c382ec79aea022d2cddadfaa.jpg

Maxwell's saving grace are its color compression algorithms. But random texture throughput is telling.

This explains why Maxwell can't pull ahead of Fiji by much at 4K despite its theoretical ROps throughput advantage. Maxwell is architecturally starving for bandwidth. This could be due to one, or a combination, of two things:

1. A memory controller that is less efficient than that of Fiji and Hawaii.

2. Less caching redundancy in the architecture.


Explaining Maxwell's inability to truly pull away from Fiji at 4K is not because Fiji uses HBM, and has more bandwidth, because even Hawaii is more efficient than Maxwell memory bandwidth wise.

What this pretty much means is that Maxwell's extra ROps are useless. Maxwell can't make use of them because it is starved for bandwidth. All 96 ROps, in GM200, share the same 3MB L2 cache pool. While the 64 ROps in GM204 share the same 2MB L2 Cache pool. We see the same cache sharing throughout Maxwell. Within the SMMs, the texture units and compute units share a 24KB L1 cache per two 32 CUDA core partitions and the whole SMM shares a 64KB cache.

Not only does GCN have an L2 cache, a Global Data Share and Local Data share but each 4 ROps have a high speed color cache and each Z/stencil has its own depth cache. Each CU (64 SPs) has a 64KB Local Data cache and each grouping of 4 texture units have their own 16KB L1 cache per CU. And each 4 CUs have a 48KB L1 cache. Pretty much every "unit", within GCN, has its own dedicated cache.

This explains why Fiji can keep up with Maxwell at 4K. Fiji is maxing out its 64 ROps throughput whereas Maxwell is running out of bandwidth on two levels, cache and framebuffer. 96 ROps means absolutely nothing for Titan X and the GTX 980 Ti.

With Polaris adding a refined memory controller, a new L2 cache and potentially more ROps then what we have is a recipe for 4K gaming being possible.

We should know more about Pascal soon. Pascal will probably be increasing its cache sizes and improving its memory controller over Maxwell. HBM2 alone won't do it.

Just a thought.
 

el etro

Golden Member
Jul 21, 2013
1,584
14
81
How GCN works:
Basic is compute unit (CU)
cumpute-unit.jpg

Each CU have 64SP,4TMU
Cu is organised in 4 blocks

Next is Shader GCN engine(Pipeline)
shader-engine.jpg

Each pipeline have:
Compute Units(CU)
1 geometry Procesor
1 Rasterizer
MAX 4x RB per pipeline
EACH RB have 4 ROPS UNITS
SO IF fiji have only 4x pipelines it also must have 64rops.

290x 2808SP 4X pipelines.Its Ok for 2808SP but how can 4x pipelines still "feed" 4096SP on fiji?Pretty much cant.
HawaiiArch_575px.png

Full Fiji diagram.Its bottleneck by front-end(only 4x pipelines for Too much CU/Shaders)Fiji can do now 4x triangles/clock and 64pixes/clock same as 290X
fiji-varianta-1.jpg


how fiji should look like to not be bottleneck by 4x pipelines.It have 8X pipelines and 128Rops because Each RB have 4x Rops.This Fiji could do 8x Triangles/clock and 128Pixels/clock.Twice as 290x and current Fiji.8x Pipelines will feed 4096 with zero problem.
fiji-varianta-2.jpg


i think 8x pipeline FIJI will be 10-15% faster than TITANX in 1920x1080 and around 20-25% in 4K(maybe even more in 4K because 4k is bottleneck by number of rops and 128rops will just destroy TITANX in 4k)
WHY AMD dont do that?Well Fiji is already 600mm2 and IF they make it with 8x pipelines it will be much bigger and probably cant make it on 28nm.So yeah Fiji is bad design from start.

David Kanter says he think this too. Fiji got bottlenecked with Frontend, but as AMD ran out of die area they had to bake the chip this way.
 

el etro

Golden Member
Jul 21, 2013
1,584
14
81
When the 2 engineers gets relocated it will go completely wrong. And Fury will make 780 owners look lucky ;)

If you want good stable performance, get GCN 1.1 and avoid GCN 1.2. There are also only 3 parts based on GCN 1.2 and one of those is an APU.

AMD should not do 3 steps for GCN1. Just two at the max. Altough GCN1.2 brought important things and nice new resources.
 

gamervivek

Senior member
Jan 17, 2011
490
53
91
Not doubling the front end is not new for AMD, 5870 for instance didn't double it and would end up close to 5770 where front end was the bottleneck(HAWX iirc)

How do you measure the capacity of rops though? Maybe if you overclock the memory and massively downclock the card or something?

I'm not sure about what you're suggesting, but there are some tests that go for the pixel fill rate in practical conditions.

75487.png


Looking at the 285's pixel fill-rate, AMD could have done Fiji with 512-bit GDDR5 as well for the same results.
 
Last edited:

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
Very interesting about the pipelines. So the Fury gets very little advantage from all those extra Compute Units, and the memory bandwidth is where much of the (relatively minor) gains are from? You'd expect Fury Air to be ~16-20% faster than 390X at 1080p/1440p just going off of 7870XT->7970, but you're telling me the front end will not allow this so that's why we are seeing 7-14%.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
The memory bandwidth and color compression pretty much allowed for the last bit of pixel throughput to be exhausted. So Fiji is pretty much a maxed out Tonga (GCN 1.2).

Fiji can still make gains relative to other GPUs but more so in terms of the render:compute ratio department games are headed towards. As games utilize more and more compute resources, all other cards will begin to be compute bottlenecked much quicker than Fiji will and that's where games are headed.

So yeah, Fiji will age well ironically.
 

Qwertilot

Golden Member
Nov 28, 2013
1,604
257
126
Or it might die entirely if all this optimisation gets dropped :(

Have to disagree about GCN 1.2 being a mistake - they had to improve GCN. The current line up is a logical mess in all sorts of ways.

Purely objectively, I think the oddest thing is actually having Fury and the 390(x) in the line up at once. Huge overlap.

Presume that the initial very low availability of HBM basically forced that though.
 

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
I think it was necessary to get products that consistently edge out the 980 and crawl at the heels of the 980 Ti. The Fury and Fury X respectively do this, and the 390X just isn't there on average especially at 1080p. As bad as they are at hitting theoretical performance closer to older GCN, they are faster cards. And this would really only be possible with HBM bringing down power consumption. The brand perception of seeing 3 tiers of Nvidia all consistently beat the top AMD chip would be bad. But the Fiji cards really do seem like very gradual steps in performance over Hawaii, which is disappointing on its own.