Performance per Watt: What chance does Polaris have?

Adored · Apr 6, 2016

swilli89 said:
Simply brilliant. Well boy! Someone should email them your resume because had it been as simple as just asking you all along they could have just hired you as CEO!

Are you suggesting that AMD made the right choice by basically stagnating progress and rebranding the majority of their lineup while Nvidia went with a whole new architecture?

Qwertilot · Apr 6, 2016

Perf/watt matters everywhere, but really shows strongly at the top and bottom (notebooks etc.) because you're basically TDP constrained there.

Hence the 980ti/NV's notebook monopoly.

As to whether AMD can actually catch up, they'll try (they have to), but it is an awfully hard ask coming from that far behind with rather lower R&D. Hopefully they'll get it at least plausibly close.

Adored · Apr 6, 2016

Qwertilot said:
As to whether AMD can actually catch up, they'll try (they have to), but it is an awfully hard ask coming from that far behind with rather lower R&D. Hopefully they'll get it at least plausibly close.

Thats exactly what Nvidia did between 40nm and 28nm. Historically, AMD are ahead in perf/area and perf/Watt metrics. They will be again this time.

Mahigan · Apr 6, 2016

Let's look at what we saw from the P100 presentation yesterday...

I've made several posts on these forums, and others such as HardOCP and Overclock.net, where I made controversial statements. These statements, however, were proven to be accurate with NVIDIA's latest architecture. These statements will give anyone, who dares set aside their bias, insight as to the direction in which both RTG and NVIDIA are headed. What statements?

1. GCN is a superior architecture, in each iteration, when compared with its competition from NVIDIA. Hardware wise, ignoring RTGs rather inept software package, GCN is the architecture to beat.

2. The console effect has pushed game development towards GCNs architectural strengths. Primarily the increased compute:Graphics rendering ratio as well as shaders written with 64 threads in mind (mapping to a single wavefront or mapping directly to a GCN CUs 64 SIMDs).

3. Asynchronous Compute + Graphics is here to stay in very much the same way as SMT (Hyperthreading) is here to stay in x86 CPU designs (AMD adopting the technique for Zen...finally).

4. Maxwell is not strong when it comes to compute parallelism. Going over 16 concurrent warps, per SM, leads to local cache spills into L2 cache. There are too many CUDA cores, and too much logic in general, in each SM when compared to the available local caches.

5. Maxwell's ROps are bandwidth starved. NVIDIA increased the ROp to cache and memory controller ratio from 8:1, in Kepler, to 16:1 in Maxwell leading to bandwidth starved ROps when gaming at 4K resolution.

6. Fiji, Hawaii and Grenada's 64 ROps are more than enough to compete with Maxwell's (GM20x) 64-96ROps. This is primarily due to the available cache and Render Back End pipelining (resources sharing) available in GCN.

7. NVIDIA will need to move towards a more GCN-like architecture to compete in the upcoming DX12/Vulkan generation of games.

These statements are factual. They're not made out of some bias towards any hardware vendor. The truth does not necessarily mean that one has to draw a line in the middle. The truth is what it is. The truth can swing in one direction, or many directions.

Now to answer these statements.

1. What recent DX12 titles have shown is that once you circumvent RTG GCNs API overhead under DX11 titles, you end up with a competitive situation. We see this in the majority of DX12 titles that have released so far. Rise of the Tomb Raider is an exception but that game's DX12 path was patched in and is a work in progress. NVIDIA have consistently, across all titles, lost performance when moving from DX11 to DX12. RTG, on the other hand, have consistently gained performance from DX11 to DX12.

Ashe's of the Singularity

Hitman

Etc etc...

2. We've seen the console effect in recent releases, even DX11 releases, such as Far Cry Primal, The Division, Star Wars Battlefront, Mad Max Fury Road, Doom Alpha/Beta etc. This trend won't change in the forceable future. As titles move from DX11 to DX12, the console effect will only become more pronounced.

3. Asynchronous Compute + Graphics has now been pushed by both Sony and Microsoft. It is even part of joint AMD and NVIDIA talks. The feature is here to stay. I don't need to say more on this because plenty of threads here have highlighted this fact.

To be continued...

Mahigan · Apr 6, 2016

4. A scientific paper confirmed my suspicions about Maxwell's lack of parallelism. Theoretically speaking, each Maxwell SM is capable of 64 concurrent Warps and each Warp being made up of 32 threads for a total of 2,048 threads per SM. Sadly, this is not the case in practice. In practice Maxwell loses performance once we move higher than 16 concurrent Warps per SM. This pits the maximum threads per SM, before a performance drop off, at 512 threads.

Maxwell is thus not a good candidate for Asynchronous compute + graphics even if its static scheduler could emulate the process. On top of that, Maxwell's static scheduler hits the CPU hard when attempting to emulate Asynchronous compute + graphics as revealed by Dan Baker of Oxide:

5&6. In order to understand what I mean by ROp to cache or ROp to Memory Controller ratio, we need to look at a schematic of GM107.

GM20x differs from GM107 in that NVIDIA increased the ROps ratio from 8:1 to 16:1. So lets look at both GM204 and GM200.
GM204
- 64 ROps divided by 16 = 4.
- 2MB of L2 cache divided by 4 = 512KB.
- 256bits/4 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).

GM200
- 96 ROps divided by 16 = 6.
- 3MB of L2 cache divided by 6 = 512KB.
- 384bits/6 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).

The result is that there isn't enough bandwidth to feed these ROps and they're consistently 10GPixel/s behind their theoretical throughput. This is without any other work straining the memory controller or L2 cache as seen here:

NVIDIA thus, knowing this was a limitation, invested heavily in color compression algorithms in order to reach parity, or near parity, with Fiji and its 64 ROps as seen here:

This issue is further compounded by the inefficient memory controllers used by GM20x. NVIDIA had to sacrifice efficiency in order to keep die size down and power usage low as seen here:

Fiji and Hawaii use the following ROp setup:

Fiji is different from Hawaii because it has a 4096bit bus divided by 8 or 512bit memory controller per 8 ROps (GP100 will have 16 ROps per 512bit memory controller) and 256KB L2 cache per 8 ROps vs 128KB L2 cache per 8 ROps on Hawaii.
That blue bar above all the Render Back Ends links them all together (pipelined ROps) for resource sharing and each Render Back End has Depth (Z)/Stencil cache as well as Color cache and, on top of that, access to a Global Data Share cache.
GCN Hawaii/Fiji ROps are thus not bandwidth starved compared to Maxwell's. This is why Fiji can compete with GM200 at 4K resolution.

7. And finally P100, P100 has moved towards 64 SIMD FP32 cores per SM, vs 128 in GM20x and 192 in Kepler. This is just like GCNs 64 SIMD cores per CU. P100 has also moved towards 4 Texture Mapping Units per SM, again like GCN CU. With less logic per SM, but the same degree of local cache per SM as GM20x, then P100 will be less likely to suffer cache spills into L2 when running concurrent Warps. This means that, like GCN, P100 is a highly parallel architecture. To top it off, P100 will use Warps with a max thread sizing of 64, just like a GCN Wavefront. So NVIDIA has moved, as I suspected they would, towards a more GCN-like architecture.

This quote, from NVIDIA, confirms what I had been saying about GM20x:

Overall shared memory across the GP100 GPU is also increased due to the increased SM count, and aggregate shared memory bandwidth is effectively more than doubled. A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

Source: https://devblogs.nvidia.com/parallelforall/inside-pascal/

Conclusion:
We're looking at GM20x remaining as the better DX11 GPU of this generation not due to hardware superiority but rather RTGs software inferiority.
As DX12 titles release, GM20x will begin to lose steam the same way Kepler did and will be surpassed by its GCN competitors.
NVIDIAs new architecture, Pascal, will likely turn out to be quite successful and is not likely to suffer the shorter life spans of its Kepler and Maxwell ancestors by virtue of being more GCN-like.
Performance/watt? Knowing all of this, it would seem to me that Polaris/Vega and Pascal will be quite similar in performance/watt due to their overall similarities.
Polaris/Vega will bring Asynchronous compute + Graphics into the mix which will offset the higher power usage of its extra logic (ACEs, caches and Hardware Scheduling). Each aspect of Polaris/Vega will be tuned for higher efficiency thus while Polaris/Vega won't likely arrive with more SIMD cores, than their predecessor's, they will instead raise the IPC of each one.
The end result? Similar TDP and performance from RTG and NVIDIA architectures with RTGs Polaris/Vega maybe even being slightly lower wattage for the performance output.

swilli89 · Apr 6, 2016

Adored said:
Are you suggesting that AMD made the right choice by basically stagnating progress and rebranding the majority of their lineup while Nvidia went with a whole new architecture?

This wasn't a choice of theirs to make, you speak as if they willfully went down that road.

Qwertilot said:
Perf/watt matters everywhere, but really shows strongly at the top and bottom (notebooks etc.) because you're basically TDP constrained there.

Hence the 980ti/NV's notebook monopoly.

As to whether AMD can actually catch up, they'll try (they have to), but it is an awfully hard ask coming from that far behind with rather lower R&D. Hopefully they'll get it at least plausibly close.

Except this is a fallacy perpetrated again and again and again. They aren't "far behind" at all if you look at recently released and soon to be released titles. They have made the 960 and 980 very bad purchases especially with recent discoveries coming to light.

Stuka87 · Apr 6, 2016

Adored said:
Are you suggesting that AMD made the right choice by basically stagnating progress and rebranding the majority of their lineup while Nvidia went with a whole new architecture?

Maxwell is not a whole new architecture. it is a progression of kepler, with some compute removed. Yes this is a bit of an over simplification, but Maxwell was most definitely not a ground up fresh design.

Likewise, GCN has been progression as well. Which is why we have 1.0, 1.1, 1.2, etc.

Mahigan · Apr 6, 2016

Adored said:
Are you suggesting that AMD made the right choice by basically stagnating progress and rebranding the majority of their lineup while Nvidia went with a whole new architecture?

Yes. Hawaii/Grenada will surpass their GM204 competitors under DX12/Vulkan titles.

So rebranding Hawaii was a very good move. It didn't seem like a good move at the time because of DX11, but it will really pay off in the coming months as more DX12 titles are released.

DaveSimmons · Apr 6, 2016

master_shake_ said:
i'm trying really hard to care about performance per watt here.

i'm coming up blank on reasons.

can anyone give me a reason to care?

afaik perf/dollar is a better metric.

what you get vs what you pay for.

That depends on your budget. 10 cent Ramen supplies more calories per dollar than fresh apples or steak, but is cal/$ what people care most about post-college?

I bought a GTX 980ti to replace my old 680ti because I've had better luck with nvidia drivers and because of the speed - power - heat - noise differences from AMD's air-cooled cards.

AMD cards offer better performance / $ but I'm not really on a budget. The same thing applies to longevity. If the 980ti stops working well enough I'll replace it.

Adored · Apr 6, 2016

swilli89 said:
This wasn't a choice of theirs to make, you speak as if they willfully went down that road.

It was their choice. They chose to let it all go to hell in favour of doubling down on 14nm.

Except this is a fallacy perpetrated again and again and again. They aren't "far behind" at all if you look at recently released and soon to be released titles. They have made the 960 and 980 very bad purchases especially with recent discoveries coming to light.

The 980 and 960 were never good purchases and also not where AMD lost the battle. They lost it by having nothing to compete first with the 750 Ti, then later the 970, then finally the 980 Ti.

Adored · Apr 6, 2016

Stuka87 said:
Maxwell is not a whole new architecture. it is a progression of kepler, with some compute removed. Yes this is a bit of an over simplification, but Maxwell was most definitely not a ground up fresh design.

Likewise, GCN has been progression as well. Which is why we have 1.0, 1.1, 1.2, etc.

Mahigan said:
Yes. Hawaii/Grenada will surpass their GM204 competitors under DX12/Vulkan titles.

So rebranding Hawaii was a very good move. It didn't seem like a good move at the time because of DX11, but it will really pay off in the coming months as more DX12 titles are released.

It doesn't matter. Who is gonna care about their 28nm cards when 14/16 is about to blow them all away in every metric?

Even if you bought a 390 or 390X, the massively lower power consumption of the Polaris 10 will make it look like ancient technology.

ThatBuzzkiller · Apr 6, 2016

With specs like that Nvidia is basically forfeiting the gaming competition to AMD ...

Pascal looks somewhat underwhelming improvement wise. AMD would have to try really hard to do worse than Nvidia with Polaris ...

Silverforce11 · Apr 6, 2016

Mahigan said:
Yes. Hawaii/Grenada will surpass their GM204 competitors under DX12/Vulkan titles.

So rebranding Hawaii was a very good move. It didn't seem like a good move at the time because of DX11, but it will really pay off in the coming months as more DX12 titles are released.

Hawaii already surpassed GM204 in new DX11 titles. You don't need to wait for more DX12 or Vulkan to show that effect.

390 > 970

390X > 980

Likewise, Tonga & Tahiti > GM206.

It's already here. When benches update to remove older games, add in newer ones, it'll be clear.

It really is only Fiji that cannot compete with GM200 in DX11 titles, and it's solely because it's a unbalanced design, too many SP for the same front end.

Silverforce11 · Apr 6, 2016

Adored said:
It doesn't matter. Who is gonna care about their 28nm cards when 14/16 is about to blow them all away in every metric?

Even if you bought a 390 or 390X, the massively lower power consumption of the Polaris 10 will make it look like ancient technology.

Because those folks who are still on 390/X will still have great performance in games. It'll last for several more years in the DX12 era. So unless you need perf/w advantages, there's no point to upgrade to Polaris.

BlitzWulf · Apr 6, 2016

Silverforce11 said:
Because those folks who are still on 390/X will still have great performance in games. It'll last for several more years in the DX12 era. So unless you need perf/w advantages, there's no point to upgrade to Polaris.

This ^^ I have a 390X and if I start getting perf comparable to GM200 chips in DX12 I'm probably going to buy a second when they are cheap and hold off on upgrading till Navi in 2018

RussianSensation · Apr 6, 2016

Adored said:
Polaris will beat Pascal in perf/Watt but not by the kind of margins we see today with Maxwell vs GCN. I expect to see a correction in the market to nearer classic levels, 60-40 in Nvidia's favour.

I highly doubt Polaris will beat Pascal in perf/watt by any significant margin. At the same time, I think we have to compare like-for-like. In other words, Polaris seems to come with GDDR5 / GDDR5X and Vega will have even higher perf/watt due to HBM2. That means Polaris vs. Pascal is too broad of a comparison since Pascal will also have GDDR5 / GDDR5X and the large die compute HBM2 GP100/102. It's actually possible to end up with Polaris losing to Pascal GP104/106/107 in perf/watt but beating GP100/102 with Vega. This is because I think NV will have lean gaming chips with GP104->107 but AMD has already incorporated monster compute performance in each GCN iteration (even since 7970). That means AMD has already incurred a huge compute penalty, unlike P100 that ended up with 3840 CUDA cores with 300W TDP. It seems evident that NV took a massive perf/watt and perf/mm2 compute penalty by making GP100 an all-around chip in a similar vein to HD7970/R9 290X series.

I actually don't know if AMD will go with 430-460mm2 Vega and be aggressive on price like $549 290X vs. $699 780Ti or if they will go for the performance crown? From all indications, it seems Pascal is a highly refined Maxwell but GCN 4.0 seems to be a MUCH more heavily redesigned GCN architecture with major focus on efficiency this time.

AtenRa said:
At the same performance in DX-12 Games, GCN 1.1 and 1.2 will be very close with Maxwell if not better in perf/watt.

That's a good point. It also matters a great deal how aggressive NV will be with GameWorks. The more DX12 games and the less there are DX11/GameWorks titles, the better AMD cards will look. Unfortunately we are now in an era where performance in a game can swing massively depending on NV sponsorship.....:sneaky:

Exophase · Apr 6, 2016

master_shake_ said:
i'm trying really hard to care about performance per watt here.

i'm coming up blank on reasons.

can anyone give me a reason to care?

afaik perf/dollar is a better metric.

what you get vs what you pay for.

High end GPUs have been heavily thermally limited. We've been up against a ~300W limit for years and have started seeing water cooling in standard products.

In this scenario more perf/W will mean more performance, especially for dual-chip GPUs.

This is all the more pertinent in laptops and tablets.

raghu78 · Apr 6, 2016

I think AMD's Polaris/Vega might have more shaders than their Pascal counterparts. Nvidia seems to have gone for lesser number of shaders at higher clocks. AMD might have improved shader/CU throughput and efficiency as they mentioned in their presentations but I still think there is a good possibility we see a 5120-6144 sp Vega 10 with clocks of 1050-1100 Mhz. Fiji with 4096 sp and 4096 bit HBM controller on 28nm was 600 sq mm. So it should not be a problem to fit a 5120-6144 sp Vega 10 for < 500 sq mm. We have to remember that 14LPP is bound to be even smaller than 14LPE (due to process maturity, transistor improvements) which showed 9% smaller die size vs 16FF+ for Apple A9. A 450 sq mm 14LPP chip is going to be > 500 sq mm on 16FF+ . Since 16FF+ is 2x the transistor density of TSMC 28nm , 14LPP is >= 2.2x transistor density of TSMC 28nm. I think a Vega 10 with 450 sq mm on 14LPP is realistically possible and the upper limit of what AMD is likely to build as they are more conservative with die sizes than Nvidia (Fiji being an exception). imo AMD and Nvidia are taking slightly different approaches. I am interested in seeing how these architectures pan out

Silverforce11 · Apr 6, 2016

RussianSensation said:
I highly doubt Polaris will beat Pascal in perf/watt by any significant margin. At the same time, I think we have to compare like-for-like. In other words, Polaris seems to come with GDDR5 / GDDR5X and Vega will have even higher perf/watt due to HBM2. That means Polaris vs. Pascal is too broad of a comparison since Pascal will also have GDDR5 / GDDR5X and the large die compute HBM2 GP100/102. It's actually possible to end up with Polaris losing to Pascal GP104/106/107 in perf/watt but beating GP100/102 with Vega. This is because I think NV will have lean gaming chips with GP104->107 but AMD has already incorporated monster compute performance in each GCN iteration (even since 7970). That means AMD has already incurred a huge compute penalty, unlike P100 that ended up with 3840 CUDA cores with 300W TDP. It seems evident that NV took a massive perf/watt and perf/mm2 compute penalty by making GP100 an all-around chip in a similar vein to HD7970/R9 290X series.

I actually don't know if AMD will go with 430-460mm2 Vega and be aggressive on price like $549 290X vs. $699 780Ti or if they will go for the performance crown? From all indications, it seems Pascal is a highly refined Maxwell but GCN 4.0 seems to be a MUCH more heavily redesigned GCN architecture with major focus on efficiency this time.

That's a good point. It also matters a great deal how aggressive NV will be with GameWorks. The more DX12 games and the less there are DX11/GameWorks titles, the better AMD cards will look. Unfortunately we are now in an era where performance in a game can swing massively depending on NV sponsorship.....:sneaky:

Unless they plain lied. Think back to the Polaris 11 demo.

84W total system power, 60 fps. That rig is probably 50W alone, leaving the GPU at ~34W.

Now ofc its limiting via vsync and it's not running flat out, but likely the chip itself will be 50W TDP, no power connector required. What's the performance profile? 960 or 380 class.

Can NV match that with Pascal?

Let's take the 960 and shrink it to 16nm FF, GP107. It will be <75W TDP easy, and close enough.

But definitely Polaris 11 and 10 are built for perf/w given the data we have at hand.

Polaris 10 with 2560SP, or twice as big as Polaris 11, potentially a ~100W chip (more or less depending on SKU clocks). Should be 390X+ in performance given the close SP count but major uarch improvements.

It'll be close to shrunk GM204, or the GP206.

GP100 having so many FP64 units that aren't used in games is a big hindrance. If there's no GP102 "gaming focused" chip, Vega 10 will smash it.

Why?

Polaris 11 is 120mm2, Polaris 10 is 232mm2. Vega 11 is likely to be around Tahiti/Tonga size, ~380mm2, Vega 10 has to be much bigger than that.

IllogicalGlory · Apr 6, 2016

I don't have the source, but NV has reportedly confirmed that GP100 (specifically) will not be making it into a Geforce card. I saw someone post this OCN, with a source, but I can't find the post.

Silverforce11 · Apr 6, 2016

raghu78 said:
I think AMD's Polaris/Vega might have more shaders than their Pascal counterparts. Nvidia seems to have gone for lesser number of shaders at higher clocks.

Something to think about:

Clockspeed is mainly affected by transistor density (leakage) given the same node. AMD went for a denser approach.

However, the major advantage of FF is much reduced leakage, density has less of an effect on clockspeed.

raghu78 · Apr 6, 2016

Silverforce11 said:
Something to think about:

Clockspeed is mainly affected by transistor density (leakage) given the same node. AMD went for a denser approach.

However, the major advantage of FF is much reduced leakage, density has less of an effect on clockspeed.

Silverforce Nvidia's P100 is running at 1480 Mhz boost clocks. Its now a guarantee that GP104 will run at 1.6+ Ghz. So Nvidia's approach of having fewer (and maybe higher performing cores) clocked at 30% higher clocks compared to AMD's Polaris/Vega which might have more cores running at lower clocks and thus lower voltage (and drawing lower power per core). This is an interesting contest. Generally its known that power is directly proportional to square of voltage so that higher frequency comes at a significant power cost. Running more cores at lower clocks/voltage might be more power efficient. We still do not know the efficiency and performance (IPC) of these cores. Anyway in 3 months we will know a lot more about these two architectures.

Silverforce11 · Apr 6, 2016

raghu78 said:
Silverforce Nvidia's P100 is running at 1480 Mhz boost clocks. Its now a guarantee that GP104 will run at 1.6+ Ghz. So Nvidia's approach of having fewer (and maybe higher performing cores) clocked at 30% higher clocks compared to AMD's Polaris/Vega which might have more cores running at lower clocks and thus lower voltage (and drawing lower power per core). This is an interesting contest. Generally its known that power is directly proportional to square of voltage so that higher frequency comes at a significant power cost. Running more cores at lower clocks/voltage might be more power efficient. We still do not know the efficiency and performance (IPC) of these cores. Anyway in 3 months we will know a lot more about these two architectures.

Sure, I'm aware of that. I just wanted to point out, on 28nm, AMD's lower clock speed is directly related to their higher density that leads to more leakage.

FF doesn't have this problem (or much less). Nothing stops Polaris or Vega being clocked higher.

raghu78 · Apr 6, 2016

Silverforce11 said:
Sure, I'm aware of that. I just wanted to point out, on 28nm, AMD's lower clock speed is directly related to their higher density that leads to more leakage.

FF doesn't have this problem (or much less). Nothing stops Polaris or Vega being clocked higher.

At 28nm AMD made trade-offs especially with Fiji as they tried to cram a lot of transistors at the expense of lower clock speeds due to higher leakage from more densely packed transistors. Fiji packed 10% more transistors than GM200 in the same space. This affected its clocking ability. With 14LPP AMD have a node which is extremely high performing (massively lower transistor leakage) and >= 2.2x transistor density of TSMC 28nm. With 14LPP AMD do not need to make trade-offs and can optimize for best efficiency and good OC headroom. AMD are making architectural optimizations to make sure that perf/watt, perf/sq mm, perf/transistor is good (something which was horrible with Fiji).

coercitiv · Apr 7, 2016

raghu78 said:
With 14LPP AMD do not need to make trade-offs and can optimize for best efficiency and good OC headroom. AMD are making architectural optimizations to make sure that perf/watt, perf/sq mm, perf/transistor is good (something which was horrible with Fiji).

Does this correlate with the difference between initial Polaris 10 shader count estimates and what is currently being reported/speculated?

Performance per Watt: What chance does Polaris have?

Senior member

Golden Member

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Elite Member

Senior member

Senior member

Golden Member

Lifer

Lifer

Member

Elite Member

Diamond Member

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member