Performance per Watt: What chance does Polaris have?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

sandorski

No Lifer
Oct 10, 1999
70,784
6,343
126
Waiting to see how things play out. One thing is certain, the first gen on these new Processes will be less efficient than later gens.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
I don't have the source, but NV has reportedly confirmed that GP100 (specifically) will not be making it into a Geforce card. I saw someone post this OCN, with a source, but I can't find the post.

This means we shouldn't be surprised if we see a gaming oriented part with more than 3800SPs, maybe 4500-5000.

And it makes sense. Otherwise if Nvidia cuts the low end parts from the GP100 part, we'll see $100 cards with heavy FP64 capability, which will hurt it in the markets its selling(gaming). You'd make version without FP64, and from there you cut to make smaller parts for consumers.
 
Feb 19, 2009
10,457
10
76
This means we shouldn't be surprised if we see a gaming oriented part with more than 3800SPs, maybe 4500-5000.

And it makes sense. Otherwise if Nvidia cuts the low end parts from the GP100 part, we'll see $100 cards with heavy FP64 capability, which will hurt it in the markets its selling(gaming). You'd make version without FP64, and from there you cut to make smaller parts for consumers.

The versions without FP64 has been x04, x06, and x07 chips.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
This means we shouldn't be surprised if we see a gaming oriented part with more than 3800SPs, maybe 4500-5000.

I'm expecting 4096 SPs on GP104, and 6144 SPs on GP102 (assuming it exists, which it probably will). A lot of people have adjusted their expectations downward as a result of the low SP count on GP100, but that's due simply to the fact that GP100 is a dedicated HPC chip. It wastes lots of transistors on stuff that gamers (and even most Quadro workstation users) don't need. It has 15 billion transistors, which means if it was simply a die-shrunk Maxwell, we should be getting about double the shaders of GM200. 6144, not a mere 3840. Nvidia uses an extremely inefficient method of providing 64-bit FP support. Consider how Hawaii managed to beat GK110 in gaming performance with a considerably smaller die (about 100mm^2 smaller) while still providing 1/2 FP64 support for the FirePro market. Unless Nvidia creates a large die dedicated to gaming/FP32, AMD's Vega will absolutely dominate the high end when it is released. I don't think Nvidia intends to let that happen.

A lot of people are grossly underestimating what a die shrink should actually mean. The upper-mid-size 40nm Fermi chip, GF114, was a 360mm^2 chip and it had 1.95 billion transistors. The corresponding 28nm Kepler chip, GK104, was noticeably smaller (294mm^2), but even so, it still had ~82% more transistors - 3.54 billion. That's a ~2.2x increase in transistor density from a single node shrink. And look at the shader counts! GF114 had 384 shaders. GK104? 1536 shaders - four times as many. Sure, it's a different architecture, but you can't tell me that a single Fermi shader is more powerful than two Kepler shaders. We're looking at roughly a doubling of actual performance. And that's without getting into the clock boosts that FinFET enables. 40% increase in clocks from GM200 to GP100 on the Tesla side! Considering that pro cards always have lower clocks than consumer-grade cards, we should be seeing ~1.6 GHz base clock rates on GeForce versions.
 

Adored

Senior member
Mar 24, 2016
256
1
16
40% increase in clocks from GM200 to GP100 on the Tesla side! Considering that pro cards always have lower clocks than consumer-grade cards, we should be seeing ~1.6 GHz base clock rates on GeForce versions.

The problem is GM200 was rated at 250W TDP, but is likely running lower than that. It could easily be 1.2GHz at 300W.

P100 is rated at 300W TDP but quite often "300W" TDP means "more than 300W" if you look at it historically.
 
Feb 19, 2009
10,457
10
76
Consider how Hawaii managed to beat GK110 in gaming performance with a considerably smaller die (about 100mm^2 smaller) while still providing 1/2 FP64 support for the FirePro market. Unless Nvidia creates a large die dedicated to gaming/FP32, AMD's Vega will absolutely dominate the high end when it is released. I don't think Nvidia intends to let that happen.

Hawaii also had a power hungry hardware scheduler to enable that feature.

NV tore it out of Kepler & Maxwell.

GCN could also do 2x FP16 on a single FP32 SP, what NV calls a revolution in Deep Learning computation... GCN had many years ago basically.

Now, I expect GP104 to also have this feature (since JHH is pushing Deep Learning!), so it will retain some of the compute capabilities of GP100, just less FP64 units.

NV can make a gaming focused chip with 3840 CC like GP100, removing all the FP64 units will put them around 400-450mm2.
 

xpea

Senior member
Feb 14, 2014
458
156
116
I'm expecting 4096 SPs on GP104, and 6144 SPs on GP102 (assuming it exists, which it probably will).
Seems a bit too much. For consumer/Quadro parts, I was thinking more like 96 FP32 units + 6 FP64 per SM. It will be the well know 1/16 FP64 rate and FP32 cores count per SM between GM204 and GP100.

A lot of people have adjusted their expectations downward as a result of the low SP count on GP100, but that's due simply to the fact that GP100 is a dedicated HPC chip. It wastes lots of transistors on stuff that gamers (and even most Quadro workstation users) don't need. It has 15 billion transistors, which means if it was simply a die-shrunk Maxwell, we should be getting about double the shaders of GM200. 6144, not a mere 3840.
Agree

Nvidia uses an extremely inefficient method of providing 64-bit FP support. Consider how Hawaii managed to beat GK110 in gaming performance with a considerably smaller die (about 100mm^2 smaller) while still providing 1/2 FP64 support for the FirePro market. Unless Nvidia creates a large die dedicated to gaming/FP32, AMD's Vega will absolutely dominate the high end when it is released. I don't think Nvidia intends to let that happen.
Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process. P100 is very competitive when taking into account the huge 14MB register file and the 4 Nvlinks (that occupies 400 pins !!!) that allows very efficient 8 GPU node.
And please don't talk about AMD, it's negligible quantity in this market and without something like Nvlink, they don't scale very well. maybe OK for hobby research but not where real money is made (yes I know lately they won a nice contract but its the exception, not the norm)

A lot of people are grossly underestimating what a die shrink should actually mean. The upper-mid-size 40nm Fermi chip, GF114, was a 360mm^2 chip and it had 1.95 billion transistors. The corresponding 28nm Kepler chip, GK104, was noticeably smaller (294mm^2), but even so, it still had ~82% more transistors - 3.54 billion. That's a ~2.2x increase in transistor density from a single node shrink. And look at the shader counts! GF114 had 384 shaders. GK104? 1536 shaders - four times as many. Sure, it's a different architecture, but you can't tell me that a single Fermi shader is more powerful than two Kepler shaders. We're looking at roughly a doubling of actual performance. And that's without getting into the clock boosts that FinFET enables. 40% increase in clocks from GM200 to GP100 on the Tesla side! Considering that pro cards always have lower clocks than consumer-grade cards, we should be seeing ~1.6 GHz base clock rates on GeForce versions.
I think 1.5GHz base clock will already be a very good achievement and will give some headroom for partners to make factory OCed models.
 
Last edited:

xpea

Senior member
Feb 14, 2014
458
156
116
P100 is rated at 300W TDP but quite often "300W" TDP means "more than 300W" if you look at it historically.
not on Tesla parts. The power rating is always the maximum value. Nvidia said it again yesterday that 300W is absolute maximum power.
 

Dribble

Platinum Member
Aug 9, 2005
2,076
611
136
Don't think you can use the P100 to predict Nvidia's gaming performance. As others have said it's full of stuff for HPC (FP64, nvlink, cache, etc), take all that away and you'll loose a lot of die size and power for no loss in gaming performance. Make a version for gaming the same size as a P100 and you'll end up with significantly higher performance.

As for efficiency, well Nvidia has been ahead there since kepler - that was the chip that killed AMD in discrete mobile (where efficiency is most important), maxwell just continued what it started. I doubt AMD has any chance of overtaking Nvidia, I doubt they'll even be able to catch up - that's just simple common sense:
-Nvidia is starting ahead in efficiency.
-It has the bigger R&D budget.
-It's using TSMC who have been the premium chip manufacturer (excluding Intel) and have lots of experience making huge gpu's.

Where as:
-AMD is starting behind in efficiency.
-It has less R&D budget.
-Company is in a bad way so it's probably already lost many of its best R&D people (when it's going wrong those who can get another job leave first, i.e. the best).
-It's allegedly using GloFo who have been inferior to TSMC for the last few generations and have little experience making huge gpu's.
 
Last edited:

flopper

Senior member
Dec 16, 2005
739
19
76
While the transition to finfets should help with AMD/RTG's performance per watt, is this the only thing Polaris has going for it or will the architecture be designed with efficiency as a higher priority than the previous generation?

To put it roughly, basing on most benchmarks, AMD's performance per watt seems to be roughly 70-80% of nVidia's. To put it in perspective from my own findings, my R9 390 is set to power limit -30% / vcore -30mv at all times, giving it roughly GTX 980 power consumption figures but only giving about 75-80% of the performance. nVidia has come a long way since Fermi, Fermi to Kepler was akin to Pentium 4 to Conroe, and Maxwell somehow repeated that feat even on the same process node.

If nVidia can manage another feat through architectural optimization alone regardless of the benefits of a smaller fab process, is there any way for AMD to catch up?

question is this, can pascal do async compute in hardware?
adding compute adds to power and judging 28nm vs 14nm we have a whole new baseline which is difficult to compare as each new iteration of cards on 14nm will be improved as tech ages.
if Pascal can do async compute in hardware AMD then catches up in power efficiency.
if not, well AMD then isnt the worried one.
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process.
a single FirePro W9100 has 2.6 TFLOPS FP64. And that is 28nm 275W TDP chip. Porting this to 14nm would allow doubling resources putting it right where nv claims P100 will be.

P100 is very competitive when taking into account the huge 14MB register file and the 4 Nvlinks (that occupies 400 pins !!!) that allows very efficient 8 GPU node.
And please don't talk about AMD, it's negligible quantity in this market and without something like Nvlink, they don't scale very well. maybe OK for hobby research but not where real money is made (yes I know lately they won a nice contract but its the exception, not the norm)
You can't have both. It either doesn't scale, or scales and was used in 1000 dualGPU server that will crunch our universe dimensions?

I see we have a new buzz word, NVlink.
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
question is this, can pascal do async compute in hardware?
adding compute adds to power and judging 28nm vs 14nm we have a whole new baseline which is difficult to compare as each new iteration of cards on 14nm will be improved as tech ages.
if Pascal can do async compute in hardware AMD then catches up in power efficiency.
if not, well AMD then isnt the worried one.
Async compute in hardware isn't the reason for the difference in efficiency. Kepler doesn't do it either, but maxwell is massively more efficient.

I don't see efficiency being very important. It may be with the ridiculously large GP100, but that's out of most people's league anyway. Polaris and the nvidia equivalent will be much smaller and probably more gaming focussed, so they'll have modest powerconsumption anyway.
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
performance metrics:
P100 5.3TF FP64, 10.6TF FP32 300W TDP @16nm
Hawaii (W9100) 2.6TF FP64, 5.2TF FP32 275W TDP @28nm

We know polaris will be 2-2.5 perf/watt. What amd needs it an hawaii class GPU at @14nm. 14nm is supposed to offer 50% area scaling compared to 28nm doubling the number of xtors. Then we would have 440mmsq amd vs 600mm nv duking it out.
 

coercitiv

Diamond Member
Jan 24, 2014
7,374
17,480
136
Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process.
Keep in mind the KNL TFLOPS figures are given for 200W TDP.
 

xpea

Senior member
Feb 14, 2014
458
156
116
a single FirePro W9100 has 2.6 TFLOPS FP64. And that is 28nm 275W TDP chip. Porting this to 14nm would allow doubling resources putting it right where nv claims P100 will be.
performance metrics:
P100 5.3TF FP64, 10.6TF FP32 300W TDP @16nm
Hawaii (W9100) 2.6TF FP64, 5.2TF FP32 275W TDP @28nm

We know polaris will be 2-2.5 perf/watt. What amd needs it an hawaii class GPU at @14nm. 14nm is supposed to offer 50% area scaling compared to 28nm doubling the number of xtors. Then we would have 440mmsq amd vs 600mm nv duking it out.
I love this speculation period :cool:
With AMD, its always the same. Great on paper... full of fanboys with "you will see next", but reality is:
1- AMD is negligible quantity in HPC. Please tell me how many of this W9100 wonder AMD sold ? no need to bother, RTG financial performance says it all...

2- To sell in this market, you must provide support. In other words, hardware without software is useless. And AMD has basically nothing. Where's their SDK ? what's their equivalent of CuDNN ? Do their provide libraries to get Caffe / Theano / CNTK / TensorFlow / Torch GPU accelerated on their wonderful hardware ? you know the answer, its a massive NO.

On the green side, you can find everything here: https://developer.nvidia.com/deep-learning
and on the thousands useful pages related to CUDA deep leaning you can find on google...

To understand what I'm talking about, cuDNN 5 is available now in Release candidate:
https://developer.nvidia.com/cudnn

The new cuDNN 5 release delivers new features and performance improvements. Highlights include:

LSTM recurrent neural networks that deliver up to 6x speedup in Torch
Up to 44% faster training on a single NVIDIA Pascal GPU
Accelerated networks with 3x3 convolutions, such as VGG, GoogleNet, and ResNets
Improved performance and reduced memory usage with FP16 routines on Pascal GPUs
Support for Jetson TX1

Adding high-performance LSTM layers to cuDNN helps us immensely in accelerating all of our NLP use-cases. [This] is awesome work by NVIDIA, as always.
- Soumith Chintala, Facebook AI Research

We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver. This new version of the SDK, significantly improves our convolution algorithms, and goes so far as to accelerate the 3D convolution by a factor of 3x! On top of that, we are excited about their decision to provide tools for other models such as LSTM, RNN and GRU in this new version.
- Frédéric Bastien, Team Lead - Software Infrastructure at MILA

CNTK relies on the NVIDIA Deep Learning SDK for performance and scalability. The time we save by not having to implement and optimize the latest algorithms from scratch, helps us invest more time in improving CNTKs strengths in speech, image and text processing
- Xuedong Huang (XD), Distinguished Engineer at Microsoft Research Advanced Technology Group

The performance of mxnet has consistently improved with each release of the NVIDIA Deep Learning SDK and with the latest release mxnet is now 10% faster! We’re excited about NVIDIA’s decision to introduce Winograd and LSTM which are highlights of this release.
- Bing Xu, Masters student at University of Alberta
 
Last edited:

xpea

Senior member
Feb 14, 2014
458
156
116

parvadomus

Senior member
Dec 11, 2012
685
14
81
performance metrics:
P100 5.3TF FP64, 10.6TF FP32 300W TDP @16nm
Hawaii (W9100) 2.6TF FP64, 5.2TF FP32 275W TDP @28nm

We know polaris will be 2-2.5 perf/watt. What amd needs it an hawaii class GPU at @14nm. 14nm is supposed to offer 50% area scaling compared to 28nm doubling the number of xtors. Then we would have 440mmsq amd vs 600mm nv duking it out.

AMD should just shrink Hawaii and add it some optimizations. Then launch a Hawaiix2 PRO card and goodbye GP100. Then simply focus on gaming graphics cards..

ED : I really dont know why they make very big chips for HPC, when at the end, they will use them in big arrays. Why not make a lot of 200m2 HPC chips?
 
Last edited:
Feb 19, 2009
10,457
10
76
AMD should just shrink Hawaii and add it some optimizations. Then launch a Hawaiix2 PRO card and goodbye GP100. Then simply focus on gaming graphics cards..

ED : I really dont know why they make very big chips for HPC, when at the end, they will use them in big arrays. Why not make a lot of 200m2 HPC chips?

A shrunk Hawaii with updated video blocks on FF running higher clocks would still kick some major ass in the DX12 era. But Polaris GCN is enhanced, souped up, it'll be better.

HPC is limited in performance per slot (and also limited by CPU : accelerator relationship). Cos those arrays are limited in size, let's say you can put a max of 4,500 GPUs into that building. Do you want 4,500 huge GP100 sized accelerator or small ones? Yeah?
 

xpea

Senior member
Feb 14, 2014
458
156
116
Keep in mind the KNL TFLOPS figures are given for 200W TDP.
really ? where ? please show me official number and corresponding SKU.

All public is that first gen Xeon Phi is rated from 225W (on 5110P) to 300W for the highest performance model.
I haven't seen any numbers yet on KNL except 2 years ago with the "160-200W" target. Since... nothing...
Beside, KNL is now 3 quarters late. Last slide says it will ship Q3-16
2-1080.2427182198.png
 
Feb 19, 2009
10,457
10
76
Beside, KNL is now 3 quarters late. Last slide says it will ship Q3-16
2-1080.2427182198.png

IIRC, Intel is already shipping to select customers building super computers.

Look at that pic though, contracts for 100 >PFlops. Probably rising fast too.

That's what NV is afraid of. Not AMD. :)
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
Stop spreading FUD.
Nvlink doesn't need POWER. It has 2 implementations:
One for GPU-GPU interconnect like used on DGX1 (with dual Xeon CPU, you see)
Another for CPU-GPU interconnect that will work with second gen POWER8 and future POWER9. First POWER8 with Nvlink is coming by end of the year:
http://www.anandtech.com/show/10230...-openpower-hpc-server-with-power8-cpus-nvlink

So it is not required. A gimmick then?


Threadcrapping and trolling are not allowed
Markfw900
 
Last edited by a moderator:

xthetenth

Golden Member
Oct 14, 2014
1,800
529
106
Also we learned that HPC dev support is relevant to a discussion that's trying to figure out performance/power characteristics of new chips. So that's interesting.