I don't have the source, but NV has reportedly confirmed that GP100 (specifically) will not be making it into a Geforce card. I saw someone post this OCN, with a source, but I can't find the post.
This means we shouldn't be surprised if we see a gaming oriented part with more than 3800SPs, maybe 4500-5000.
And it makes sense. Otherwise if Nvidia cuts the low end parts from the GP100 part, we'll see $100 cards with heavy FP64 capability, which will hurt it in the markets its selling(gaming). You'd make version without FP64, and from there you cut to make smaller parts for consumers.
This means we shouldn't be surprised if we see a gaming oriented part with more than 3800SPs, maybe 4500-5000.
40% increase in clocks from GM200 to GP100 on the Tesla side! Considering that pro cards always have lower clocks than consumer-grade cards, we should be seeing ~1.6 GHz base clock rates on GeForce versions.
Consider how Hawaii managed to beat GK110 in gaming performance with a considerably smaller die (about 100mm^2 smaller) while still providing 1/2 FP64 support for the FirePro market. Unless Nvidia creates a large die dedicated to gaming/FP32, AMD's Vega will absolutely dominate the high end when it is released. I don't think Nvidia intends to let that happen.
Seems a bit too much. For consumer/Quadro parts, I was thinking more like 96 FP32 units + 6 FP64 per SM. It will be the well know 1/16 FP64 rate and FP32 cores count per SM between GM204 and GP100.I'm expecting 4096 SPs on GP104, and 6144 SPs on GP102 (assuming it exists, which it probably will).
AgreeA lot of people have adjusted their expectations downward as a result of the low SP count on GP100, but that's due simply to the fact that GP100 is a dedicated HPC chip. It wastes lots of transistors on stuff that gamers (and even most Quadro workstation users) don't need. It has 15 billion transistors, which means if it was simply a die-shrunk Maxwell, we should be getting about double the shaders of GM200. 6144, not a mere 3840.
Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process. P100 is very competitive when taking into account the huge 14MB register file and the 4 Nvlinks (that occupies 400 pins !!!) that allows very efficient 8 GPU node.Nvidia uses an extremely inefficient method of providing 64-bit FP support. Consider how Hawaii managed to beat GK110 in gaming performance with a considerably smaller die (about 100mm^2 smaller) while still providing 1/2 FP64 support for the FirePro market. Unless Nvidia creates a large die dedicated to gaming/FP32, AMD's Vega will absolutely dominate the high end when it is released. I don't think Nvidia intends to let that happen.
I think 1.5GHz base clock will already be a very good achievement and will give some headroom for partners to make factory OCed models.A lot of people are grossly underestimating what a die shrink should actually mean. The upper-mid-size 40nm Fermi chip, GF114, was a 360mm^2 chip and it had 1.95 billion transistors. The corresponding 28nm Kepler chip, GK104, was noticeably smaller (294mm^2), but even so, it still had ~82% more transistors - 3.54 billion. That's a ~2.2x increase in transistor density from a single node shrink. And look at the shader counts! GF114 had 384 shaders. GK104? 1536 shaders - four times as many. Sure, it's a different architecture, but you can't tell me that a single Fermi shader is more powerful than two Kepler shaders. We're looking at roughly a doubling of actual performance. And that's without getting into the clock boosts that FinFET enables. 40% increase in clocks from GM200 to GP100 on the Tesla side! Considering that pro cards always have lower clocks than consumer-grade cards, we should be seeing ~1.6 GHz base clock rates on GeForce versions.
not on Tesla parts. The power rating is always the maximum value. Nvidia said it again yesterday that 300W is absolute maximum power.P100 is rated at 300W TDP but quite often "300W" TDP means "more than 300W" if you look at it historically.
While the transition to finfets should help with AMD/RTG's performance per watt, is this the only thing Polaris has going for it or will the architecture be designed with efficiency as a higher priority than the previous generation?
To put it roughly, basing on most benchmarks, AMD's performance per watt seems to be roughly 70-80% of nVidia's. To put it in perspective from my own findings, my R9 390 is set to power limit -30% / vcore -30mv at all times, giving it roughly GTX 980 power consumption figures but only giving about 75-80% of the performance. nVidia has come a long way since Fermi, Fermi to Kepler was akin to Pentium 4 to Conroe, and Maxwell somehow repeated that feat even on the same process node.
If nVidia can manage another feat through architectural optimization alone regardless of the benefits of a smaller fab process, is there any way for AMD to catch up?
a single FirePro W9100 has 2.6 TFLOPS FP64. And that is 28nm 275W TDP chip. Porting this to 14nm would allow doubling resources putting it right where nv claims P100 will be.Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process.
You can't have both. It either doesn't scale, or scales and was used in 1000 dualGPU server that will crunch our universe dimensions?P100 is very competitive when taking into account the huge 14MB register file and the 4 Nvlinks (that occupies 400 pins !!!) that allows very efficient 8 GPU node.
And please don't talk about AMD, it's negligible quantity in this market and without something like Nvlink, they don't scale very well. maybe OK for hobby research but not where real money is made (yes I know lately they won a nice contract but its the exception, not the norm)
Async compute in hardware isn't the reason for the difference in efficiency. Kepler doesn't do it either, but maxwell is massively more efficient.question is this, can pascal do async compute in hardware?
adding compute adds to power and judging 28nm vs 14nm we have a whole new baseline which is difficult to compare as each new iteration of cards on 14nm will be improved as tech ages.
if Pascal can do async compute in hardware AMD then catches up in power efficiency.
if not, well AMD then isnt the worried one.
Keep in mind the KNL TFLOPS figures are given for 200W TDP.Don't agree. Only real competitor in HPC field is Intel KNL that provides a bit more than 3TFLOS FP64 and 6TFLOPS FP32 with much bigger 683mm2 die on 14nm process.
And do not require IBM platform to work.Keep in mind the KNL TFLOPS figures are given for 200W TDP.
a single FirePro W9100 has 2.6 TFLOPS FP64. And that is 28nm 275W TDP chip. Porting this to 14nm would allow doubling resources putting it right where nv claims P100 will be.
I love this speculation periodperformance metrics:
P100 5.3TF FP64, 10.6TF FP32 300W TDP @16nm
Hawaii (W9100) 2.6TF FP64, 5.2TF FP32 275W TDP @28nm
We know polaris will be 2-2.5 perf/watt. What amd needs it an hawaii class GPU at @14nm. 14nm is supposed to offer 50% area scaling compared to 28nm doubling the number of xtors. Then we would have 440mmsq amd vs 600mm nv duking it out.
The new cuDNN 5 release delivers new features and performance improvements. Highlights include:
LSTM recurrent neural networks that deliver up to 6x speedup in Torch
Up to 44% faster training on a single NVIDIA Pascal GPU
Accelerated networks with 3x3 convolutions, such as VGG, GoogleNet, and ResNets
Improved performance and reduced memory usage with FP16 routines on Pascal GPUs
Support for Jetson TX1
Adding high-performance LSTM layers to cuDNN helps us immensely in accelerating all of our NLP use-cases. [This] is awesome work by NVIDIA, as always.
- Soumith Chintala, Facebook AI Research
We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver. This new version of the SDK, significantly improves our convolution algorithms, and goes so far as to accelerate the 3D convolution by a factor of 3x! On top of that, we are excited about their decision to provide tools for other models such as LSTM, RNN and GRU in this new version.
- Frédéric Bastien, Team Lead - Software Infrastructure at MILA
CNTK relies on the NVIDIA Deep Learning SDK for performance and scalability. The time we save by not having to implement and optimize the latest algorithms from scratch, helps us invest more time in improving CNTKs strengths in speech, image and text processing
- Xuedong Huang (XD), Distinguished Engineer at Microsoft Research Advanced Technology Group
The performance of mxnet has consistently improved with each release of the NVIDIA Deep Learning SDK and with the latest release mxnet is now 10% faster! We’re excited about NVIDIA’s decision to introduce Winograd and LSTM which are highlights of this release.
- Bing Xu, Masters student at University of Alberta
Stop spreading FUD.And do not require IBM platform to work.
performance metrics:
P100 5.3TF FP64, 10.6TF FP32 300W TDP @16nm
Hawaii (W9100) 2.6TF FP64, 5.2TF FP32 275W TDP @28nm
We know polaris will be 2-2.5 perf/watt. What amd needs it an hawaii class GPU at @14nm. 14nm is supposed to offer 50% area scaling compared to 28nm doubling the number of xtors. Then we would have 440mmsq amd vs 600mm nv duking it out.
AMD should just shrink Hawaii and add it some optimizations. Then launch a Hawaiix2 PRO card and goodbye GP100. Then simply focus on gaming graphics cards..
ED : I really dont know why they make very big chips for HPC, when at the end, they will use them in big arrays. Why not make a lot of 200m2 HPC chips?
really ? where ? please show me official number and corresponding SKU.Keep in mind the KNL TFLOPS figures are given for 200W TDP.
 
	Beside, KNL is now 3 quarters late. Last slide says it will ship Q3-16

Stop spreading FUD.
Nvlink doesn't need POWER. It has 2 implementations:
One for GPU-GPU interconnect like used on DGX1 (with dual Xeon CPU, you see)
Another for CPU-GPU interconnect that will work with second gen POWER8 and future POWER9. First POWER8 with Nvlink is coming by end of the year:
http://www.anandtech.com/show/10230...-openpower-hpc-server-with-power8-cpus-nvlink
now you are trolling :thumbsdown:So it is not required. A gimmick then?
hey thanks to you today we learned that cern and chime is essentially a hobbyist dream D:now you are trolling :thumbsdown:

 
				
		