NVIDIA GeForce 20 Series (Volta) to be released later this year - GV100 announced

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Samwell

Senior member
May 10, 2015
225
47
101
I did some simulation of the Volta GPU, that should slot as GTX 2060.

IMO we are looking at 250 mm2 die size, with 125W TDP. 2048 CUDA cores, clocked at 1.5 GHz. Performance around 1.8 times of GTX 1060(gaming performance). Overall Max TFLOPs output around 6 TFLOPs.

You are expecting too much, as i already wrote above. You don't get more perf/flops than maybe 5% in gaming. At least not from shaders alone. Just look at the hpc numbers:

hpc_perf_v100.png


DGEMM is nearly pure shader limited. It scales linearly with gflops and that's what you will see mostly in games if other stuff like front-end isn't the limiting factor. The improvements only help in special cases like FFT here. But gaming workload is much easier codewise.

Important is that volta makes this 40% more speed possible with the same power and such a successor i would expect. 1,8 times is too high and in no way possible on the same note (pretty sure just marketing like vega which uses the 3rd Finfet process by GF).

Ohh and because some people find it so important, hardware scheduling is back:
C_fs13WUMAAdFAy.jpg:large


Found it on twitter from their deep dive into volta, but they didn't publish it online in detail.
 
  • Like
Reactions: Ajay

Glo.

Diamond Member
Apr 25, 2015
5,928
4,989
136
You are expecting too much, as i already wrote above. You don't get more perf/flops than maybe 5% in gaming. At least not from shaders alone. Just look at the hpc numbers:

hpc_perf_v100.png


DGEMM is nearly pure shader limited. It scales linearly with gflops and that's what you will see mostly in games if other stuff like front-end isn't the limiting factor. The improvements only help in special cases like FFT here. But gaming workload is much easier codewise.

Important is that volta makes this 40% more speed possible with the same power and such a successor i would expect. 1,8 times is too high and in no way possible on the same note (pretty sure just marketing like vega which uses the 3rd Finfet process by GF).

Ohh and because some people find it so important, hardware scheduling is back:
C_fs13WUMAAdFAy.jpg:large


Found it on twitter from their deep dive into volta, but they didn't publish it online in detail.
That calculation of mine is wrong. However you are also looking for clues in the wrong places. You compare GP100 vs GV100 chip, and you given validity to my point: GV100 in FP32 has core for core, clock for clock the same level of performance as GP100. It is just much bigger chip, thats why we see on average 50% increase, because thats what we should see.

But consumer Pascal GPUs are just Maxwell GPUs on 16 nm process. GP100 and GV100 chips will be at least 30% faster clock for clock, core for core, than them.
 
  • Like
Reactions: Bacon1

beginner99

Diamond Member
Jun 2, 2009
5,312
1,749
136
I think we should use one of these new GPUs and deep learning to analyze posts from Sweepr, xpea and other candidates and see if they actually aren't from the same author.

To contribute to the topic is what decides to use the tensor cores? Is this transparent? Or do I need to use CUDA and special instructions at that? Eg. Are software changes required?



This forum isn't for hunting socks or RBMs or shills or anyone else. Posting something topical doesn't excuse that.

AT Moderator ElFenix
 
Last edited by a moderator:

maddie

Diamond Member
Jul 18, 2010
5,085
5,410
136
A question.

When we discuss asynchronous compute as utilized by AMD, the refrain was that Nvidia did not get the same performance increase because they already had extremely good utilization of the cores and not for other reasons that will not be repeated in this thread.

Now, we are hearing that they are getting this huge boost in utilization with Volta.

What is really happening?

I hope it's understood that you can't get more than 100% utilization, although by reading some comments one can conclude that a few believe this to be possible.
 
Mar 10, 2006
11,715
2,012
126
  • Like
Reactions: Phynaz and Sweepr

Malogeek

Golden Member
Mar 5, 2017
1,390
778
136
yaktribe.org
It is just much bigger chip, thats why we see on average 50% increase, because thats what we should see.
I don't think they're understanding this point. To go with the obvious car analogy, the cylinders perform the same but you're comparing a v8 to a v10. I'm sure there will be performance improvements from a better scheduler and other possible refinements in the consumer version, but the point @Glo. is trying to make is that the performance per core for fp32 should be about the same.
 
  • Like
Reactions: Bacon1

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Even today's NVIDIA gaming GPUs have a few DP cores for SW compatibility reasons. The same will be true for the Tensor Cores.
There are no DP cores in consumer cards. What is indeed there is DP *capability*. Just Google GP100 SM and GP104 SM, you'll know exactly what the difference is.
Also, why would TSMC and NV collaborate to build 12FFN for one product? This node was clearly designed to serve NVIDIA's GPU needs until the move to 7nm HPC.
There might be a number of reasons - 815mm^2 not possible on 16FF+, NVIDIA expects higher demand relative to GP100 and wants to ensure they can meet it, no fundamental reason why Volta cannot exist on 16FF+ for consumer variants.
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,928
4,989
136
but the point @Glo. is trying to make is that the performance per core for fp32 should be about the same.
The same as GP100 chip. It will be ultimately clock for clock, core for core faster than GP102, 104, 106 etc chips.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I think we should use one of these new GPUs and deep learning to analyze posts from Sweepr, xpea and other candidates and see if they actually aren't from the same author.

To contribute to the topic is what decides to use the tensor cores? Is this transparent? Or do I need to use CUDA and special instructions at that? Eg. Are software changes required?
Think of it as an analogue to AVX but for GPUs.
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
Comparing GV100 (5120 cores) to GP100 (3584 cores) we see a 42.8% increase in cuda cores. The FLOP increase is 41.5% from 10.6 to 15.0 TF. Core clocks have fallen only around 1.5% and TDP is still 300W. This is an amazing engineering achievement even if we consider the 33% die size increase from GP100 (610 sq mm) to GV100 (815 sq mm) . Volta is a massive improvement in power efficiency and I guess there are 2 parts to how Nvidia achieved that - architecture and the 12FFN process.

Looking forward to GV102 , GV104 and GV106 we can expect the following

GV102 (5376) - 5120 cc enabled at launch. Performance 40-50% faster than TitanXp. Die size 600 sq mm approx
GV104(3584) - 3584 enabled at launch. Performance 5-10% faster than Titan Xp. Die size 400 sq mm approx
GV106(1792) - 1792 enabled at launch. Performance equal to GTX 1070 or slightly better. Die size 250 sq mm approx.

AMD will have a very tough time in 2018 against this stack. Nvidia could again get to 80+% market share in 2018.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5





People here that don't know anything shouldn't be posting.

Volta GeForce consumer GPUs will have Tensor core for compatability so developers can buy cheap GeForce cards to code and debug and take that code to run on Tesla V100 for maximum throughput performance. It won't be fast, just enough for coding & debugging purposes.
People that don't follow that the discussion was about dedicated FP64 hardware shouldn't derail the thread by bringing in FP16, and then use it as a justification for the inclusion of tensor cores in GV104.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
A question.

When we discuss asynchronous compute as utilized by AMD, the refrain was that Nvidia did not get the same performance increase because they already had extremely good utilization of the cores and not for other reasons that will not be repeated in this thread.

Now, we are hearing that they are getting this huge boost in utilization with Volta.

What is really happening?

I don't really think you can correlate asynchronous compute in graphics workloads with Volta's MPS? MPS seems heavily geared towards multiple HPC compute workloads with different kernels (perhaps a more refined version of Hyper-Q), which are far heavier than anything you will encounter in a gaming scenario. In those situations, latency is the biggest bottleneck by far, so using a hardware scheduler that's located on the GPU itself rather than the CPU makes more sense.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
Well, NVIDIA has earned its credibility. Pascal delivered what NVIDIA said, Maxwell delivered what NVIDIA said, and Kepler delivered what NVIDIA said.

All solid releases, and it has only gotten better with each generation.
Can't be another Maxwell anyway, Maxwell was on the same process as Kepler. Different situations.

Sent from my VTR-L09 using Tapatalk
 

Samwell

Senior member
May 10, 2015
225
47
101
A question.

When we discuss asynchronous compute as utilized by AMD, the refrain was that Nvidia did not get the same performance increase because they already had extremely good utilization of the cores and not for other reasons that will not be repeated in this thread.

Now, we are hearing that they are getting this huge boost in utilization with Volta.

What is really happening?

I hope it's understood that you can't get more than 100% utilization, although by reading some comments one can conclude that a few believe this to be possible.

That's what i try to explian to glo, but he's not believing me. GP100 has no better utilization than GP102 in most cases, beside some register heavy, harder algorithms. Same accounts for volta against pascal. Only in special cases the sm changes will help it get faster and in many cases it will just make it easier for developers to get the maximum speed.
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
If 12 FFN vs 16 FF+ is similar to this:
http://www.anandtech.com/show/11337/samsung-and-tsmc-roadmaps-12-nm-8-nm-and-6-nm-added/4
You can calculate how much they got from process and how much from arch. changes :)

The 12FFC to 16FFC comparison is unlikely to be the same as 12FFN to 16FF+ . The 20% area reduction and 25% power reduction from 16FFC to 12FFC comes due to the use of a 6 track library at 12FFC instead of a 7.5 track library at 16FFC.

https://community.cadence.com/cadence_blogs_8/b/breakfast-bytes/archive/2017/03/22/tsmc2

"12FFC is an optical shrink from 16FFC, but some of the logic density and power reduction comes from the low-track standard-cell libraries, so it is best not to just shrink at the die level. Instead, logic should be re-implemented with the new libraries, but SRAM, analog, and I/O just require recharacterization. Comparing N12 to N16 there is a 20% area reduction with the 6-track library, or a 14% area reduction with the 6-track turbo library. There is also a higher performance 9-track library that obviously gives up more area."

Comparing GP100 to GV100 transistor density has not changed much at all as these high performance chips are most likely using high performance 9 track libraries.

http://www.anandtech.com/show/10222/nvidia-announces-tesla-p100-accelerator-pascal-power-for-hpc
http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

GP100 - 15300 million transistors , 610 sq mm die. 25.08 Transistors / sq mm
GV100 - 21100 million transistors, 815 sq mm die 25.88 Transistors / sq mm

imo the vast majority of Volta's power efficiency gains (>80% of the efficiency gains) are architecture related.
 

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
TFTCentral says first 4k 144Hz (120Hz mode for 4:4:4?) monitors will have retail availability around September or October. Some of the more optimistic posters think GV104 can be out around then. That'd be a very nice combo this fall...
 
  • Like
Reactions: Arachnotronic

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
So does 12FFN offer that 40%? Because GPUs are not CPUs and generally you can get performance either by going wide (moar cores) or fast, or even both. If 12FFN offers 40% more density at the same power for example (but GV100 is 33% bigger than GP100 and obviously not all parts of a chip scale equally and the the expect clocks are bit lower compared to GP100 so it looks like they are running it nearer the sweetspot to get it under 300W), then it is possible that Nvidia will use some of that increased density for smaller dies and only some for increased performance. So if they feel that +25% is enough for GP104 vs GV104 I don't see why they would offer +40%. In HPC they are worried about competition from Intel, FPGAs etc. and hence are willing to eat some of the extra costs, in the desktop gaming market they might not be as willing.

Point being, the top of the range GV100 is not necessarily an indication of the consumer cards. In terms of maximising future profits they have to look at their expected competition, the likelihood of current users upgrading, versus how long they expect 12FFN to last and how soon at what price the next node comes.
Well it may ruffle some feathers but it appears that 12nm is just basically 16nm that's matured ever so slightly. Companies are doing this more and more.. look at "Kaby Lake" with its "enhanced" 14nm. This has always been a thing but now that process changes are so far between that companies have ramped up their marketing efforts to make something "new". We have ALWAYS got CPUs back in later steppings that clocked higher with better power draw. Its just that now companies can't have the same product for multiple years. Its like OEM rebranding but for foundry processes ;)

TSMC "16nm" Density (GP100)
15.3 billion transistors / 610mm2 = 25.082 million transistors per sq mm (this absolutely blows my mind when I picture 1 single square millimeter.. but I digress...)

TSMC "12FFN" Density (GV100)
21.1 billion transistors / 815mm2 = 25.889 million transitors per sq mm

"12nm FFN"
utilizing a different chip design resulted in a 3% density improvement. Now.. this is a NEW chip design. Every chip design has its own properties as we all know.. how well it clocks and how dense it is etc.. As this is a newer chip design it speaks to reason that this process is no denser than last year's 16nm and that nVidia simply designed this chip every so slightly denser than the one it replaces.

As we can reasonably expect it to also have the same electrical properties as its obviously just a matured 16nm process.. the question becomes has Nvidia redesigned the architecture itself to improve efficiency? Its definitely possible with their vast amount of R&D.. Time will tell.
 
  • Like
Reactions: Bacon1
Mar 10, 2006
11,715
2,012
126
Well it may ruffle some feathers but it appears that 12nm is just basically 16nm that's matured ever so slightly. Companies are doing this more and more.. look at "Kaby Lake" with its "enhanced" 14nm. This has always been a thing but now that process changes are so far between that companies have ramped up their marketing efforts to make something "new". We have ALWAYS got CPUs back in later steppings that clocked higher with better power draw. Its just that now companies can't have the same product for multiple years. Its like OEM rebranding but for foundry processes ;)

TSMC "16nm" Density (GP100)
15.3 billion transistors / 610mm2 = 25.082 million transistors per sq mm (this absolutely blows my mind when I picture 1 single square millimeter.. but I digress...)

TSMC "12FFN" Density (GV100)
21.1 billion transistors / 815mm2 = 25.889 million transitors per sq mm

"12nm FFN"
utilizing a different chip design resulted in a 3% density improvement. Now.. this is a NEW chip design. Every chip design has its own properties as we all know.. how well it clocks and how dense it is etc.. As this is a newer chip design it speaks to reason that this process is no denser than last year's 16nm and that nVidia simply designed this chip every so slightly denser than the one it replaces.

As we can reasonably expect it to also have the same electrical properties as its obviously just a matured 16nm process.. the question becomes has Nvidia redesigned the architecture itself to improve efficiency? Its definitely possible with their vast amount of R&D.. Time will tell.

12FFN can be thought of like a 16nm++, basically a performance refined version of 16FF+ tailored to NVIDIA's needs.
 
  • Like
Reactions: Phynaz

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136

As I see you beat me to this calculation :p

TFTCentral says first 4k 144Hz (120Hz mode for 4:4:4?) monitors will have retail availability around September or October. Some of the more optimistic posters think GV104 can be out around then. That'd be a very nice combo this fall...
GV104 will most likely be out this year. The only difference from the last cycle is that the 980 Ti came out in summer of 2015 and the 1080 launched about 10 months later, which allowed the 980Ti to thrive and clear inventory after a strong sales period. This time the 1080Ti has just launched in March so I doubt nVidia will seek to cannablize the sales of this chip so quickly, I would wager GV104 drops around September-October, a Titan Volta that's $1200+ before the holidays and finally the GV102 Ti will drop next spring.
 

OatisCampbell

Senior member
Jun 26, 2013
302
83
101
As I see you beat me to this calculation :p


GV104 will most likely be out this year. The only difference from the last cycle is that the 980 Ti came out in summer of 2015 and the 1080 launched about 10 months later, which allowed the 980Ti to thrive and clear inventory after a strong sales period. This time the 1080Ti has just launched in March so I doubt nVidia will seek to cannablize the sales of this chip so quickly, I would wager GV104 drops around September-October, a Titan Volta that's $1200+ before the holidays and finally the GV102 Ti will drop next spring.
Maybe.

I see the Ti parts as mopping up sales of parts on the way out, and a way to get people who jumped at the midrange parts (1070/1080 in this case) to buy another card.

My guess is we get the $1200 part and next "1080" Feb/March 2018, and the Ti next summer. NV doesn't have any reason to release earlier in absence of competition.

If Vega(s) turns out to beat 1080Ti and 1080 (the high end segments that matter) we might see parts earlier if possible.
 
Mar 10, 2006
11,715
2,012
126
As I see you beat me to this calculation :p


GV104 will most likely be out this year. The only difference from the last cycle is that the 980 Ti came out in summer of 2015 and the 1080 launched about 10 months later, which allowed the 980Ti to thrive and clear inventory after a strong sales period. This time the 1080Ti has just launched in March so I doubt nVidia will seek to cannablize the sales of this chip so quickly, I would wager GV104 drops around September-October, a Titan Volta that's $1200+ before the holidays and finally the GV102 Ti will drop next spring.

Titan Volta doesn't come until 2018, that's the GPU using the 384-bit GDDR6 16Gbps memory.
 

IronWing

No Lifer
Jul 20, 2001
71,884
31,963
136
Walk me through this like I forgot 90% of the math I ever learned. What is the benefit of the tensor units and is the 4x4+C configuration of particular value to a specific problem that comes up a lot? Is 4x4 better than say 3x3 or 5x5? I'm just trying to understand the connection between tensor units and real world problems.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Walk me through this like I forgot 90% of the math I ever learned. What is the benefit of the tensor units and is the 4x4+C configuration of particular value to a specific problem that comes up a lot? Is 4x4 better than say 3x3 or 5x5? I'm just trying to understand the connection between tensor units and real world problems.
These are for Fused Multiply ADD but with 4x4 matrices. 4x4 because most algorithms, like FFTs, are based on powers of 2.