NVIDIA GeForce 20 Series (Volta) to be released later this year - GV100 announced

Page 12 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

xpea

Senior member
Feb 14, 2014
429
135
116
Ok, and how does this affect FP32 performance? How gamers will benefit from this?

We are here to talk about game performance and GPU architectures, not brand cheerleading. You are trying to move the goalpost away from the merit of my posts, to prove that GV100 is revolutionary. In FP32 - it isn't. In DL it is huge leap forward.

But 90% of market is still FP32 Workload. Gaming is still FP32, and right now it starts to be in small steps FP16.
You want to talk about FP32 ? ok so let's see... in FP32, GV100 has 42% more Flops than GP100 at same 300W TDP. And it includes new scheduler, new L0 cache and new tensor cores. If we exclude the tensor cores and maybe the L0 cache that wont make it into the consumer Volta, these missing features should counter balance whatever the optimized 12FFN brings in power efficiency compared to 16FF+.
So we can safely say that Volta consumer will offer at least 40% more FP32 FLOPS than Pascal at same power level. Not a small jump in performance for basically the same node...
 
Last edited:

tamz_msc

Diamond Member
Jan 5, 2017
3,758
3,577
136
You want to talk about FP32 ? ok so let's see... in FP32, GV100 has 42% more Flops than GP100 at same 300W TDP. And it includes new scheduler, new L0 cache and new tensor cores. If we exclude the tensor cores and maybe the L0 cache that wont make it into the consumer Volta, these missing features should counter balance whatever the optimized 12FFN brings in power efficiency compared to 16FF+.
So we can safely say that Volta consumer will offer around 40% more FP32 FLOPS than Pascal at same power level. Not a small jump in performance for basically the same node...
That is going to depend on whether consumer Volta will use 12FFN and the die size saved by shaving off the non-FP32 cores. Not saying that it isn't possible, but you're going to have to give up on a small die size if you are to fit more cores.
 

Glo.

Diamond Member
Apr 25, 2015
5,696
4,533
136
You want to talk about FP32 ? ok so let's see... in FP32, GV100 has 42% more Flops than GP100 at same 300W TDP. And it includes new scheduler, new L0 cache and new tensor cores. If we exclude the tensor cores and maybe the L0 cache that wont make it into the consumer Volta, these missing features should counter balance whatever the optimized 12FFN brings in power efficiency compared to 16FF+.
So we can safely say that Volta consumer will offer at least 40% more FP32 FLOPS than Pascal at same power level. Not a small jump in performance for basically the same node...
You are again trying to move the goal post away from the merit of my posts.

CLOCK FOR CLOCK, CORE FOR CORE in FP32 Volta has the same level of performance as GP100 chip. It will be at least 30% clock for clock, core for core faster, than Consumer Pascal GPUs. But not from GP100 chip.

This is not revolutionary. You quoted my post in the first place which directly addressed that, tried to spin it out to DL stuff as revolutionary, and right now turned back to FP32. This is my last post on this matter to you.
 
  • Like
Reactions: Bacon1

Glo.

Diamond Member
Apr 25, 2015
5,696
4,533
136
That is going to depend on whether consumer Volta will use 12FFN and the die size saved by shaving off the non-FP32 cores. Not saying that it isn't possible, but you're going to have to give up on a small die size if you are to fit more cores.
Exactly. Nvidia can decide also that current die sizes, and core counts are enough however, if they will want just to use the "at least 30%" increase in performance in the same thermal envelope, just from changed high-level Architecture layout. Yes, Clock for clock, core for core, the GPUs will not use less power. But they will have more performance.
 

xpea

Senior member
Feb 14, 2014
429
135
116
CLOCK FOR CLOCK, CORE FOR CORE in FP32 Volta has the same level of performance as GP100 chip.
Breaking news. Fermi core, clock for clock, core per core, performs an ADD or MUL operation at same speed than Volta and even Polaris. But does Fermi offer same performance than Volta ? of course not.
They are no such concept of IPC gain per core in GPU world. Not from Nvidia. Not from AMD. Gains are from moar cores, higher clock speeds and better core utilization.
 

Glo.

Diamond Member
Apr 25, 2015
5,696
4,533
136
Breaking news. Fermi core, clock for clock, core per core, performs an ADD or MUL operation at same speed than Volta and even Polaris. But does Fermi offer same performance than Volta ? of course not.
They are no such concept of IPC gain per core in GPU world. Not from Nvidia. Not from AMD. Gains are from moar cores, higher clock speeds and better core utilization.
Ok, Im done.

I don't think you understand GPUs at all if you post something like this. I don't think you even understand what you are trying to discuss.

I will not respond to your posts anymore.
 

xpea

Senior member
Feb 14, 2014
429
135
116
That is going to depend on whether consumer Volta will use 12FFN and the die size saved by shaving off the non-FP32 cores. Not saying that it isn't possible, but you're going to have to give up on a small die size if you are to fit more cores.
My contact told me just now that all Volta are made with 12FFN, including consumer ones. I can even say that consumer Volta will have some tensor cores for compatibility reason (it will be like FP64, where performance is crippled in GeForce card)
So Nvidia has 3 choices
1/ moar cores at same TDP
2/ same number of cores at much reduced TDP
3/ a mix of the above
Looking at history, I think the market is mature and already well segmented. So I predict GV104 around 180W with 40% moar FP32 cores than GP104
 
  • Like
Reactions: Arachnotronic

xpea

Senior member
Feb 14, 2014
429
135
116
Ok, Im done.

I don't think you understand GPUs at all if you post something like this. I don't think you even understand what you are trying to discuss.

I will not respond to your posts anymore.
I think the same about you. Welcome to my ignore list
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,696
4,533
136
My contact told me just now that all Volta are made with 12FFN, including consumer ones. I can even say that consumer Volta will have some tensor cores for compatibility reason (it will be like FP64, where performance is crippled in GeForce card)
So Nvidia has 3 choices
1/ moar cores at same TDP
2/ same number of cores at much reduced TDP
3/ a mix of the above
Looking at history, I think the market is mature and already well segmented. So I predict GV104 around 180W with 40% moar FP32 cores than GP104
Which shows that your contact has no clue about the stuff he is talking about. The cores will not have reduced TDP, unless they will run on lower core clocks, which will mitigate the performance advantage Volta architecture in FP32 workloads will have, over consumer Pascal GPUs.

If anyone is interested here is my analysis from another forum:
https://forums.macrumors.com/thread...ce-keynote-today.2043059/page-6#post-24568464

In essence. GTX 2060 with 1280 CUDA cores, 192 bit GDDR6 memory bus will have at least 30% higher performance than GTX 1060, IN THE SAME THERMAL ENVELOPE(125W). It is all due to lowered starvation of cores, for resources.

I have a strong suspicion that we have a shill in this thread...
 
  • Like
Reactions: Bacon1

tamz_msc

Diamond Member
Jan 5, 2017
3,758
3,577
136
My contact told me just now that all Volta are made with 12FFN, including consumer ones. I can even say that consumer Volta will have some tensor cores for compatibility reason (it will be like FP64, where performance is crippled in GeForce card)
So Nvidia has 3 choices
1/ moar cores at same TDP
2/ same number of cores at much reduced TDP
3/ a mix of the above
Looking at history, I think the market is mature and already well segmented. So I predict GV104 around 180W with 40% moar FP32 cores than GP104
If your contact has said that consumer Volta, that is gaming GPUs, will have tensor cores for 'compatibility' reasons, he/she has no idea what they're talking about.

Nvidia have learned their lesson after Kepler Titan.

As for whether it will be on 12FFN, I'll believe it when I see it.
 

xpea

Senior member
Feb 14, 2014
429
135
116
Interesting stuff about Volta:
“It has a completely different instruction set than Pascal,” remarked Bryan Catanzaro, vice president, Applied Deep Learning Research at Nvidia. “It’s fundamentally extremely different. Volta is not Pascal with Tensor Core thrown onto it – it’s a completely different processor.”

Catanzaro, who returned to Nvidia from Baidu six months ago, emphasized how the architectural changes wrought greater flexibility and power efficiency.

It’s worth noting that Volta has the biggest change to the GPU threading model basically since I can remember and I’ve been programming GPUs for a while,” he said. “With Volta we can actually have forward progress guarantees for threads inside the same warp even if they need to synchronize, which we have never been able to do before. This is going to enable a lot more interesting algorithms to be written using the GPU, so a lot of code that you just couldn’t write before because it potentially would hang the GPU based on that thread scheduling model is now possible. I’m pretty excited about that, especially for some sparser kinds of data analytics workloads there’s a lot of use cases where we want to be collaborating between threads in more complicated ways and Volta has a thread scheduler can accommodate that.

It’s actually pretty remarkable to me that we were able to get more flexibility and better performance-per-watt. Because I was really concerned when I heard that they were going to change the Volta thread scheduler that it was going to give up performance-per-watt, because the reason that the old one wasn’t as flexible is you get a lot of energy efficiency by ganging up threads together and having the capability to let the threads be more independent then makes me worried that performance-per-watt is going to be worse, but actually it got better, so that’s pretty exciting.”

Added Alben: “This was done through a combination of process and architectural changes but primarily architecture. This was a very significant rewrite of the processor architecture. The Tensor Core part is obviously very [significant] but even if you look at FP32 and FP64, we’re talking about 50 percent more performance in the same power budget as where we’re at with Pascal. Every few years, we say, hey we discovered something really cool. We basically discovered a new architectural approach we could pursue that unlocks even more power efficiency than we had previously. The Volta SM is a really ambitious design; there’s a lot of different elements in there, obviously Tensor Core is one part, but the architectural power efficiency is a big part of this design.”
source https://www.hpcwire.com/2017/05/10/nvidias-mammoth-volta-gpu-aims-high-ai-hpc/
 

xpea

Senior member
Feb 14, 2014
429
135
116
If your contact has said that consumer Volta, that is gaming GPUs, will have tensor cores for 'compatibility' reasons, he/she has no idea what they're talking about.

Nvidia have learned their lesson after Kepler Titan.

As for whether it will be on 12FFN, I'll believe it when I see it.

For history sake, I was the one who first leaked the 12nm node for Volta that every single website told the story few months ago. So yes my contact has been pretty reliable :)

edit for proof: https://www.fool.com/investing/2017/01/22/nvidia-corporation-volta-architecture-rumor-emerge.aspx
 

tamz_msc

Diamond Member
Jan 5, 2017
3,758
3,577
136
For history sake, I was the one who first leaked the 12nm node for Volta that every single website told the story few months ago. So yes my contact has been pretty reliable :)

edit for proof: https://www.fool.com/investing/2017/01/22/nvidia-corporation-volta-architecture-rumor-emerge.aspx
I'll decide for myself how reliable your contact is when I see official specs of consumer Volta with tensor cores.

All your source has confirmed till now is that GV100 uses 12FFN.
 

nvgpu

Senior member
Sep 12, 2014
629
202
81
http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5

To get right to the point then, each SM on GP104 only contains a single FP16x2 core. This core is in turn only used for executing native FP16 code (i.e. CUDA code). It’s not used for FP32, and it’s not used for FP16 on APIs that can’t access the FP16x2 cores (and as such promote FP16 ops to FP32).

FP64 has been treated as a Tesla feature since the beginning, and consumer parts have either shipped with a very small number of FP64 CUDA cores for binary compatibility purposes

People here that don't know anything shouldn't be posting.

Volta GeForce consumer GPUs will have Tensor core for compatability so developers can buy cheap GeForce cards to code and debug and take that code to run on Tesla V100 for maximum throughput performance. It won't be fast, just enough for coding & debugging purposes.
 
Last edited:

Samwell

Senior member
May 10, 2015
225
47
101
Yes, it would make no sense not to have at least 1 tensor core in gaming products. I don't even need sources for beeing sure about that.
The big question is, what can you do with these tensor cores? Vega has 2xFP16 and i expect Volta to have the same. In deap learning the tensor cores make FP16 calculations. Is this also possible for graphics FP16 calculations? If yes then we should get 4-8 tensor cores in a 128 Shader SMM. (V100 has 8 TCs per 64Shader). Limited amount would also be important for slower interference chips. You can't use GV100 for everything, there have to be cheap alternatives like Tesla P4 nowadays.

Tegra Xavier even has a specifically designed deap learning asic inside it, so you would think it doesn't need the tensor cores. But additionally it still has 8 TCs per 128 shader.
 
  • Like
Reactions: Phynaz and xpea
Mar 10, 2006
11,715
2,012
126
If your contact has said that consumer Volta, that is gaming GPUs, will have tensor cores for 'compatibility' reasons, he/she has no idea what they're talking about.

Nvidia have learned their lesson after Kepler Titan.

As for whether it will be on 12FFN, I'll believe it when I see it.

Even today's NVIDIA gaming GPUs have a few DP cores for SW compatibility reasons. The same will be true for the Tensor Cores.

Also, why would TSMC and NV collaborate to build 12FFN for one product? This node was clearly designed to serve NVIDIA's GPU needs until the move to 7nm HPC.
 
  • Like
Reactions: Phynaz and xpea
Mar 10, 2006
11,715
2,012
126
Breaking news. Fermi core, clock for clock, core per core, performs an ADD or MUL operation at same speed than Volta and even Polaris. But does Fermi offer same performance than Volta ? of course not.
They are no such concept of IPC gain per core in GPU world. Not from Nvidia. Not from AMD. Gains are from moar cores, higher clock speeds and better core utilization.

This is such a great point. Notice how NVIDIA said in the Volta paper that it "streamlined" the ISA and reduced instruction latencies? That's going to definitely give a nice boost in real world performance.
 

Samwell

Senior member
May 10, 2015
225
47
101
This is such a great point. Notice how NVIDIA said in the Volta paper that it "streamlined" the ISA and reduced instruction latencies? That's going to definitely give a nice boost in real world performance.

Not in graphics. Graphics speed/gflop is pretty much at a limit. There's not much left for improvements. That's why you see some shader-limiting games nearly scale with the theorethical gflops of the cards and fury also performing good there. Same in many hpc applications, where nvidia published results for volta. There you have just the gflop scaling. The many changes which nvidia made to the shaders are more important for cases/algorithms where it hasn't been possible to get near to the theoretical gflops. It will be possible to run more code on a gpu, which was larrabee exclusive before.
 

KompuKare

Golden Member
Jul 28, 2009
1,013
924
136
You want to talk about FP32 ? ok so let's see... in FP32, GV100 has 42% more Flops than GP100 at same 300W TDP. And it includes new scheduler, new L0 cache and new tensor cores. If we exclude the tensor cores and maybe the L0 cache that wont make it into the consumer Volta, these missing features should counter balance whatever the optimized 12FFN brings in power efficiency compared to 16FF+.
So we can safely say that Volta consumer will offer at least 40% more FP32 FLOPS than Pascal at same power level. Not a small jump in performance for basically the same node...
So does 12FFN offer that 40%? Because GPUs are not CPUs and generally you can get performance either by going wide (moar cores) or fast, or even both. If 12FFN offers 40% more density at the same power for example (but GV100 is 33% bigger than GP100 and obviously not all parts of a chip scale equally and the the expect clocks are bit lower compared to GP100 so it looks like they are running it nearer the sweetspot to get it under 300W), then it is possible that Nvidia will use some of that increased density for smaller dies and only some for increased performance. So if they feel that +25% is enough for GP104 vs GV104 I don't see why they would offer +40%. In HPC they are worried about competition from Intel, FPGAs etc. and hence are willing to eat some of the extra costs, in the desktop gaming market they might not be as willing.

Point being, the top of the range GV100 is not necessarily an indication of the consumer cards. In terms of maximising future profits they have to look at their expected competition, the likelihood of current users upgrading, versus how long they expect 12FFN to last and how soon at what price the next node comes.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
This is such a great point. Notice how NVIDIA said in the Volta paper that it "streamlined" the ISA and reduced instruction latencies? That's going to definitely give a nice boost in real world performance.

And there are nice changes in SIMT model that will surely increase utilization and reduce pain for both developers and driver team.
 
  • Like
Reactions: Arachnotronic