NVIDIA GeForce 20 Series (Volta) to be released later this year - GV100 announced

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
No. Look for reviews of Maxwell GPUs, and compare to Kepler.

There are no Kepler and Maxwell cards with the same exact configuration to do 1:1 tests, sadly. That's what makes 7970 vs 380x vs 470 so nice.

Although I guess of a 760 (1152 cores) loses or ties to a 960 (1024) with the same clocks would prove it. But no one ever did a clocks test like that to my knowledge. And then there's memory compression improvements. How do we isolate that?

But then again we already know there are driver differences between the archs since Maxwell has overall improved relative to Kepler from 2014. Who knows if you can compare evenly. =\
 

Harmaaviini

Member
Dec 15, 2016
34
11
36
Yes, you are correct that Nvidia still can offer rebranded Maxwell Architecture, as Volta, like they did with Pascal. But that will be just ridiculous.

If they want progress that comes from anywhere else, they have to use Volta or at least GP100 chip architecture.

How large a difference is there in engineering effort/cost between designing a gaming chip based on GV100 architecture and re-using Pascal/Maxwell like gaming architecture on this improved manufacturing process? What about time to market?
 

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
It'd be worse for gaming than Titan Xp.

Why? 3584 cores. 720 GB/s bandwidth. At equal core clocks to Titan X (2016) and 1080 Ti it should win. Compared to 3840 TXp perhaps not. Depends on this 64 shaders per L2 cache theory and if its bandwidth starved, right?

Am I missing something? Something about the ROPs? AT said 128 with a question mark...
 

Glo.

Diamond Member
Apr 25, 2015
5,845
4,855
136
How large a difference is there in engineering effort/cost between designing a gaming chip based on GV100 architecture and re-using Pascal/Maxwell like gaming architecture on this improved manufacturing process? What about time to market?
Those question should not be adressed to me.
Why? 3584 cores. 720 GB/s bandwidth. At equal core clocks to Titan X (2016) and 1080 Ti it should win. Compared to 3840 TXp perhaps not. Depends on this 64 shaders per L2 cache theory and if its bandwidth starved, right?

Am I missing something? Something about the ROPs? AT said 128 with a question mark...
It will be much faster in gaming.

GP100 chip is faster in FP32 compute benchmarks, than P6000. Why is that, you think, when it has less cores, and lower core clock?
 

crisium

Platinum Member
Aug 19, 2001
2,643
615
136
GP100 chip is faster in FP32 compute benchmarks, than P6000. Why is that, you think, when it has less cores, and lower core clock?

Ah, links please. You may have found the smoking gun. Assuming ROPs and bandwidth aren't the defining factor...

AT said:
With 720GB/sec of memory bandwidth – thanks to 4 HBM2 stacks clocked at 1.4Gbps each – the Quadro GP100 has 66% more memory bandwidth than the Quadro P6000’s mere 432GB/sec. Coupled with what’s almost certainly a ROP count advantage – NVIDIA still hasn’t disclosed GP100’s ROP count, but based on what we know of GP102, 128 ROPs is a safe bet – and Quadro GP100’s pure pixel pushing power should be greater than even P6000 by around 22%. Given that CAD/CAE can be very pixel-bound, and this should be a tangible benefit for some Quadro customers.
 

Glo.

Diamond Member
Apr 25, 2015
5,845
4,855
136
Ah, links please. You may have found the smoking gun.
I will not give you links. The tests were done by me friends from the industry that I have left few months ago.

The difference is between 20 and 30% for the GPUs.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
Those question should not be adressed to me.

It will be much faster in gaming.

GP100 chip is faster in FP32 compute benchmarks, than P6000. Why is that, you think, when it has less cores, and lower core clock?
Might be why they kept 128 CC/SM with Pascal, so they could have a generational uplift for 2000 series. With GV100 they sidestepped the issue by just making a giant die, getting the generational uplift from there.
 
  • Like
Reactions: crisium and Glo.

Glo.

Diamond Member
Apr 25, 2015
5,845
4,855
136
Might be why they kept 128 CC/SM with Pascal, so they could have a generational uplift for 2000 series. With GV100 they sidestepped the issue by just making a giant die, getting the generational uplift from there.
If they knew that they will not be able to increase core clocks, core counts for Volta chips within reasonable thermal envelopes - you are correct.
 

Cookie Monster

Diamond Member
May 7, 2005
5,161
32
86
I thought both camps are already close to extracting the most they can from delta color compression. There is only so much you can compress after all.

Not sure about AMD since they are 1-2 generations behind nVIDIA in this regard.

However its not just compressing (this the method to do it in this case) but its part of the whole 'discarding useless work' concept to minimise the use of resources at hand.

There will always be opportunities to implement newer innovative methods (not disclosed to the public most likely) that minimise resource usage within a GPU whether thats bandwidth (e.g delta compression) or something else.

On this note.. i tend to think nVIDIA has been doing better than AMD in this area since when you compare GPU architectures they were able to more with alot less for a couple of generations now. But i dont blame AMD given nVIDIAs massive R&D budget.
 

Harmaaviini

Member
Dec 15, 2016
34
11
36
Those question should not be adressed to me.

Your post just seemed to sum up so precisely the sentiment that Volta gaming chips will resemble GV100 architecture rather than previous gaming architectures. So it seemed a logical starting point for my questions.
 

Glo.

Diamond Member
Apr 25, 2015
5,845
4,855
136
Your post just seemed to sum up so precisely the sentiment that Volta gaming chips will resemble GV100 architecture rather than previous gaming architectures. So it seemed a logical starting point for my questions.
It is the most logical thing, to assume.
 

Cookie Monster

Diamond Member
May 7, 2005
5,161
32
86
On topic with GV100 i think its really impressive in what they were able to do while being at the same power enevelope as the P100.

A 33% increase in CUDA core count along with tensor cores, TMUs and ROPs (they exist in the chip) other logics, schedulers etc resulted in the die size being 33% larger but performance is on average 50% faster all within the TDP!

The 12nm process should have helped them but only a little imo. So in this case architecturally they were able to get alot of performance out because of how they are meeting the power targets even with increased core count (clocks are about the same). Thats the impressive part.
 
  • Like
Reactions: Arachnotronic

Glo.

Diamond Member
Apr 25, 2015
5,845
4,855
136
On topic with GV100 i think its really impressive in what they were able to do while being at the same power enevelope as the P100.

A 33% increase in CUDA core count along with tensor cores, TMUs and ROPs (they exist in the chip) other logics, schedulers etc resulted in the die size being 33% larger but performance is on average 50% faster all within the TDP!

The 12nm process should have helped them but only a little imo. So in this case architecturally they were able to get alot of performance out because of how they are meeting the power targets even with increased core count (clocks are about the same). Thats the impressive part.
I would like to point out, that actually GP100 chip has 235W TDP. Lets wait for GV100 to land in PCIe form factor to judge its efficiency, rather than rely on Nvidia marketing. Mezzanine GPUs really have 300W TDP, but its because standardized server rack mounting for fanless environment. Same thing with Vega in MI25 chip. It can have 300W TDP, but the actual power consumption of the GPU can be much lower.
 

jpiniero

Lifer
Oct 1, 2010
15,551
6,026
136
A 33% increase in CUDA core count along with tensor cores, TMUs and ROPs (they exist in the chip) other logics, schedulers etc resulted in the die size being 33% larger but performance is on average 50% faster all within the TDP!

I guess if you put it that way. BTW, the full GV100 die has 5376 cores so it's really 40% more cores on 33% bigger die size and that includes the kitchen sink, etc. The shrink only seems to of given like 2-3% when you factor in the transistor count.

The shrink may have helped more on the voltage side.

I would like to point out, that actually GP100 chip has 235W TDP.

You mean 250, and the PCIe version has the clock speed cut vs the mezzanine version.
 

Harmaaviini

Member
Dec 15, 2016
34
11
36
It is the most logical thing, to assume.

Could you write out your logic? In my opinion the questions I asked are pertinent. For example if TSMC 12nm is just an improved 16nm then it could be fairly easy to manufacture existing Pascal gaming chips at the new node. I did quick googling: https://www.extremetech.com/computing/245880-rumor-nvidias-volta-built-tsmcs-new-12nm-process

"[O]ur strategy is continuously to improve every node in the performance, such as 28 nanometer. And are continuing to improve the 16 nanometers technology. And we have some very good progress, and you might call it the 12 nanometer because we’re improve in the density, classical density, performance and power consumption. Yes, we have that."

Perhaps it is a similar situation to what AMD did with Phenom2. Continuous improvement of clockspeeds as process matured. In best case scenario might nVidia be able to use the Pascal designs as they are adding about a years worth of small improvements. With a decent clockspeed/efficiency bump they could basically do RX480->RX580 but perhaps a bigger boost in performance if the process is good enough. Even something like 15% boost across the board is basically a new product at each tier. We've seen smaller re-brands.

So that's why the question of engineering effort in choice of architecture is important. Results of our respective logics depends on our premises. My logic itself is sound I think.
 

Cookie Monster

Diamond Member
May 7, 2005
5,161
32
86
I guess if you put it that way. BTW, the full GV100 die has 5376 cores so it's really 40% more cores on 33% bigger die size and that includes the kitchen sink, etc. The shrink only seems to of given like 2-3% when you factor in the transistor count.

The shrink may have helped more on the voltage side.

Yeah some good points. Its interesting that P100 also has 4SMs disabled just like the V100.

Now if we were extrapolating the possible specs of a GV104.. the transistion from GP100 to GP102 could give us a rough estimate.

GP100 is 610mm2 while the GP102 is 471mm2. Thats roughly a ~23% reduction where the savings are coming from the removal (or a heavy minimisation of) of HPC related features like FP64, FP16 + NVlink Phy. Not sure about the actual size of HBM2 memory controllers to the 384bit GDDR5 controller but im certain GV102 will be 384bit thanks to the GDDR6 memory.

Now taking this into account (since both GPUs have the same amount of SMs) then im thinking that we are now looking at a GV102 at ~627mm2 depending on whether or not its on 12nm FFN or 16nm FF+ (but it shouldnt make a huge difference imo).

Now the interesting part is that we can assume GV102 could potentially be at 256bit using GDDR6 to save die space and power while maintaing a healthy bandwidth of 512GB/s. If not then itd be at 768GB/s or lower depending on how high they clock.

Assuming that the 42SMs are packing 128 FP32 CUDA cores (or 84SMs with 64 cores) with most likely some FP64/FP16/Tensor cores left for compatibility purposes (1:32 FP64 rate etc).. we could have a chip with 6 GPCs each containing 14SMs totalling 5376 CC or 7 if you partition them at 128CC per SM.

Actually this could give us a clue that the SM design in Volta even for consumer level GPU is different to that of Pascal and perhaps a big reason to it why are the improved memory/cache systems along with logics/schedulers that enable much more thread parallism within the cores.

A GV104 will be a 4GPC version imo i.e. 256bit memory interface with either GDDR5X/5, 3584CC (4GPC/14SMs per GPC/64CCperSM), 64 ROPs and other improvments. Possibly with a die size at around ~418mm2 (or smaller) based on GP102 vs GP104.

It all depends on how many CCs the geforce GPUs will have. They've always targetted around 290-340mm2 dies with their x70 x80 SKUs but Volta might break that tradition possibly due to the process limitation. If not we may have a GPU thats less spec'd but thanks to architectural improvements its atleast somewhat compensated for.

Food for thought.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,984
3,671
136
My point was that as long as AMD still lags significantly behind NVidia in the performance per watt category, they won't be able to catch up and reach parity. Take the recently released Titan Xp for example. Its' advertised Tflop rating is 12, but in reality it's closer to, or slightly above 13 due to boost. And this is in a 250w TDP envelope.
Thats fine, but its really just throw away words. What you need to look at is what causes high energy usage.
1. Data movement
2. pushing the voltage/clocks/circuit curve
3. executing things you don't need to

To hit 12.5 TFlops at 300w, AMD will likely require an AiO cooler a la Fury X, which just emphasizes what I'm saying.
Except the part where its stated that they are DC passively cooled parts, which means they have the shared cooling of the chassis fans .

I'll give credit to AMD though, they have certainly come a long way since Fiji when it comes to performance per watt, and with Vega, I hope they take it even further. But they will need much more to be on somewhat equal footing with NVidia.

So
1. If they actually increased GCN wavefront occupancy ( higher IPC) thats more performance for no power cost ( see GCN develop doc to understand how GCN executes data)
2. If they actually have refactored the pipelines to increase clocks that will be more performance for no power cost/same perf for less power cost
3. If they actually have the ROP cache in L2 that is more performance and less power at the same clock

You then have the other stuff which we have no idea what/where it will be used, the TBR, primitive shaders , cache controller ( if it actually helps prefetching into GPU L2 etc that could save alot of power).

Look at what changed between Kepler and Maxwell and what didn't change on AMD's side, they refactored there "CUDA cores" for higher clocks and they refactored their ROP's for much higher efficiency, that's the point they pulled away from AMD. NV always had a geometry perf advantage which AMD finally fixed in Polaris.

I think a lot of the issue with Polaris was that people expected to see the things described in some of the new GCN based patents to be in it and none of them where, but given the much more explicit slideware for Vega it looks they will be in Vega, the question is which idea's and to what degree ( in Zen they idea's end up in more advanced form then what was described in the patents).

The proof will be in the pudding and we will have to wait and see. But just like the last 2 years in the Zen threads, when every troll was running with sandy bridge IPC and max 3.0ghz clock when you look at where the architectural issues were, you looked at the published information (compiler notes/slide ware etc) and you looked at the available patents there was no logic to that position. To me i see much of the same stuff going on here as well.

We will have to wait and see what GV,102,104,106,107,108,etc bring which could change this calculation but i think at this point its obvious that Gx100's are now a significant departure in terms of ALU design then the rest of the Geforce stack.
 
Last edited:
May 11, 2008
20,873
1,197
126
Thats fine, but its really just throw away words. What you need to look at is what causes high energy usage.
1. Data movement
2. pushing the voltage/clocks/circuit curve
3. executing things you don't need to


Except the part where its stated that they are DC passively cooled parts, which means they have the shared cooling of the chassis fans .



So
1. If they actually increased GCN wavefront occupancy ( higher IPC) thats more performance for no power cost ( see GCN develop doc to understand how GCN executes data)
2. If they actually have refactored the pipelines to increase clocks that will be more performance for no power cost/same perf for less power cost
3. If they actually have the ROP cache in L2 that is more performance and less power at the same clock

You then have the other stuff which we have no idea what/where it will be used, the TBR, primitive shaders , cache controller ( if it actually helps prefetching into GPU L2 etc that could save alot of power).

Look at what changed between Kepler and Maxwell and what didn't change on AMD's side, they refactored there "CUDA cores" for higher clocks and they refactored their ROP's for much higher efficiency, that's the point they pulled away from AMD. NV always had a geometry perf advantage which AMD finally fixed in Polaris.

I think a lot of the issue with Polaris was that people expected to see the things described in some of the new GCN based patents to be in it and none of them where, but given the much more explicit slideware for Vega it looks they will be in Vega, the question is which idea's and to what degree ( in Zen they idea's end up in more advanced form then what was described in the patents).

The proof will be in the pudding and we will have to wait and see. But just like the last 2 years in the Zen threads, when every troll was running with sandy bridge IPC and max 3.0ghz clock when you look at where the architectural issues were, you looked at the published information (compiler notes/slide ware etc) and you looked at the available patents there was no logic to that position. To me i see much of the same stuff going on here as well.

We will have to wait and see what GV,102,104,106,107,108,etc bring which could change this calculation but i think at this point its obvious that Gx100's are now a significant departure in terms of ALU design then the rest of the Geforce stack.

When it comes to AMD, are the different GCN versions not backward compatible ?
And how is this for Nvidia ? Do the cuda cores not need to be backwards compatible ?
If it needs to be backwards compatible, it may be difficult to do a radical new ALU design and keep the efficiency pascal has.
Then it would be more architectural tweaks here and there to alleviate bottlenecks and for increased clocks at same energy consumption.
Instead it would be either more SM and higher clocks combined with tweaks to get more FLOPS throughput/sec.
I can imagine this would need to be the case for the industry cards like gp100 and gv100.
But for a consumer graphic card, i wonder if that needs to be the case. As seen with all the architectural changes in gpu design over the years, it would require new drivers and new shader compilers, everything has to be redone to get performance. Which will take some time.

Compatibility with more sm and higher clocks and tweaks is the best path for consumer gpu volta.
It costs less in development and short time to market.