[vrworld] Pascal Secrets: What makes Nvidia Geforce GTX 1080 so fast?

Azix

Golden Member
Apr 18, 2014
1,438
67
91
In our initial talks with Nvidia and their partners, we learned that the GeForce GTX 1080 is coming to market in several shapes:

GeForce GTX 1080 8GB
GeForce GTX 1080 Founders Edition
GeForce GTX 1080 Air Overclocked Edition
GeForce GTX 1080 Liquid Cooled Edition

In the search for absolute performance per transistor, Nvidia revised the way how their Streaming Multiprocessor works. When we compare GM200 versus GP100 in clock-per-clock, Pascal (slightly) lags behind Maxwell. This change to a more granulated architecture was done in order to deliver higher clocks and more performance. Splitting the single Maxwell SM into two, doubling the amount of shared memory, warps and registers enabled the FP32 and FP64 cores to operate with yet unseen efficiency. For GP104, Nvidia disabled/removed the FP64 units – reducing the double-precision compute performance to a meaningless number, just like its predecessors.

GP100: 15.3 billion transistors, 3840 cores, 60 SM, 4096-bit memory, 1328 MHz GPU clock
GP104: 7.2 billion transistors, 2560 cores, 40 SM, 256-bit memory, 1660 MHz GPU clock
What is there is single-precision (FP32) performance, which stands at 9 TFLOPS. While the GP100 chip needs a Turbo Boost to 1.48 GHz in order to deliver 10.6 TFLOPS, GP104 clocks up to 1.73 GHz and that’s not the end. If you clock the GTX 1080 to 2.1 GHz, which is achievable on air – you will speed go past the GP100. We can already see the developers and scientists that need single-precision performance placing orders for air and liquid cooled GTX 1080s.

For DirectX 12 and VR, the term Asynchronous Compute was thrown around, especially since AMD Radeon-based cards were beating Nvidia GeForce cards in DirectX 12 titles such as Ashes of The Singularity and Rise of the Tomb Raider. We were told that the Pascal architecture doesn’t have Asynchronous Compute, but that there are some aspects of this feature which qualified the card for ‘direct12_1’ feature set.

However, DX12 titles face another battle altogether, and that is delivering a great gaming experience. This is something where titles such as Gears of War Ultimate Edition or Quantum Break failed entirely, as Microsoft ‘screwed the pooch’ with disastrous conversions and limitations set forth by the Windows Store. Tim Sweeney event wrote an in-depth column on The Guardian stating what’s wrong with Microsoft. These days, game developers work hand in hand with both AMD and Nvidia in order to extract as much performance out of DirectX 12 as possible, which is needed for challenging VR environments.

http://vrworld.com/2016/05/10/pascal-secrets-nvidia-geforce-gtx-1080/

the slight decrease in ipc from maxwell to pascal might explain some discrepancy in the performance for clocks
 
Last edited:

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
Bolded part is not true.

The Maxwell cards do have Asynchronous Compute Engines - they just aren't as good at it as AMD cards are. No reason to suspect Nvidia would remove them from Pascal.

EDIT:

Look at the table near the bottom of this page.
 
Last edited:

Azix

Golden Member
Apr 18, 2014
1,438
67
91
Bolded part is not true.

The Maxwell cards do have Asynchronous Compute Engines - they just aren't as good at it as AMD cards are. No reason to suspect Nvidia would remove them from Pascal.

EDIT:

Look at the table near the bottom of this page.

if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...
 

.vodka

Golden Member
Dec 5, 2014
1,203
1,537
136
The main limiting factor for the overclocking beyond 2.2 GHz is 225 Watts, which is how much the board can officially pull from the power circuitry: 75 Watts from the motherboard and 150 W through 8-pin PEG connector. However, there are power supply manufacturers which provide more juice per rail, and we’ve seen single 8-pin connector delivering 225 W on its own.

Still, partners such as ASUS, Colorful, EVGA, Galax, GigaByte, MSI are preparing custom boards with 2-3 8-pin connectors. According to our sources, reaching 2.5 GHz using a liquid cooling setup such as Corsair H115i or EK Waterblocks should not be too much of a hassle.
That slight perf/clock decrease plus the node shrink has given them around 1GHz extra clock speed vs GM204 if that 2.5GHz quote turns out to be true... the netburst/bulldozer philosophy seems to work for GPUs. You'll be at 250-300w just like on GM204/Fiji, yet blowing them both out of the water at the same time.

I wonder how GP100 overclocks. If GM204 is anything to go by, it'll be just as impressive in that regard.




if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...

This. For all intents and purposes, the feature is broken on nV hardware so far, relative to AMD's implementation.
 
Last edited:

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...

See here

EDIT:

Also

From that article:
Keep in mind however, that even Maxwell featured Asynchronous Compute on paper. Unfortunately, due to the fact that expensive software based context switching had to be employed before it could be used, (since it did not have a dedicated hardware scheduler like AMD’s GCN) resulted in lowered performance on Maxwell based graphics cards.

So, it isn't true to say that Maxwell has no ACE. It just lacks a hardware scheduler.
 
Last edited:

Azix

Golden Member
Apr 18, 2014
1,438
67
91
That slight perf/clock decrease plus the node shrink has given them around 1GHz extra clock speed vs GM204 if that 2.5GHz quote turns out to be true... the netburst/bulldozer philosophy seems to work for GPUs. You'll be at 250-300w just like on GM204/Fiji, yet blowing them both out of the water at the same time.

I wonder how GP100 overclocks. If GM204 is anything to go by, it'll be just as impressive in that regard.

OC scaling should be worse than maxwell though
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
From what I understand the problem with async and maxwell is the whole gpu has to be in either graphics or compute mode, and there's a penalty when switching.

If pascal can handle compute and graphics side by side it will be able to benefit from async, although probably not as much as gcn.

Anyway, async is nice, but if you look at tests in ashes of the singularity amd cards are only 11% faster with async on. It's only part of the reason why they perform better than nvidia under dx12 in ashes.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
From what I understand the problem with async and maxwell is the whole gpu has to be in either graphics or compute mode, and there's a penalty when switching.

If pascal can handle compute and graphics side by side it will be able to benefit from async, although probably not as much as gcn.

Anyway, async is nice, but if you look at tests in ashes of the singularity amd cards are only 11% faster with async on. It's only part of the reason why they perform better than nvidia under dx12 in ashes.

yeap, but having a 10-15% increase in performance allows you to compete with a smaller die against the competition, thus lower cost , lower TDP etc etc.

We will see a lot higher performance from Pascal vs Maxwell in DX-12 games but mostly on GameWorks titles.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Article is from the same person who claimed GTX1080 was a 1920SP part ;)

And its already in the Nvidia Pascal thread.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
See here

EDIT:

Also

From that article:


So, it isn't true to say that Maxwell has no ACE. It just lacks a hardware scheduler.
it lacks quite a lot for having any async compute capability at any serious level
1)hardware sc(it was the main of amd to have such a high tdp in the first place)
2)more pipelines to feed the ace's
3)ace's
4)flexible cores that can jump/cycle/mid cylce workloads

its pretty obvious that they gonna try to brute force it this time around instead of actually having it
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
it lacks quite a lot for having any async compute capability at any serious level
1)hardware sc(it was the main of amd to have such a high tdp in the first place)
2)more pipelines to feed the ace's
3)ace's
4)flexible cores that can jump/cycle/mid cylce workloads

its pretty obvious that they gonna try to brute force it this time around instead of actually having it
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.

yeah like the context switch was the only problem helding back the maxwell cards..
 

kraatus77

Senior member
Aug 26, 2015
266
59
101
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.
two things - 1.regression 2. performance boost

pascal only solves 1. regression problem. so there's still ACEs needed to get performance boost which pascal doesn't have.
 

guskline

Diamond Member
Apr 17, 2006
5,338
476
126
In response to OP's thread title, a shrink from 28nm to 16 nm alone was expected to really improve performance and it apparently did while sipping less energy.

For those of us with GTX980TIs and Titan Xs we were at the top of the heap for performance in most games in the 28nm era of gpu chips but now have to pass the baton to the new chips on the block.

Too bad Neal Sedaka couldn't write a song to make us happy!

On a serious note, the question is do you "dump" your GTX980TI for a 1080 and stay on the never ending upgrade trail?

In the forums there appear to be tomes on this. To each her/his own.

Since I run 2 rigs, one with Nvidia and 1 with AMD and since my goal is to go to single cards to replace dual cards and still show a performance boast.

I will focus on an AMD card powerful enough to replace my 2 R9 290s. I replaced 2 GTX 670s with the GTX980TI. Mission accomplished on the Nvidia side but not forever. This cycle I go AMD, probably Big Vega.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
the search for absolute performance per transistor, Nvidia revised the way how their Streaming Multiprocessor works. When we compare GM200 versus GP100 in clock-per-clock, Pascal (slightly) lags behind Maxwell. This change to a more granulated architecture was done in order to deliver higher clocks and more performance. Splitting the single Maxwell SM into two, doubling the amount of shared memory, warps and registers enabled the FP32 and FP64 cores to operate with yet unseen efficiency. For GP104, Nvidia disabled/removed the FP64 units – reducing the double-precision compute performance to a meaningless number, just like its predecessors

See, I was right when I stated that past 16 concurrent Warps per SM Maxwell would spill into L2 cache.

Folks who keep criticizing me don't understand that I used to do this for a living.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it :)
 

guskline

Diamond Member
Apr 17, 2006
5,338
476
126
Mahigan, first to tip my hat to you!:thumbsup::D

How does AMD counter punch?
 
Last edited:

Killer_Croc

Member
Jan 4, 2016
29
0
0
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it :)

So from looking at that graph and chart than Fermi isn't affected by it but Keplar is as well?

Does that also explain why the GTX 970 struggled with VRAM issues?
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
two things - 1.regression 2. performance boost

pascal only solves 1. regression problem. so there's still ACEs needed to get performance boost which pascal doesn't have.

There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
so far from what we have seen is that nvidia gains 1-2% going from dx11 on dx12
if the dx11 perf of 1080 wont jump by a similiar jump on dx12 then its all once more bogus
 

kraatus77

Senior member
Aug 26, 2015
266
59
101
There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
It could have gotten another 20% boost on top of what kind of performance it have if it has aces, but it doesn't have those. so no 20% boost it could have. understand ?

100+20 is more than 100, but 100+ 40 is even more than 100 and or 120, get it ?
 

flopper

Senior member
Dec 16, 2005
739
19
76
Let me explain again...



What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it :)

yea more amd like.
cookies in the mail
 

boozzer

Golden Member
Jan 12, 2012
1,549
18
81
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it :)
this entire sub forum, you, glo, and rs posts the most informative posts :) :thumbsup::thumbsup::thumbsup: