• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

[vrworld] Pascal Secrets: What makes Nvidia Geforce GTX 1080 so fast?

Azix

Golden Member
In our initial talks with Nvidia and their partners, we learned that the GeForce GTX 1080 is coming to market in several shapes:

GeForce GTX 1080 8GB
GeForce GTX 1080 Founders Edition
GeForce GTX 1080 Air Overclocked Edition
GeForce GTX 1080 Liquid Cooled Edition

In the search for absolute performance per transistor, Nvidia revised the way how their Streaming Multiprocessor works. When we compare GM200 versus GP100 in clock-per-clock, Pascal (slightly) lags behind Maxwell. This change to a more granulated architecture was done in order to deliver higher clocks and more performance. Splitting the single Maxwell SM into two, doubling the amount of shared memory, warps and registers enabled the FP32 and FP64 cores to operate with yet unseen efficiency. For GP104, Nvidia disabled/removed the FP64 units – reducing the double-precision compute performance to a meaningless number, just like its predecessors.

GP100: 15.3 billion transistors, 3840 cores, 60 SM, 4096-bit memory, 1328 MHz GPU clock
GP104: 7.2 billion transistors, 2560 cores, 40 SM, 256-bit memory, 1660 MHz GPU clock
What is there is single-precision (FP32) performance, which stands at 9 TFLOPS. While the GP100 chip needs a Turbo Boost to 1.48 GHz in order to deliver 10.6 TFLOPS, GP104 clocks up to 1.73 GHz and that’s not the end. If you clock the GTX 1080 to 2.1 GHz, which is achievable on air – you will speed go past the GP100. We can already see the developers and scientists that need single-precision performance placing orders for air and liquid cooled GTX 1080s.

For DirectX 12 and VR, the term Asynchronous Compute was thrown around, especially since AMD Radeon-based cards were beating Nvidia GeForce cards in DirectX 12 titles such as Ashes of The Singularity and Rise of the Tomb Raider. We were told that the Pascal architecture doesn’t have Asynchronous Compute, but that there are some aspects of this feature which qualified the card for ‘direct12_1’ feature set.

However, DX12 titles face another battle altogether, and that is delivering a great gaming experience. This is something where titles such as Gears of War Ultimate Edition or Quantum Break failed entirely, as Microsoft ‘screwed the pooch’ with disastrous conversions and limitations set forth by the Windows Store. Tim Sweeney event wrote an in-depth column on The Guardian stating what’s wrong with Microsoft. These days, game developers work hand in hand with both AMD and Nvidia in order to extract as much performance out of DirectX 12 as possible, which is needed for challenging VR environments.

http://vrworld.com/2016/05/10/pascal-secrets-nvidia-geforce-gtx-1080/

the slight decrease in ipc from maxwell to pascal might explain some discrepancy in the performance for clocks
 
Last edited:
Bolded part is not true.

The Maxwell cards do have Asynchronous Compute Engines - they just aren't as good at it as AMD cards are. No reason to suspect Nvidia would remove them from Pascal.

EDIT:

Look at the table near the bottom of this page.
 
Last edited:
Bolded part is not true.

The Maxwell cards do have Asynchronous Compute Engines - they just aren't as good at it as AMD cards are. No reason to suspect Nvidia would remove them from Pascal.

EDIT:

Look at the table near the bottom of this page.

if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...
 
The main limiting factor for the overclocking beyond 2.2 GHz is 225 Watts, which is how much the board can officially pull from the power circuitry: 75 Watts from the motherboard and 150 W through 8-pin PEG connector. However, there are power supply manufacturers which provide more juice per rail, and we’ve seen single 8-pin connector delivering 225 W on its own.

Still, partners such as ASUS, Colorful, EVGA, Galax, GigaByte, MSI are preparing custom boards with 2-3 8-pin connectors. According to our sources, reaching 2.5 GHz using a liquid cooling setup such as Corsair H115i or EK Waterblocks should not be too much of a hassle.
That slight perf/clock decrease plus the node shrink has given them around 1GHz extra clock speed vs GM204 if that 2.5GHz quote turns out to be true... the netburst/bulldozer philosophy seems to work for GPUs. You'll be at 250-300w just like on GM204/Fiji, yet blowing them both out of the water at the same time.

I wonder how GP100 overclocks. If GM204 is anything to go by, it'll be just as impressive in that regard.




if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...

This. For all intents and purposes, the feature is broken on nV hardware so far, relative to AMD's implementation.
 
Last edited:
if that's what you're saying then you aren't saying anything new. been through all that and its obvious that maxwell doesn't have ACEs nor can it handle async compute.

of course we are still waiting on nvidia driver update to be sure...

See here

EDIT:

Also

From that article:
Keep in mind however, that even Maxwell featured Asynchronous Compute on paper. Unfortunately, due to the fact that expensive software based context switching had to be employed before it could be used, (since it did not have a dedicated hardware scheduler like AMD’s GCN) resulted in lowered performance on Maxwell based graphics cards.

So, it isn't true to say that Maxwell has no ACE. It just lacks a hardware scheduler.
 
Last edited:
That slight perf/clock decrease plus the node shrink has given them around 1GHz extra clock speed vs GM204 if that 2.5GHz quote turns out to be true... the netburst/bulldozer philosophy seems to work for GPUs. You'll be at 250-300w just like on GM204/Fiji, yet blowing them both out of the water at the same time.

I wonder how GP100 overclocks. If GM204 is anything to go by, it'll be just as impressive in that regard.

OC scaling should be worse than maxwell though
 
From what I understand the problem with async and maxwell is the whole gpu has to be in either graphics or compute mode, and there's a penalty when switching.

If pascal can handle compute and graphics side by side it will be able to benefit from async, although probably not as much as gcn.

Anyway, async is nice, but if you look at tests in ashes of the singularity amd cards are only 11% faster with async on. It's only part of the reason why they perform better than nvidia under dx12 in ashes.
 
From what I understand the problem with async and maxwell is the whole gpu has to be in either graphics or compute mode, and there's a penalty when switching.

If pascal can handle compute and graphics side by side it will be able to benefit from async, although probably not as much as gcn.

Anyway, async is nice, but if you look at tests in ashes of the singularity amd cards are only 11% faster with async on. It's only part of the reason why they perform better than nvidia under dx12 in ashes.

yeap, but having a 10-15% increase in performance allows you to compete with a smaller die against the competition, thus lower cost , lower TDP etc etc.

We will see a lot higher performance from Pascal vs Maxwell in DX-12 games but mostly on GameWorks titles.
 
Article is from the same person who claimed GTX1080 was a 1920SP part 😉

And its already in the Nvidia Pascal thread.
 
See here

EDIT:

Also

From that article:


So, it isn't true to say that Maxwell has no ACE. It just lacks a hardware scheduler.
it lacks quite a lot for having any async compute capability at any serious level
1)hardware sc(it was the main of amd to have such a high tdp in the first place)
2)more pipelines to feed the ace's
3)ace's
4)flexible cores that can jump/cycle/mid cylce workloads

its pretty obvious that they gonna try to brute force it this time around instead of actually having it
 
it lacks quite a lot for having any async compute capability at any serious level
1)hardware sc(it was the main of amd to have such a high tdp in the first place)
2)more pipelines to feed the ace's
3)ace's
4)flexible cores that can jump/cycle/mid cylce workloads

its pretty obvious that they gonna try to brute force it this time around instead of actually having it
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.
 
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.

yeah like the context switch was the only problem helding back the maxwell cards..
 
Getting rid of the high context switch cost is all they need to do and from what I heard they stated that they did it.
No more regression from dx11 to dx12...end of complaints.
two things - 1.regression 2. performance boost

pascal only solves 1. regression problem. so there's still ACEs needed to get performance boost which pascal doesn't have.
 
In response to OP's thread title, a shrink from 28nm to 16 nm alone was expected to really improve performance and it apparently did while sipping less energy.

For those of us with GTX980TIs and Titan Xs we were at the top of the heap for performance in most games in the 28nm era of gpu chips but now have to pass the baton to the new chips on the block.

Too bad Neal Sedaka couldn't write a song to make us happy!

On a serious note, the question is do you "dump" your GTX980TI for a 1080 and stay on the never ending upgrade trail?

In the forums there appear to be tomes on this. To each her/his own.

Since I run 2 rigs, one with Nvidia and 1 with AMD and since my goal is to go to single cards to replace dual cards and still show a performance boast.

I will focus on an AMD card powerful enough to replace my 2 R9 290s. I replaced 2 GTX 670s with the GTX980TI. Mission accomplished on the Nvidia side but not forever. This cycle I go AMD, probably Big Vega.
 
Last edited:
the search for absolute performance per transistor, Nvidia revised the way how their Streaming Multiprocessor works. When we compare GM200 versus GP100 in clock-per-clock, Pascal (slightly) lags behind Maxwell. This change to a more granulated architecture was done in order to deliver higher clocks and more performance. Splitting the single Maxwell SM into two, doubling the amount of shared memory, warps and registers enabled the FP32 and FP64 cores to operate with yet unseen efficiency. For GP104, Nvidia disabled/removed the FP64 units – reducing the double-precision compute performance to a meaningless number, just like its predecessors

See, I was right when I stated that past 16 concurrent Warps per SM Maxwell would spill into L2 cache.

Folks who keep criticizing me don't understand that I used to do this for a living.
 
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it 🙂
 
Mahigan, first to tip my hat to you!:thumbsup:😀

How does AMD counter punch?
 
Last edited:
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it 🙂

So from looking at that graph and chart than Fermi isn't affected by it but Keplar is as well?

Does that also explain why the GTX 970 struggled with VRAM issues?
 
two things - 1.regression 2. performance boost

pascal only solves 1. regression problem. so there's still ACEs needed to get performance boost which pascal doesn't have.

There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
 
There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
so far from what we have seen is that nvidia gains 1-2% going from dx11 on dx12
if the dx11 perf of 1080 wont jump by a similiar jump on dx12 then its all once more bogus
 
There is plenty of performance boost with nvidia in scenarios for which Dx12 was made,small cores vs huge GPUs,look up the benches,of course you don't get a boost when you bench with a single core monster that just brute forces though everything on a single thread. (dx11)
It could have gotten another 20% boost on top of what kind of performance it have if it has aces, but it doesn't have those. so no 20% boost it could have. understand ?

100+20 is more than 100, but 100+ 40 is even more than 100 and or 120, get it ?
 
Let me explain again...



What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it 🙂

yea more amd like.
cookies in the mail
 
Let me explain again...

ba3b5b0443a3fb323e55e3fb7d3726d3.jpg

The reason behind Maxwell having issues with more than 16 concurrent warps per SM was the shared L1 cache/texture cache and the shared SM memory.

You have twice as many Texture Mapping units and CUDA cores per SM with Maxwell than you do with Pascal despite Maxwell having the same amount of cache per SM as Pascal.

The end result was a spill over into L2 cache which brought about a sharp drop in performance when Maxwell was pushed compute wise.

This spill into L2 cache also affected the overall ROP performance, as I've mentioned before as well.

What NVIDIA have done is they've delivered a more GCN-like architecture with Pascal, which I alluded to as being the solution prior to Pascal being announced.

So yeah... I called it 🙂
this entire sub forum, you, glo, and rs posts the most informative posts 🙂 :thumbsup::thumbsup::thumbsup:
 
Back
Top