Why Doesn't Anyone Get to the Bottom of the Aging of Kepler and GCN?

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Mahigan

Senior member
Aug 22, 2015
573
0
0
This. Games are too reliant on specific driver support, if its gone, performance plummet.

Look at this, performance of GTX980 vs 780Ti in GPGPU situation, Octane Render:

https://render.otoy.com/octanebench/results.php

780Ti - 103 points
980 - 98 points

pretty much tied for performance. How much faster is 980 in games?
It has to do with the GPU compute utilization of compute loads. Nothing to do with pure render loads.

Basically the thread size of the compute workloads translates into GPU utilization hardware side.

This is decided upon by the developer's, not AMD or NVIDIA.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Crysis 3 October 24 2013:
59287.png


Crysis 3 September 2014:
67723.png


Crysis 3 July 2 2015:
75455.png


7970 regressed performance from 2013 to 2015.
290X regressed performance from 2013 to 2015.
780TI regressed performance from 2014 to 2015.

Driver optimization is everything.
 

Timmah!

Golden Member
Jul 24, 2010
1,571
935
136
It has to do with the GPU compute utilization of compute loads. Nothing to do with pure render loads.

Basically the thread size of the compute workloads translates into GPU utilization hardware side.

This is decided upon by the developer's, not AMD or NVIDIA.

Not sure what are you trying to say. That the actual GPGPU performance is mirroring the objective "strength" (if anything like that even exists) of the GPU worse than performance in games, because devs decide this? And if they decided different, the 980 would smoke 780Ti, or vice versa? I dont think so.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
Not sure what are you trying to say. That the actual GPGPU performance is mirroring the objective "strength" (if anything like that even exists) of the GPU worse than performance in games, because devs decide this? And if they decided different, the 980 would smoke 780Ti, or vice versa? I dont think so.
Not at all,

GPGPU performance depends entirely on GPU compute utilization. Utilization depends on the code and how it relates to the architecture.

In games, compute shaders are used in order to run small programs (work items) on the GPU. When a GPU begins work on a work item, it does so by executing kernels (data parallel program), kernels are further broken down into work groups and work groups are broken down into wavefronts (or warps).
5f0d20b872876b0f700c54405fb9a2d9.jpg

387fe65435a781c6e192f7992c9b2d4e.jpg

e51ce68dc19e61555c52cfe3dfc4f483.jpg


So the work groups are segmented into wavefronts (GCN) or Warps (Kepler/Maxwell).

The programmer decides the size of the work group and how that work group is split up into smaller segments is up to the hardware to decide.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

Your Octane renderer is probably a relic of that past. It is no longer relevant. Games are now arriving with GCN centric optimizations.

Under those scenario's, Kepler is under utilized to a large degree. This is due to the way the CUDA cores in the SMXs were organized (192 CUDA cores per SMX). NVIDIA took notice of this and reduced the amount of CUDA cores in each SM to 128 for Maxwell's SMM and segmented those 128 CUDA cores into four groups of 32 CUDA cores (mapping directly to a Warp).

So yes, how an application is optimized, written, largely determines performance.
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
Not at all,

GPGPU performance depends entirely on GPU compute utilization. Utilization depends on the code and how it relates to the architecture.

In games, compute shaders are used in order to run small programs (work items) on the GPU. When a GPU begins work on a work item, it does so by executing kernels (data parallel program), kernels are further broken down into work groups and work groups are broken down into wavefronts (or warps).
5f0d20b872876b0f700c54405fb9a2d9.jpg

387fe65435a781c6e192f7992c9b2d4e.jpg

e51ce68dc19e61555c52cfe3dfc4f483.jpg


So the work groups are segmented into wavefronts (GCN) or Warps (Kepler/Maxwell).

The programmer decides the size of the work group and how that work group is split up into smaller segments is up to the hardware to decide.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

Your Octane renderer is probably a relic of that past. It is no longer relevant. Games are now arriving with GCN centric optimizations.

Under those scenario's, Kepler is under utilized to a large degree. This is due to the way the CUDA cores in the SMXs were organized (192 CUDA cores per SMX). NVIDIA took notice of this and reduced the amount of CUDA cores in each SM to 128 for Maxwell's SMM and segmented those 128 CUDA cores into four groups of 32 CUDA cores (mapping directly to a Warp).

So yes, how an application is optimized, written, largely determines performance.

Great write-up, thank you. :)
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Crysis 3 October 24 2013:
59287.png


Crysis 3 September 2014:
67723.png


Crysis 3 July 2 2015:
75455.png


7970 regressed performance from 2013 to 2015.
290X regressed performance from 2013 to 2015.
780TI regressed performance from 2014 to 2015.

Driver optimization is everything.
I don't see regression for those GCN cards. I see results which are within the realm of normalcy (margin of error). Different benchmark runs will give you different results within a few FPS of each.

Crysis 3 looks to have remained the same for Kepler and GCN based cards. Maxwell based cards received further optimizations. Nothing earth shattering.
 

Timmah!

Golden Member
Jul 24, 2010
1,571
935
136
Not at all,

GPGPU performance depends entirely on GPU compute utilization. Utilization depends on the code and how it relates to the architecture.

In games, compute shaders are used in order to run small programs (work items) on the GPU. When a GPU begins work on a work item, it does so by executing kernels (data parallel program), kernels are further broken down into work groups and work groups are broken down into wavefronts (or warps).
5f0d20b872876b0f700c54405fb9a2d9.jpg

387fe65435a781c6e192f7992c9b2d4e.jpg

e51ce68dc19e61555c52cfe3dfc4f483.jpg


So the work groups are segmented into wavefronts (GCN) or Warps (Kepler/Maxwell).

The programmer decides the size of the work group and how that work group is split up into smaller segments is up to the hardware to decide.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

Your Octane renderer is probably a relic of that past. It is no longer relevant. Games are now arriving with GCN centric optimizations.

Under those scenario's, Kepler is under utilized to a large degree. This is due to the way the CUDA cores in the SMXs were organized (192 CUDA cores per SMX). NVIDIA took notice of this and reduced the amount of CUDA cores in each SM to 128 for Maxwell's SMM and segmented those 128 CUDA cores into four groups of 32 CUDA cores (mapping directly to a Warp).

So yes, how an application is optimized, written, largely determines performance.

Its certainly not not relic of the past or not longer relevant. Its constantly in development since only cca 2009/10, its kernels have been rewritten several times from the scratch to allow addition of new advanced features to get to the feature levels of old un-biased renderers like V-Ray and currently its getting its 3.0 version, which is going to be eventually ported to OpenCL as well (at last i will be able to choose between Nvidia and AMD again). Bottom line, its up to date and IMO less prone to performance decrease over time due to lack of driver support than games.

Bottom line, the relatively tied performance between 780Ti and 980 is exactly the result of Octane being optimized by its devs for both Kepler and Maxwell. Under these circumstances 780Ti has slight performance lead, which pretty much mirrors its relative theoretical SP peak performance lead over 980... which is not true for games at all, where 980 beats the crap out of 780Ti (AFAIK).

Based on this, i maintain my opinion that apps like Octane are more reliable performance measuring tool of GPU peak performance than games, equally to likes of the Cinebench being better benchmark than games.
 
Feb 19, 2009
10,457
10
76
Bottom line, the relatively tied performance between 780Ti and 980 is exactly the result of Octane being optimized by its devs for both Kepler and Maxwell. Under these circumstances 780Ti has slight performance lead, which pretty much mirrors its relative theoretical SP peak performance lead over 980... which is not true for games at all, where 980 beats the crap out of 780Ti (AFAIK).

You notice in neutral titles, the 780Ti often performs very close to the 980. Well ahead of the 970. It really only tanks since 2014 in NV GameWorks titles.

But sadly these days, all the new AAA games, Kepler runs poorly.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Its certainly not not relic of the past or not longer relevant. Its constantly in development since only cca 2009/10, its kernels have been rewritten several times from the scratch to allow addition of new advanced features to get to the feature levels of old un-biased renderers like V-Ray and currently its getting its 3.0 version, which is going to be eventually ported to OpenCL as well (at last i will be able to choose between Nvidia and AMD again). Bottom line, its up to date and IMO less prone to performance decrease over time due to lack of driver support than games.

Bottom line, the relatively tied performance between 780Ti and 980 is exactly the result of Octane being optimized by its devs for both Kepler and Maxwell. Under these circumstances 780Ti has slight performance lead, which pretty much mirrors its relative theoretical SP peak performance lead over 980... which is not true for games at all, where 980 beats the crap out of 780Ti (AFAIK).

Based on this, i maintain my opinion that apps like Octane are more reliable performance measuring tool of GPU peak performance than games, equally to likes of the Cinebench being better benchmark than games.
It's not relevant for measuring Direct Compute performance which is what we're discussing by discussing DX11/DX12 gaming.

When we discuss gaming, we're not discussing peak compute throughput or theoretical compute throughput. We're discussing GPU compute utilization. Based on game specific optimizations. You're right to say that Kepler can, if developer's optimize their code for it, perform admirably. Sadly that's not the case due to the console effect.

Recent DX11 games have been starting to favor GCN due to the console effect. That being said GCN is API bound under DX11. DX12 allows for the removal of API overhead issues relating to GCN. This further boosts GCNs performance. Throw in Async Compute and you have another boost to GCN compute utilization.

The topic was discussing why Kepler has seemingly regressed relative to Maxwell and GCN in newer titles. That can be explained by compute utilization caused by GCN specific optimizations in the console arena making their way onto PC ports.
 

Timmah!

Golden Member
Jul 24, 2010
1,571
935
136
It's not relevant for measuring Direct Compute performance which is what we're discussing by discussing DX11/DX12 gaming.

When we discuss gaming, we're not discussing peak compute throughput or theoretical compute throughput. We're discussing GPU compute utilization. Based on game specific optimizations. You're right to say that Kepler can, if developer's optimize their code for it, perform admirably. Sadly that's not the case due to the console effect.

Recent DX11 games have been starting to favor GCN due to the console effect. That being said GCN is API bound under DX11. DX12 allows for the removal of API overhead issues relating to GCN. This further boosts GCNs performance. Throw in Async Compute and you have another boost to GCN compute utilization.

The topic was discussing why Kepler has seemingly regressed relative to Maxwell and GCN in newer titles. That can be explained by compute utilization caused by GCN specific optimizations in the console arena making their way onto PC ports.

I see what you mean now. I guess i cant disagree with that.
 

moonbogg

Lifer
Jan 8, 2011
10,731
3,440
136
You're nuts.

There is a constant improvement of base technology. Every year we learn how to put more transistors on a square millimeter. Every year we gain more knowledge on how to design chips. We get better software to help design chips.

The GPU industry succeeds in using that overall progress to manufacture better products.The CPU industry does not. They have better base material to work with, but they do not succeed in converting those better opportunities into a better end-result for the customers. Imnsho the GPU industry is the successful one here, and the CPU industry is the failing one.


Unless you wanted us all to still use slide rules. And horse and carriages. Humanity makes technological progress. If you don't like that, maybe you should move back to the Dark Ages. Tbh, now that I think of it, I have no idea what you are doing on AT technical forums.

I develop emotional attachments to my hardware. I love my hardware. Its like a friend to me. I can rely on it to provide great experiences for years to come. But there is a problem with my relationship with GPU's.
GPU's are like a really sexy hot chick, and I like this girl a lot, but the problem is she has never been in a relationship for more than 18 months. I knew that going in, and although I was hoping for the best, right around that 18 month mark she sort of just stopped performing. I had no choice but to replace her with a new girl, a fresh one who is eager to please. But alas, it was more of the same. Pretty soon she had to go as well. So you see, I find it hard to develop attachments to my GPU's. I can love my CPU long time. I just bought some HD700 headphones that look like they will take care of me for years, heart and soul. But the GPU's...they only break my heart.
So, I must now ask, what the hell are you doing on anandtech if you don't have these emotional connections?! You are a hardware sociopath it seems. You aren't capable of connecting with your hardware and that's not enthusiast-like at all. That's cold...and lifeless.
 

Genx87

Lifer
Apr 8, 2002
41,091
513
126
In the video card market it seems like there have been some certain trends about GPUs aging over time, but the reasons why all seem to be speculation. Why doesn't someone (aka one of these sites that review GPUs) actually do the testing to figure it out?

Like for example Kepler. It is obviously not doing as well as it once did with new games vs both GCN and Maxwell. But the reasons why vary- some blame lack of driver optimization for Kepler, some blame overall trends in development. Seems like this is easy to test. Go back and benchmark older games (say 2013) on Maxwell to make sure that there wasn't some sort of overall driver boost, and then retest some barely older games (say 2014-2015) to make sure that subsequent driver updates don't help Kepler. If Maxwell runs better with old games, or Kepler runs better after six-eight months then obviously there is some truth to the claims that the priority is Maxwell.

Or GCN vs Nvidia aging overall. It is obvious that GCN has aged well especially vs Kepler and somewhat vs Maxwell, but the reasons vary- some blame developers targeting consoles, others say the drivers got better. Again this seems easy to test. Go back and retest old games to see if performance got better or worse vs competitors and vs old tests. If old games run better, then clearly the driver got better.

I get that these sites are focused on new cards and the enthusiasm around new technology, and at some level there might be a hesitation to screw with hindsight because it might contradict their recommendations from an earlier period. With that said, there is no new technology that most people can buy for months and given that it's a new generation of GPUs observations from the previous generations can help us draw conclusions of what unexpected outcomes to expect. At the very least it seems like there is a million clicks for the person who discovers the actual proof of what is going on as the two groups of online fans battle each other with "leaked" or made up evidence of what the new hardware will do.

I just don't understand.

Marketing. You may not believe this but I doubt AMD wants people to know their older cards dont need to be upgraded. Also, like I have said numerous times in other threads about older Nvidia cards performing in DX12. Barely anybody cares about going back 1-2 generations to see how they fare. Benchmarking hardware is marketing said hardware like music videos were to marketing artists. Once a new song came out the old video was retired from the marketing cycle. Same thing with hardware.
 

Zodiark1593

Platinum Member
Oct 21, 2012
2,230
4
81
Not at all,

GPGPU performance depends entirely on GPU compute utilization. Utilization depends on the code and how it relates to the architecture.

In games, compute shaders are used in order to run small programs (work items) on the GPU. When a GPU begins work on a work item, it does so by executing kernels (data parallel program), kernels are further broken down into work groups and work groups are broken down into wavefronts (or warps).
5f0d20b872876b0f700c54405fb9a2d9.jpg

387fe65435a781c6e192f7992c9b2d4e.jpg

e51ce68dc19e61555c52cfe3dfc4f483.jpg


So the work groups are segmented into wavefronts (GCN) or Warps (Kepler/Maxwell).

The programmer decides the size of the work group and how that work group is split up into smaller segments is up to the hardware to decide.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

Your Octane renderer is probably a relic of that past. It is no longer relevant. Games are now arriving with GCN centric optimizations.

Under those scenario's, Kepler is under utilized to a large degree. This is due to the way the CUDA cores in the SMXs were organized (192 CUDA cores per SMX). NVIDIA took notice of this and reduced the amount of CUDA cores in each SM to 128 for Maxwell's SMM and segmented those 128 CUDA cores into four groups of 32 CUDA cores (mapping directly to a Warp).

So yes, how an application is optimized, written, largely determines performance.

Sort of matches my hypothesis of Maxwell being largely a modified and optimized Kepler. The only thing really lacking now is the brute vs GCN.
 

nenforcer

Golden Member
Aug 26, 2008
1,778
20
81
Marketing. You may not believe this but I doubt AMD wants people to know their older cards dont need to be upgraded. Also, like I have said numerous times in other threads about older Nvidia cards performing in DX12. Barely anybody cares about going back 1-2 generations to see how they fare. Benchmarking hardware is marketing said hardware like music videos were to marketing artists. Once a new song came out the old video was retired from the marketing cycle. Same thing with hardware.

I totally agree with this. AMD GCN Architecture with Asynchronous Computer support is totally forward thinking and was ahead of its time when it was released in 2012. The problem was NVidia graphics cards had better DirectX 11 support through their better threaded driver and (arguably) performed better in DirectX 11 titles shipped the last 5-6 years since Fermi. The NVidia graphics cards certainly sold better than their AMD counterparts.

Now here is the dilemma - NVidia still to this day doesn't have Asynchronous Compute in the consumer Geforce driver up through the Maxwell series. However - once Pascal releases later this year, I suspect they will have added Asynchronous Computer to their driver and now we will really see who performs better in the upcoming and current DirectX 12 Asynchronous Compute releases. AMD (Tahiti) owners can sit tight knowing they have at least hardware level support for DirectX 12 Feature Level 1 without being forced to upgrade like NVidia users will. I own a Fermi and Kepler but skipped Maxwell and went with AMD Tonga for this very reason - especially since the consoles are all AMD and next generation is suppose to be the same as well.
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
AMD (Tahiti) owners can sit tight knowing they have at least hardware level support for DirectX 12 Feature Level 1 without being forced to upgrade like NVidia users will.

Nobody is forced to upgrade. Games run fine in dx12 without async compute.
 
Feb 19, 2009
10,457
10
76
I own a Fermi and Kepler but skipped Maxwell and went with AMD Tonga for this very reason - especially since the consoles are all AMD and next generation is suppose to be the same as well.

I think whatever the cause of it, we know it has occurred and GCN will continue to mature well as long as devs target console hardware that is GCN. In the next console cycle, if it's not GCN, then we'll not see this effect anymore. That's all that this comes down to for the GCN side.

The NV side is a different issue.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
You notice in neutral titles, the 780Ti often performs very close to the 980. Well ahead of the 970. It really only tanks since 2014 in NV GameWorks titles.

But sadly these days, all the new AAA games, Kepler runs poorly.

ye i find it very funy that aots actually expose what nvidia does to their cards 780ti reaching 980 and even beating it some times its the epitome of irony
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
I totally agree with this. AMD GCN Architecture with Asynchronous Computer support is totally forward thinking and was ahead of its time when it was released in 2012. The problem was NVidia graphics cards had better DirectX 11 support through their better threaded driver and (arguably) performed better in DirectX 11 titles shipped the last 5-6 years since Fermi. The NVidia graphics cards certainly sold better than their AMD counterparts.

Now here is the dilemma - NVidia still to this day doesn't have Asynchronous Compute in the consumer Geforce driver up through the Maxwell series. However - once Pascal releases later this year, I suspect they will have added Asynchronous Computer to their driver and now we will really see who performs better in the upcoming and current DirectX 12 Asynchronous Compute releases. AMD (Tahiti) owners can sit tight knowing they have at least hardware level support for DirectX 12 Feature Level 1 without being forced to upgrade like NVidia users will. I own a Fermi and Kepler but skipped Maxwell and went with AMD Tonga for this very reason - especially since the consoles are all AMD and next generation is suppose to be the same as well.

Its so great and forward thinking we already see GCN 1.0 and 1.2 either stagnate or regress in DX12. Unless its a Fury card. And even then it depends on the game.

We are down to specific SKU type optimization. And Tonga users for example are the big losers.
 
Last edited:

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
so just because on rotr the devs cant really use the dx12 path yet it means that the cards are becoming obsolete? amd isnt nvidia
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
You act like it doesn't happen in other games ;)

IZIu6.jpg
Odd... Every other source shows this pattern..
cb612dd6173bb823f3d2b9c27a7fd02b.jpg

f2282daea0692091d840da30ca6b6be2.jpg

Whereas only Tonga regresses.

Rise of the Tomb Raider is broken. Best to wait until they release a fix for their broken DX12 path (like GoW did).