Technical speculation- why has Kepler "aged badly"?

NTMBK · Dec 19, 2016

First off- please, keep this to technical chat. I don't want fanboy nonsense cluttering up the thread.

So there's plenty of evidence that Kepler is not doing as well on recent games as its contemporary AMD GPU, Hawaii- for example this thread. What I haven't seen is much well-informed speculation as to why.

Here's my theory:

Instruction Level Parallelism

On a fundamental level, the Kepler architecture needs shader code to provide ILP in order to achieve full occupancy. There is an imbalance between the number of warp schedulers in a Kepler SMX and the number of CUDA cores. Here's the relevant part from the Kepler Tuning Guide:

Also note that Kepler GPUs can utilize ILP in place of thread/warp-level parallelism (TLP) more readily than Fermi GPUs can. Furthermore, some degree of ILP in conjunction with TLP is required by Kepler GPUs in order to approach peak single-precision performance, since SMX's warp scheduler issues one or two independent instructions from each of four warps per clock. ILP can be increased by means of, for example, processing several data items concurrently per thread or unrolling loops in the device code, though note that either of these approaches may also increase register pressure.

https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html

An SMX has 4 warp schedulers (which can issue up to two instructions per cycle), and 192 CUDA cores. Each instruction can provide work to 32 CUDA cores (as each warp is 32 threads wide), meaning that in order to saturate all 192 cores there must be at least 6 instructions issued. The only way that a warp scheduler can issue more than 1 instruction per cycle is if there is ILP present, hence ILP is needed to fully utilize the SMX.

Maxwell and Pascal do not have this limitation. They still have 4 schedulers per SM, but each SM only has 128 CUDA cores, meaning that full utilization can be attained without ILP. From the Maxwell Tuning Guide:

Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.

https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html

This is a fundamental feature of the Kepler architecture; so why has it only become a "problem" recently? My suspicion is that this is because developers (both game devs, and NVidia engineers who provide optimization assistance to game devs through e.g. Gameworks) are no longer focusing on optimizing shader code for ILP. Maximising ILP requires tricks like loop-unrolling, and since newer NVidia GPUs no longer need these tricks, they aren't being carried out as often.

Could someone who knows more about GCN than me comment on instruction issue and ILP for that architecture, and whether it also doesn't need ILP to achieve maximum performance? I remember AMD saying that one reason why they moved away from VLIW was to reduce the need for ILP in shader code, but I don't know much more than that. If GCN also doesn't need ILP, then that would explain why shaders originally written for consoles don't run well on Kepler.

TLDR: Kepler requires optimization tricks that newer architectures don't, and I suspect newer games are not using those tricks.

Arachnotronic · Dec 19, 2016

Interesting!

MajinCry · Dec 19, 2016

Zlatan covered this before; it's due to NVidia abandoning Kepler in the driver releases.

https://forums.anandtech.com/thread...cher-3-benchmark.2431768/page-2#post-37409421

He had other posts, but..Erk. Quite a few to wade through.

Those game-specific performance improvements, that you see in driver release notes? By and large, those are due to the game's shaders being replaced by AMD/NVidia's own versions (AFAIK, don't quote me on this), optimized for their cards. NVidia has stopped doing this for their previous architectures, whilst AMD continues to refine theirs.

That being said, AMD only really updates their drivers for post-HD 6000 cards, but that's more than what NVidia has been doing.

Vesku · Dec 19, 2016

I agree, Nvidia does a lot of optimizing for current gen cards. Devs have mentioned how sometimes Nvidia completely replaces their games shader code in driver. It's a big part of the reason they do so well fps/gflop while they are "current". If you upgrade every generation than this would be more of a feature than a drawback. For those who like to skip a generation or two AMD is probably the safer option for having decent performance, less dramatic drop in fps/gflop, as the card ages.

tviceman · Dec 19, 2016

The two biggest reasons are because AMD's hardware being in the major consoles and Kepler being 3 architectures ago.

NTMBK · Dec 19, 2016

MajinCry said:
Zlatan covered this before; it's due to NVidia abandoning Kepler in the driver releases.

https://forums.anandtech.com/thread...cher-3-benchmark.2431768/page-2#post-37409421

He had other posts, but..Erk. Quite a few to wade through.

Those game-specific performance improvements, that you see in driver release notes? By and large, those are due to the game's shaders being replaced by AMD/NVidia's own versions (AFAIK, don't quote me on this), optimized for their cards. NVidia has stopped doing this for their previous architectures, whilst AMD continues to refine theirs.

That being said, AMD only really updates their drivers for post-HD 6000 cards, but that's more than what NVidia has been doing.

Nice find, I didn't see that post first time around! Definitely lines up with what I suspected, that NVidia isn't performing ILP optimizations on shaders any more.

Has anyone seen analysis of shader optimizations/replacements by drivers? I know that it's meant to be quite common (and would be a big help to Kepler), but I haven't seen any good "before/after" comparisons.

MajinCry · Dec 19, 2016

NTMBK said:
Nice find, I didn't see that post first time around! Definitely lines up with what I suspected, that NVidia isn't performing ILP optimizations on shaders any more.

Has anyone seen analysis of shader optimizations/replacements by drivers? I know that it's meant to be quite common (and would be a big help to Kepler), but I haven't seen any good "before/after" comparisons.

I distinctly remember Zlatan saying that a shader, unoptimized for Kepler, leaves a >30% performance deficit. Resource utilization and all that. Tryna find the post.

Face2Face · Dec 19, 2016

I wanted to also add the evidence pile that Big Kepler can still hold it's ground in a lot of modern titles. I know there are a number of games where Kepler performs terribly, and there's no denying that, but there are also some current gen games that it holds up alright against the RX 480/GTX 1060 @ 1080p.

Here are some videos by YouTuber that tests a multitude of hardware using today's games. His GTX 780Ti is a Asus Matrix with a stock boost core clock of 1150Mhz, so it's obviously not representative of a reference blower card like Gamers Nexus used, but it's a core clock most if not all 780Ti's can acheive. There are more videos on his channel than what's below, but I don't want to clog up this thread with YT videos.

NTMBK · Dec 19, 2016

MajinCry said:
I distinctly remember Zlatan saying that a shader, unoptimized for Kepler, leaves a >30% performance deficit. Resource utilization and all that. Tryna find the post.

Yup, that's the effect I was talking about. You need ILP to keep all 192 cores in an SM fed, as without it you can only provide enough work for 128 of those cores (one instruction from each of the 4 warp schedulers). Code can naturally have ILP even without specific optimizations for Kepler, so it's not like you lose 1/3rd of your performance in all cases, but it needs optimizations to really shine.

MajinCry · Dec 19, 2016

NTMBK said:
Yup, that's the effect I was talking about. You need ILP to keep all 192 cores in an SM fed, as without it you can only provide enough work for 128 of those cores (one instruction from each of the 4 warp schedulers). Code can naturally have ILP even without specific optimizations for Kepler, so it's not like you lose 1/3rd of your performance in all cases, but it needs optimizations to really shine.

I was mis-remembering. The >30% thing I was talking about, came from this: https://forums.anandtech.com/thread...-curved-surfaces.2268520/page-2#post-33924993

sontin said: ↑
Minimum size of a triangle for Fermi+ chips is 8 pixel.

And the rasterizer efficiency will be 50-60 percent with that, which is to low. NVIDIA just said that we can use triangle size around 8 pixels, but it will be inefficient on all GPUs.

So aye, I was wrong. That being said, 30% performance improvements from remade shaders isn't unheard of; I swear there have been driver releases stating such gains.

nathanddrews · Dec 19, 2016

So has anyone gone back and tested the impact of previous generational transitions? While I have no problem accepting the deficiencies Kepler exhibits post-Maxwell, I'd like a bit more hard data before claiming it's a unique situation.

dacostafilipe · Dec 19, 2016

How about testing this with games that don't have optimisations in the drivers? Like indy games.

Could be fun to see how Pascal hardware would sit compares to Maxwell/Kepler/...

TeknoBug · Dec 19, 2016

It's not that they aged, it's that Nvidia made drivers render older cards useless. With my GTX560Ti for example with 270.* drivers in many games it performs great but with 360.* it performs awful.

MajinCry · Dec 19, 2016

There's also the whole Gameworks thing, that NVidia uses as a retroactive forced obsolescence tool. To quote a proper dev:

https://forums.anandtech.com/thread...the-last-3-years.2450903/page-5#post-37763576

The closed-source middlewares are the problems, for example GameWorks. It don't have a public source code, so the devs may not able to optimize the shaders for Kepler. And this will limit the optimal ALU usage, so the hardware will lose some performance. But as you can see on the tests the non-Nvidia titles are good on Kepler, the latest example is SW Battlefront.

It is important to understand that GameWorks is not made for the players. This middleware is built for a special business strategy to limit the performance on the older hardwares. The main reason why it don't have a public source code is this, because the devs can ad that performance back, but Nvidia don't want it. They want to sell the newer generation.

And that's just the shaders being blatantly ravaged by NVidia. Wouldn't be surprised if shader gimpin' has been done silently.

NTMBK · Dec 19, 2016

NeoLuxembourg said:
How about testing this with games that don't have optimisations in the drivers? Like indy games.

Could be fun to see how Pascal hardware would sit compares to Maxwell/Kepler/...

Pascal GP102/GP104 is extremely similar in architecture to Maxwell. Same number of schedulers per SM, same number of cores per SM, same size register file per SM, same amount of shared memory per SM. Any shader optimizations targeting Pascal GP102/104 would also be beneficial for Maxwell GPUs.

GP100 is a different matter, but it probably isn't going to show up in any consumer GPUs.

NTMBK · Dec 19, 2016

MajinCry said:
There's also the whole Gameworks thing, that NVidia uses as a retroactive forced obsolescence tool. To quote a proper dev:

https://forums.anandtech.com/thread...the-last-3-years.2450903/page-5#post-37763576

And that's just the shaders being blatantly ravaged by NVidia. Wouldn't be surprised if shader gimpin' has been done silently.

Eh, there's a difference between neglect and deliberate malice. Just because NVidia are no longer actively optimizing for Kepler does not mean that they have set out to deliberately hamstring it. Unfortunately, it looks like Kepler is an architecture that needs specific optimizations to shine.

Headfoot · Dec 19, 2016

I've heard lots and lots of talk from both sides for the last couple of generations about how they have been increasing cache and register size. I wonder if new optimized code relies on having these bigger pools of nearby memory to perform at a high level, with GCN 1.1 levels being the lowest common denominator in the consoles. Any time I read a deep dive you see an awful lot of care put into register management so having more of it seems like it would be an all around win. I doubt this is the only factor but I bet it's an important factor.

nathanddrews · Dec 19, 2016

It may take a while to set up and test, but I've got an i5 (3450, I think) system sitting around doing nothing. I've also got a 470, 570, and 970 to test. Is there something I can benchmark to help validate some claims?

stahlhart · Dec 19, 2016

Thread cleaned. Respect the wishes of the OP and keep this discussion technically focused.
-- stahlhart

dogen1 · Dec 19, 2016

https://forum.beyond3d.com/threads/amd-gpu-gets-better-with-age.56336/page-6#post-1946116

Kepler is ~1/3 the speed of Maxwell and GCN with LDS atomics.

I'm just learning right now what that means, but he's mentioned it at least a few times, so it must be important at least in some cases.

Maxwell also had more dedicted shared memory for each SM. Which I believe is also useful for compute performance.

Whatever the cause is, maxwell destroys kepler in most of these compute benches.

http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20

NTMBK · Dec 19, 2016

Headfoot said:
I've heard lots and lots of talk from both sides for the last couple of generations about how they have been increasing cache and register size. I wonder if new optimized code relies on having these bigger pools of nearby memory to perform at a high level, with GCN 1.1 levels being the lowest common denominator in the consoles. Any time I read a deep dive you see an awful lot of care put into register management so having more of it seems like it would be an all around win. I doubt this is the only factor but I bet it's an important factor.

Good point! Kepler and Maxwell have the same number of registers per SM (64k), but Maxwell has fewer cores per SM, so there are more registers per core. ( https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications ) That makes it easier to keep occupancy high with complex shaders.

I can certainly imagine that with newer games, the more sophisticated graphics effects mean that shader code is more complex, with higher register pressure.

Mercennarius · Dec 19, 2016

When comparing back to it's AMD competition at least part of AMDs advantage falls on their consistent driver improvement over the last couple years. I've bench-marked my 390X after almost every driver update over the last year and a half and almost every time I see a higher benchmark score with each new driver update. I've seen roughly a 5% increase in performance just from driver updates on my 390X.

biostud · Dec 19, 2016

My guess would be that nvidia makes more a more "simple/narrow" but faster GPU approach, that is optimized for the code available now and 1-2 year forward, while AMD has a more "complex" but not as fast approach. So when the code gets more complex it starts to bog the nvidia GPU's down while AMD is better at handling the more complex code.

But this is just based on observations, not any technical insight.

tviceman · Dec 19, 2016

More proof that having the PS4 and XB1 in AMD's corner having hurt Kepler, look at game benchmarks that got released first on PC, are PC exclusive, or are on PC and only 1 console. No Mans Sky, Civ 6, Shadow Warrior 2, Xcom 2, ARK, etc.

What you'll generally see is that when developers only have to optimize for either XB1 or PS4 - but not both - Kepler performs much better.

BFG10K · Dec 21, 2016

MajinCry said:
Those game-specific performance improvements, that you see in driver release notes? By and large, those are due to the game's shaders being replaced by AMD/NVidia's own versions (AFAIK, don't quote me on this), optimized for their cards.

This isn't true. While full shader substitution on nVidia was rampant during the 5800 Ultra days, these days performance improvements mostly come from hint flags in the profiles (and under the hood) which allow faster paths through the driver by using per-game knowledge about the expected workload.

These days full shader substitutions tend to be for legacy and/or compatibility reasons, where a game outright won't work properly without them.

Now, sometimes nVidia/AMD work directly with the game developer to implement the shaders in the first place, but that's something else entirely.

Technical speculation- why has Kepler "aged badly"?

Lifer

Lifer

Platinum Member

Diamond Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Lifer

Platinum Member

Graphics Cards, CPU Moderator

Senior member

Platinum Member

Platinum Member

Lifer

Lifer

Diamond Member

Graphics Cards, CPU Moderator

Super Moderator Graphics Cards

Senior member

Lifer

Senior member

Lifer

Diamond Member

Lifer