[Videocardz] AMD Polaris 11 SKU spotted, has 16 Compute Units

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

coercitiv

Diamond Member
Jan 24, 2014
7,518
17,987
136
This remains as one of the biggest lessons I've learned about the general human population.

This lesson is that the majority of humans do not care about the truth...they cannot handle the truth. They instead prefer to live their lives clinging onto non-truths/fantasies/lies and seek the validation of their pre-conceived notions of reality. That is to say that the majority of humans suffer from a confirmation bias.
The lesson your learned is based on a classic example of confirmation bias. Some people, a minority actually, pushed you around because they didn't like your ideas, hence you concluded people in general cannot handle the truth. That is a classic example of faulty generalization.

There's more to having people accept truth than just posting it on a forum, and part of it is both establishing yourself as a trusted source, making compelling arguments, and most importantly understanding who you should talk to: in forums you win controversial debates by convincing all the other readers except your adversaries. You talk to them, not your opponents.

While NVIDIA's Pascal will be boosting all of its CUDA cores, and will have the power usage that comes with it, from say 1300MHz to 1500MHz, GCN 4.0 will have clock gating incorporated at the Vector ALU level. This means that the potential for a boost clock is well beyond 1500MHz, dare I say perhaps even near or over 2GHz?!
While I do enjoy reading your insights into how modern GPUs work (if only I had a thank you button...), when I see you comparing a feature that Polaris is likely to have with the lack of said feature on Pascal, I feel compelled to ask: at this moment do you have any insights into specific power usage improvements or lack of such improvements in the new Pascal architecture?
 
Last edited:

KompuKare

Golden Member
Jul 28, 2009
1,236
1,611
136
hm, just because the patent is from Sept. 2014 doesn't mean all of the ideas have been incorporated into Polaris, does it?
I had wondered why Polaris and Vega were so close together, and more importantly how AMD hoped to gain another big perf/watt uplift in such a short period between the two and presummably on the same node. HBM2 will save some power, but not this much.
Isn't it possible that Polaris only incorporates some of the ideas outlined in the 2014 patent, whille Vega implenets more of them?
And that AMD had two separate teams working on these architectures possible intending to release them about a year apart (targetting the same node like Nvidia did with Maxwell)?
 

Mopetar

Diamond Member
Jan 31, 2011
8,534
7,799
136
The lesson your learned is based on a classic example of confirmation bias. Some people, a minority actually, pushed you around because they didn't like your ideas, hence you concluded people in general cannot handle the truth. That is a classic example of faulty generalization.

You're missing something important though, which is the people who agreed with what he posted simply because it confirmed their own biases.

Humans in general are wired to be susceptible to confirmation bias (among other common cognitive biases) and practically everyone falls into that pitfall on a regular basis. It's not that people can't handle the truth, it's just that once we've made up our minds about something, we tend not to want to change, even in light of evidence.

You see it everywhere: religion, climate change, economic policy, vaccination, etc. The craziest part is that even when a person believes in an evidence based approach for most things, there's usually some area where they will disregard that approach in favor or maintaining their own opinion.

It's not a generalization to suggest that most people would rather bury their head in the sand than accept something which does not conform to their existing views. It's just that what that thing is tends to very from person to person. It's not that humans actively want to do that, it's just how we evolved, probably because at some point in our past it was beneficial to our survival to just go with an idea rather than second guessing ourselves at every turn, but in a world that's approaching a post-scarcity society, such traits are detrimental rather than beneficial.
 

Saylick

Diamond Member
Sep 10, 2012
4,152
9,691
136
The ideas should have been baked into Polaris/Vega during design development before the patent was formally filed. AMD's engineers must have been working on fleshing out the ideas summarized in the patent well ahead of the patent's filing date because it would not make sense for AMD to file the patent before deciding to incorporate it into the design. I would imagine that if you, as an inventor, know a certain design concept is feasible, you should start building it and take the time to work out the kinks before you officially submit your design to the patent office.

Now, with regards to the patent, it sounds like these new "4th Gen" CUs will either be built with a mix of ALUs with different thread-widths (e.g. 2, 4, 8-thread wide) - let's call this Method A - or be built with a 16-thread wide SIMD unit (a la current GCN) but have the capability of power gating any number of threads such that if, for example, a 4-thread operation comes down the pipe, 12 of those ALUs can be shutdown and the saved power can be used to increase the clock speed of the remaining 4 ALUs - Method B.

With that said, both methods should increase the total throughput of the chip but it sounds like going with Method A would be better for increasing the total ALU utilization by using up as many ALUs as possible given a mixed workload, whereas Method B relies more on power gating unused ALUs and increasing clock speeds to make up for inefficiency. Given a hypothetical chip using either of these methods, with Method A you may have (just guessing now) 90%+ utilization whereas you may only have 70% utilization with Method B but are also able to boost clocks 25% due to the power savings of shutting down unused ALUs. That allows Method B to at least catch up in terms of maximizing the total throughput of the chip with respect to Method A but at the cost of needing to run the ALUs at a higher clock.

My understanding is that it is better to keep clocks as low as possible while engaging as many ALUs as possible; that way, you keep the voltage low and thus keep the power low, hence why Method A sounds like the better approach. Now, I would not be surprised if there are corner cases where using a 2, 4, 8-thread breakdown (along with 2 scalar ALUs to fill in the gaps) does not allow for a 90%+ utilization rate. Given that there is no reason why both Method A and Method B can't be baked into the same chip, you can make up for this small amount of inefficiency by upping the clocks.

In the past, a GCN CU was comprised of (1) scalar ALU with (4) 16-thread wide SIMD units (64 vector ALUs per CU). With what the patent suggests, the change in CU will be comprised of (2) scalar ALUs with (1) 2-thread wide SIMD unit, (1) 4-thread wide SIMD unit, and (1) 8-thread wide SIMD unit a la Method A (14 vector ALUs per CU). This means that GCN 4 should have a lot more CUs than GCN 1-3 if you wanted to match the same number of SPs as before, but the good news is that you essentially have higher "IPC" due to the fact that for a given number of SPs, you have a higher SP utilization with GCN 4. By factoring in the benefits of Method B (aka "turbo mode"), you can increase "IPC" even further by turbo'ing the ALUs which are actually doing useful work.

I would not be surprised if this patent is the "secret sauce" that AMD is using when they claim their 30/70 split on the 2.5x perf/W increase.

Edit:

I gave this some more thought and realized how much of an impact the implications of this patent have when you combine it with something like Async Shading. In the past, if you didn't have AC and the workload was light, you'd turbo the entire chip until thermal or power limits were reached. Now, if you have power gating at the ALU level, you can turbo only the active ALUs and achieve higher performance than before because you don't have to power up unused portions of the chip. No change in how PowerTune works here. Now, with AC, you can fill up the unused ALUs with meaningful work and bring back down the clocks to improve perf/W but the crucial difference now is that with the increased granularity that Method A provides, you have a "double whammy" effect where both Method A and AC increase utilization. If both the graphical pipeline and the compute pipelines are unable to fill up all ALUs, then you can power down the unused ALUs and turbo the active ALUs for even more throughput.

Beyond that, all that you need to do is make sure no other portion of the chip holds back ALU throughput (i.e. make sure the front end and ROPs are fast enough) and you should see pretty good utilization and/or maximum throughput in a given thermal/power envelope most of the time. This is where the Primitive Discard Accelerator, improved Command Processor, Geometry Processor, L2 cache, and Memory Controller come in.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
7,518
17,987
136
It's not a generalization to suggest that most people would rather bury their head in the sand than accept something which does not conform to their existing views.
But it is, the simple fact our society has greatly evolved throughout millennia is the most compelling argument against this generalization.

Or do you reckon we have somehow managed to change despite being naturally wired to repel change? It's one thing to acknowledge being susceptible to such fault, another to proclaim "the majority of humans do not care about the truth...they cannot handle the truth".
 

Mopetar

Diamond Member
Jan 31, 2011
8,534
7,799
136
But it is, the simple fact our society has greatly evolved throughout millennia is the most compelling argument against this generalization.

Or do you reckon we have somehow managed to change despite being naturally wired to repel change? It's one thing to acknowledge being susceptible to such fault, another to proclaim "the majority of humans do not care about the truth...they cannot handle the truth".

We move forward because no matter how much you might want to believe something, those who cling to that which is wrong tend to be less successful than those who eventually move towards what is right. For example, if AMD did come out with a monster graphics card that destroys NV, a lot of people (even if they were huge NV fans) might still buy the AMD card because it's a better value for their money.

Look back far enough through human history and all of the things which we accept as true today were once consider nonsense because of a previous prevailing theory. Once upon a time it was common knowledge that the earth/sun was the center of the universe or that disease spread through noxious odors referred to as miasma.

I don't think that people can't handle the truth (though depending on what the OP meant by this I might accept that argument as well), but I do believe it's fair to say that they don't care about it. Most of what people believe is just what's been handed to them, much like the ideas of miasma or a heliocentric universe. They never really question why something is true or look for reasons to change their mind.

But we're kind of straying off topic with this, so back to GPUs I suppose.
 

coercitiv

Diamond Member
Jan 24, 2014
7,518
17,987
136
But we're kind of straying off topic with this, so back to GPUs I suppose.
Indeed, going back to GPUs and these changes that may come with the next iteration of GCN, I wonder if they will also change the way we understand and use GPU overclocking.
 

Magee_MC

Senior member
Jan 18, 2010
217
13
81
Indeed, going back to GPUs and these changes that may come with the next iteration of GCN, I wonder if they will also change the way we understand and use GPU overclocking.

My first thought is that it shouldn't have much of an affect on how we overclock GPUs. As I understand it all of these new functions are hardware controlled and increasing the clocks shouldn't affect their function.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
My first thought is that it shouldn't have much of an affect on how we overclock GPUs. As I understand it all of these new functions are hardware controlled and increasing the clocks shouldn't affect their function.

That all depends on whether or not a clock multiplier is used. If a multiplier is used then overclocking the GPU will result in an overclock to the boost mechanism.
 

.vodka

Golden Member
Dec 5, 2014
1,203
1,538
136
On top of all this new functionality, there's the increased raw clock speeds from the 28nm -> 14FF move that will further increase performance. If perf/SP is going up this much then the rumored SP counts for Polaris 10 and 11 start to make much more sense. 2048sp GM204 did wreck 2880sp GK110 after all (a combination of better architecture and higher clock speeds).


The Vega chips could be monsters at Fiji number of SPs plus HBM2 to feed the beast.
 

Magee_MC

Senior member
Jan 18, 2010
217
13
81
That all depends on whether or not a clock multiplier is used. If a multiplier is used then overclocking the GPU will result in an overclock to the boost mechanism.

Would that mean that the boost mechanism might limit the amount that the card could be OC'd?
 

Mopetar

Diamond Member
Jan 31, 2011
8,534
7,799
136
Also, if it's mostly there to provide a turbo-boost like functionality, if you overclock it high enough to eat all of that headroom, it probably won't boost much at all.

I think that this is probably a bigger deal for people who don't overclock themselves. Ideally it could be used to keep the frame rate smoother by temporarily boosting if there's a sudden spike in workload.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Also, if it's mostly there to provide a turbo-boost like functionality, if you overclock it high enough to eat all of that headroom, it probably won't boost much at all.

I think that this is probably a bigger deal for people who don't overclock themselves. Ideally it could be used to keep the frame rate smoother by temporarily boosting if there's a sudden spike in workload.

Technically you couldn't overclocker the GPU as high as the per Vector ALU boost clock. Meaning that when you overclock the GPU, you overclock every component within the GPU. You would probably hit a limit well before the potential per ALU boost clock this patent mentions.

Since unused ALUs are powered down so that used ALUs can be boosted then you could see Vector ALUs running in excess of 2Ghz.
 
Feb 19, 2009
10,457
10
76
The ideas should have been baked into Polaris/Vega during design development before the patent was formally filed. AMD's engineers must have been working on fleshing out the ideas summarized in the patent well ahead of the patent's filing date because it would not make sense for AMD to file the patent before deciding to incorporate it into the design. I would imagine that if you, as an inventor, know a certain design concept is feasible, you should start building it and take the time to work out the kinks before you officially submit your design to the patent office.

Now, with regards to the patent, it sounds like these new "4th Gen" CUs will either be built with a mix of ALUs with different thread-widths (e.g. 2, 4, 8-thread wide) - let's call this Method A - or be built with a 16-thread wide SIMD unit (a la current GCN) but have the capability of power gating any number of threads such that if, for example, a 4-thread operation comes down the pipe, 12 of those ALUs can be shutdown and the saved power can be used to increase the clock speed of the remaining 4 ALUs - Method B.

With that said, both methods should increase the total throughput of the chip but it sounds like going with Method A would be better for increasing the total ALU utilization by using up as many ALUs as possible given a mixed workload, whereas Method B relies more on power gating unused ALUs and increasing clock speeds to make up for inefficiency. Given a hypothetical chip using either of these methods, with Method A you may have (just guessing now) 90%+ utilization whereas you may only have 70% utilization with Method B but are also able to boost clocks 25% due to the power savings of shutting down unused ALUs. That allows Method B to at least catch up in terms of maximizing the total throughput of the chip with respect to Method A but at the cost of needing to run the ALUs at a higher clock.

My understanding is that it is better to keep clocks as low as possible while engaging as many ALUs as possible; that way, you keep the voltage low and thus keep the power low, hence why Method A sounds like the better approach. Now, I would not be surprised if there are corner cases where using a 2, 4, 8-thread breakdown (along with 2 scalar ALUs to fill in the gaps) does not allow for a 90%+ utilization rate. Given that there is no reason why both Method A and Method B can't be baked into the same chip, you can make up for this small amount of inefficiency by upping the clocks.

In the past, a GCN CU was comprised of (1) scalar ALU with (4) 16-thread wide SIMD units (64 vector ALUs per CU). With what the patent suggests, the change in CU will be comprised of (2) scalar ALUs with (1) 2-thread wide SIMD unit, (1) 4-thread wide SIMD unit, and (1) 8-thread wide SIMD unit a la Method A (14 vector ALUs per CU). This means that GCN 4 should have a lot more CUs than GCN 1-3 if you wanted to match the same number of SPs as before, but the good news is that you essentially have higher "IPC" due to the fact that for a given number of SPs, you have a higher SP utilization with GCN 4. By factoring in the benefits of Method B (aka "turbo mode"), you can increase "IPC" even further by turbo'ing the ALUs which are actually doing useful work.

I would not be surprised if this patent is the "secret sauce" that AMD is using when they claim their 30/70 split on the 2.5x perf/W increase.

Edit:

I gave this some more thought and realized how much of an impact the implications of this patent have when you combine it with something like Async Shading. In the past, if you didn't have AC and the workload was light, you'd turbo the entire chip until thermal or power limits were reached. Now, if you have power gating at the ALU level, you can turbo only the active ALUs and achieve higher performance than before because you don't have to power up unused portions of the chip. No change in how PowerTune works here. Now, with AC, you can fill up the unused ALUs with meaningful work and bring back down the clocks to improve perf/W but the crucial difference now is that with the increased granularity that Method A provides, you have a "double whammy" effect where both Method A and AC increase utilization. If both the graphical pipeline and the compute pipelines are unable to fill up all ALUs, then you can power down the unused ALUs and turbo the active ALUs for even more throughput.

Beyond that, all that you need to do is make sure no other portion of the chip holds back ALU throughput (i.e. make sure the front end and ROPs are fast enough) and you should see pretty good utilization and/or maximum throughput in a given thermal/power envelope most of the time. This is where the Primitive Discard Accelerator, improved Command Processor, Geometry Processor, L2 cache, and Memory Controller come in.

An excellent post, it seems someone has a deep understanding of technical reading. :)

We don't know whether this patent makes it into Polaris, but this version of GCN is revolutionary.

It is in effect Hyper-threading (SMT) for individual SPs, with power gating and turbo boost at the SP level.

Any game engine that utilizes more compute tasks will fly on this uarch, as they can all be processed in parallel along with graphics threads in the different ALUs. Yeah, it's got a crazy level of Async Shading active all the time. Let that sink in a bit.
 

Mopetar

Diamond Member
Jan 31, 2011
8,534
7,799
136
Technically you couldn't overclocker the GPU as high as the per Vector ALU boost clock. Meaning that when you overclock the GPU, you overclock every component within the GPU. You would probably hit a limit well before the potential per ALU boost clock this patent mentions.

Since unused ALUs are powered down so that used ALUs can be boosted then you could see Vector ALUs running in excess of 2Ghz.

But you also reach a limit where the practical boost you can expect becomes practically zero if you're overvolting to get a better OC. Sure it could still boost, but the inefficiency at such levels means that there's less benefit before the chip needs to govern itself. If the chip is pushed to the thermal limits without the boost, then it can't be expected to kick in under normal circumstances.

If you're running the card against something that doesn't utilize most of the ALUs most of the time, then the boost doesn't get you much more than smoother frame rates during periodic spikes of demand, but if you're already maxing out the chip to attain a higher frame rate (as opposed to targeting a fixed frame rate) then there isn't much room for the boost to come into play unless the drivers are poorly written or there's other non-optimal code at play. Maybe that functionality helps them deal with games that aren't well written to fully or best utilize the hardware, but that's still less than ideal.

Thinking back though, AMD did show off Polaris running Hitman at 60 FPS. They specifically mentioned that they'd capped the frame rate and showed that it remained stable. That makes me think that they have enabled this technology as that game has been seen to have some parts that cause frame rate drops at times, which this technology could theoretically smooth out. The only other explanation for that is they've got a card that can do more than 60 FPS, but don't want to tip their hand as to exactly how much more it can do than that.
 

Saylick

Diamond Member
Sep 10, 2012
4,152
9,691
136
At the end of the day, the patent only allows AMD to extract performance out of the ALUs closer to what the theoretical maximum should be.

Simply put, with other bottlenecks aside, Perf = (Clock Speed) x (Number of ALUs) x (Utilization Ratio). The patent helps with Clock Speed and Utilization but AMD really needs to make sure they produce a chip with > 4096 ALUs if they want to hit it out of the park this generation.

Why? It's because nVidia already achieved high utilization of their CCs ever since Maxwell was released (which is why nVidia is downplaying Async Compute because they don't need it to max out their utilization ratio), and Pascal only seeks to improve on that.

My understanding is that Maxwell's IPC and perf/W gains were largely due to an increase in the granularity of the CCs such that instead of having an SMX with all (192) CCs powered on, you have an SMM with (4) blocks of (32) CC which I presume some of the blocks could be powered down when they were not fed with useful work. This is in contrast with Kepler's SMX, where you had all of the shared resources powered on even if only a fraction of the (192) CCs were used.

The patent of course takes a similar approach but takes the concept to a whole new level of granularity by allowing the power gating to be done at the ALU level, thus allowing AMD to close the gap between the utilization ratio between Pascal and GCN.

All that's left at this point, assuming utilization is the same between the two architectures, is to ensure that the theoretical maximum throughput remains higher than the competitor's solution.

Let's assume that nVidia will release a big die (~600 mm^2) catered towards gaming with all of the DP compute stripped out. Realistically, we'd be looking at a GM200-styled chip that has been scaled down to 14nm then scaled back up in die size until 600 mm^2 was hit. That produces a chip with basically double the number of CCs that GM200 has, i.e. 6144 CCs, but I do not expect nVidia to be able to reach the same clocks with this theoretical "GP200" as it did with P100 due to the increase in CCs. Let's use 1.1 GHz for discussion's sake, since nVidia's chips seem to clock higher than AMD's counterparts. That gives you a theoretical peak throughput of 13.5 TFLOPS (single precision).

If AMD were to match this number, they would need 6758 SPs at a clock of 1 GHz. Fiji was 598 mm^2, so Fiji on 14nm should be no more than 300 mm^2. Scaling that back up to a Hawaii-level die size gives you 5980 SPs (4096 x 438 / 300). Now, 5980 is still less than 6758 but AMD has a significant die size advantage here. The performance gap is roughly 10-15% in favor of "GP200" over "Fiji 2.0", but if AMD decided to produce an even larger chip, they should have the performance crown. I could imagine an 8000 SP chip if AMD decides to go all-out with a 600 mm^2 die on 14nm. Again, assuming utilization ratios were equal, nVidia would need a 33% clock speed advantage to tie.
 

gamervivek

Senior member
Jan 17, 2011
490
53
91
As before, much of it going to hinge on clockspeeds. A 390X could match 980Ti if it were at 980's boost clocks. With nvidia's focus on DP this time round, if AMD can close the clockspeed gap with the new process, it could be a repeat of the Fermi.

Doesn't look likely though with the given Polaris leaks and ~1.5Ghz from GP100.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
It's because nVidia already achieved high utilization of their CCs ever since Maxwell was released (which is why nVidia is downplaying Async Compute because they don't need it to max out their utilization ratio), and Pascal only seeks to improve on that.

I've heard fans on forums claim this. Has it been documented anywhere that nVidia has any other advantage than better threading with DXD11 drivers? I think this is where they get any advantage.
 
Feb 19, 2009
10,457
10
76
@Saylick

SP and Cuda Cores normally have individual math units like scalar and vector, AFAIK, they can only run 1 workload/thread at a time. So the 16 wide vector ALU for example, if it's running work that requires 2-8 wide ALU, it's under-performing for it's design.

There's no multi-threading at the SP/CC level currently.

What is NV's utilization advantage is DX11 Multi-Thread Rendering, so let's say for GM200 and it's 3,072 CC, most of it will be utilized because it's fed with the Gigathread engine using multiple CPU cores to schedule the tasks.

Fury X for example, 4,096 SP, but at low resolution where the threads get processed faster, basically the SPs run out of work to do because the single-thread driver can't feed enough work to the Command Processor to keep those shaders utilized. At higher resolution where the workload increase and takes longer to process, at any one time, more of the SPs are running, hence it scales better at higher resolutions.

DX12 completely solves that aspect already.

This GCN with SMT and power gating/boost is another beast altogether. Imagine on Vega, take the 4096 SP and now each SP is capable of putting 2 scalar threads, and up to 4 vector threads processing concurrently. Massive uplift in work per SP. Ofc this is the maximum ideal scenario where there's workloads that can be distributed for 2,4,8,16 wide vector ALU and 2 scalar ALU. In games, it won't be x6 threads as gaming load tend to be repetitive and consistent. The kicker here is that compute workloads can vary more with post effects that need various different maths.

The cool thing is if there's let's say workloads that only need 8 wide vector ALU and 1 scalar, it can power down the other ALUs, boost the 8 wide and scalar unit to higher clocks to achieve the task faster.

It's a win-win scenario and a really clever design, ON PAPER.

On paper it sounds like the next coming of jesus for GPU tech, it really is that good. But let's see how it's actually translated to reality. I'm pumped, this is the most exciting GPU uarch launch in a very long time.
 
Feb 19, 2009
10,457
10
76
I've heard fans on forums claim this. Has it been documented anywhere that nVidia has any other advantage than better threading with DXD11 drivers? I think this is where they get any advantage.

Exactly.

In DX12 where AMD's single thread woes are no longer holding it back, we see GCN leap up at every segment, most of all, Fury X. It pulls away from Hawaii more than in DX11, with Async Compute in play it has a sizeable lead.

This is related to shader/CC utilization.

Not actually shader/CC IPC, which this GCN patent is talking about.
 

Saylick

Diamond Member
Sep 10, 2012
4,152
9,691
136
I've heard fans on forums claim this. Has it been documented anywhere that nVidia has any other advantage than better threading with DXD11 drivers? I think this is where they get any advantage.

I have not seen formal documentation, but the fact that Maxwell's (128) CCs performs as well as 90% of Kepler's (192) CCs should say something about how much extra utilization they are getting out of Maxwell on a per CC basis. Whether or not there is a definitive DX11 advantage to Maxwell over Kepler is out of my realm of knowledge, but I do not recall there being a distinct advantage in terms of driver multi-threading in favor of Maxwell... but I could be mistaken.
 

Saylick

Diamond Member
Sep 10, 2012
4,152
9,691
136
Exactly.

In DX12 where AMD's single thread woes are no longer holding it back, we see GCN leap up at every segment, most of all, Fury X. It pulls away from Hawaii more than in DX11, with Async Compute in play it has a sizeable lead.

This is related to shader/CC utilization.

Not actually shader/CC IPC, which this GCN patent is talking about.

I would argue that the patent increases shader utilization just as much as one can argue that it improves IPC. Shader utilization and IPC are two performance/efficiency indices that are joined at the hip, in my opinion. :)
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
I have not seen formal documentation, but the fact that Maxwell's (128) CCs performs as well as 90% of Kepler's (192) CCs should say something about how much extra utilization they are getting out of Maxwell on a per CC basis. Whether or not there is a definitive DX11 advantage to Maxwell over Kepler is out of my realm of knowledge, but I do not recall there being a distinct advantage in terms of driver multi-threading in favor of Maxwell... but I could be mistaken.

Were there any other changes to the CC's that would explain their higher performance? I doubt it's simply higher utilization. Especially considering nVidia's performance advantage vs. AMD at lower res was present in Kepler too.

Remove or shift the bottleneck from the CPU to the GPU and AMD performs better relative to nVidia. That's the only trend I've seen and it's been going on for longer than Maxwell.
 
Feb 19, 2009
10,457
10
76
I have not seen formal documentation, but the fact that Maxwell's (128) CCs performs as well as 90% of Kepler's (192) CCs should say something about how much extra utilization they are getting out of Maxwell on a per CC basis. Whether or not there is a definitive DX11 advantage to Maxwell over Kepler is out of my realm of knowledge, but I do not recall there being a distinct advantage in terms of driver multi-threading in favor of Maxwell... but I could be mistaken.

There was a really informative diagram from a scientific research paper on Fermi, Kepler and Maxwell warp thread size and the effective on memory access/bandwidth utilization. Mahigan linked it but it is really striking.

Kepler had issues hitting its 192 CC per SM cluster with warp sizes commonly used in games. Basically warp size 32 was optimal (still not peak), 64 and it would tank losing a lot of performance.

In compute work with Hyper-Q, I am sure GK110 and GK210 were very good, which is why they retain the HPC Tesla crown which Maxwell never filled due to lack of FP64.

So despite Maxwell having 128 CC per SM and less CC overall, it's much higher clocks and better CC utilization means it can pull ahead.

Pascal has moved to an even smaller CC per SM cluster, 64 that matches GCN. Basically as all modern games are console port and use warp/wavefront of 64, Maxwell loses some utilization and Kepler tanks worse, Pascal won't, it will be just fine.

These changes all increase CC/SP utilization. ie. How many shaders are being used at any one-time?

GCN's SMT and gating/boost on SP level increases IPC. ie. When a shader is working, how much throughput/perf does it deliver?

Better pre-fetch and a new Command Processor addresses SP utilization.

Primitive Discard Accelator addresses overall efficiency, by not rendering invisible scenery at all.