[Videocardz] AMD Polaris 11 SKU spotted, has 16 Compute Units

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Saylick

Diamond Member
Sep 10, 2012
4,154
9,693
136
@Saylick

SP and Cuda Cores normally have individual math units like scalar and vector, AFAIK, they can only run 1 workload/thread at a time. So the 16 wide vector ALU for example, if it's running work that requires 2-8 wide ALU, it's under-performing for it's design.

There's no multi-threading at the SP/CC level currently.

What is NV's utilization advantage is DX11 Multi-Thread Rendering, so let's say for GM200 and it's 3,072 CC, most of it will be utilized because it's fed with the Gigathread engine using multiple CPU cores to schedule the tasks.

Fury X for example, 4,096 SP, but at low resolution where the threads get processed faster, basically the SPs run out of work to do because the single-thread driver can't feed enough work to the Command Processor to keep those shaders utilized. At higher resolution where the workload increase and takes longer to process, at any one time, more of the SPs are running, hence it scales better at higher resolutions.

DX12 completely solves that aspect already.

This GCN with SMT and power gating/boost is another beast altogether. Imagine on Vega, take the 4096 SP and now each SP is capable of putting 2 scalar threads, and up to 4 vector threads processing concurrently. Massive uplift in work per SP. Ofc this is the maximum ideal scenario where there's workloads that can be distributed for 2,4,8,16 wide vector ALU and 2 scalar ALU. In games, it won't be x6 threads as gaming load tend to be repetitive and consistent. The kicker here is that compute workloads can vary more with post effects that need various different maths.

The cool thing is if there's let's say workloads that only need 8 wide vector ALU and 1 scalar, it can power down the other ALUs, boost the 8 wide and scalar unit to higher clocks to achieve the task faster.

It's a win-win scenario and a really clever design, ON PAPER.

On paper it sounds like the next coming of jesus for GPU tech, it really is that good. But let's see how it's actually translated to reality. I'm pumped, this is the most exciting GPU uarch launch in a very long time.

I'm not entirely sure that this patent is enabling "multi-threading" at the SP level, but we may be agreeing on the same thing. I think you're right to coin the concept as an SMT-esque approach in that what normally would be underutilized SPs are now being fed with useful work, very much in line with how unused execution units in Intel's CPUs may be used by another thread if the resources are mutually exclusive. If anything, I would argue that this "GPU-SMT' happens at the CU level.

With regards to DX11, we're talking about two constraints which holds back the GPU from 100% SP utilization:
1) Driver/CPU side: If the driver can't issue enough work faster than the GPU can process the work, you will be under-utilizing the computational resources of the GPU. This is the issue with AMD's DX11 implementation that nVidia does not have an issue with. DX12 will alleviate this issue.
2) Workload-dependent/GPU side: If a hypothetical workload was highly mixed, highly scrambled such that the GPU only receives sub-16-thread vector math, then the GPU will be under-utilizing it's resources since it was built for 16-thread wide vector operations. The patent aims to alleviate this issue by enabling A) a more granular break-down of vector-widths such that a mixed workload may be more easily divided amongst all ALUs without having unused resources, and/or B) power gating any unused ALUs and using the available power budget to power up any ALUs doing useful work.

But again, these two issues, if resolved, only help with maximizing utilization and thus allowing the GPU to have an expected performance that is closer to peak performance.
 

Slaughterem

Member
Mar 21, 2016
77
23
51
Beyond that, all that you need to do is make sure no other portion of the chip holds back ALU throughput (i.e. make sure the front end and ROPs are fast enough) and you should see pretty good utilization and/or maximum throughput in a given thermal/power envelope most of the time. This is where the Primitive Discard Accelerator, improved Command Processor, Geometry Processor, L2 cache, and Memory Controller come in.
And so here is some more for those that understand it. Unfortunately I do not but really appreciate the simple guys explanations that you and Mahigan provide.
http://companyprofiles.justia.com/company/amd
Ordering thread wavefronts instruction operations based on wavefront priority, operation counter, and ordering scheme
Patent Number 9304772 - April 5, 2016
A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than requiring…
Detecting multiple stride sequences for prefetching
Patent Number 9304919 - April 5, 2016
The present application describes some embodiments of a prefetcher that tracks multiple stride sequences for prefetching. Some embodiments of…
Techniques for identifying and handling processor interrupts
Patent Number 9304955 - April 5, 2016
A method for identifying and reporting interrupt behavior includes incrementing a counter when an interrupt signal is a designated type and is…
Selection of an operating point of a memory physical layer interface and a memory controller based on memory bandwidth utilization
Patent Number 9298243 - March 29, 2016
The present application describes embodiments of a method that includes modifying an operating point of at least one of a memory physical…
 
Feb 19, 2009
10,457
10
76
And so here is some more for those that understand it. Unfortunately I do not but really appreciate the simple guys explanations that you and Mahigan provide.
http://companyprofiles.justia.com/company/amd
Ordering thread wavefronts instruction operations based on wavefront priority, operation counter, and ordering scheme
Patent Number 9304772 - April 5, 2016
A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than requiring…
Detecting multiple stride sequences for prefetching
Patent Number 9304919 - April 5, 2016
The present application describes some embodiments of a prefetcher that tracks multiple stride sequences for prefetching. Some embodiments of…
Techniques for identifying and handling processor interrupts
Patent Number 9304955 - April 5, 2016
A method for identifying and reporting interrupt behavior includes incrementing a counter when an interrupt signal is a designated type and is…
Selection of an operating point of a memory physical layer interface and a memory controller based on memory bandwidth utilization
Patent Number 9298243 - March 29, 2016
The present application describes embodiments of a method that includes modifying an operating point of at least one of a memory physical…

Nice. Gonna take some time to digest that.

This is basically the new tech for Polaris, as they are talking about a new pre-fetch, wavefront optimizations and enhanced command processors for thread scheduling etc.

AMD-Polaris-5.jpg


http://patents.justia.com/patent/20160085551

"HETEROGENEOUS FUNCTION UNIT DISPATCH IN A GRAPHICS PROCESSING UNIT"

A compute unit configured to execute multiple threads in parallel is presented. The compute unit includes one or more single instruction multiple data (SIMD) units and a fetch and decode logic. The SIMD units have differing numbers of arithmetic logic units (ALUs), such that each SIMD unit can execute a different number of threads. The fetch and decode logic is in communication with each of the SIMD units, and is configured to assign the threads to the SIMD units for execution based on such differing numbers of ALUs.

^ SMT for SIMD units. The wording is quite clear and when you examine it in details, it is actually multi-threading the SIMDs to target individual scalar and vector ALUs with separate threads.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
4,154
9,693
136
Were there any other changes to the CC's that would explain their higher performance? I doubt it's simply higher utilization. Especially considering nVidia's performance advantage vs. AMD at lower res was present in Kepler too.

Remove or shift the bottleneck from the CPU to the GPU and AMD performs better relative to nVidia. That's the only trend I've seen and it's been going on for longer than Maxwell.

Yeah, I'm not sure about the specific details that enabled nVidia to hit their 1.35x IPC increase from Kepler to Maxwell outside of what AT reported, but from AT's analysis alone, it appears to suggest that a lot of the improvements made to Maxwell were targeting at increasing the utilization ratio of the CCs, or at least get more of them doing something useful for a longer period of time. The heavy increase in L2 cache in Maxwell supposedly helps with ensuring that the CCs aren't waiting on memory to do work, and the refinement from 192 CCs to 128 CCs (broken down into 4 groups), both probably help with ensuring that more CCs are doing work and less of them are idling. Nvidia's "secret sauce" wasn't fully divulged so I have no clue as to the exact reason why Maxwell has better perf/CC than Kepler.

As for having AMD GPUs perform better relative to nVidia when the bottleneck shifts towards the GPU, I think that is simply because AMD GPUs have more resources under the hood. In the past few generations, AMD GPUs tend to have more SPs than their respective nVidia offerings, and while I know that SPs =/= CCs, my understanding of why that equivalency cannot be made comes down to their respective IPCs (or utilization). In other words, AMD made up for weaker IPC/utilization by throwing more resources at the task. With DX12, by alleviating the CPU bottleneck and thus putting more of the bottleneck on the GPU, you're uncovering more of the GPU's full potential and it is my understanding that AMD GPUs just had more potential than nVidia GPUs from a pure resource point of view.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,320
1,768
136
The ideas should have been baked into Polaris/Vega during design development before the patent was formally filed. AMD's engineers must have been working on fleshing out the ideas summarized in the patent well ahead of the patent's filing date because it would not make sense for AMD to file the patent before deciding to incorporate it into the design.

First I was skeptical about your line of though. It does make sense to file a patent you are not using if you want a prevent a competitor from using it. But AFAIK NV and AMD have a cross-license agreement so sooner or later NV can use it if it's published.

Question remains that if you really have a great invention which can't be easily reverse engineered, why patent it at all if you have a cross-license agreement with your competitor? You only give him the idea about this and he can easily copy it. Would it be that easy to detect this invention in running silicon anyway and help the competitor to implement it? Sometimes trade secrets have their bonus (see Coke).
 
Feb 19, 2009
10,457
10
76
First I was skeptical about your line of though. It does make sense to file a patent you are not using if you want a prevent a competitor from using it. But AFAIK NV and AMD have a cross-license agreement so sooner or later NV can use it if it's published.

Question remains that if you really have a great invention which can't be easily reverse engineered, why patent it at all if you have a cross-license agreement with your competitor? You only give him the idea about this and he can easily copy it. Would it be that easy to detect this invention in running silicon anyway and help the competitor to implement it? Sometimes trade secrets have their bonus (see Coke).

Because reading a patent and understanding the high level concepts is easy, but actually implementing it on transistor level is far from easy.

How do you get each SIMD to inform the hardware scheduler on the current occupancy (of each scalar/vector ALU), the task and the ETA for when the task gets finished? In terms of transistors, that's an insane level of monitoring and dynamism, for each of the many SIMDs.

DX12 for example was a few years ago when it was announced. NV would have known full well that to leverage it, they need a multi-engine design, but how to implement it in a way that's compatible with their current designs? Not so easy.

This is why reading an existing patents don't actually help you implement the technology. That requires the actual brains behind those designs. A higher salary or bonus perhaps to incentivize engineering talent to move around?
 
Last edited:

Vaporizer

Member
Apr 4, 2015
137
30
66
For me it seems that the SMT and shut down of unused ALU will be a killer feature in mobile graphics. Because the waste of power to unused ressources will be prevented. I also assume that desktop power consumption will be affected tremendously as well as during surfing.
 

Saylick

Diamond Member
Sep 10, 2012
4,154
9,693
136
First I was skeptical about your line of though. It does make sense to file a patent you are not using if you want a prevent a competitor from using it. But AFAIK NV and AMD have a cross-license agreement so sooner or later NV can use it if it's published.

Question remains that if you really have a great invention which can't be easily reverse engineered, why patent it at all if you have a cross-license agreement with your competitor? You only give him the idea about this and he can easily copy it. Would it be that easy to detect this invention in running silicon anyway and help the competitor to implement it? Sometimes trade secrets have their bonus (see Coke).

I think you bring up a good point here but I think Silverforce hit the nail on the head. Here's my thoughts:

1) Nvidia is going to at some point in time figure out what is going with Polaris, or rather GCN 4, and start to develop it's own method of GPU-SMT or ALU-level power-gating/turbo. It's only a matter of time that this happen, in my opinion. The method of which they use to discover how GCN 4 operates could be a combination of inside sources, reading up on AMD's press slides and internal white papers, or maybe even buy a few GCN 4 chips themselves and doing some die investigation. Either way, it is my understanding that nVidia will at some point in time figure out what is going on, even if it is only conceptually.
2) If we presume that Nvidia will eventually understand the implications of GCN 4's architectural advantages, then issuing a patent which, at the end of the day, only describes at a high-level what the innovation is doing but not necessarily how it is implemented, then Nvidia is no better off even if I were to personally hand JHH a copy of the patent itself.
3) So that leaves AMD with two options: A) don't issue the patent and hope that Nvidia doesn't implement a fundamentally similar innovation, or B) issue the patent, fully knowing that if you don't issue the patent, Nvidia eventually might, and then enjoy having an extra patent under your belt along with all the benefits of owning a patent your competitor does not.

I think it makes more sense to go with Option B.
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
That rumor comes from a questionable interpretation of someone's LinkedIn profile (which has since been deleted). The profile claimed the author was working on the "Greenland" project, and that this GPU would have 4096 shaders. It's pure speculation that the GPU formerly known as Greenland is now going to be Vega 10 (the bigger of the two). It could be Vega 11 (the smaller chip). It could be a project that will never be publicly released for whatever reason. It could even be the iGPU for the massive "Zeppelin" server APU we've heard rumors about.

I'm fairly certain that Greenland has been rumored as the flagship GPU of the Arctic Islands series, since before we even knew of the the Polaris/Vega codenames, so it would seem perfectly reasonable to assume that it is Vega 10 imho.

A Vega chip with 4096 shaders should be able to do quite a bit better than that. Remember, Fiji is severely bottlenecked in many games by its weak front-end, which is no better than Hawaii's. Raja Koduri admitted that this was a trade-off to fit it in the reticle limit on 28nm. A Vega chip won't have that problem, and should be able to actually make use of its 4096 shaders. Right now, Fury X is only ~16% faster than R9 390X at 1080p, despite having 45% more shaders. Even at 4K, it's only ~25% faster. If a better front end and improved scheduler let a 4096-shader card actually be 45% better than Hawaii, that alone would be a substantial improvement.

You're absolutely right that I forgot to account for the front-end problems of Fiji in my estimates. As you mentioned Fiji ought to be about 45% faster than Hawaii, yet it is only 15-25% faster depending upon resolution. Thus we're looking at a 15-25% performance deficit.

Take these 15-25% and combine them with the 30-35% I was predicting previously, and you end up with 50-70%, which would be within range of the 60-70% of the GP100 (not quite pwning like Silverforce put it, but at least matching).

To me this would actually indicate that a 4096 shader Vega 10 is quite likely. Sure AMD could make something bigger, but if there's one thing Nvidia taught us this last generation, it is that trickling out a larger number of moderate improvements instead of a small number of large improvements, is much better for business, since you get to sell to the same consumer 2-3 times vs. just once. In other words, AMD will want to make sure that they are competitive with Nvidia, but not much more, so that they have room for an upgrade down the road.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
Nice. Gonna take some time to digest that.

This is basically the new tech for Polaris, as they are talking about a new pre-fetch, wavefront optimizations and enhanced command processors for thread scheduling etc.

AMD-Polaris-5.jpg


http://patents.justia.com/patent/20160085551

"HETEROGENEOUS FUNCTION UNIT DISPATCH IN A GRAPHICS PROCESSING UNIT"



^ SMT for SIMD units. The wording is quite clear and when you examine it in details, it is actually multi-threading the SIMDs to target individual scalar and vector ALUs with separate threads.
isnt this http://patents.justia.com/patent/20160055033 the same thing with different wording?(they are actually describing what the previous patent is doing on a software level?)
 

Head1985

Golden Member
Jul 8, 2014
1,867
699
136
I'm fairly certain that Greenland has been rumored as the flagship GPU of the Arctic Islands series, since before we even knew of the the Polaris/Vega codenames, so it would seem perfectly reasonable to assume that it is Vega 10 imho.



You're absolutely right that I forgot to account for the front-end problems of Fiji in my estimates. As you mentioned Fiji ought to be about 45% faster than Hawaii, yet it is only 15-25% faster depending upon resolution. Thus we're looking at a 15-25% performance deficit.

Take these 15-25% and combine them with the 30-35% I was predicting previously, and you end up with 50-70%, which would be within range of the 60-70% of the GP100 (not quite pwning like Silverforce put it, but at least matching).

To me this would actually indicate that a 4096 shader Vega 10 is quite likely. Sure AMD could make something bigger, but if there's one thing Nvidia taught us this last generation, it is that trickling out a larger number of moderate improvements instead of a small number of large improvements, is much better for business, since you get to sell to the same consumer 2-3 times vs. just once. In other words, AMD will want to make sure that they are competitive with Nvidia, but not much more, so that they have room for an upgrade down the road.
Fiji at 14nm and with better fron-end will be only tahiti size.Amd can pull hawaii size SKU with around 5120SP
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
To be fair, ALUs could already support the concurrent execution of threads. Each CU can execute 40 concurrent wavefronts. That means 40 concurrent threads per ALU.

What's really different is the clock gating and overclocking on a per ALU basis.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
To be fair, ALUs could already support the concurrent execution of threads. Each CU can execute 40 concurrent wavefronts. That means 40 concurrent threads per ALU.

What's really different is the clock gating and overclocking on a per ALU basis.

If im not mistaken each CU has 4x SIMDs Units and each SIMD has 1x ALU. Each ALU can have up to 16x Threads.

That makes 4x SIMDs x (1x ALU x 16 Threads) = 64 Threads per CU.
 
Feb 19, 2009
10,457
10
76
If im not mistaken each CU has 4x SIMDs Units and each SIMD has 1x ALU. Each ALU can have up to 16x Threads.

That makes 4x SIMDs x (1x ALU x 16 Threads) = 64 Threads per CU.

Each CU has 1 scalar unit, 4x SIMDs. Each SIMD has 16 wide integer/float vector ALUs. Hence the 64 SP per CU.

This is the current GCN (all versions):

UHMmBIL.jpg


This is the new GCN:

onf4yQR.jpg


There's more to this change, takes some time to get head over it.

My understanding of it, there will be much more CUs of less total ALU/SP per CU, going from 64 down to 14. But it's arranged in 2 wide, 4 wide and 8 wide. The scheduler will assign the best workloads for those SIMDs.

Reading it several times again, that AMD is saying is the current Compute Unit's 4x SIMD 16 wide ALU per CU leads to inefficiencies as the workload is dynamic, and may not require all 16 lanes but the ALUs are still running resulting in wasted power and die space.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
The register space allows for 40 concurrent wavefronts per CU. A wavefront is 64 threads wide. A CU is comprised of 4x16 wide vector ALUs for a total of 64 ALUs per CU.

That's 40x64 = 2,560 concurrent threads per CU.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
That's for GCN1-3 of course. GCN 4.0 changes this as per Silverforces comments.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Yea my bad didnt expressed it correctly.

Each CU has 4x SIMDs

Each SIMD has 16 ALUs (1 Op x 16 Threads)

If you start to use different SIMD count per CU, you will lose Throughput but you'll gain in latency and energy consumption.

This may be better for VR due to lower latency but you will need more CUs to get the same Throughput as the standard 4x SIMD x16 Threads configuration.
 
Feb 19, 2009
10,457
10
76
To be fair, ALUs could already support the concurrent execution of threads. Each CU can execute 40 concurrent wavefronts. That means 40 concurrent threads per ALU.

What's really different is the clock gating and overclocking on a per ALU basis.

Does it process those threads (do the maths operation) at the same time or rather, it holds those threads to be fired at consecutive wavefronts, as in a thread-queue system?

Say you have 8 ALU/SP per SIMD, and schedule a 32 wavefront, it does not mean each SP process 4 threads at the same time, rather, over 4 cycles... that's my understanding.

Correct me if I have misunderstood it. :)
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Does it process those threads (do the maths operation) at the same time or rather, it holds those threads to be fired at consecutive wavefronts, as in a thread-queue system?

Say you have 8 ALU/SP per SIMD, and schedule a 32 wavefront, it does not mean each SP process 4 threads at the same time, rather, over 4 cycles... that's my understanding.

Correct me if I have misunderstood it. :)

All 4 are processed at the same time in lock step fashion. It takes 4 cycles to fill them up but afterwards they're executed at the same time until the kernel terminates.
 
Feb 19, 2009
10,457
10
76
If you start to use different SIMD count per CU, you will lose Throughput but you'll gain in latency and energy consumption.

This may be better for VR due to lower latency but you will need more CUs to get the same Throughput as the standard 4x SIMD x16 Threads configuration.

If you use 2, 4, 8 wide SIMDs per CU, and you have a scheduler able to distribute work loads suitable for each SIMD type, you gain a lot of efficiency in terms of SP utilization.

There needs to be a high CU design to offset the lower ALU count per CU. But it results in doing more work per ALU, as well as doing more perf/w and better perf/mm2 as there is much less ALU waste.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Guys, an ALU is a Compute Unit. It can compute data from a single Thread ONLY.

Each SIMD has 16x ALUs, so you can only compute data from 16 concurrent threads per cycle (per SIMD).

What this patent say is, there are smaller than 16x Threads wavefronts (such as 2x or 4x). If you use a single SIMD 16x Thread you lose efficiency because 12x ALUs will do nothing but they will be awake (consume power doing nothing).
 

Saylick

Diamond Member
Sep 10, 2012
4,154
9,693
136
Each CU has 1 scalar unit, 4x SIMDs. Each SIMD has 16 wide integer/float vector ALUs. Hence the 64 SP per CU.

This is the current GCN (all versions):

/snip

This is the new GCN:

/snip

There's more to this change, takes some time to get head over it.

My understanding of it, there will be much more CUs of less total ALU/SP per CU, going from 64 down to 14. But it's arranged in 2 wide, 4 wide and 8 wide. The scheduler will assign the best workloads for those SIMDs.

Reading it several times again, that AMD is saying is the current Compute Unit's 4x SIMD 16 wide ALU per CU leads to inefficiencies as the workload is dynamic, and may not require all 16 lanes but the ALUs are still running resulting in wasted power and die space.

Exactly.

Correct me if I am mistaken, but in theory, by having the granularity all the way down to a 2-thread wide vector SIMD (technically, it's inclusive of most thread widths due to the "binary-esque" 1-2-4-8 breakdown when you factor in the scalar ALUs), there should only be a few corner cases where you couldn't "nicely" breakdown a workload into some combination of a 2, 4, or 8-thread vector operations that occur over a few cycles.

If my logic is sound, then that means if a 12-thread wide operation comes in, in the past you would have to use a 16-wide SIMD unit and have (4) of those (16) ALUs unused during that cycle. With GCN 4, you could either A) use a 4-wide and 8-wide to hit your 12-wide target, or B) use an 8-wide in cycle 1 then issue the remainder into a 4-wide in cycle 2, or vice-versa. I suppose as long as an even-numbered thread-wide operation comes in, GCN 4 should be able to tackle it without wasting ALUs. The only drawback is that it now might have to be issued over (2) cycles instead of (1) but so long as there is enough work coming in, you should be able to populate most, if not all, of the ALUs in a given cycle.

Now, I don't see why an odd-numbered thread-wide vector operation can't come in (if someone knows that this is untrue, please feel free to weigh in), but I would hope that the fact that AMD doubling the scalar ALUs per CU means that they are hedging against this scenario so as to ensure that the vector ALUs are almost always working on even-numbered thread-wide work where you get "perfect" breakdown. If you think about it, the scalar ALU to vector ALU ratio has gone up significantly (from 1:64 to 2:14 scalar to vector).