[Videocardz] AMD Polaris 11 SKU spotted, has 16 Compute Units

Glo. · Apr 11, 2016

tential said:
You think apple would let amd get a new product launch at their own keynote presentation?

If they would announce partnership - yes, they would. Just like they announced partnership with Intel, and Paul Otellini came to stage to announce that. The same happened when Apple asked them to build first custom CPU from Intel - first Macbook Air CPU.

Piroko · Apr 11, 2016

Silverforce11 said:
This is the new GCN:

There's more to this change, takes some time to get head over it.

My understanding of it, there will be much more CUs of less total ALU/SP per CU, going from 64 down to 14. But it's arranged in 2 wide, 4 wide and 8 wide. The scheduler will assign the best workloads for those SIMDs.

You missed this paragraph:

It is noted that while the compute unit 400 is shown with two scalar ALUs 404, one two thread wide vector SIMD unit 406, one four thread wide vector SIMD unit 412, and one eight thread wide vector SIMD unit 418, the compute unit 400 may be constructed with different numbers of the scalar units and the SIMD units without affecting the overall operation of the compute unit 400.
Alternatively, SIMD units 406, 412, and 418 may initially have the same width (e.g., each being an eight thread wide SIMD unit) but may be configured (on a demand basis) to deactivate (e.g., through gating mechanisms, disabling, powering off, etc.) to have different widths (e.g., a two thread wide, a four thread wide, and an eight thread wide SIMD unit, as described above, by deactivating, six, four, and zero, respectively, pipes or ALUs in each unit).

The patent does not rule out a 64 ALUs per CU design.

Mopetar · Apr 11, 2016

tential said:
You think apple would let amd get a new product launch at their own keynote presentation?

Maybe, but only if Apple were getting some kind of timed exclusive. They haven't refreshed their notebook lineup yet and if they could get a 3-month exclusive on AMD's mobile chips it might be worth it simply because the performance/power ratio is going to be so much better than what they've had before.

It's an even sweeter deal if NV isn't going to have mobile chips available in that window either as it would mean that Apple's notebooks are the only ones to offer the newest generation of GPUs. Apple certainly has the cash to make a deal like that.

However, even if we assume all of that is true or close enough to the truth, one would think AMD would go to other events, especially to show off their desktop cards that Apple perhaps aren't as inclined to use.

AtenRa · Apr 11, 2016

Piroko said:
You missed this paragraph:
The patent does not rule out a 64 ALUs per CU design.

Yes, but due to power gating each CU will be significantly larger than the current 4 SIMD x16 ALU design.

Mahigan · Apr 11, 2016

I think that they'll retain the 64 Vector ALU design but add the power gating. That way the CUs can intelligently reconfigure themselves to suit the incoming instructions. Every instruction will be boosted relative to their CU occupancy.

The end result is better compute performance than current GCN designs because current GCN designs are underutilized. GCN is massively underutilized. If you program software to fully utilize GCN, it would crash on Maxwell (causes a driver crash and/or Windows protection fault as it did over at Beyond3D and appears to do for Quantum Break).

Applications like folding@home massively underutilize GCN for example. Their workloads are based on CUDA but ported to OpenCL (32 thread loads).

If you ran those loads on GCN 4.0, they'd likely operate at near twice the performance as they do on GCN1/2/3.

This solution not only brings perf/watt but also helps in cases of under utilization which are, for GCN, the majority of cases.

This is why this patent is a BIG deal.

Vesku · Apr 11, 2016

Yes, sounds like the CUs will get a bit bigger due to power gating and will allow full mix-mode use running different levels of precision in addition to more fine-grained clock boosting.

Not sure if it will be present in Polaris but it would be especially useful in a compute oriented large die Vega.

Mopetar · Apr 11, 2016

Mahigan said:
If you ran those loads on GCN 4.0, they'd likely operate at near twice the performance as they do on GCN1/2/3.

Doesn't that assume linear scaling in terms of overclocking though? Unless the architecture is designed to be very wide with a lower clock that has more room to scale (which it may very well be for a larger chip like Vega) upwards eventually you hit diminishing returns and adding another 10% to the clockspeed requires far more than 10% additional power.

There have been some leeks or other information floating around that puts the shader clock at 1000 MHz for various Polaris parts, which on its own seems rather low, especially when you expect a large bump just from moving to the new process node. Another way to look at is a wider architectural approach that could boost to much, much higher levels without as big of a falloff in efficiency.

However, at the same time the current GCN cards don't overclock all that well, especially when compared to NV's offerings. AMD would have to make some fairly significant changes to get around whatever is the current limiting factor. Perhaps that was accomplished with the changes they've been making already, but it seems like more of a radical shift than one should normally expect.

3DVagabond · Apr 11, 2016

Mopetar said:
However, at the same time the current GCN cards don't overclock all that well, especially when compared to NV's offerings. AMD would have to make some fairly significant changes to get around whatever is the current limiting factor. Perhaps that was accomplished with the changes they've been making already, but it seems like more of a radical shift than one should normally expect.

Isn't it nVidia's boost that gives them the higher clocks. Does the base frequency go much higher?

Silverforce11 · Apr 11, 2016

Saylick said:
Good catch.

If the Scalar ALUs can be used to catch the one-off vector math, then the total ALU count for GCN 4's CU will be 16 ALUs, which allows it to operate 16-thread wide vector math in one cycle if all ALUs are engaged. Thus, (4) "GCN 4" CUs will match (1) "GCN 1-3" CU (64 ALUs total for each).

The current GCN has in a CU/SM block 4x SIMD (16 ALU ea) and 1x Scalar ALU. But they don't count the Scalar. It would be 65 SP. That's why I was hesitant to count the new design, going for a 2x Scalar ALU and 14x total Vector ALU. But the wording suggests the Scalar can handle threads too, especially priority threads that need the lowest latency.

This change is going to take a real knowledgeable dev (@ zlatan ??) to analyzed rather than us forum warriors. I feel we're only scratching the surface.

Slaughterem · Apr 11, 2016

Silverforce11 said:
The current GCN has in a CU/SM block 4x SIMD (16 ALU ea) and 1x Scalar ALU. But they don't count the Scalar. It would be 65 SP. That's why I was hesitant to count the new design, going for a 2x Scalar ALU and 14x total Vector ALU. But the wording suggests the Scalar can handle threads too, especially priority threads that need the lowest latency.

This change is going to take a real knowledgeable dev (@ zlatan ??) to analyzed rather than us forum warriors. I feel we're only scratching the surface.

I agree with your hesitation I am certainly not a dev but (and I always like to use a but), let me throw another thought out, if a 2nd scaler can be an ALU and since it is a 2 cycle as compared to 4 cycle ALU could it do 2 threads in the same time?

AtenRa · Apr 12, 2016

Slaughterem said:
I agree with your hesitation I am certainly not a dev but (and I always like to use a but), let me throw another thought out, if a 2nd scaler can be an ALU and since it is a 2 cycle as compared to 4 cycle ALU could it do 2 threads in the same time?

Can you show me where in GCN says that Scalar can process one op per 2 cycles ??

Tapoer · Apr 12, 2016

Would it make sense for the new CU configuration to be like:
(1+1+2+4+8)+(16)+(16)+(16)
or
(1+1+2+4+8)+(1+1+2+4+8)+(16)+(16)

??

Piroko · Apr 12, 2016

Silverforce11 said:
14x total Vector ALU.

Please don't repeat this, that's not at all what the patent talks about and it only creates confusion.

Reading through the patent I think you make this out to be a much larger thing than it actually is. Boiling it down I came to the following impression:

The design may be modified to include N issue units that share M SIMD units, and those SIMD units may be of different thread widths.

... The active threads are assigned to the SIMD units, in a combination such that work is not wasted...

That could have a couple of implications, but nothing fantastic like magic huge performance gains.

Slaughterem · Apr 12, 2016

AtenRa said:
Can you show me where in GCN says that Scalar can process one op per 2 cycles ??

The issue logic dynamically determines which execution unit to target for a given collection of threads (a wavefront) based on a variety of factors. For instance, if the number of active threads in a wavefront is very small (for example, one or two), the threads may be dispatched to the high-performance scalar unit, where the instruction will complete in only a couple of cycles. This enables threads that are potentially the bottleneck of computation to be executed more quickly and efficiently than would occur on a heavily underutilized 64 element wide vector pipeline.
http://patents.justia.com/patent/20160085551 Figure 4 discription paragraph 13

Slaughterem · Apr 12, 2016

Silverforce11 · Apr 12, 2016

Piroko said:
That could have a couple of implications, but nothing fantastic like magic huge performance gains.

When you reduce waste or inefficiency (increasing SIMD/ALU utilization), you increase performance and perf/w.

When the ALUs can be power gated down if not used, and other ALUs turbo boosted, this leads to increased performance and also perf/w.

Going for a more efficient SIMD layout saves on perf/mm2, so you can have more SIMD/ALUs in a given die area.

These all add up.

The only thing we don't know, what kind of % improvement, how much does utilization improve? How high does it boost it's clocks? I mean it could be a very small figure, like 5% and it will be meh.

Slaughterem · Apr 12, 2016

Tapoer said:
Would it make sense for the new CU configuration to be like:
(1+1+2+4+8)+(16)+(16)+(16)
or
(1+1+2+4+8)+(1+1+2+4+8)+(16)+(16)

??

Once again I am no expert but I believe 4 CU were usually controlled as a group. IMO What combination within that group would be the best secret sauce would need to be tested not only for games but other applications. If you think about it Snapdragon currently uses a mixture of cores 2 high speed and 2 lower speed. It might be wise to have 2 CU with 4 x 16 and 2 CU with 1+1+2+4+8.

AtenRa · Apr 13, 2016

Slaughterem said:
The issue logic dynamically determines which execution unit to target for a given collection of threads (a wavefront) based on a variety of factors. For instance, if the number of active threads in a wavefront is very small (for example, one or two), the threads may be dispatched to the high-performance scalar unit, where the instruction will complete in only a couple of cycles. This enables threads that are potentially the bottleneck of computation to be executed more quickly and efficiently than would occur on a heavily underutilized 64 element wide vector pipeline.
http://patents.justia.com/patent/20160085551 Figure 4 discription paragraph 13

Still cant find where it says that the Scalar needs 2 cycles to be processed. The ALUs inside the SIMD core need 4 cycles (dispatch, decode, execute and retire). And a ~~SIMD~~ CU can retire 256 threads per 4 cycles.
I know the Scalar can dispatch one op per cycle but i havent seen how many cycles it needs until it will retire, so im assuming it also takes 4 cycles.

The only difference i see between the Scalar ALU and those within the SIMD core is that the Scalar is a native 64bit, it can process ordinary Int and Float but also special functions.

Now if you want to use 2 scalar units to process 2 threads it will not do it in less cycles but it will lower the energy compared doing it on the SIMD core (16 ALU).

Also to know that power gating will ADD latency since you need one or more extra cycles to close or open the ALUs under the power gating.

Example, you have power gate over half of the 16 ALUs inside each SIMD core, so you can close/open 8 ALUs per cycle.

You start by processing 16 Threads and then you only have 8 threads. So you close 8 ALUs but you will need an extra cycle to close them and then process them. Then you may have 16 Threads again, so you will need another cycle to open the 8 ALUs you closed before in order to process those 16 Threads.
Well that is if you only have one 16 ALU SIMD available. If you have multiply SIMDs you may have a lot of SIMDs with 8 power gated ALUs closed and other SIMDs with all 16 ALUs working etc etc.

Piroko · Apr 13, 2016

Silverforce11 said:
When you reduce waste or inefficiency (increasing SIMD/ALU utilization), you increase performance and perf/w.

The patent also talks about increasing SIMD/ALU utilization by reconfiguring SIMD units (shutting off ALUs), which might save power but likely won't increase absolute performance compared to a chip that was fully enabled in the first place.

When the ALUs can be power gated down if not used, and other ALUs turbo boosted, this leads to increased performance and also perf/w.

Power consumption increases drastically if you leave the clock sweet spot of Fin FETs, more so than planar. This in turn means that, while absolute power might increase with higher turbo clocks, you may see a reduction in perf/w under that scenario.

Going for a more efficient SIMD layout saves on perf/mm2, so you can have more SIMD/ALUs in a given die area.

These all add up.

The only thing we don't know, what kind of % improvement, how much does utilization improve? How high does it boost it's clocks? I mean it could be a very small figure, like 5% and it will be meh.

The way I read it is that most of the text is talking about how to avoid performance regressions with a heterogenous SIMD layout and not how to improve peak performance. Honestly, the whole patent could also be interpreted as "giving AMD an even finer grained binning method for defective ALUs" which has no implications on % improvements and boost clocks of the top SKU. It would make a lot of sense for console binning though, you could squeeze a couple % more working chips out of the only bin that you can sell.

Slaughterem · Apr 13, 2016

AtenRa said:
Still cant find where it says that the Scalar needs 2 cycles to be processed. The ALUs inside the SIMD core need 4 cycles (dispatch, decode, execute and retire). And a ~~SIMD~~ CU can retire 256 threads per 4 cycles.
I know the Scalar can dispatch one op per cycle but i havent seen how many cycles it needs until it will retire, so im assuming it also takes 4 cycles.

The only difference i see between the Scalar ALU and those within the SIMD core is that the Scalar is a native 64bit, it can process ordinary Int and Float but also special functions.

Now if you want to use 2 scalar units to process 2 threads it will not do it in less cycles but it will lower the energy compared doing it on the SIMD core (16 ALU).

Also to know that power gating will ADD latency since you need one or more extra cycles to close or open the ALUs under the power gating.

Example, you have power gate over half of the 16 ALUs inside each SIMD core, so you can close/open 8 ALUs per cycle.

You start by processing 16 Threads and then you only have 8 threads. So you close 8 ALUs but you will need an extra cycle to close them and then process them. Then you may have 16 Threads again, so you will need another cycle to open the 8 ALUs you closed before in order to process those 16 Threads.
Well that is if you only have one 16 ALU SIMD available. If you have multiply SIMDs you may have a lot of SIMDs with 8 power gated ALUs closed and other SIMDs with all 16 ALUs working etc etc.

Like I said I am not a dev and do not have much knowledge on this subject. That is why I placed a ? mark at the end of my statement of a scaler completing a thread in 2 cycles. In regards to power gating a new patent came out it revolves around dynamic gating as opposed to static gating. What you described is Static gating?
Dynamic Medium Grain Clock Gating

As discussed above, in conventional approaches, clocking of all SIMD units in a shader complex is either enabled or disabled simultaneously. In many applications, not all SIMDs are assigned work. However, conventional approaches continue to actively provide clocking signals to such SIMDs. This approach increases power consumption of a graphics processing unit and is inefficient. Conventional approaches can include static clock gating for shader complex blocks in which, when a request is initiated by a SPI, clocks of shader complex blocks are turned-on, one by one, with a di/dt (i.e., rate of change of current) avoidance count delay. Once started, the clocks keep clocking for the entire shader complex even if there is no work for many blocks inside the shader complex. In other words, only a few SIMDs are active at any given time. Once work is completed by the shader complex, the clocks are shut-off automatically using the di/dt avoidance count delay. Thus, in conventional approaches, clock gating is static in nature, and treats the shader complex as a single system.

In contrast to conventional approaches, embodiments of the invention achieve dynamic grain (e.g., dynamic medium grain) clock gating of individual SIMDs in a shader complex. Switching power is reduced by shutting down clock trees to unused logic, and by providing a clock on demand mechanism (e.g., a true clock on demand mechanism). In this way, clock gating can be enhanced to save switching power for a duration of time when SIMDs are idle (or assigned no work).

Embodiments of the present invention also include dynamic control of clocks to each SIMD in a shader complex. Each SIMD is treated as shader complex sub-system that manages its own clocks. Dynamic control for each block/tile in an SIMD is also provided. Clocking can start before actual work arrives at SIMDs and can stay enabled until all the work has been completed by the SIMDs.

Dynamic medium grain clock gating, according to the embodiments, causes negligible performance impact to the graphics processing unit. Embodiments of the present invention can also be used to control power of SIMDs by power gating switches and thus save leakage power of SIMDs.
http://patents.justia.com/patent/9311102
You probably understand this better than I do, would like your thoughts on this.

monstercameron · Apr 13, 2016

http://vrworld.com/2016/04/13/amd-polaris-10-gpu-beat-competing-pascal/

Silverforce11 · Apr 13, 2016

monstercameron said:
http://vrworld.com/2016/04/13/amd-polaris-10-gpu-beat-competing-pascal/

Their article has nothing to do with the actual URL link haha.

"amd-polaris-10-gpu-beat-competing-pascal"

Are they going with subtle leaks now?

tential · Apr 13, 2016

Ya I got tricked too. Can you add a quote and summarize I hate giving click bait articles revenue

Sent from my C6833 using Tapatalk

antihelten · Apr 14, 2016

tential said:
Ya I got tricked too. Can you add a quote and summarize I hate giving click bait articles revenue

Sent from my C6833 using Tapatalk

All the actual information (true or not) with all the fluff and speculation removed:

While we cannot go into more details in order to protect the source, we can confirm that AMD Polaris 10 engineering samples are varying in clock between 800 and 1050 MHz, depending on the partner

...

Polaris 10 GPU 67DF:C4 Specifications

14nm FinFET, GlobalFoundries

Diffused in USA (New York state)

Assembled in Taiwan

2304 Cores (silicon: 2560)

36 Enabled Core Clusters (silicon: 40)

New GPU Architecture (not GCN 1.5)

256-bit Memory Controller

8GB GDDR5/GDDR5X (when available)

...

Depending on the sample, the memory is clocked at 1.25-1.50 GHz QDR (5000-6000 MHz), which the 256-bit memory bus converts to 160-192 GB/s. This however, probably wont be the final shipping clock. Memory is estimated to reach as high as 1.75-2.0 GHz QDR stock (224-256 GB/s)

...

Today, AMD is tuning clocks above 1.05 GHz, and the final configuration might be as high as 1.15 GHz for the stock clock.

...

It is expected that AMD will launch its Polaris-based family of GPUs on Computex 2016 (May 31-June 4) or E3 2016 (June 14-17)

[Videocardz] AMD Polaris 11 SKU spotted, has 16 Compute Units

Diamond Member

Senior member

Diamond Member

Lifer

Senior member

Diamond Member

Diamond Member

Lifer

Lifer

Member

Lifer

Member

Senior member

Member

Member

Lifer

Member

Lifer

Senior member

Member

Diamond Member

Lifer

Diamond Member

Golden Member