Still cant find where it says that the Scalar needs 2 cycles to be processed. The ALUs inside the SIMD core need 4 cycles (dispatch, decode, execute and retire). And a SIMD CU can retire 256 threads per 4 cycles.
I know the Scalar can dispatch one op per cycle but i havent seen how many cycles it needs until it will retire, so im assuming it also takes 4 cycles.
The only difference i see between the Scalar ALU and those within the SIMD core is that the Scalar is a native 64bit, it can process ordinary Int and Float but also special functions.
Now if you want to use 2 scalar units to process 2 threads it will not do it in less cycles but it will lower the energy compared doing it on the SIMD core (16 ALU).
Also to know that power gating will ADD latency since you need one or more extra cycles to close or open the ALUs under the power gating.
Example, you have power gate over half of the 16 ALUs inside each SIMD core, so you can close/open 8 ALUs per cycle.
You start by processing 16 Threads and then you only have 8 threads. So you close 8 ALUs but you will need an extra cycle to close them and then process them. Then you may have 16 Threads again, so you will need another cycle to open the 8 ALUs you closed before in order to process those 16 Threads.
Well that is if you only have one 16 ALU SIMD available. If you have multiply SIMDs you may have a lot of SIMDs with 8 power gated ALUs closed and other SIMDs with all 16 ALUs working etc etc.
Like I said I am not a dev and do not have much knowledge on this subject. That is why I placed a ? mark at the end of my statement of a scaler completing a thread in 2 cycles. In regards to power gating a new patent came out it revolves around dynamic gating as opposed to static gating. What you described is Static gating
?
Dynamic Medium Grain Clock Gating
As discussed above, in conventional approaches, clocking of all SIMD units in a shader complex is either enabled or disabled simultaneously. In many applications, not all SIMDs are assigned work. However, conventional approaches continue to actively provide clocking signals to such SIMDs. This approach increases power consumption of a graphics processing unit and is inefficient. Conventional approaches can include static clock gating for shader complex blocks in which, when a request is initiated by a SPI, clocks of shader complex blocks are turned-on, one by one, with a di/dt (i.e., rate of change of current) avoidance count delay. Once started, the clocks keep clocking for the entire shader complex even if there is no work for many blocks inside the shader complex. In other words, only a few SIMDs are active at any given time. Once work is completed by the shader complex, the clocks are shut-off automatically using the di/dt avoidance count delay. Thus, in conventional approaches, clock gating is static in nature, and treats the shader complex as a single system.
In contrast to conventional approaches, embodiments of the invention achieve dynamic grain (e.g., dynamic medium grain) clock gating of individual SIMDs in a shader complex. Switching power is reduced by shutting down clock trees to unused logic, and by providing a clock on demand mechanism (e.g., a true clock on demand mechanism). In this way, clock gating can be enhanced to save switching power for a duration of time when SIMDs are idle (or assigned no work).
Embodiments of the present invention also include dynamic control of clocks to each SIMD in a shader complex. Each SIMD is treated as shader complex sub-system that manages its own clocks. Dynamic control for each block/tile in an SIMD is also provided. Clocking can start before actual work arrives at SIMDs and can stay enabled until all the work has been completed by the SIMDs.
Dynamic medium grain clock gating, according to the embodiments, causes negligible performance impact to the graphics processing unit. Embodiments of the present invention can also be used to control power of SIMDs by power gating switches and thus save leakage power of SIMDs.
http://patents.justia.com/patent/9311102
You probably understand this better than I do, would like your thoughts on this.