The ideas should have been baked into Polaris/Vega during design development before the patent was formally filed. AMD's engineers must have been working on fleshing out the ideas summarized in the patent well ahead of the patent's filing date because it would not make sense for AMD to file the patent before deciding to incorporate it into the design. I would imagine that if you, as an inventor, know a certain design concept is feasible, you should start building it and take the time to work out the kinks before you officially submit your design to the patent office.
Now, with regards to the patent, it sounds like these new "4th Gen" CUs will either be built with a mix of ALUs with different thread-widths (e.g. 2, 4, 8-thread wide) - let's call this Method A - or be built with a 16-thread wide SIMD unit (a la current GCN) but have the capability of power gating any number of threads such that if, for example, a 4-thread operation comes down the pipe, 12 of those ALUs can be shutdown and the saved power can be used to increase the clock speed of the remaining 4 ALUs - Method B.
With that said, both methods should increase the total throughput of the chip but it sounds like going with Method A would be better for increasing the total ALU utilization by using up as many ALUs as possible given a mixed workload, whereas Method B relies more on power gating unused ALUs and increasing clock speeds to make up for inefficiency. Given a hypothetical chip using either of these methods, with Method A you may have (just guessing now) 90%+ utilization whereas you may only have 70% utilization with Method B but are also able to boost clocks 25% due to the power savings of shutting down unused ALUs. That allows Method B to at least catch up in terms of maximizing the total throughput of the chip with respect to Method A but at the cost of needing to run the ALUs at a higher clock.
My understanding is that it is better to keep clocks as low as possible while engaging as many ALUs as possible; that way, you keep the voltage low and thus keep the power low, hence why Method A sounds like the better approach. Now, I would not be surprised if there are corner cases where using a 2, 4, 8-thread breakdown (along with 2 scalar ALUs to fill in the gaps) does not allow for a 90%+ utilization rate. Given that there is no reason why both Method A and Method B can't be baked into the same chip, you can make up for this small amount of inefficiency by upping the clocks.
In the past, a GCN CU was comprised of (1) scalar ALU with (4) 16-thread wide SIMD units (64 vector ALUs per CU). With what the patent suggests, the change in CU will be comprised of (2) scalar ALUs with (1) 2-thread wide SIMD unit, (1) 4-thread wide SIMD unit, and (1) 8-thread wide SIMD unit a la Method A (14 vector ALUs per CU). This means that GCN 4 should have a lot more CUs than GCN 1-3 if you wanted to match the same number of SPs as before, but the good news is that you essentially have higher "IPC" due to the fact that for a given number of SPs, you have a higher SP utilization with GCN 4. By factoring in the benefits of Method B (aka "turbo mode"), you can increase "IPC" even further by turbo'ing the ALUs which are actually doing useful work.
I would not be surprised if this patent is the "secret sauce" that AMD is using when they claim their 30/70 split on the 2.5x perf/W increase.
Edit:
I gave this some more thought and realized how much of an impact the implications of this patent have when you combine it with something like Async Shading. In the past, if you didn't have AC and the workload was light, you'd turbo the entire chip until thermal or power limits were reached. Now, if you have power gating at the ALU level, you can turbo only the active ALUs and achieve higher performance than before because you don't have to power up unused portions of the chip. No change in how PowerTune works here. Now, with AC, you can fill up the unused ALUs with meaningful work and bring back down the clocks to improve perf/W but the crucial difference now is that with the increased granularity that Method A provides, you have a "double whammy" effect where both Method A and AC increase utilization. If both the graphical pipeline and the compute pipelines are unable to fill up all ALUs, then you can power down the unused ALUs and turbo the active ALUs for even more throughput.
Beyond that, all that you need to do is make sure no other portion of the chip holds back ALU throughput (i.e. make sure the front end and ROPs are fast enough) and you should see pretty good utilization and/or maximum throughput in a given thermal/power envelope most of the time. This is where the Primitive Discard Accelerator, improved Command Processor, Geometry Processor, L2 cache, and Memory Controller come in.