There's little point in going with a large iGPU if you cannot offer advantages in cost and power over a dGPU. The latter is flexible and has a better market perception.
Also a large iGPU means that's pretty much the only market it'll sell in, because you wouldn't pair that with a dGPU since it'll be more expensive.
When has a large iGPU offered cost advantages over a similar dGPU? You think the manufacturers making them will make large dies just to sell it the same as a regular boring iGPU?
Nvidia killed both Intel's Iris and Kaby-G efforts by offering discounts on their low end dGPUs. It's that simple.
Sure dGPUs in theory are more expensive but GDDR memory modules are produced in much higher volume and has been done forever. Whether you go 3D stacked caches or HBM it's lower volume and it's likely more expensive. GPU boards are done pretty much every day and in every segment. New design, low volume = higher cost.
So for just 2x the performance gain over regular iGPU you might/potentially/maybe have a setup that has better battery life in idle over a dGPU setup but worse in flexibility and costs about the same or even higher.
And that advantage is just for a single year because it's impossible to upgrade them. More e-waste and no repairability too!
Designing ONE chip, instead of two or even three separate chips for process node, is something beneficial to both: companies like AMD and Nvidia, and to consumers.
I think if 256 bit bus on mainstream platforms will happen, - the dGPUs up to 106 die from Nvidia will disappear because they will simply be useless.
I expect that at some point, AMD will be designing three APUs, to something like this:
Lets say, for the sake of the discussion, that Strix Point is the first arch which will have three APUs in the lineup.
Small Strix Point: 4P/8E/8(1024 ALU) CU design. 128 bit DDR5/LPDDR5 bus, no Infinity cache. Sort of AMD's A16 chip.
"Normal" Strix Point: 8P/8E/16(2048 ALU) CU design, with 128 bit DDR5/LPDDR5 bus, 32 MB Infinity Cache. Sort of AMD's M2 chip.
Large Strix Point: 8P/16E/32 CU(4096 ALU) design, 256 bit LPDDR5 bus, with 64 MB Infinity Cache. Sort of AMD's M2 Pro chip.
For "normal" and Large Strix point AMD would design one type of chiplet that combines DDR5/LPDDR5 memory controller, with Infinity cache, exactly like Navi 31 has cache+memory controller chiplets, thanks to which - they are intercompatible with each other. You simply take two more chiplets for large Strix Point.
For APU portion, those designs will be monolithic, only cache are transferred to chiplets, and the die size of large Strix point, would NOT exceed 250 mm2. APU like this offers console levels of performance, while not exceeding certain thermal threshold, while also allowing for new, and smaller footprints and form factors for desktop PCs. It would fit basically everywhere where OEMs would want to compete against Apple: AIOs, MiniPCs, laptops, etc. Three designs that would span from mobile to desktops. It saves development costs, saves manufacturing costs, saves time.
32CU design is large, to the degree that we are talking about desktop RX 6800-6800 XT performance level in an APU. Is it pointless? Hell, no.
And lastly, the price. For something like this to be viable you have to have benefits: low manufacturing costs, high scalability, large enough market for volume. Is it impossible to see something like what I have mentioned: 250 mm2 die, with 4 larger and cheaper node chiplets costing 499$(assuming no hyperinflation happens in upcoming years)?
Mark this as a speculation. But this should give you all an idea where things are going.
Great Great. So which is more ideal for meeting the bandwidth and power requirments then?
1. Doubling bus width to 256b and using LPDDR5
2. Using a stacked 3D-V cache as SLC/infinity cache
What will happen is both of them.