I think this is what I remembered:
Still leaves a lot of open questions. For one are NPUs deprecated nextgen in all Zen 6 + RDNA 5 based products or later. If it applies to mobile too how will AMD adress the concerns regarding time to first token (execution latency) and power efficiency, because getting an inferior solution in terms of battery life on a brand new product vs last gen is just unacceptable.
Are we talking about customizations to ML and core in RDNA 5 to effectively emulate an "NPU mode" to save on power? That could perhaps be very fine-grained power gating, architectural changes to ML HW and even special modes of operation, and in general massive architectural changes to cachemem and data locality. Clearly spitballing here.
This patent might help increase ML performance:
https://patents.google.com/patent/US20240264942A1
Sounds a bit like compute-in-memory (CIM), but as of now no confirmation for its implementation (yes or no) and if so in which product lineups.