I wasn't thinking about the "can't execute" problem, but more of the "shouldn't go there" situation because it would be slow.
Both are valid problems though.... and as your example shows, a big problem.
AMD's approach of using the same core architecture gets around the "can't execute" problem, but doesn't get around the "shouldn't go there" problem.
Without knowledge of which threads should be executed on which level of performance cores, things get messed up and slow down.
This is particularly true of the LPE cores. This is an architectural direction I believe both AMD Zen 6 and Intel Nova Lake are perusing.
You could mostly solve the "shouldn't go there" problem by having the CPU flags lie - a core might support sluggish AVX512 but when checked the CPU flags will claim that it doesn't support AVX512. That way it'll select a more effective non-AVX512 code path, but for the corner cases where a process is switched from a P core with good AVX512 performance to an E/LPE core with wimpy AVX512 performance in the middle of an AVX512 sequence it'll still complete, just more slowly than it otherwise would have.
There are alternate architectural decisions to consider as well. Look at what Apple did with SME2. From the time they created it as an Apple only "AMX" instruction group, it has been not a per core unit but per cluster. They considered it important to have, but not important enough that every core needs to have its own full sized AMX unit. From the standpoint of the instruction stream there is no difference, and indeed there would be no way to know other than if you tried to have two P cores execute AMX instructions at once (not sure what happens, I assume the second core will generate some sort of exception treated similiarly to when a process is waiting on I/O, and the scheduler handles requeueing that process to run when the AMX unit is free)
There's no reason small cores couldn't support AVX512 at the same speed as big cores while remaining area efficient - if your cluster of small cores all shared one (or more) full sized AVX512 execution resources. Then when they encounter AVX512 instructions they can run them at full speed, but small cores aren't likely to be scheduled for the kind of heavy number crunching tasks that grind away at long sequences of AVX512 instructions (at least not if your scheduler is doing its job correctly) so sharing should work fine unless you're trying to "win" at Cinebench.