The theoretical limit for HT/SMT-2 scaling is 100% as no more than 2 threads (200% of # of threads) can run on Hyperthreading machines. This might be a fun exercise to produce such code. ^^
The other extreme you cite is a power virus - or close to that: Prime95, which I used. It got slowed down by nearly 50% thanks to the less prioritized threads sitting in the same core and occupying available resources. That's the thing with missing prioritization of Hyperthreads/logical cores. ATM I don't know , where this equalization happens, maybe already during fetch (in alternating cycles?). This is also, what BD did in the shared front end units. But this part is not SMT, but fine grained MT. So if instructions of the second thread enter the OoO section, the first thread can do nothing against it. Even not occupying the whole core (impossible) or just the AGUs for example wouldn't help, because this thread wouldn't get enough supply of instructions to continue playing this game.
I see what you mean. But amongst the millions of instructions executed during one thread's timeslice, there are so many different types (simple ALU, IMUL, IDIV, AGU, loads, stores, FP SIMD, Int SIMD, FMUL, FADD, shuffle, etc.), that there always will be some room for other instructions. Especially with 8 or more execution units.
Agreed. The realistic range for SMT scaling is more like that. As often said, there will be ST code (like Blender with 1T), which might have even higher IPC on Zen, and there will be other code with lower IPC. That's natural for the wide variety of instruction mixes. 256b AVX will be enough to show a difference.
You are right. Increased mem fetch latency due to OC'ing is another factor.