From what I understand, Abwx implies that due to their OoOE design, cores end up executing instructions from a thread even when their time slice is done and the CPU is currently servicing another thread. I find that hard to believe, but I'm already near the limit of my CPU&OS understanding, in which a thread requires execution context in order to run.
Yeah, I don't buy Abwx's theory either. Yes, in HT-enabled Intel CPUs, that much IS true, you can have a core executing different thread's instructions, and I suppose that possibly, switching thread contexts doesn't serialize the core's pipeline, because that would flush and delay the other HyperThread. I guess I had always assuming that thread switches would serialize the core, but now that I think about it, it's possible that it does not. If it doesn't, then some of that could come into play, but you wouldn't have "lingering" threads executing on the core, while a newly-scheduled thread also executed, without HyperThreading enabled on that core.
That may be so, but for me the more important finding was that the CPU execution engine is not the main culprit for the performance loss, otherwise we would have seen it in the HEDT platform as well. Even with WinRar not scaling with more than 8 threads, we should still see somewhat similar behavior to the mainstream i7, and we do not.
That observation even more strongly confirms, IMO, that the issue is L3 cache size. Because if it were core exe resource contention, the Haswell-E / HEDT should behave the same way that the mainstream Haswell desktop CPUs do, since they have the same cores. The fact that they do not, points away from a core resource issue, and to something else in the design of the HEDT that is different than the mainstream desktop. And one of the major somethings, besides PCI-E bandwidth, is the L3 cache size, as well as the quad-channel RAM. (I had forgotten about that, but the additional RAM bandwidth could come into play here as well. But the FX still only has dual-channel RAM. So I feel more confident that the issue is L3 cache size.)
Edit: Thinking about it more, I feel strongly that there is at least SOME serialization going on in the core, when switching thread contexts, because otherwise, a few micro-ops from one (prior) thread, would get access to the page(s) of memory of the (newly-scheduled) thread, and that could present a security issue from the standpoint of the processor architecture.
Likewise, there would be two sets of registers for the memory access descriptors in the processor, one bank for each hyperthread. (This is my speculation.) Unless the AGUs and load/store units only deal with physical addresses somehow.