typically many server applications are extremely throughput orientated from what I have seen.
Some are, some aren't. If VM response times and overprovisioning of hardware are on the table, then CMT gets back to its intrinsic disadvantages that you encounter in consumer workloads (more or less).
Certainly at this point many fields are much more threaded than they were 8 years ago when Bulldozer launched.
It is, but with the core wars you basically have the same problem. Yes, it's more likely that a 4M CMT chip will see all its resources used today . . . but AMD has been producing 8c SMT chips for over two years. Would an 8M chip on the same process from the same design team ever be as flexible in terms of resource allocation as Zen2 is today or Zen3 will be tomorrow? Likely not. When you run what is essentially an 8t workload with light SIMD on 8c/16t Zen2, you are leaving maybe 25-30% of your execution resources idle. A hypothetical 8M/16t XV in the same scenario is leaving somewhere around 45% of its execution resources idle. I remember testing that on my old Steamroller. Going to 1 thread per module resulted in ridiculous losses of performance.
If we get to the point where software developers start pushing out tons of software that demands more thread-level parallelism than CPUs can realistically provide, then maybe CMT will make more sense. That isn't happening right now.
Also Bulldozer did support AVX too, XOP just added instructions that had no equivalent in AVX at the time BD was designed (including FMA4 of course, which Intel clearly sold them down the river with by doing a 180 and going to FMA3).
XOP was just faster than AVX128 overall, I think. When properly implemented.
Take a look here:
Specifically, observe the 4.2GHz 3930k time (233.251s) vs a 4.21 GHz FX-8350 (267.329s). Both are 128-bit SIMD implementations. Those 4 PD modules are very close to 6 Sandy cores. A hypothetical 6M PD @ ~4.2GHz would have turned in a time of around 178s, which interestingly enough, is really darn close to some of the 4930k numbers on that list (which supported AVX256, but not AVX2). AMD's 2013 chip was dangerously close to Intel's 2011 chip in a very narrow set of circumstances. The potential was there. AMD was limited by process and software support.
In a hypothetical alternate universe where GF iterated quickly and successfully on advanced SOI nodes and software developers supported XOP, XOP2, etc., AMD would have done much better with their con cores.
In our universe, nobody much supported FMA4 + XOP and GF wrecked AMD with dated process nodes. 2013's Haswell with AVX2 left XOP in the dust. The one narrow circumstance where XOP enabled AMD to sort-of reach parity with Intel in fp performance swung wildly in Intel's favor. The rest is history.
I think AVX2 must have addressed most of the remaining instructions if not all of it, considering XOP is deprecated now - at least that is the impression I got of why some apps used XOP at the time.
AVX2, as implemented in Haswell, annihilated anything that ever supported XOP. The difference is just night and day. If AMD ever expects to go back to CMT and tries to use SIMD to shore up an otherwise-weak fp unit, they've got to have something better than XOP under their belt. It would be interesting to see (in one of those alternate universes) what a CMT CPU that supported SVE2 would be like.