Always makes you wonder what could have been if AMD just kept shrinking and tweaking and adding cores to the existing K architecture in Thubian instead of dumping all kinds of R&D into the Dozerpile.
My understanding is that K-projects were always going for actual CMT2.
Early K8 designs were all clustered architectures with 2x K6-IV execution cores.
Early K10 was derived off the earlier(K8 by David Witt/Jim Keller) clustered architecture implementation. Of which, it adds CMT2(both clusters can run different threads) and removes the integrated FPU(K6's Multimedia/Floating Point Unit) from each of the execution clusters.
Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
forums.anandtech.com
K8/K9 clustered patents with K6-esque execution core (Int/Mem/FPU). Where as K10 ditched the duplicated FPU.
This the K10 CMT patent;

notice that in Bulldozer's early days there was only ever ONE retire queue.
The change to where the architecture element went from clusters to CPUs is when the retirement logic went from shared to dedicated.
---
Meet the Bulldozer genius
Moore was the first Bulldozer chief architect and then became a senior fellow on another project:
https://ieeexplore.ieee.org/document/4771772
He is also the one who coined "Cluster-based Multithreading" in 2005.
Going through it though by 2009 Butler/Moore weren't singing praises of a brand new Cluster-based Multithreading architecture, but rather a brand new Conjoined Cores architecture.
"Chuck Moore, chief technical officer of AMDs technology development group, said a new chip, code-named Bulldozer, '“is designed from the bottom up to take advantage of low-power technologies.”' =>> Each chip has conjoined cores, the big management portions of the chip, which share some real estate and architecture."
In this case, they aren't using Cluster-based Multithreading in which IEEE:
This new micro-architecture contains two processor cores that implement chip-level multi-threading (CMT).
----
The more simple what if/wonder about is if AMD actually implemented Cluster-based Multithreading as intended.

Linear scaling => Clustered

Singular core => Clustered
AMD's internal numbers for multithreaded scaling:
SMT = ~1.3x scaling, +5% area
CMP = ~1.7x scaling, +100% area
Cluster-based Multithreading = ~1.8x scaling, +50% area; The 1.8x scaling can also be used against monolithic gains: Zen's ~+52% * 0.8 => ~+41.6% single-threaded improvement just by actually doing clusters instead of cores.
Roadmap-wise the removal of Cluster-based Multithreading was set sometime before November 2008. Since, by 2009 they were already saying it was tightly linked cores.
If it launched as intended with the correct threading/architectural layout, AMD would have been at least two years ahead of Haswell's 4 ALU implementation. AMD would have also been able to dodge most if not all the negatives that popped up.
Of which, Cluster-based Multithreading was well researched:
"Note that the cycle-time of these clustered architectures is much smaller than that of the centralized SMT. Indeed, Palacharla and Jouppi [12] estimate that the cycle-time for an 8-issue processor will be twice as long as a 4-issue processor when using 0.18um technology. In the light of their observations, clustered SMT, with two 4-issue clusters, may have a frequency that is twice higher than centralized SMT." - A Clustered Approach to Multithreaded Processors - 1998
"Clustering is an architectural technique that allows the design of wide superscalar processors without sacrificing cycle time, but at the cost of longer communication latencies. Simultaneous multithreading architectures effectively tolerate instruction latency, but put even more pressure on timing-critical processor resources. This paper shows that the synergistic combination of the two techniques minimizes the IPC impact of the clustered architecture, and even permits more aggressive clustering of the processor than is possible with a single-threaded processor." - Clustered Multithreaded Architectures – Pursuing Both IPC and Cycle Time - 2004
"The corresponding clustered multi-threaded (CMT) architecture is highly competitive with un-realizable SMT processors, achieving 90-96% of the cycle-level performance of a partitioned SMT (which improves on the base SMT), while dissipating about 50% of its energy." - Partitioning Multi-Threaded Processors with a Large Number of Threads - 2005
Post-RISC architectures were all clustering up as well:
Initial design was simple CMT4 of 3-wide VLIW(Int+Mem+FP) clusters with 5 temporal threads
Meanwhile... CMT2 in the FPU is just sitting there passive aggressive like.

Shared Retire, Shared Rename, Independent Repeated Scheduler/Execution Units/Physical Register Files, Shared Load/Store.