Bulldozer cores are based on the CIC (Clustered Integer Core) architecture developed by DEC in 1996.
I know you got this from Wikipedia but I find this claim questionable. Alpha 21264 has its two ALU/AGU pairs (that aren't even totally symmetric) physically partitioned to separate register files, but it still has the same scheduler in front of it and the same load/store queue, DTLBs, L1 dcache, etc ahead of it. DEC's scheme was in place purely to reduce the number of ports on the register files, and the only difference between it and cloning the reg file to double read ports - a bog standard technique - is that writes weren't automatically synchronized so it also doubled the write ports. The downside is that there was a cycle penalty for when the domains were crossed, but the domains could be crossed implicitly which means that the two clusters still worked on the same logical thread. Probably the only reason anyone made this claim is because both designs use the term cluster for their partitioning.
According to Andy Glew, who worked as a CPU architect for both Intel and AMD, the CMT concept was devised because he witnessed that SMT on Netburst was thrashing the small dcache. The idea was to replicate the dcache, but since the load/store units, AGUs, and even ALUs are on the critical path to the dcache they all needed to be replicated too, as well as part of the scheduler. AMD took this idea a little further, replicating the entire integer scheduler (and on Steamroller the decoders as well). The point is, this makes the root of CMT the split dcache, so if you don't have that you're probably not following the same idea.
