Originally posted by: Idontcare
...I vaguely remembered coming across a forum thread about this topic over on aces...here it is:
http://aceshardware.freeforums...-amd-sse5-t538-75.html
Sweet info IDC, I checked Dresdenboy?s page with the patent info and the
CPU Diagram. The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs.
So the CPU has a 4-way Decoder, like the Core2/i7 architectures. And having two INT units with a dedicated L1 data cache for each unit gives us a clue that they very likely designed with the capability to run separate threads. Seems like a more brute-force approach than Intel?s SMT.
Posters Opteron, Dresdenboy, and Hans de Vries appear to be convinced based on the patent sifting they have done that Bulldozer will likely have at least SMT for Integer processing. They are not convinced BD will support SMT for FP.

confused: what would that look like to an OS?)
These guys know what they are talking about, I agree with their assessment about Bulldozer getting SMT
For the FPU there has to be SMT support also, otherwise as you say the OS can?t perceive an extra ?complete? core. At least we know that the FPU in Bulldozer has to be 256bits wide to support AVX, and if that is the case, the FPU can be designed in such a way to run multiple 64bit or two 128bit instructions simultaneously. Just a thought here, so anyone with more knowledge correct me if I?m wrong.
Originally posted by: Hans de Vries
Bulldozer's clustered multiprocessor architecture
Design a 4-way processor which has a pipeline which can be split up into two independent 2-way pipes. In this case both threads have there own set of resources without interfering with each other. Part of the pipeline would not be split. Wide instruction decoding would be alternating for both threads.
Hans does a great job to hypothesize how a Bulldozer core can run multiple threads, this method would work better than Intel's HT because each thread has it's own "independent" INT unit. The concern here though, is if you have only a single thread, that requires more than two instructions, can both INT units be combined to work as one unit? Otherwise the single core IPC for BD can be lower than K10 in certain situations.
The 128 bit SSE/FP units could be modified partly in connection with the read/write ports. There was some improvement but not that much when AMD almost doubled the SSE2/FP hardware going from 64 bit units in K8 to 128 bit units in the K10.
There is lots of efficiency to be gained by using two K8 like SSE/FP which can operate independently in 2-way mode and which can operate together as a single 128 bit unit in 4-way mode. Other similar tricks can be beneficial as well.
As Hans mentioned here, split the FPU to run (smaller) independent instructions, or run one wide instruction (ex: 2x 128bit or 1x 256bit). Some circuitry would have to be added to support this of course. Very interesting stuff.