I can imagine meeting in AMD when engineers were presenting great new 2xALU arch Bulldozer.
三三ᕕ( ᐛ )ᕗ
"Alternately, integer instruction operations can be dispatched to the integer execution units 212 and 214 opportunistically. To illustrate, assume again that two threads T0 and T1 are being processed by the processing pipeline 200. In this example, the instruction dispatch module 210 can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like."
"In certain instances, the processing pipeline 200 may be processing only a single thread. In this case, the instruction dispatch module 210 can be configured to dispatch integer instruction operations associated with the thread to both integer execution units 212 and 214 based on a predefined or opportunistic dispatch scheme. Alternately, the instruction dispatch module 210 can be configured to dispatch integer instruction operations of the single thread to only one of the integer execution units 212 or 214 and the unused integer execution unit can be shut down or otherwise disabled so as to reduce power consumption. The unused integer execution unit can be disabled by, for example, reducing the power supplied to the circuitry of the integer execution unit, clock-gating the circuitry of the integer execution unit, and the like."
"To illustrate, the front-end unit 202 can fetch and decode instructions associated with a thread such that load instructions later in the program sequence of the thread are prefetched and dispatched to one of the integer execution units for execution while the other integer execution unit is still executing non-memory-access instructions at an earlier point in the program sequence. In this way, memory data will already be prefetched and available in a cache (or already in the process of being prefetched) by the time one of the integer execution units prepares to execute an instruction dependent on the load operation."
"By utilizing multiple integer execution units that share an FPU (or share multiple FPUs) and that share a single pre-processing front-end unit, increase processing bandwidth afforded by multiple execution units can be achieved while reducing or eliminating the design complexity and power consumption attendant with conventional designs that utilize a separate pre-processing front-end for each integer execution unit. Further, because in many instances it is the execution units that result in bottlenecks in processing pipelines, the use of a single shared front-end may introduce little, if any, delay in the processing bandwidth as the fetch, decode, and dispatch operations of the front-end unit often can be performed at a higher instruction-throughput than the instruction-throughput of two or more execution units combined."
AMD was never going to stop at single-threaded CMT, but instead diverge into clustered simultaneous multithreading.
Ex: Zen2 -> Adds CSMT -> Zen3
8x ALU(4 ALU per cluster), 6 AGUs(3 AGU per cluster),
Zen2-class split-FPU: 8x FPUs(4 per FPU cluster, potential fused 512-bit mode)
Zen2-class fused-FPU: 4x FPUs(No clusters pure 512-bit units)
Cluster of Zen-class FPUs; 16x FPUs(4*128-bit 2FMUL+2FADD) ((probably, would be complex enough to allow cluster-fusion for single-cycle 256-bit and 512-bit))