Bulldozer module (CMT core)
L1-I 64KB 2-Way
4 x86 decoders
2x 2ALU, 2AGU (CMT)
2x L1-D 16KB 4-Way (CMT)
FPU 2x 128bit (SMT)
Bulldozer module = CMT module
Multithreaded Branch Predictor -> Vertical Multithreaded AMD64 Fetch/1x 64 KB 2-way L1i -> Per-core Instruction Byte Buffer -> Vertical Multithreaded AMD64 Decode/Dispatch -> Per-core Macro-op Retire Queue -> Per-core Macro-op Dispatch -> Per-core Macro-op to Micro-op Scheduler -> Per-core 2ALU+2AGU or per-unit 2xFMAC+2xPALU -> Per-core LSU -> Per-core 16KB 4-way L1d -> 4KB, 4-way WCC and 64B 4-entry WCB -> L2 cache.
The FPU is 4x 128-bit
PADD can be executed on MAL[P2 | P3] => 2-cycle latency
PMUL, PMADD can be executed on MMA[P0] => 5-cycle latency(PMUL)/4-cycle latency(PMADD)
FMUL, FADD, FMADD can be executed on FMA[P0 | P1] => 6-cycle latency(FMUL, FADD, FMADD)
Four 128-bit pipes; 2x FMACs(1x has the Packed Multiply&Multiply-add), 2x PALUs.
In a non-CMT design only the P0/P3 pipes would be present in a CMT-less architecture.
Husky -> 84-entry OoO window + 42-entry FPU scheduler + 3x 128-bit pipes + 120-entry PRF <- 32nm PDSOI
Bulldozer -> 256-entry OoO window(from both cores) + 64-entry FPU scheduler + 4x 128-bit pipes + 160-entry PRF <- 32nm PDSOI
Bobcat -> 56-entry OoO window + 18-entry FPU scheduler + 2x 64-bit pipes + 88-entry PRF <- 40nm bulk
Jaguar -> 64-entry OoO window + 18-entry FPU scheduler + 2x 128-bit pipes + 72-entry PRF <- 28nm bulk
Zen -> 192-entry OoO window + 36-entry FPU scheduler + 4x 128-bit pipes + 160-entry PRF<- 14nm bulk
Zen2 is the only one with the rename width of the rename part labeled;
224-entry OoO Window + 64-entry NSQ + 36-entry FPU scheduler + 4x 256-bit pipes + 160-entry PRF <- 7nm bulk