AMD could bring back CMT, except modernized with way more cores, different types of cores, some type of scheduling optimizer, etc. 😉
They never launched the original cluster-based multithreading core in the first place. Bulldozer -> Excavator is a chip-level multithreading architecture not a cluster-based multithreading one. AMD would not be bringing anything back, but rather launching it for the first time. Since, the first one got replaced by the chip-level multithreading architecture as a production product.
In fact, they technically already did the architecture-style within Zen3. Zen3 has two FPUs which can execute in single-threaded mode[both schedulers get one thread] or simultaneous-threaded mode[each scheduler gets a different thread].
1st Cluster-based core(single-threaded only) = Clustered Integer&FPU, Monolithic Memory(LSU)
::
https://patents.google.com/patent/US6256721B1
2nd Cluster-based core(adds multithreading and has a shared FPU instead of duplicated integrated FPU) = Clustered Integer + Monolithic Memory(LSU) + Monolithic FPU
:: patents are wonky because teams kept shifting. (MGB switched it towards to multi-core rather than multi-cluster)
::
https://patents.google.com/patent/US7043626B1 starts from here to slightly before 2007.. as if you go beyond mid-2007 it becomes
https://patents.google.com/patent/US7877559B2 which is the chip-level multithreading architecture. From what I can gather the switch away from cluster-based to chip-level was caused by Microsoft.
:: 2004-1H2007 => cluster-based multithreading
:: 2H2007+ => chip-level multithreading with Microsoft being involved up to 2009 then who ever it was ditched the collaboration going on. (Microsoft wanted a cheap and fast x86-64 to swap out Xenon and they wanted something for early cloud. 2001 - x86 -> 2005 - PowerPC -> 2009 - AMD64(Bulldozer 2.0) -> 2013 - AMD64(originally Bulldozer 2.5(Steamroller)) -> 2017 - AMD64 then 3 year cycle.)
:: The not-AMD timeline is: Chuck Moore in 2005 announces cluster-based multithreading, AMD+Laptop Manufactor meet in 2006, where Bulldozer/Bobcat were both single-core. Announcing both products would be early fusion products for late 2008/early 2009. Bobcat first to launch with singe-core+1 GPU(40SP?Radeon HD 3450-related?)(late 2008) -> Bulldozer second to launch with single-core+3 GPU(120SP?Radeon HD 3650-related?)(early 2009). This meet also described each cores target: 1+ GHz for Bobcat and 2+ GHz for Bulldozer. Also, later on they did announce at AMD+Laptop Meet 2.0 in 2007 a third product: three Bulldozer cores+1 GPU core for late 2009.
:: 2007 is off the rails... 1H2007 announces the delay that all new products shifted to late 2009. Then, 2H2007 announced that nope, it won't be coming till 2010+. They literally dumped the slides for the old architecture in mid-2007. When they knew they changed gears to another architecture for the "Bulldozer" processor.
:: Pretty confident that Bulldozer 1.0 was 2x2 ALU + 3 AGU + 4 FPU and was referred to as a single-core with multithreading throughout. It is Bulldozer 2.0 (2x2 ALU, 2x2 AGLU. 1x4 FPU) which was referred to a dual-core throughtout and was referred to chip-level multithreading by the chief architect that launched it.
:: I also probably identified the original mm2 size for Bulldozer 1.0 being ~10.8 mm2(11.3 mm2 for pervasive bits) for a single-core on 45nm. It is also likely that the FPU was 2x64b for FMA(Penalty for MUL and ADD(diff instructions) simultaneously was low) and 2x64b for MMX. I believe at the time AMD wasn't eyeballing 256-bit, but rather speeding up 64-bit(2x throughput of K8) and 128-bit(same throughput in 2/3rd the area of GH45/Stars45). Bulldozer 1.0 and Bobcat was very close in design and target. However, Bulldozer 1.0 could scale out and up. Bobcat ~10W target for single-core(~5.4 mm2 on 45nm/1+ GHz) and Bulldozer went ~100W target with eight-cores(~10.8 mm2 * 8/2+ GHz).
:: Greyhound standard cells were same across => Agena, Deneb, Thuban, Llano
:: Bobcat and Bulldozer 1.0 standard cells were the same && Bulldozer 2.0 used its own grounds-up standard cells.
-- Chief Architect 0 (1997-2002): Low Power (1st Gen Clustered Core)
-- Chief Architect 1 (2002-2004): LP -> High Performance (Adopts K10 and is very close to Bulldozer that launched)
-- Chief Architect 2 (2004 to December 2007): HP -> Low Power (Adopts Bulldozer and is mobile-focused and is closer to DW/JK's K8-alternative architecture)
-- Chief Architect 3 (2008 to 2012) -> LP -> HP (Architecture that would launch and start on Steammroller 1.0 in 2009)
-- (Acting) Chief Architect 4 (2012 to 2015?) -> Mobile-ify Steamroller/Excavator and work on 3rd Gen architecture(back to original design)
-- Chief Architect 5* (2016 to 2020) -> Ultra Low Power cluster-based Multithreading architecture (22FDX/12FDX)
-- Chief Architect 6* (2020 to present)-> Ultra Low Power grid-based multiclustered architecture
* => same person, maybe.
With the nodes of the new architecture being: 90CPP/80Mx FDX and 64CPP/56Mx FDX and 45CPP/40Mx FDX
New Malta Testsite/Shuttle September 2021-present for the above nodes. AMD is exclusively getting their own FDX nodes. AMD/GF - DTCO covers ultra-low-voltage custom digital, SerDes, memories, and etc.
Zen3 core = Monolithic Integer + Monolithic Memory(LSU) + Clustered FPU
::
https://patents.google.com/patent/US11281466B2
::literal die shot + software optimization guide.
I believe you are wanting to refer to cluster-based multithreading and not chip-level multithreading.
Example: Zen3/4 w/ partial architecture cluster-based multithreading to Zen5 w/ full architecture cluster-based multithreading:
8-core => 8-core
Only FPU clusters have thread to cluster pairing => Both Int and FPU clusters have thread to cluster pairing.
The reason for clusters to become more prevalent is detailed all the way back from 2001:
"A comparison of the two microarchitectures, both optimized for energy efficiency, shows that the multicluster architecture is potentially up to twice as energy efficient for wide issue processors,
with an advantage that grows with the issue width. Conversely, at the same power dissipation level, the multicluster architecture supports configurations with measurably higher performance than equivalent conventional designs."
If Zen5 uses cluster-based for Front-end and Integer-end to get 6-wide decode/ALU. Then, it would be more efficient than Apple's P-core(if, it is still 6 ALU by N3) and Intel's P-core(if, it is still somewhere around 6 ALU by Intel 3?). Cluster-based architectures are always faster and more efficient than their conventional competitors. It is the main reason Power9/Power10 is cluster-based multithreaded; each thread gets its own cluster; t0->s0, t1->s1, t2->s2, t3->s3 when in full load.
Also, to cover all bases just in case... AMD could delete the Integer Execution and Integer Scheduler. Re-purpose the repeated instances inside FPU clusters for general purpose. The area given with a shrink should allow for a second decode/fetch path and a second op-cache.
Decode 4-wide + >4K op-cache (lo-path) and Decode 4-wide + >4K op-cache (hi-path) (Decodes have two pipeline stages so 4+4 equals 8 macro-ops for each, total decode being 16 macro-ops.)
P0/P1/P2/P3 => 8 64-bit ALUs + 8 64-bit ALUs within in two+two Vector Integer 256-bit vALUs. These units are already exposed to 64-bit operations. Allowing the other 3x64-bit internal paths to do 64-bit is a minimal area gain. There is also shift/rotate instructions that can shift/rotate hi-64 and lo-64 already by Zen4. This also has the side-effect of allowing EVEX/VEX/SSE 128-bit operations the upper-half for a second execution. The vPRFs in Zen3/Zen4 have 128-bit unit size, so the pairing only needs to get 2x64-bit ops into 1x128-bit op, which then can use the upper half when 2x128-bit is fused into 1x256-bit op.
Minus: Integer/GPR scheduler+execution (subtraction of power consumption (-20%))
Addition: second decode+op-cache (addition of power consumption (+10%))
Re-using: FPU scheduler/VI datapths for grid-GPR Integer (similar power consumption as before (no change%))
With them skipping straight to grid-based multicluster on Zen5. Reasons for this:
- Integer cluster-based and added cores falls under chip scaling.
- FPU being improved to support GPR-Int is under logic scaling, as the PRF's already allow manipulation of 64-bit(Lo-XMM) and 2x64-bit(Lo&Hi Swaps).
- Easiest way to improve superscalar performance on both Integer and FPU. The slowest code on any benchmark is when they spend ~50-70% of the time running easily vectorized, but superscalar int/fp operations on the FPU.
- It is also the easiest way to improve area/power efficiency with a huge IPC increase. Vectorized instructions avoid cache dependencies most of the time;
3x64b+2x64b = 320b (a lot of operations over time)
3x128b+2x128b = 640b
2x256b+1x256b = 768b (least amount of operations over time)
Of which, the average load/store might be the same, the instantaneous load/store is much better(power&perf) for 4-wide 64-bit superscalar compartmentalized in 256-bit instructions post-decode.