NostaSeronx
Diamond Member
- Sep 18, 2011
- 3,809
- 1,289
- 136
The pipeline stage increase is no different from the increase from Core2 (14-stages) -> Nehalem (16-stages) -> Sandy Bridge (fetch-based 19-stages/uop-based 14-stages).Ok, it looks to me that most of the items in that list are architecture/AMD related. I've read that Bulldozer had long pipelines (like the P4) to enable higher frequencies so was the process node to blame for the underwhelming frequencies?
The increase in pipeline depth for Bulldozer(Zambezi) was to reduce power at 1.3V and increase frequency at 0.8V. Relative to Husky(Llano).
HP-team (CTO of Cores:FW -> DM(interim)):
K9, 65nm PDSOI = 5 GHz, target frequency
K10 (proto-Bulldozer), 45nm PDSOI = 9 GHz, target frequency (quad-processor/quad-core/octo-cluster)
Bulldozer, 32nm PDSOI = 3.5 GHz, target frequency (quad-processor/octo-core)
LP-team (CTO of Cores: PH):
The unreleased Bulldozer, 45nm PDSOI = 2 GHz, target frequency (octo-proccessor/octo-core)
The unreleased Bobcat, 65nm PDSOI = 1 GHz, target frequency (single-core), same node as K8L, Lion core, Fam 11h and is the reason Bobcat IEEE says this "The Bobcat core uses about one-third the area the earlier K8 architecture would have used if implemented in the same fabrication process."
Unreleased Bulldozer equates close to 2/3rd the area of Husky.
Husky = 9.69 mm2 core area
Husky 2-core = 19.36 mm2 core area (however care is put more on L2-size this)
Unreleased Bulldozer on 32nm = 14.52 mm2, if unr_Bulldozer was based on Husky, which it isn't. It was instead based on a higher performance target Bobcat. 40nm bulk -> 32nm PDSOI can cover the high-performance cost-add; 7.35 mm2 general area of the unreleased fully mobile-focused/high core count Bulldozer.
ILP efficiency(OoO) with 16-registers, single L1 memory or TLP efficiency(CMT/SMT) with 2x16-registers, single L1 memory.
The leap in size is basically paying attention to out-of-order resources to find out why released Bulldozer is so large;
- Bobcat = 56-entry Retire
- Husky = 84-entry Retire
- Bulldozer (module/processor-level resource) = 256-entry Retire, which is comparable to Zen3 on 7nm with it's 256-entry Retire.
If the retire is bigger then all other OoO resources are bigger.
Also, newer designs can use less OoO resources for the same performance.
Cortex-A65/Neoverse-E1 = 40-entry OoO with SMT2 with slighter better capabilities of it's 2x64-bit ALUs, 2x64-bit FPUs.
E1 on 7nm = 2.5 GHz/183 mW/0.46 mm2 with 128KB L2.
If the MT-core design comes back in ARMv9 it is candidate to Cluster-based Multithreading if pushed for Cortex-A76/N1 performance.
The switch to this: The Importance of Being Low Power in High Performance Computing. Where some of the box packages for AMD have the green leaf or green circle logo, began in 2005. With focus that HE/EE parts would replace SE parts. Hence, why low-power focus being the first Bulldozer (2005/2007). Why Cluster-based Multithreading used a pure green tone.
Last edited: