A 4-way superscalar OoO ALU with deep pipelines and multithreading? Talk about complexity. Even worse, the vast majority of code tends towards towards 1 IPC
(COTS software, straight from a compiler, not an assembly line or intrinsic in sight), and more threads sharing execution resources means needing more register and cache resources, making it a simpler trade-off to have more simpler cores, even at the expense of the occasional high-IPC code
(MLP is the Way of the Future(tm), and we are going to get there by pounding square ISAs into round holes
). High-IPC code, OTOH, can always benefit from having the extra instructions issued the next cycle, which BD appears to be betting on. I wouldn't be surprised, as well, if keeping enough instructions issuing at any given time has to do with the fast caches, too
(not "Intel" fast, but fast instead of dense).