i dont think im smart but you have done nothing other then but "teh APPLE!@!#@!#@!#. So now that you have said 4x AGU, how much load and store bandwidth/ports to cache, whats the cache configuration, in multi-ported caches you are wire limited. lets not even talk about getting enough decode/ dispatch for the mythical 4 threads.Funny how you think others are dumb and you are smart.
This is exactly what I say. On paper stronger K8 with tied ALU+AGU together (3xALU+3xAGU) was much slower than theoreticaly weaker Core2Duo with decoupled 3xALU+2AGU. C2D had speculative load feature and some other new stuff which was possible at that big cluster and K8 was missing all that. That's why fusion of two cores together into one big Zen 3 core with 8xALU + 4xAGU + SMT4 could provide enough room to implement some new advanced logic to extract more ILP/IPC and is not possible at narrow 4xALU+SMT2 core. Especially for next iterations in Zen4 and Zen5. It looks like you argumented in favor in my dumb wide Zen3+SMT4 core, thanks
no im right and your wrong ( see i provided as much evidence as you do )You are wrong about that. Zen 2 has not new full AGU but only store unit what is much much simpler that load with all those speculative loading and load predictors. Lowest hanging fruits, it was the easiest way. Maybe you noted that Intel is using dedicated store unit for a while too.
Maybe you should go read the patient of how it actually works ( yes its published) it is one unified queue in which it picks 3 address to generate and load/store, it wasn't simple and cant be done in a single cycle, there is no point adding the 3rd AGU to the load side of the equations because there are only 2 load ports to cache. But the AGU's have nothing to do with prefetch/predict so i dont know why your trying to conflate that. But Store has to deal with store to load forwarding/ memory memory disambiguation and it still needs to connect to the PRF.
So first they dont tell us how any of there Cores works at all, for all you know if could be two cluster of 3 ALU + branch +AGU / split PRF (just like z15) . You have no idea of how there prefetch/predict/ L2/stream page walkers etc work, you have no idea what kind of memory disambiguation they are doing (arm has a weaker memory model). The only thing you know is they have 6 ALU's so that MUST be it, just ignore that hurrican has 4 ALU's and would still beat skylake in your metric quite handily and Apple has massively improved there cache and memory sub systems from A10 to A12 as can be seen in the anandtech reviews along with dispatch and all the prefetch predict /etc improvements you would expect. also ARM has load/store pairs and Apple has complete control of there ecosystem/compilers so those 2 load/store units can be load/storing 4 "bits" of data a cycle.How Apple in Vortex core is feeding those 6xALUs? They can do that with just 2xAGUs. How they gain +58% IPC INT over Skylake? Maybe Apple hired some black magic Woo Doo shaman, or maybe they know what they are doing. And unfortunately Apple engineers forgot to ask you that it's not possible
See i've never said we wont see more ALU's on Zen , unlike you i dont see more ALU's being the "killer feature", the killer feature is all the other micro architectural improvements that allow you to get enough ILP to be worth having more ALU's.