Something like 50% of x86/64 non SIMD operations on average contain a load or a store. The idea that there is all the free ILP just lying about all you need is just some more ALU's is quite frankly stupid. You need to be loading and storing more which means you need better front end to predict and prefetch and better cache to have the data closer when your front end misses. What you need to do to get more ILP is having the data in the core sooner, so go have a look at A12 vs Zen per core:
L1 32KB vs 128KB
L2 8MB vs 512KB
because the apple mobile SOC's are just that, they can have a small tight interconnect with larger per core caches , they don't have to worry how they scale coherency/interconnect/size to 64 cores. just look at the random latency between Zen2 and A12
Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
www.anandtech.com
Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
www.anandtech.com
you can see how much longer a Big A12 core can hold lower latency.
if you look at instruction throughput you will also see while A12 is 6 wide its instruction throughput on a per instruction basis isn't much higher and on many common instructions Zen2 has slightly lower instruction latency (fpmul/mul/Imul 4v3). If amd wanted to hit the same instruction throughput as A12 they could just as easily add a little bit more functionality to the existing ALU's, No need to go 6 wide. One area AMD can obviously improve is DIV but thats a latency problem not number of issuing ports.
Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
www.anandtech.com
Apple is winning at IPC not by its width , but by its frontend and its big low latency caches.
Apples internal Width could just as easily be about power/clock gating and only powering up the ALU's with the more expensive complex logic when needed.