I don’t really have time to study the specifics here. Performance of modern processors is really difficult to predict with multiple levels of caches, prefetch, TLB effects, and everything else. Those things effect applications differently. One application may work well with the prefetcher while one might not. More bandwidth will help the prefetcher, but it might evict useful data in some cases. I know that the large number of ALUs are very unlikely to be fully utilized. You obviously might get some better utilization with SMT, but that doesn’t help single thread performance. The complexity of the scheduler goes up significantly with added units, so it can limit clock speed. That doesn’t effect Apple as much since their target clock isn’t 5 GHz.I think i had good laughs at certain mister in this forum, who used to shout about 6 ALUs that AMD/Intel need to add to match Apple who is already 6-wide ( at way lesser clock and much tighter memory subsystem and different ABI )
So lets not pretend to be smarter than actual chip designers.
But i think some facts are true for all chips. Even if average IPC is 1, there were sections of code, where suddenly a lot of ops became ready ( maybe some blocking dependancy arrived from DRAM ) and then 5 ALUs and wider machine can chew those ready ops faster. Maybe the end difference in average IPC will be 0.97 versus 0.98, but wider chip will still come out on top.
So armed with this information we find that in web browser benchmarks - esp Speedometer 2.0 Apple and Alder Lake are especially strong. And i think most of that power comes from having massive OoO cores backed by 5-6 ALUs and other resources. How else You'd explain 250 vs 325 score for 5Ghz Z3 vs ADL ?
I did a test on Zen3 as well, and with tuned memory @4.4ghz it is scoring 7950 MIPS ( vs 6800 as in article @4.9ghz i believe ), so it is another confirmation than 7Zr compression algorithm is scaling too well with memory to be proper measurement of ALU process. Who knows where Zen3 or ADL peak ?
Due to the complexity of modern systems, it is difficult to determine cause for sure, but the out of order window on all of these processors is quite large, so do you really expect that to be limiting Zen 3 performance? At 4 or 5 GHz, I would expect even a single ALU would be waiting on memory accesses quite a bit. Also, isn’t Zen 3 four ALU? How often does the extra 1 or 2 ALU come into play? It just doesn’t seem like it is going to make a difference very often.
You have some subset of applications that are likely very cacheable and two processors with larger and/or faster L2 that do a lot better. Seems likely that it is the caches rather than the “core”; you can’t really separate them. Cache design is probably the most complicated component of high performance CPUs these days. Intel was winning for an long time since they were very good at it. I would say that AMD is kind of brute forcing it a bit by going for larger cache size and using the advantages of their MCM architecture to compete, or dominate in some cases. Intel’s Safire rapids appears to still be made up of rather large, expensive, monolithic die. I haven’t read too much on it, but it looks like 4 x 400 mm2 or so. That is ~1600 mm2 of silicon on and advanced process while Milan is about 300 for a common 4 die part and 600 for the more expensive 8 die part. The IO die is cheap GF silicon. That is a big difference. It would be a crippling difference for Intel if they didn’t own their own fabs. It will still limit Intel’s capacity and pricing for these parts. AMD can make a lot of Epyc chips per 5nm wafer. It will be 4, 8, or 12 with Genoa vs. just 4 or 8 with Milan, but the parts that use more than 8 will likely be very expensive and less common.