YOU should read the article again.
Specifically this page for memory latency
and bandwidth:
https://www.anandtech.com/show/14892/the-apple-iphone-11-pro-and-max-review/3
After you are done showing yourself out, then I'll leave it up to you to find the floating point results.
I'm objective, but calling Apple's ARM CPUs "competitive" with high end is a stretch.
You're making the classic mistake of assuming that the ONLY way to solve a problem is the wya you know.
No-one buys a CPU for the latency and bandwidth, they buy it for the performance on their workload. Apple achieves their uncore performance the same way they achieve their core performance --- by throwing massive amounts of (low power, but high area) logic at the problem rather than (high power) speed. Thus, as I keep telling you, their caches and prefetchers are ASTONISHINGLY good, so that you get vastly more hits in cache and just don't need to go to DRAM that often.
Look at their caches.
Compare
to
In particular look at the L2 and L3 and compare the amount of storage (the regular grid pattern) to the amount of logic (the irregular stuff in the same block). Note how Apple has VASTLY more logic in its caches. That logic is being used to ensure that its caches hold much more data that will be useful in future, compared to Intel's caches.
How is this done? Well, go read many many academic papers. But for example when you have a cache and a read misses you get that line from RAM and then you have various choices like
- do you put that line in L3, or L2, or only L1?
- which line in each of these caches do you remove to make space for the new line?
There are easy answers to these questions, answers known in the 1980s (like make all the caches inclusive, and use random replacement, or pseudo-LRU). But those answers aren't very good! For example on Intel L3, under most circumstances, over half the lines are dead (ie sitting there, but will never be touched again). Or how does your L2 treat D vs I lines? I lines are more critical because the I fetch subsystem can only tolerate a few cycles before grinding to a halt, while D can tolerate 50 cycles or more. So you should prioritize I lines over D in L2 and L3. But that requires a smarter cache. (I don't think Intel or AMD do such prioritization, I've never heard of it.)
Working harder you can come up with much better answers, so that you caches hold vastly more USEFUL data. Connect that up to smarter prefetchers and, woo hoo, you no longer need to burn so much power connecting to DRAM as fast as possible.
And this is everywhere. TLBs are also a caching system. And once again you can run your TLBs (and the caches that feed them) the smart way or the easy way.