ILP wall is valid for in-order CPU uarch only. In OoO CPU there is massive reordering and parallel speculative execution, loading etc. CPU is limited by its OoO engine window size and how efficient predictors are. So for OoO CPU there is no IPC/ILP wall. However I don't say it's easy to get more IPC.
More transistors doesn't mean more power. Apple is good example how despite almost double transistors A13 Lightning core can be twice as efficient as Zen2. Apple must be able to power gate a lot of parts inside core when not being used. There is no other way. Something like Cortex core when detecting high miss prediction rate it minimizes speculative execution to save an energy. And imagine Apple is at least 4 years ahead in development than everybody else...
Regarding A13:
You are right that 2xA13 Ligtning has 5-6 Watts and 8xcore would consume 20-24 Watts. But Andrei measured system consumption including dual channel LPDDR4 .... so that 8-core A13 with 20-24 Watts would include 8-channel LPDDR4. And PCIe links power consumption is not that much and in reality it's just a fraction of CPU cores.
And best thing at the end: those 8xA13 Lightning cores at 2.6 GHz delivers performance equal to 8xZen2 at 4.7 GHz...... and show me Ryzen CPU which can run its all eight cores at 4.7 GHz simultaneously at 24W (even if you would find such a rare super chip it would consume 150W). APU Zen2 Renoir shows definitely better efficiency but again AMD selects the best low leakage dies for laptops and rest go to desktop later this year. Every Apple's A13 can reach such a performance which means there is some decent performance margin.
Regarding similarity between ARM and X86 when scaling up:
Did you see Andrei's test of Graviton2? I doubt.
32c/64t Zen1@2.9GHz vs 64c/64t A76@2.5GHz
- higher performance per thread despite lower frequency
- higher MT throughput
- lower power consumption by half (180W vs. 90W).
- even 14nm vs. 7nm cannot explain such a difference in power consumption
- cheap to manufacture due A76/N1 core being only 1.4mm2 in compare to Zen2 3.4mm2
....
So precursor (because this is a long post), I'm not bashing on ARM. I am slightly bashing the claims that the A13 is somehow faster than a general purpose desktop processor (for reasons I'll get into below) and I'm absolutely bashing the thought that ARM CPUs are somehow much more efficient. Let's get into it:
First, a response to some of your comments:
- Regarding your comment about LPDDR4: LPDDR4 is a very different beast from DDR4. As a further aside from this, the Apple A13 can be (and very likely is) optimized around a fixed platform. This means that the memory controller is likely NOT a full memory controller, but is optimized to specifically work with the iPhone platform at hand. Furthermore, we don't know what the addressable memory limits are for this chip. It likely is not very high, given the emphasis on power vs. performance.
- Regarding the Graviton 2, I read that article the day it came out: We don't have enough data to make a real comparison. That article was meant for comparison between cloud platforms on one provider alone. For example, one can look at the actual Zen 2/EPYC benchmarks and see that they are higher than the Graviton 2 benchmarks.
- You seem to think clockspeed is a game changer here. Different platforms clock differently. Just because the A13 is clocked at 2.66 GHz does NOT mean it's more efficient than x86! Way back in the day Intel and AMD both traded places frequently when it came to clock speed, per/watt, IPC, and other metrics. For Apple to speed up their chip, they'd likely have to adjust the number of stages in the pipeline as well as make other changes. This can cause a negative impact on IPC even though overall performance will increase. In Apple's case, the A13 is a fixed platform, their next chip is rumored to be on 5nm. Apple is using node shrinks for performance increases.
- Regarding the power consumption: AMD's 15 watt parts easily beat out the A13 by most performance metrics. The important takeaway here is that Apple's A13 is only faster for very specific workloads (source below). This says to me that the A13 has a vastly superior cache subsystem or maybe something else, but it in no way means that the A13 is a faster chip!
- Chips do not, and will not scale up linearly. An 8 core, 8 thread A13 would have a 45 watt TDP. I've seen the A13 in my iPhone draw 7 watts of power and heat the phone up until it was uncomfortably hot, and that was in a GAME. While TDP isn't an indicator of power, the two are usually pretty close. Keep this in mind with the data I present below.
- The A13 doesn't accelerate AVX, SSE, or a myriad of other instructions that current processors do.
- The iPhone platform is built for power savings, not performance. The macOS platform is a completely different beast. That is why Apple is doing what it is doing now: Macbooks, Power Macs, etc. are all segmented. iPads are taking over the lower end. (Ironically, iPads don't have the A13; even the newest iPad is using an A12 variant).
Now lets look at some data.
Here is an early benchmark on Geekbench 5 (my favorite benchmark) of the 15 watt Ryzen 4800U:
https://browser.geekbench.com/v5/cpu/1373084
Here is a random benchmark of the iPhone 11 Pro Max:
https://browser.geekbench.com/v5/cpu/1498904
Notice the areas where the A13 is winning:
- Text Compression
- Navigation
- HTML5
- SQLite
- PDF Rendering
- Text Rendering
- Clang
- N-Body Physics
- Face Detection
- Horizon Detection
- Image Inpainting
- Ray Tracing
- Speech recognition
In most cases, it's a pretty narrow victory. However, that's not what is alarming here. What IS alarming is the pattern that emerges. It's almost like those very specific mobile oriented workloads are being accelerated in some way! With the exception of the SQLite benchmark, N-Body Physics, and Ray Tracing, All of the functions above are used on a smartphone. As the developers of Geekbench have little control over how their benchmark is built, it's very likely that Apple is using a number of accelerators in both the CPU and GPU to accelerate these workloads, while the Ryzen 4800U is doing everything brute force with the exception of built in instruction sets.
So you are probably wondering about SQLite, Clang, N-Body Physics, and Ray Tracing. Let's address those now.
- SQLite largely consists of taking plain text SQL statements and parsing them into machine readable format. For a platform that accelerates Text IO as hinted at above, it stands to me that the A13 would perform well at this workload. Interestingly enough, it doesn't perform all that much better than the 4800U.
- Clang once again consists of parsing a bunch of text (1094 lines of it!), want to know what would really accelerate this process? The ability to accelerate pattern recognition and text parsing! I don't see any evidence of Geekbench actually running a linker. Their own document suggests that they are building for the AArch64 architecture, but they don't mention how much of it they do. Furthermore, the benchmark is listed as klines/sec. Interesting indeed. That suggests to me that they aren't going as far as generating code, but instead are only going as far as parsing the source code. We can speculate on this all day, but it's a very small win, regardless.
- N-Body Physics is an interesting benchmark to have a win in. However, given the architecture of the A13, it doesn't really surprise me. Cache helps with this workload immensely, and it's likely the platform is accelerating things in some way, given the large "win" compared to other narrow wins.
- Ray tracing is another interesting one. Yet another workload that builds and uses a tree of data (N-Body Physics uses an octree; the ray tracing benchmark uses a k-d tree). After analyzing these results, I realize that I know exactly what is going on here. The exact same mechanism that accelerates text rendering and compression is at work here as above. The one thing the A13 absolutely excels at is analyzing and parsing data. Why is this? We'll look at that in a moment.
What is more alarming to me is where the A13
loses. The AES-XTS benchmark I can understand -- Modern x86 processors all accelerate AES in some form or fashion, and Zen 2 has first class support. However, some benchmarks that the A13
should be good at, it loses in. HDR, is one example. Gaussian blur is another. Image compression is another. In each of those benchmarks, we begin to see a clear pattern. All of these benchmarks involve computing power to manipulate images. All of the mentioned benchmarks would benefit immensely from the both the FPU and traditional instruction sets such as SSE. Indeed, if we compare the numbers of the Ryzen 3900X (note that the 4800U will likely score somewhat lower) we discover that Ryzen wins most of the benchmarks involving floating point or multimedia (it wins most of the benchmarks, period. Links below.)
So why is it that the A13 excels at those very specific workloads above?
- Cache is a big factor. Indeed, the A13 has awesome cache latency up to a queue depth of 4096. (coincidentally, all of the benchmarks in the GB5 suite are tiny enough that they are heavily influenced by cache, but that isn't the sole reason why the A13 performs well)
- The compiler for anything iPhone related is XCode. Because of this, applications can be optimized, like the processor, to run in a fixed environment, taking full advantage of hardware. This is important for the next point.
- The NPU could be playing a very big factor here. The NPU would excel at parsing octrees, k-d trees, and any other type of tree structure. Incidentally enough, that includes all of 4 of the benchmarks I singled out earlier.
- Finally, AMX could very well have a hand in this as well. Unfortunately the iPhone is a (relatively) closed platform, so we don't know.
Don't get me wrong, ARM is getting there. However, anyone expecting magic overnight needs to calm their expectations.
ARM isn't magic, and x86 isn't done.
iPhone 11 Review/Deep Dive:
https://www.anandtech.com/show/14892/the-apple-iphone-11-pro-and-max-review/
Ryzen 3900X Review/Deep Dive:
https://www.anandtech.com/show/14605/the-and-ryzen-3700x-3900x-review-raising-the-bar/
Geekbench 5 workload details:
https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf
Please let me know if you see any typos, I typed this up in a relative hurry and barely skimmed over it for proof reading.