My point in bringing up the single thread performance is the A13's cache structure. It has a huge L1 cache and a huge L2 cache, that doubtless helps tremendously in achieving high IPC for single threaded workloads. Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance.
Someone posted a graph of the voltage/frequency curve for the A12 a few pages back, and it didn't look too convincing. The voltage spikes big time at around the 2.6ghz mark, so can you imagine the power draw if you put even more of these cores on a single die with a more robust uncore . I just don't see the A13 or a CPU like it being able to be competitive with Intel and AMD in heavier multithreaded workloads without a substantial redesign of the entire CPU, and not just the uncore. The entire microarchitecture would need to be redesigned I believe.
Also, the A13 only has NEON.
"Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance."
Why? Like so many x86 people, you simply cannot understand that there are DIFFERENT WAYS to solve the same problem.
Look at something like the Graviton 2 topology
www.anandtech.com
Look at something like AFX64
www.anandtech.com
There are MANY ways to solve these issues.
For example Apple could create a hierarchical system. Fundament units are 4 CPUs+large L2, and multiple of these "tiles" share a distributed LLC. Look at what Graviton did.
With a performant enough L2, you can even scale this up to 8 cores sharing an L2 without difficulty (look at AFX64).
With few enough of these larger tiles (4 or 8 say) a ring is fine, and the NoC consists of two level addressing, first to station then to within a station (which is probably what Apple is already doing).
It's simply nonsense to claim that "scaling up number of cores" is some crazy hard problem that only x86 knows how to do properly.
A more serious critique would worry about things that impact cross-core operations. Things like locks. But ARM has a better ISA here (better scope for clarifying just how much coherency you require and no more) and similarly scoped atomics.
And A13 doesn't just have NEON. Even apart from A14 probably going to have SVE, A13 also has AMX. We don't know anything about it yet, but it's there on the core, and it's claimed to give 6x the throughput of the 3 NEON pipes together. We'll probably learn a whole more about it (and get compilers that target it) with WWDC and the XCode released at that point.