• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Question Zen 6 Speculation Thread

Page 393 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
ARM was designed with efficiency in mind. x86 not so much. I mean ARM was designed from the start to be a full 32 bit RISC architecture, no need for uops. AMD and Intel have been working around that limitation since the PowerPC and still are!
No ISA is inherently better. There’s plenty of ARM designs with worse efficiency than
x86 designs.

Im looking forward to nova lake with APX but I believe it won’t be till Unified core that Intel can totally beat ARM and even by then who knows what those Cambridge folks cook up to counter it.
 
  • Like
Reactions: 511
There’s plenty of ARM designs with worse efficiency than
x86 designs.
The way I currently see it rn, is that Apple is a gen ahead of Qcomm + ARM, and Qcomm + ARM is a gen or two ahead of AMD, and god knows where Intel is lol.
(for power).
Apple trades area for power, yes. But not much of it, no. And they're a sizeable chunk more efficient, thus.
Yea, they don't spank them, that's just me having a bit of fun lol
? 12 Apple M cores in M5 Pro/Max are faster than 16 Skymont cores in the 285K while being much more efficient, like much more.
They might have much better power, sure, but I doubt they will have the lead in perf/mm2.
The only arm core ik that competes (and I think even beats it) is the X4 in the mediatek 9400.
Otherwise there’s always a “yeah but AMD does better SIMD” argument to be made.
It's an important caveat to mention, sure, but a comparison can still be easily made even if there is an asterisk.
 
No, they're just on N3e/p.
Fun's over soon.
No, this is in reference to Qualcomm's original X elite vs Strix Point. Both N4P. Though techinsights claim Interestingly, AMD uses one more metal layer too.
1774050773363.png
Using the DT CCX and also better memory might shrink that gap all together, but even then, Qcomm's still getting crutched here because they have a faster release cadence than AMD. By the time Zen 6 comes out, Qcomm would be what, at or close to Oryon gen 4 cores? Plus I doubt Qcomm can't or won't beef up their own cache hierarchy for server or higher margin markets. Maybe similar to what Apple did with their super cores.
 
No, this is in reference to Qualcomm's original X elite vs Strix Point. Both N4P. Though techinsights claim Interestingly, AMD uses one more metal layer too.
1774050773363.png
Horrible metric, we're flogging Andrei for it.
By the time Zen 6 comes out, Qcomm would be what, at or close to Oryon gen 4 cores?
All that matters is the final PPA. gen #whatever is irrelevant.
Plus I doubt Qcomm can't or won't beef up their own cache hierarchy for server or higher margin markets.
They're doing a Dunnington and it sucks.
 
RISC-V is a newer ISA and RISC-V Linux there doesn't have the same baggage as x86 Linux and yet not yet a single RISC-V core can compete with Skymont in terms of perf/mm2 or absolute perf. Not a SINGLE company managed it with their latest RISC-V designs.
AArch64 was designed by industry professionals with the aim of building something that was as fast as possible on big cores. RISC-V was mainly designed by academics, and initially aimed for ease of implementation for a wide range of targets, including the tiniest embedded cores possible. (And with many design decisions driven not by what makes a good cpu, but doctrine firm enough to verge on being religious.) It should come as no surprise that AArch64 is faster.

VM page size doesn't affect that stuff. For years we had most CPUs with a 4K VM page size and hard drives with a 512 byte sector size.
Most operating systems (Windows since NT 3.1 in 1993, Linux since first release) always read from and wrote to disk in full VM pages, emitting 8 operations. Dealing with a different sector size and page size is just painful and OS devs generally refuse to do it.

No ISA is inherently better. There’s plenty of ARM designs with worse efficiency than
x86 designs.
That absolutely doesn't mean that no ISA is inherently better. They certainly can be worse, see IA-64 and iAPX432.

The µarch can always be better or worse, but the architecture sets the boundaries inside which the people designing the µarch have to work inside. AArch64 is the best currently existing architecture with reasonable implementations. It least limits the design team from making a really fast core.

This doesn't mean that every AArch64 core is automatically fast, you still need to do the work, nor does it mean that x86 can't get there with heroic effort. But getting the same level of performance out of AArch64 just takes less work.
 
Your argument for Apple making cores their cores the best is thats that the ARM ISA is "new" and that Intel/AMD can do somthing simliar. Thats not even remotely true, RISC-V is a newer ISA and RISC-V Linux there doesn't have the same baggage as x86 Linux and yet not yet a single RISC-V core can compete with Skymont in terms of perf/mm2 or absolute perf. Not a SINGLE company managed it with their latest RISC-V designs.
It's mostly about the design goals Intel did have plan for what Apple is doing see Royal but that was a horrible attempt now they are back at a saner attempt rumors are true AMD is doing the same.
ARM ISA is as old as X86 it's just that AARCH64 is newer relative to X86 and doesn't have binary compatibility with ARM32.
Nothing stops AMD/Intel from designing wider/deeper cores. Except themselves and they have realized that so they are doing it.
 
It's mostly about the design goals Intel did have plan for what Apple is doing see Royal but that was a horrible attempt now they are back at a saner attempt rumors are true AMD is doing the same.
ARM ISA is as old as X86 it's just that AARCH64 is newer relative to X86 and doesn't have binary compatibility with ARM32.
Nothing stops AMD/Intel from designing wider/deeper cores. Except themselves and they have realized that so they are doing it.
Royal was something else, it was peak intel idiocy. They made it too wide
 
Horrible metric, we're flogging Andrei for it.
This is pretty much the only way to measure it if you have an intense hatred of software power measurements, despite numerous papers saying Intel RAPL is actually pretty good lol.

Ig you could make it better trying to idle normalize it even harder to not include mem phys and other IO power? But it's not as if Qcomm has significantly better idle power than AMD here. Notebookchecks minimum idle power reporting, when connected to an external monitor, for what little value that has, also shows that.

Also, before anyone says, yes the first Qcomm graph shows AMD having much higher min power than Qcomm. But they have the same graph with more data point (aka the 350 at lower power) on different slides:
1774063756302.png
All that matters is the final PPA. gen #whatever is irrelevant.
Well generally new gens improve, important for this context, perf/watt. At least in the middle/upper part of the curve.
It's mostly about the design goals Intel did have plan for what Apple is doing see Royal
Would have been way wider than what Apple is doing
but that was a horrible
Royal was something else, it was peak intel idiocy. They made it too wide
I think it makes complete sense, one because it almost certainly would have had something novel to actually get high IPC and not the usual "just fatten everything up" Intel does (SNC, GLC)
And two I don't think it's a coincidence the industry best cores are also the highest IPC ones. Though ig ARM is getting pretty close? Funnily enough though, ARM stock cores also slap on area, despite being such high IPC.

And I think I said this before, but as a slight tangent, I don't see why Intel and AMD can realistically get away with chasing much more power and perf by increasing Vmin significantly. With dense cores in specifically for core spam in server, and LP islands specifically for mobile in client, why worry nearly as much for perf at low power for the P-cores?
 
Royal was something else, it was peak intel idiocy. They made it too wide
Too much resourcing yes a 10mm2 core wouldn't fly.

This is pretty much the only way to measure it if you have an intense hatred of software power measurements, despite numerous papers saying Intel RAPL is actually pretty good lol.

Ig you could make it better trying to idle normalize it even harder to not include mem phys and other IO power? But it's not as if Qcomm has significantly better idle power than AMD here. Notebookchecks minimum idle power reporting, when connected to an external monitor, for what little value that has, also shows that.

Also, before anyone says, yes the first Qcomm graph shows AMD having much higher min power than Qcomm. But they have the same graph with more data point (aka the 350 at lower power) on different slides:
View attachment 140432

Well generally new gens improve, important for this context, perf/watt. At least in the middle/upper part of the curve.
Idle Power is about the entire platform power delivery choice of Ram etc etc Qualcomm is locked down Intel/AMD it becomes more about laptop as a whole which is a good laptop comparison but not CPU only comparison.
Would have been way wider than what Apple is doing


I think it makes complete sense, one because it almost certainly would have had something novel to actually get high IPC and not the usual "just fatten everything up" Intel does (SNC, GLC)
You would be disappointed with Coyote than cause it's GLC 2.0 🤣🤣
And two I don't think it's a coincidence the industry best cores are also the highest IPC ones. Though ig ARM is getting pretty close? Funnily enough though, ARM stock cores also slap on area, despite being such high IPC.
Best for what ??? Client Yes Server No
And I think I said this before, but as a slight tangent, I don't see why Intel and AMD can realistically get away with chasing much more power and perf by increasing Vmin significantly. With dense cores in specifically for core spam in server, and LP islands specifically for mobile in client, why worry nearly as much for perf at low power for the P-cores?
LP/Dense are the same design for Intel.
Anyway we are moving off topic it's Zen 6 thread not X86 vs ARM.
 
VM page size doesn't affect that stuff. For years we had most CPUs with a 4K VM page size and hard drives with a 512 byte sector size. It wasn't until they crossed the 2 TB barrier that sector sizes went to 4096 bytes to match VM page size. Hopefully if you are writing log files you are allowing some level of write caching rather than syncing to storage with every 50 byte message. That's gonna be a MUCH bigger problem for NAND than it is for the difference between 4K and 16K VM page size.
It 100% does , go make a web server/firewall/etc VM not sector alligned and have any of the logging needed against PCI / ISO27001/ISM ( im australian , US Fedramp) etc, and watch your workload performance plumet. When we ran 15k SCSI disks we had maybe 4 cpu cores a box and genrally bear metal, old school FC SAN had large sram caches that could hide the write latency pretty well. Now we run 100's of threads a box to generally MLC local disk. We have good IOP's but not as good as you would think and way way more CPU performance and way more logging. Its even more amazing when you see an unalligned qcow on zfs.

its not uncommon on some of the boxes i have oversite/design responsabilty on to hit ~25% of cpu time on logging IO.
 
Wide datapaths aren't that useful for client workloads
Don't confuse "are not widely available for it to bother having implement that" with "not being useful"😉
(new helpful 256 bit wide instructions in AVX-512 are another matter, but there ARM has analogues)
AVX512 new instructions often extend to 128b as well. And for feature parity you need SVE which is not as widespread as neon.

they've been able to handle four 128 bit wide NEON instructions per cycle for a while now. That's equivalent (for most stuff, there are exceptions like rotate) to one 512 bit wide AVX instruction per cycle
Well, if I want to add 2 16 element arrays of 32b integers together I need 8 loads, 4 additions and 4 stores with Neon. That's 16 instructions. Compared to 3 instructions with AVX512. (1 load, 1 addition with mem operand, one store). Yes, this is trivial naive case.

And as I noted earlier Neon does not have feature parity with AVX512.

In general I haven't seen SME used yet to speed up text manipulation, nor I have heard of GPUs being used in that context, so if you have seen some deployed in production I would gladly read about them.

And particularly since the Zen 5 implementation is of the "use it or lose it" variety, per Mystical:
Well, other code still benefits from other changes than the width. But the gains will be more situational and not as impressive as from wider registers.

Plus it's more nuanced than "well, they could have doubled the execution unit count and keep them at 256b so AVX2 code could use it." AVX2 code is limited to 16 architectural registers, so it's easier to saturate than AVX512. It's possible you would run out of registers before being able to use all that stuff. Plus more execution units means more silicon to control them. More pressure on memory subsystem, etc. What I am trying to say is that adding more 256b execution units would not necessarily improve AVX2 code situation as might have had in the end still recompile to AVX512 to get more architectural registers.

The biggest problem with Zen 5's AVX512 is not that the 512b width is extremely niche (the ISA makes it easy to use less than full width and still remain useful compared to AVX2 or NEON for that matter), but Intel politics that basically withheld widespread adoption for years so it's still treated as an afterthought (the same problem SVE is facing and problem that Apple does not have).
 
Well, if I want to add 2 16 element arrays of 32b integers together I need 8 loads, 4 additions and 4 stores with Neon. That's 16 instructions. Compared to 3 instructions with AVX512. (1 load, 1 addition with mem operand, one store). Yes, this is trivial naive case.
I know you know, but your example is too simple to demonstrate your point: cache bandwidth might limit the loop, number of 512-bit units vs 128-bit units will also play a role. All you've gained is less stress on ifetch and occupancy of some structures, both of which may not significantly show enough gain. Don't get me wrong, I like wide SIMD 🙂

And as I noted earlier Neon does not have feature parity with AVX512.
But SVE has feature parity, I think.

In general I haven't seen SME used yet to speed up text manipulation, nor I have heard of GPUs being used in that context, so if you have seen some deployed in production I would gladly read about them.
If you look for GPU and pattern matching, you should find some interesting uses of GPU (sorry, too lazy for a reference at the moment), but I guess that like for most GPU offloading stuff, it only shows benefits for large data sets (DNA for instance). For matrix units, I'm not sure there's any use, but I don't really follow research in that domain.

The biggest problem with Zen 5's AVX512 is not that the 512b width is extremely niche (the ISA makes it easy to use less than full width and still remain useful compared to AVX2 or NEON for that matter), but Intel politics that basically withheld widespread adoption for years so it's still treated as an afterthought (the same problem SVE is facing and problem that Apple does not have).
The SVE situation is really depressing with Qualcomm disabling it in the firmware despite the CPU having support for it. I don't know if this has changed in the last few years. And also having only 128-bit SVE is often not enough to show any gain. My take is that AVX-512/AVX10/SVE with 256-bit vectors is the way to go for client market.
 
Back
Top