Question x86 and ARM architectures comparison thread.

johnsonwax · Sunday at 2:52 PM

S'renne said:
Yeah pretty sure part of what makes Apple Silicon works so well is also from the MacOS optimisations for the target audience

Yeah, that's a big part of my thesis. That's not to say that it won't perform on SPEC which kind of bypass all of that, but it's important to understand that Apple knows what their target markets are and they are very much building for that market. If it's terrible at mining bitcoin, Apple literally doesn't give a damn. They are perfectly happy to have 0% of that market - in fact, that might be desirable. But everyone else in the industry doesn't get to do that as much. And Apple's market is really, really clearly consumers with an extension into the creative space. In terms of industry it's professional audio/video, software development, data science. Go to a conference for web development and it's all Macs. Data science tends to be nearly all Mac as well on the front end (mostly because Python on Windows is useless though WSL is fine) and then a lot of DC or maybe local Nvidia AI on the back.

Even the discussions around SMT have limited application to Apple's market because anything sufficiently parallellizable that Apple sees in it's market space they're going to do custom silicon for because they have the resources to do that. That's why they shipped the first NPUs to process the camera output on iPhone faster. It wasn't for general AI purposes, it was to make fake depth-of-field. The last x86 Mac Pro had a huge FPGA on board just to do codecs that their vendors wouldn't support. There just isn't a lot of generic SMT compute in Apple's space.

Doug S · Sunday at 4:05 PM

OneEng2 said:
M4 core itself is hard to prove one way or another that it would perform as well as Zen 5 in DC since no platform exists to test the theory. I SUSPECT that it would not perform as well simply because that is NOT what it was designed for. Zen 5 (and several previous generations) have been specifically architected "Server First" (AMD's quote, not mine). It is therefore likely that M4 wouldn't fare well in such a contest.

On the flip side, Zen 5 wouldn't work well at all in a phone or tablet.

To date, this is the only ARM vs Zen 5 benchmark in DC I have seen:

AMD EPYC 9965 "Turin Dense" Delivers Better Performance/Power Efficiency vs. AmpereOne 192-Core ARM CPU Review - Phoronix

www.phoronix.com

It didn't look very flattering for ARM.

BS.

"Designed for server" vs "designed for mobile" is just as meaningless in the days of billion transistor chips as decoding complex x86 instructions vs decoding simple AArch64 instructions. It mattered when there was only room for one or two big cores on a die, but now that there are dozens there's almost no difference in the P core you'd design for mobile vs the P core you'd design for server as far as the CPU structures itself.

You might make difference choices in the transistor types with FinFlex, but even that I'm kinda skeptical about - because today power matters almost as much in DC chips as phone chips, since the allowable per core power draw has been falling for 20 years - ever since the first dual core CPUs were sold. The per core power draw of DC chips isn't as low as it is in mobile, but it is a lot closer than it was a decade ago, and is going to keep getting closer unless Intel/AMD decide to start designing DC chips that require liquid cooling like IBM mainframes.

The differences when designing for the DC come in the uncore, stuff like the number of memory channels, support for ECC/chipkill on those memory channels, the need for a "bigger" fabric to connect many more CPU cores together, more PCIe links, that sort of thing. Ideally you want a really fat last level cache to support the much larger memory a server will have than a phone, but mobile CPUs ALSO like pretty fat caches for a different reason - it lets them power off the power hungry memory controllers. I remember reading a while back that memory controllers are responsible for almost all the power draw of DRAM. The DRAM itself - charging capacitors, bitsense amplifiers, row/column selectors etc. are a low single digit percentage of the overall DRAM power budget! So being able to power off those memory controllers, even momentarily, is a huge win. That's why Apple has a much bigger SLC on the iPhone SoC than the base Apple Silicon SoC.

Pointing to that Ampere One benchmark as any sort of "evidence" that ARM is not suitable for DC is lunacy. I might as well point to Zhaoxin benchmarks as evidence that x86 is unsuitable, because Ampere One and M4 have about as much in common as Zhaoxin and Zen 5.

johnsonwax · Sunday at 4:08 PM

mikegg said:
1. How do we know macOS has better power management? For all we know, macOS has worse power management than Windows and Linux but the SoC carries the OS. How does OS power management lower total SoC power when running ST loads? We seem to be making a ton of assumptions without any proof.

You don't know for sure, but that's just what Apple does. We know that all of their scheduler/power management was reworked just for Apple Silicon, and we know from their engineers that there is an almost obsessive focus in that direction across the company. I mean, they changed their preferred memory management technique when Swift was introduced in 2014 so that they could design more performant silicon around that in A and M series - they forced every developer to do this slightly harder thing in their code because it would allow the system to be faster. It's why I'm skeptical of people demanding a linux OS on Apple Silicon for performance testing because I seriously doubt it would have a lighter footprint. The linux devs aren't even going to be aware of where Apple seeks power management because they don't disclose that.

"How does OS power management lower total SoC power when running ST loads?"

We know that Apple's scheduler puts all of the service threads onto the e cores. They'd been doing that since they started doing asymmetric cores. It's not a traditional scheduler that simply round-robins any given thread depending on which core is open - the OS stuff goes on the e cores even if they're congested so that user applications can have 100% of the P cores all the time. And whenever possible higher compute needs are getting put on ANE or GPU when those are suitable. There really is no lag between when Apple puts silicon out and when their software makes maximal use of it. In-house their adoption curve is a step function - it effectively goes from 0% to 100% the day it ships.

If you look in the linux contributor space, there is a 'payoff' function to adopt something. It's not immediate. They're not necessarily going to ifdef some part of the kernel around a new feature in the 9000 series on day one because there is an opportunity cost in doing so, there is a cost in code stability, and so on. But Apple is in the job of selling Apple Silicon and part of how they do that is to use every inch of it immediately so that when you buy your M4 MBA your reaction is 'good lord this is fast' and so they will take every opportunity for even marginal gains and spend those resources to rewrite that code. And in a lot of cases that new silicon was created in consultation with the OS developers - the said 'hey, we can cut power if that cache is bigger' or whatever, and that gets evaluated against their other priorities and if they need the power savings and have the silicon that gets implemented both in hardware and software at the same time. They aren't doing the 'oh that's marginal and so it's unlikely anyone will use it in their code so let's not burn silicon on it', everyone report to the same guy at Apple - there is a 100% chance it will be implemented in every possible opportunity, even if Apple needs to create a new API and shove developers through it against their will (which they do often). And that's not unique to Apple. The big data center operations all have their own custom designs for servers because they know exactly which marginal improvements will pay off for them which a generic company isn't going to do because they don't know if the 1% improvement is worth the effort, but Meta knows that 1% improvement because they do x, y, z is like $10M a year in operating expenses or something. That is part of why vertical businesses seek verticality.

So I just can't imagine a scenario where Apple's not maximally taking advantage of any silicon benefit they might have. That said, Apple has their other priorities like the overhead of a GUI that you can more easily bypass on linux, but you can bypass it on MacOS if performance testing is what you are trying to do. You can strip MacOS back surprisingly far.

johnsonwax · Sunday at 4:09 PM

Covfefe said:
And ARM servers have been a thing for a while now, so the idea that Apache server is somehow half baked for ARM CPUs doesn't pass the smell test.

I didn't suggest it was half baked for ARM. I suggested it might be half baked for Apple Silicon.

Doug S · Sunday at 4:22 PM

Covfefe said:
Phoronix's M4 benchmarks have several examples where the it loses badly to x86 CPUs.

Apple M4 Mac Mini With macOS vs. Intel / AMD With Ubuntu Linux Performance Review - Phoronix

www.phoronix.com

View attachment 128153

So they tested the base Mac Mini with only 16 GB against a bunch of other stuff that they don't provide links for so you can't see how they were configured as far as RAM, and every one of them has more cores than the M4's 4 P cores. It has 6 E cores but Apple's E cores are at best 1/3 of a P core so charitably you can call it 6 cores total. I'll bet every one of those x86 boxes has 32 GB of RAM or more.

Given how much a web server trying to satisfy that many requests per second is going to depend on caching, RAM is a HUGE difference. That's a terrible comparison. Let's put it up against the Ultra 285K's 8P and 16E cores (and their E cores have targeted performance rather than efficiency unlike Apple so that's equivalent to at least 16 P cores) with only 16GB of RAM and see how it does then. Or put it up against a Mini equipped with an M4 Pro and 64 GB of RAM.

Try again with a benchmark that's not totally rigged against Apple.

johnsonwax · Sunday at 4:29 PM

yottabit said:
Talking about server space, think the most feasible thing Apple could target (with minimal adjustment to their architecture) would be HPC workloads. Either with a high memory bandwidth, all P-Core die for CPU workloads or more probably an MI300 type product which could probably just be an M- Ultra with some extra IO.

I mean people are already building their own clusters of Macs for this even before LLMs took off. I doubt they have aspirations for it but it would be cool to see Apple silicon on the supercomputer charts. Also would get around their hesitancy to sell servers broadly, and be good PR. Tim Cook are you listening??

HPC doesn't make enough money to catch Apple's attention. Apple makes more money off of Watch than AMD makes as an enterprise, and Watch isn't even big enough to break out as a category in their 10-Q.

There are 5 industries large enough to get Apple's attention: healthcare, energy, finance, defense, and transportation. Apple is doing things in 3 of those (and will never touch Defense because it would kill their other business). HPC as a specialized segment mostly died off in the 90s, nobody is catching Nvidia on AI, and what's left to chase Apple spends more money on landscaping than they would be able to get. Apple isn't hesitant to sell servers broadly, they are a consumer facing company not an enterprise facing one. You want to know why HP got sh*tty? Because they made computers for CIOs, not for you. Apple isn't going down that path.

Y'all want HPC because you want a faster lambo to wave around at the neighbors, not because it makes any sense for Apple. I read the 9950 threads - it's bragging rights all the way down. Apple doesn't do that. It's why they market based on how many 8K RAW video streams they can process and won't even tell you how fast the RAM is - because their market doesn't care how fast the RAM is, they care how many video streams they can process. Apple cares more about the Oscars than they do about the supercomputer charts because that's something that consumers look at.

johnsonwax · Sunday at 4:37 PM

poke01 said:
It’s macOS that’s the problem. Linux is truly the best getting the best performance out of any CPU based tasks.

These Apple chips are so powerful that macOS is the limiter in some cases. Take a look at this, this was on M1 testing on bare Linux vs macOS.

Apple M1 Performance On Linux: Benchmarks Better Than Expected For Its Alpha State Review - Phoronix

www.phoronix.com

View attachment 128170
View attachment 128172

100% performance gap from Linux to MacOS on the same hardware? Gonna call some BS on that. Something there we're not being told about.

poke01 · Sunday at 4:45 PM

Covfefe said:
In terms of the Apache benchmark, we can't say with certainty that its all due to MacOS. Will OS make a difference in some benchmarks? undoubtedly. Is it enough to give the M4 a 3x speedup and catch the 9600x? maybe, but probably not. At the very least I don't think its fair to blame MacOS whenever the M4 loses

True. In any case, that Apache benchmark makes no sense?

Why does the 9950X do so bad and yet the 285K does so well? Clearly it’s not testing CPU performance. The 9900X is somehow better. It’s makes no sense.

poke01 · Sunday at 4:52 PM

These Phoronix tests are funny cause Michael could contact Apple for better hardware, he has the platform that Apple would respond. Digital Foundry contacted Apple PR for Mac devices and Apple sent some.

Some of these tests are wild, testing a ***fanless*** M2 against AMD Zen4 mobile. Most of those tests don’t even have ARM optimisations. There is no note or warning. It would be good to knows for each test what kernel the tests were conducted on ie x86 or arm64.

poke01 · Sunday at 4:55 PM

johnsonwax said:
I didn't suggest it was half baked for ARM. I suggested it might be half baked for Apple Silicon.

Apple CPUs conform to ARMv8 standards, if it’s not optimised for an Apple M4 it’s not optimised for a Raspberry Pi 5.

This isn’t the GPU where you need specific seperate architecture optimisations for each GPU architecture from different companies.

johnsonwax · Sunday at 5:11 PM

poke01 said:
Apple CPUs conform to ARMv8 standards, if it’s not optimised for an Apple M4 it’s not optimised for a Raspberry Pi 5.

This isn’t the GPU where you need specific seperate architecture optimisations.

I guess I don't understand what the connection test is therefore doing. Presumably it's not hitting the I/O subsystem at all, not hitting any system calls, not reading data, etc. If so, who cares? It's not testing anything, because how many servers are going to be CPU constrained at that scale? Saying 'this is the fastest web server provided there are no inbound or outbound connections, only synthetic ones, and no reading of data, only what's been shoved in cache'. Like, what even is the point of a web server test that never leaves the CPU core?

poke01 · Sunday at 5:15 PM

johnsonwax said:
I guess I don't understand what the connection test is therefore doing. Presumably it's not hitting the I/O subsystem at all, not hitting any system calls, not reading data, etc. If so, who cares? It's not testing anything, because how many servers are going to be CPU constrained at that scale? Saying 'this is the fastest web server provided there are no inbound or outbound connections, only synthetic ones, and no reading of data, only what's been shoved in cache'. Like, what even is the point of a web server test that never leaves the CPU core?

Ehh, in that Apache benchmark the 9950X is losing to a 9900X. I wouldn’t put too much thought into it. It’s an outlier, leave it at that.

poke01 · Sunday at 5:23 PM

Qualcomm Snapdragon X Elite Benchmarks On Ubuntu Linux vs. AMD vs. Intel - Phoronix

www.phoronix.com

I was having look at other ARM Benchmarks. I found this interesting.

Hope QC improves for V3 of their core.

yottabit · Sunday at 5:46 PM

johnsonwax said:
HPC doesn't make enough money to catch Apple's attention. Apple makes more money off of Watch than AMD makes as an enterprise, and Watch isn't even big enough to break out as a category in their 10-Q.

There are 5 industries large enough to get Apple's attention: healthcare, energy, finance, defense, and transportation. Apple is doing things in 3 of those (and will never touch Defense because it would kill their other business). HPC as a specialized segment mostly died off in the 90s, nobody is catching Nvidia on AI, and what's left to chase Apple spends more money on landscaping than they would be able to get. Apple isn't hesitant to sell servers broadly, they are a consumer facing company not an enterprise facing one. You want to know why HP got sh*tty? Because they made computers for CIOs, not for you. Apple isn't going down that path.

Y'all want HPC because you want a faster lambo to wave around at the neighbors, not because it makes any sense for Apple. I read the 9950 threads - it's bragging rights all the way down. Apple doesn't do that. It's why they market based on how many 8K RAW video streams they can process and won't even tell you how fast the RAM is - because their market doesn't care how fast the RAM is, they care how many video streams they can process. Apple cares more about the Oscars than they do about the supercomputer charts because that's something that consumers look at.

I mean you’re probably not wrong, but it does make me sad.

This is a technology enthusiast forum so don’t be surprised many of us are… enthused by technology even if there isn’t a strong business case behind it.

An Apple silicon supercomputer would be purely for bragging rights and PR

Covfefe · Sunday at 5:49 PM

poke01 said:
True. In any case, that Apache benchmark makes no sense?

Why does the 9950X do so bad and yet the 285K does so well? Clearly it’s not testing CPU performance. The 9900X is somehow better. It’s makes no sense.

I don't know what causes the 9900x oddity, but it arises in several of Phoronix's benchmarks. See their Postgres benchmarks here. https://www.phoronix.com/review/amd-ryzen-9950x-9900x/8

Other than that, the benchmark is just super cache dependent. L2 and L3 size and bandwidth are key. The cross CCX latency is an issue. The 3D cache also helps here, which is rare outside of games.

poke01 said:
These Phoronix tests are funny cause Michael could contact Apple for better hardware, he has the platform that Apple would respond. Digital Foundry contacted Apple PR for Mac devices and Apple sent some.

Some of these tests are wild, testing a ***fanless*** M2 against AMD Zen4 mobile. Most of those tests don’t even have ARM optimisations. There is no note or warning. It would be good to knows for each test what kernel the tests were conducted on ie x86 or arm64.

Phoenix's articles are definitely a mess. So many benchmarks with so little explanation. Sometimes I wonder if Michael Larabee even understands what they all mean.

johnsonwax · Sunday at 6:12 PM

poke01 said:
Ehh, in that Apache benchmark the 9950X is losing to a 9900X. I wouldn’t put too much thought into it. It’s an outlier, leave it at that.

I don't understand the point of the test in the first place? Just because someone puked out a bunch of code doesn't mean it's a useful benchmark. What is that test informing us of? There is no universe where a single CPU is spinning up 500 threads, allocating memory, blocking for disk and network I/O, and then retiring memory and threads to produce 250,000 connections per second. What's the point of a web server test that never leaves the CPU? Hello, it's a web server. To my eye that looks like a very tight performance loop that is going to be extremely compiler dependent and thread allocation/retirement on that scale is itself not a neutral thing. Is it even talking to the scheduler, because in a lot of web server use cases you have a lower QoS for I/O threads because you're blocking constantly, so you want to give priority to whatever is constructing the page or querying the database which are running in their own processes and then handing the result back to be sent out so that connection can be closed, freeing up another connection - there's no point creating a new thread if you are unable to retire your existing ones so they get lowest priority. But on Apple Silicon QoS tells you what core to run on. Did the Apache test run entirely on e cores because normally those connection threads run on e cores. That's literally why the e cores exist so you can put all of your I/O blocking garbage somewhere that it won't pollute the p core.

We also know that the Mac has a scheduler that you can say 'run this on an e core even if the p core is idle' that normally you don't find on ARM. ARM may prefer the e cores with a low QoS but it'll happily move them to a p core. Apples does allow you to say 'never put this on a p core'. That's why I don't understand what this thing is testing.

poke01 · Sunday at 6:33 PM

johnsonwax said:
I don't understand the point of the test in the first place? Just because someone puked out a bunch of code doesn't mean it's a useful benchmark. What is that test informing us of? There is no universe where a single CPU is spinning up 500 threads, allocating memory, blocking for disk and network I/O, and then retiring memory and threads to produce 250,000 connections per second. What's the point of a web server test that never leaves the CPU? Hello, it's a web server. To my eye that looks like a very tight performance loop that is going to be extremely compiler dependent and thread allocation/retirement on that scale is itself not a neutral thing. Is it even talking to the scheduler, because in a lot of web server use cases you have a lower QoS for I/O threads because you're blocking constantly, so you want to give priority to whatever is constructing the page or querying the database which are running in their own processes and then handing the result back to be sent out so that connection can be closed, freeing up another connection - there's no point creating a new thread if you are unable to retire your existing ones so they get lowest priority. But on Apple Silicon QoS tells you what core to run on. Did the Apache test run entirely on e cores because normally those connection threads run on e cores. That's literally why the e cores exist so you can put all of your I/O blocking garbage somewhere that it won't pollute the p core.

We also know that the Mac has a scheduler that you can say 'run this on an e core even if the p core is idle' that normally you don't find on ARM. ARM may prefer the e cores with a low QoS but it'll happily move them to a p core. Apples does allow you to say 'never put this on a p core'. That's why I don't understand what this thing is testing.

This one is interesting too.

The M4 Mac mini remains ahead here because it’s a single threaded test but Zen5 loses against Arrow Lake here. The X3D loses to the non-X3D model.

Can anyone here make sense of this?

johnsonwax · Sunday at 6:43 PM

yottabit said:
An Apple silicon supercomputer would be purely for bragging rights and PR

Yeah, Apple doesn't play that - they got here because they are disciplined. Keep in mind, AMDs entire server business would be considered a large hobby/accessory segment to Apple. HPC even less than that. Apple's markets are enormous. And one of the problems they kind of have is that most remaining bits of tech are either change on the floor or really hard markets for Apple to enter. Like, gaming PCs/consoles are technologically in reach but out of reach in terms of software, and it's a BIG lift to fix that.

Now, if it unlocked some bigger market, they'd do it, but I don't see where it does. Apple whole car instrument gauge thing seems really out of character, but it's there to hook iPhone demand. It's a hook into that transportation market which is pretty huge. ApplePay is a similar foot in the door to finance that hangs off of wearables and iPhone. What door does HPC open?

johnsonwax · Sunday at 6:59 PM

poke01 said:
This one is interesting too.
View attachment 128200
The M4 Mac mini remains ahead here because it’s a single threaded test but Zen5 loses against Arrow Lake here. The X3D loses to the non-X3D model.

Can anyone here make sense of this?

FLAC is just a big LPC model, so would favor any specialized linear algebra compute. Is it leaning on AMX/SME/etc? I'd imagine a fair bit of that would come down to whether the data is well suited to the scale of the unit - how vector units fall off a cliff once you exceed their vector size. Under that size they're fantastic and over that size they're garbage. It's not like FLAC was designed with encode compute details in mind, the whole point of it is that you're going to pay dearly up front (as in literally who cares how long it takes) but it'll be fast on playback because that's where your real-time compute budget is, so how suitable it is to a given compute arrangement is kind of arbitrary. Maybe Apple just got lucky with FLAC and Ryzen AI super unlucky. With a good compiler it could really be making hay with SME.

Covfefe · Sunday at 7:06 PM

Doug S said:
So they tested the base Mac Mini with only 16 GB against a bunch of other stuff that they don't provide links for so you can't see how they were configured as far as RAM, and every one of them has more cores than the M4's 4 P cores. It has 6 E cores but Apple's E cores are at best 1/3 of a P core so charitably you can call it 6 cores total. I'll bet every one of those x86 boxes has 32 GB of RAM or more.

Given how much a web server trying to satisfy that many requests per second is going to depend on caching, RAM is a HUGE difference. That's a terrible comparison. Let's put it up against the Ultra 285K's 8P and 16E cores (and their E cores have targeted performance rather than efficiency unlike Apple so that's equivalent to at least 16 P cores) with only 16GB of RAM and see how it does then. Or put it up against a Mini equipped with an M4 Pro and 64 GB of RAM.

Try again with a benchmark that's not totally rigged against Apple.

The system configurations are on page 2. The other CPUs have 32GB and 64GB, so RAM could be part of it.

There are other x86 CPUs in that test besides the 285k. The 9600X is a 6 core CPU which is pretty close in core counts to the M4. Hope that helps.

I would love to see some benchmarks with like to like configurations. Unfortunately practically no one does server benchmarks on Apple CPUs. The Qualcomm Phoronix article that poke01 shared might be close to what you're looking for. It has plenty of examples of an ARM CPU losing badly to x86 in sever benchmarks.

poke01 · Sunday at 7:22 PM

johnsonwax said:
Is it leaning on AMX/SME/etc?

not supported by the benchmark so no matmul acceleration

OneEng2 · Sunday at 9:56 PM

Doug S said:
"Designed for server" vs "designed for mobile" is just as meaningless in the days of billion transistor chips as decoding complex x86 instructions vs decoding simple AArch64 instructions. It mattered when there was only room for one or two big cores on a die, but now that there are dozens there's almost no difference in the P core you'd design for mobile vs the P core you'd design for server as far as the CPU structures itself.

Agree to disagree. I think there is a great deal of difference between designing for mobile and designing for DC.

poke01 said:
This one is interesting too.
View attachment 128200
The M4 Mac mini remains ahead here because it’s a single threaded test but Zen5 loses against Arrow Lake here. The X3D loses to the non-X3D model.

Can anyone here make sense of this?

Memory Bandwidth?

S'renne · Sunday at 10:32 PM

Could someone check if the MacOS scheduler affects the overall benchmarks results or not? As some of the developer documentations seems to state that unless explicitly allocated, most tasks would preferably be E cores only rather than using the P cores for burst acceleration

Thibsie · Sunday at 10:37 PM

This is a bit OT but please note that FLACCL exists which is a Flax version encoder for GPUs and is wayyy fatser.

Geddagod · Sunday at 10:55 PM

poke01 said:
Let’s do a fun exercise. Let’s create a “Zen5 CCD” sorta chiplet, what would be its die area. Include the full caches etc needed for a hypothetical M4 16 P core CPU for the mm2 calculation.

We often hear that M4 P core is too big for DC, so it would be interesting to see people’s perspectives.

Credit for die shot is Tech insights.
View attachment 128179

These are my numbers. Any obvious mistakes, I would be happy to have them pointed out. The measurements were done a while ago, so any specific questions about methodology I would have to go and double check lol.

Doug S said:
BS.

"Designed for server" vs "designed for mobile" is just as meaningless in the days of billion transistor chips as decoding complex x86 instructions vs decoding simple AArch64 instructions. It mattered when there was only room for one or two big cores on a die, but now that there are dozens there's almost no difference in the P core you'd design for mobile vs the P core you'd design for server as far as the CPU structures itself.

You might make difference choices in the transistor types with FinFlex, but even that I'm kinda skeptical about - because today power matters almost as much in DC chips as phone chips, since the allowable per core power draw has been falling for 20 years - ever since the first dual core CPUs were sold. The per core power draw of DC chips isn't as low as it is in mobile, but it is a lot closer than it was a decade ago, and is going to keep getting closer unless Intel/AMD decide to start designing DC chips that require liquid cooling like IBM mainframes.

The differences when designing for the DC come in the uncore, stuff like the number of memory channels, support for ECC/chipkill on those memory channels, the need for a "bigger" fabric to connect many more CPU cores together, more PCIe links, that sort of thing. Ideally you want a really fat last level cache to support the much larger memory a server will have than a phone, but mobile CPUs ALSO like pretty fat caches for a different reason - it lets them power off the power hungry memory controllers. I remember reading a while back that memory controllers are responsible for almost all the power draw of DRAM. The DRAM itself - charging capacitors, bitsense amplifiers, row/column selectors etc. are a low single digit percentage of the overall DRAM power budget! So being able to power off those memory controllers, even momentarily, is a huge win. That's why Apple has a much bigger SLC on the iPhone SoC than the base Apple Silicon SoC.

Pointing to that Ampere One benchmark as any sort of "evidence" that ARM is not suitable for DC is lunacy. I might as well point to Zhaoxin benchmarks as evidence that x86 is unsuitable, because Ampere One and M4 have about as much in common as Zhaoxin and Zen 5.

I think there's 3 main ways where Apple silicon could be undesirable for server.

No avx-512 or IIRC not even 256 bit vector units? Seems to be pretty important considering even with Zen 5C, AMD is keeping the massive full width 512 bit FPU implementation even though they would undoubtedly be able to save a good bit of area not doing so.
Lack of SMT. Something which Intel admits is a mistake and is reversing course in, for server at least.
Cache hierarchy that doesn't offer a bunch of cache capacity when all the cores are loaded up, and seems to depend on a shit on of memory bandwidth- something that is hard to scale up in server. An interesting rumor I saw on reddit regarding Qcomms rumored server parts was a 80 core oryon server part that was planned- that had 16 channels of ddr5. Which seems like complete overkill for only 80 cores, but may be needed to match the memory bandwidth per core that their client parts have.

mikegg said:
What is the actual SoC power? I'm not convinced that Zen5 is only 30% away from M4. How does David Huang measure power? Is it through the wall doing load - idle for both Zen5 and M4?

Cinebench ST perf/watt:

M4 Pro: 9.52 pts/W
Strix Halo 395: 2.62 pts/W

3.6x better ST perf/watt is closer to the real world experience of using a Zen5 laptop vs an M4 laptop.

Software based power readings.

Question x86 and ARM architectures comparison thread.

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Junior Member

Senior member

Diamond Member

Senior member

Senior member

Junior Member

Diamond Member

Senior member

Member

Golden Member

Golden Member