Question x86 and ARM architectures comparison thread.

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

poke01

Diamond Member
Mar 8, 2022
3,893
5,221
106
Could someone check if the MacOS scheduler affects the overall benchmarks results or not? As some of the developer documentations seems to state that unless explicitly allocated, most tasks would preferably be E cores only rather than using the P cores for burst acceleration

some of the docs S'renne may be referring too.



I use this as a guide for powermetrics, hopefully it will be of use to anyone in the future.
 
  • Like
Reactions: S'renne

S'renne

Member
Oct 30, 2022
146
108
86

some of the docs S'renne may be referring too.



I use this as a guide for powermetrics, hopefully it will be of use to anyone in the future.
These also suggests the default state would be E cores only unless the ports are specifically built with Apple Silicon as target and not ARM in general, unless I'm misreading something there
 

johnsonwax

Senior member
Jun 27, 2024
261
421
96
These also suggests the default state would be E cores only unless the ports are specifically built with Apple Silicon as target and not ARM in general, unless I'm misreading something there
That document is for Finder initiated tasks and my understanding is those tasks default to e core and only promote to p if there is user interaction (mouse, keyboard input, etc). I know all processes that don't specify that are run from cron will run exclusively on e because they inherit cron's QoS - they will never promote to p. So what's Terminal's QoS? I'm guessing it's e only unless you're launching a GUI app to receive interaction.

But if you're benchmarking you're likely running it from a bash or python script so you can record the time. Setting nice doesn't impact what core it's on so the person doing the benchmarks would need to know to call taskpolicy which sets the QoS as documented there.
 
  • Like
Reactions: S'renne

Doug S

Diamond Member
Feb 8, 2020
3,362
5,902
136
I think there's 3 main ways where Apple silicon could be undesirable for server.
  • No avx-512 or IIRC not even 256 bit vector units? Seems to be pretty important considering even with Zen 5C, AMD is keeping the massive full width 512 bit FPU implementation even though they would undoubtedly be able to save a good bit of area not doing so.
  • Lack of SMT. Something which Intel admits is a mistake and is reversing course in, for server at least.
  • Cache hierarchy that doesn't offer a bunch of cache capacity when all the cores are loaded up, and seems to depend on a shit on of memory bandwidth- something that is hard to scale up in server. An interesting rumor I saw on reddit regarding Qcomms rumored server parts was a 80 core oryon server part that was planned- that had 16 channels of ddr5. Which seems like complete overkill for only 80 cores, but may be needed to match the memory bandwidth per core that their client parts have.

That's not a list of ways Apple Silicon, or ARM, is "undesirable for server". That's a list of ways their current implementation may be. Obviously they aren't designing cores with features that they feel are only useful on servers when they don't make servers. It isn't as if adding SVE2 (or beefing up their SSVE implementation) would be that hard if they decided to target that. There's a lot of server loads that don't use much SIMD, and most server loads are almost exclusively INT. If Apple wanted to offer a much bigger cache to support chips that have many more cores they could do what AMD has done and simply stack a cache chip. TSMC is researching putting eDRAM in the BEOL, I'm not sure how far long that is but if they make it work then even phones will have a cache sizes would embarrass AMD's X3D stuff.

I'm skeptical that SMT is any sort of a "must have". The average performance gain with SMT is less than half the performance of one of Apple's E cores. I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.

Sure you can say "well their current implementation is all we have to go on" but given that Apple is designing custom chips for their own use we may see some DC features appear in the CPU cores on A19/M5 or A20/M6, just like we could see the hints of what Apple was doing to create Apple Silicon in their A-series SoCs (i.e. supporting more physical / virtual address bits that were never going to be needed on a phone, stuff like that) before Apple Silicon Macs were released.
 
  • Like
Reactions: mikegg and DavidC1

S'renne

Member
Oct 30, 2022
146
108
86
That's not a list of ways Apple Silicon, or ARM, is "undesirable for server". That's a list of ways their current implementation may be. Obviously they aren't designing cores with features that they feel are only useful on servers when they don't make servers. It isn't as if adding SVE2 (or beefing up their SSVE implementation) would be that hard if they decided to target that. There's a lot of server loads that don't use much SIMD, and most server loads are almost exclusively INT. If Apple wanted to offer a much bigger cache to support chips that have many more cores they could do what AMD has done and simply stack a cache chip. TSMC is researching putting eDRAM in the BEOL, I'm not sure how far long that is but if they make it work then even phones will have a cache sizes would embarrass AMD's X3D stuff.

I'm skeptical that SMT is any sort of a "must have". The average performance gain with SMT is less than half the performance of one of Apple's E cores. I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.

Sure you can say "well their current implementation is all we have to go on" but given that Apple is designing custom chips for their own use we may see some DC features appear in the CPU cores on A19/M5 or A20/M6, just like we could see the hints of what Apple was doing to create Apple Silicon in their A-series SoCs (i.e. supporting more physical / virtual address bits that were never going to be needed on a phone, stuff like that) before Apple Silicon Macs were released.
Would Apple even need to stack cache with how they're scheduling their L2 for each core cluster, they could just increase it instead isn't it?
 

Geddagod

Golden Member
Dec 28, 2021
1,440
1,551
106
That's not a list of ways Apple Silicon, or ARM, is "undesirable for server". That's a list of ways their current implementation may be
I don't think anyone in this thread is arguing that the ARM architecture is inherently worse for server. All I've seen is discussion about current implementations.
Obviously they aren't designing cores with features that they feel are only useful on servers when they don't make servers
And yet you also insisted that there was no difference between "designed for server" vs "designed for mobile"? I agree with you on most of the points you presented in that post, and yet...
It isn't as if adding SVE2 (or beefing up their SSVE implementation) would be that hard if they decided to target that.
Perhaps. Intel faced issues when they first added AVX-512 IIRC, but that's also Intel lol.
The problem I see is that doing so would also increase core area, and I would imagine by a good bit too. ARM's classic cores might be able to tank this and retain good PPA, but Apple's cores would get even chonkier than they already are. Power, even with great power gating, prob won't benefit either, in cases where the wider vector units won't be used. Though I also don't think it will be hurt too much either.
There's a lot of server loads that don't use much SIMD, and most server loads are almost exclusively INT.
Despite that, Intel is beefing up their E-cores with wider vector units, and AMD is retaining full width AVX-512 on their cores that are custom designed to compete in non-HPC markets.
Clearly it's still pretty important.
If Apple wanted to offer a much bigger cache to support chips that have many more cores they could do what AMD has done and simply stack a cache chip
This seems like it will increase cost dramatically.
Also a good bit of Apple's area efficiency advantage rests on the unique cache hierarchy they have employed.
I'm skeptical that SMT is any sort of a "must have". The average performance gain with SMT is less than half the performance of one of Apple's E cores. I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.
For client I agree, but for DC I doubt it. No one is doing heterogenous designs in DC, and I think it's because no one wants to deal with it.
There's also the question of how much of a benefit Apple adding SMT to their cores will bring vs AMD. AFAIK, longer pipelines benefit from SMT more, AMD may see larger gains in nT perf with 1 core than Apple does.
 
  • Like
Reactions: Tlh97 and OneEng2

Gideon

Platinum Member
Nov 27, 2007
2,022
5,006
136
I'm skeptical that SMT is any sort of a "must have". The average performance gain with SMT is less than half the performance of one of Apple's E cores. I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.
According to people that have actually ported Linux to ARM like Jon Masters (who was the lead for ARM server in Red Hat for a decade and later worked at Nuvia and now Google) There is a reason why every single vendor avoids hybrid cores there. I'll try to find his articles/tweets when i get the time (failed in a quick search) But he has been really vocal about how hybrid cores, while excellent for client, are absolutely terrible for servers.

And he's a total Apple fanboy btw, so it's not that.

EDIT:
Ok here i found one instance where he talks about it:
cJoKEC4.png

"Who is this guy anyway?"
In 2019:


EE Times said:
The OS ports are just the tip of a software iceberg, warns Jon Masters. He has spent the last nine years working on a standard version of Red Hat Linux for Arm servers. So far, just two commercial systems have been announced as certified to run it.

The process involved identifying a set of low-level hardware primitives that are assumed in the x86 world for things like how interrupts and power states work. Those de facto standards were articulated in a 50+ page document, then applied to the Arm architecture.

A separate effort documented x86 boot standards across Linux and Windows and applied them to Arm cores. They included obscure but crucial details of BIOS, power, and multiprocessing functions.
 
Last edited:

poke01

Diamond Member
Mar 8, 2022
3,893
5,221
106
Regarding the talk of cache, David Huang tested the M4 Pro cache system.

He notes some interesting details.
IMG_2359.jpeg
IMG_2360.png

This reddit thread goes into further detail.

This comment from u/b3081a stood out to me.

“ Apple's L2 cache works like virtual cache in a lot of ways.

In a single cluster, the L2 latency isn't uniform in different sizes even when we exclude the TLB overhead, so a slice of the L2 (~2-3MB) is faster to each core, making the other slices look more like L3 cache to a single core. This has been the case since a long time ago, perhaps since they first began building multi-core processors.

In M3 max or M4 pro/max, this got extended to multiple clusters, L2 cache from neighbor clusters could be accessed with an even higher latency, and the 16 MB cache in the other P cluster looks more like L4 cache from a single core perspective.

It's actually a quite clever design that balances between single thread performance, multi thread performance and design complexity quite well.”

The only other architecture that does is the Telum, still remember that Aandtech article and discussion regarding it.
 

OneEng2

Senior member
Sep 19, 2022
730
978
106
These are my numbers. Any obvious mistakes, I would be happy to have them pointed out. The measurements were done a while ago, so any specific questions about methodology I would have to go and double check lol.
View attachment 128207
View attachment 128208

I think there's 3 main ways where Apple silicon could be undesirable for server.
  • No avx-512 or IIRC not even 256 bit vector units? Seems to be pretty important considering even with Zen 5C, AMD is keeping the massive full width 512 bit FPU implementation even though they would undoubtedly be able to save a good bit of area not doing so.
  • Lack of SMT. Something which Intel admits is a mistake and is reversing course in, for server at least.
  • Cache hierarchy that doesn't offer a bunch of cache capacity when all the cores are loaded up, and seems to depend on a shit on of memory bandwidth- something that is hard to scale up in server. An interesting rumor I saw on reddit regarding Qcomms rumored server parts was a 80 core oryon server part that was planned- that had 16 channels of ddr5. Which seems like complete overkill for only 80 cores, but may be needed to match the memory bandwidth per core that their client parts have.

Software based power readings.
I would love to have been able to "like" this about 10 times! Thanks! That is great work and a really nice comparison chart!
That's not a list of ways Apple Silicon, or ARM, is "undesirable for server". That's a list of ways their current implementation may be. Obviously they aren't designing cores with features that they feel are only useful on servers when they don't make servers. It isn't as if adding SVE2 (or beefing up their SSVE implementation) would be that hard if they decided to target that. There's a lot of server loads that don't use much SIMD, and most server loads are almost exclusively INT. If Apple wanted to offer a much bigger cache to support chips that have many more cores they could do what AMD has done and simply stack a cache chip. TSMC is researching putting eDRAM in the BEOL, I'm not sure how far long that is but if they make it work then even phones will have a cache sizes would embarrass AMD's X3D stuff.
It is a very weak argument.

A Honda Civic can't outrun a dragster, but that's only because it doesn't have .....

Conversely, a Dragster can't be driven on the road legally, but that's only because it doesn't have ....

In the real world of engineering, the fact is you can't make a design that does EVERYTHING better than everyone else's design because in engineering you can't get something for nothing.
I'm skeptical that SMT is any sort of a "must have". The average performance gain with SMT is less than half the performance of one of Apple's E cores. I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.
SMT is simply a fantastic use of die space to raise MT performance. It came out of the implementation of superscalar designs where resources were frequently idled. SMT gave them something useful to do. It's simply a really good idea and only adds 5% die space to the core for 40% MT performance.

Intel is doing exactly what you are saying. Why have SMT and all its complexities when you can just have another full core? The answer? Another full core costs a LOT more than 5% die space for the core. It costs 100%. Intel is putting SMT back. It's no wonder though.
 
  • Like
Reactions: booklib28 and Tlh97

511

Diamond Member
Jul 12, 2024
3,274
3,200
106
Intel is doing exactly what you are saying. Why have SMT and all its complexities when you can just have another full core? The answer? Another full core costs a LOT more than 5% die space for the core. It costs 100%. Intel is putting SMT back. It's no wonder though.
Intel is putting E core which are way more area efficient for scaling MT in their case
 

AMDK11

Senior member
Jul 15, 2019
472
401
136
Does anyone have access to a table of IPC results from FlopsCPU (x87, SSE, SSE2, AVX, AVX2, FP, and SMT) for various x86 microarchitectures? Last time I saw it, Zen3/4 and Raptor/GoldenCove were added. Maybe LionCove and Zen5 are also included. I can't find it anywhere.
 

511

Diamond Member
Jul 12, 2024
3,274
3,200
106
So you think CWF will have a chance at besting Venice D? Both will likely be on N2.
Nope it will beat Turin dense in Integer workload by a good Margin though it will fall quite short of Turin Dense but Turin dense is upto Rouge River forest.
 

Geddagod

Golden Member
Dec 28, 2021
1,440
1,551
106
Regarding the talk of cache, David Huang tested the M4 Pro cache system.
Latency picture for anyone who was wondering about that, from the same source
1754331407380.png
The only other architecture that does is the Telum, still remember that Aandtech article and discussion regarding it.
IIRC Qualcomm does something similar too, prob because it's the "same" team.
I would love to have been able to "like" this about 10 times! Thanks! That is great work and a really nice comparison chart!
<3
Intel is putting E core which are way more area efficient for scaling MT in their case
How does the math work out for this one?
So you think CWF will have a chance at besting Venice D? Both will likely be on N2.
Intel confirmed that CWF would have 18A compute tiles and Intel 3 base tiles. Idk if server will ever go external tbh.
Nope it will beat Turin dense in Integer workload by a good Margin though it will fall quite short of Turin Dense but Turin dense is upto Rouge River forest.
I'm assuming for the latter half of that statement you mean it will fall quite short to Venice Dense?
TBH, I think Turin Dense and CLF are going to be pretty close. CLF has a ~50% core count advantage, but I doubt CLF has a higher all core turbo frequency, Turin Dense should have higher perf/core, and also benefits from SMT.
I also think CLF will be competing with Venice Dense for most of its lifespan. I doubt Rouge River Forest, if it is going to come out, comes till late 2027/early 2028 at best, and I would imagine that's around the time, or at most ~1/2 a year earlier, than when Zen 7 server parts come out.
 
  • Like
Reactions: Tlh97

Jan Olšan

Senior member
Jan 12, 2017
549
1,089
136
I'm skeptical that SMT is any sort of a "must have".

Very much agreed, it's just a tool, calculating that by dropping SMT you can make a core smaller and stuff more of them on chip to compensate, that is likely a very valid approach

I think a P/E mix may work better - and I think Intel's designers probably agreed but due to their precarious state they're focusing on a "unified core" so they have little choice but to go back to SMT to have it do the work that small cores could be doing.
If you mean that there will be a mix of P and E on a server chip or worse, that instead of SMT, you would basically always have a P-Core and E-Core companion providing second thread, i think that would be the most cursed and hated thing in the field.

Multiple architectures can have their place in servers, but it's likely only so if they are separate products, like Intel does it now.
Possible exception: Say somebody like Amazon designs a chip for their own cloud stack. Then a processor could have say 128 P-Cores that the customers will see and use, and separately 32 E-Core that would not be rented and not visible from the cloud side, they will be reserved for invisible Amazon housekeeping, storage/network access processing, secure enclave functionalities and other shenanigans.
 

johnsonwax

Senior member
Jun 27, 2024
261
421
96
According to people that have actually ported Linux to ARM like Jon Masters (who was the lead for ARM server in Red Hat for a decade and later worked at Nuvia and now Google) There is a reason why every single vendor avoids hybrid cores there. I'll try to find his articles/tweets when i get the time (failed in a quick search) But he has been really vocal about how hybrid cores, while excellent for client, are absolutely terrible for servers.

And he's a total Apple fanboy btw, so it's not that.

EDIT:
Ok here i found one instance where he talks about it:
cJoKEC4.png
I mean, as noted above, to Apple the distinction between P cores and E cores is loosely user interaction/non-user interaction, which adds another wrinkle to the 'what is the point of this benchmark' if it's in-the-wild execution on the Mac would be on an e core because if you're batch encoding FLAC files, why not do that in the background while you're playing Minecraft or whatever. None of that makes any sense on a server though.

But I wonder if the server silicon space isn't as diverse as it should be to ask where someone like Apple slots in. Just thinking of the server applications I've personally been involved in, they do range from high parallelization HPC to high thread-low compute-I/O bound, to balanced compute-different I/O bound, etc. and apart from the AI space which is pretty specialized, the rest of the market is too small (apart from AI) to make really specialized designs to fit various needs so we'll throw this fairly generic blanket over all of it to cover up how bumpy everything below it is. The Mac Pro kind of is such a thing in that it's designed for high volume video and audio processing that is taking a LOT of local data and turning it into less data very frequently and therefore not well suited to datacenter (because bandwidth is a very limiting factor) but is well suited to custom silicon. Things like Bitcoin went to custom silicon as well and took out one chunk of that market.

It seems to me a lot of Masters statements there are 'we don't know our use case, so we need to be as generic and safe as possible' and that's also largely true for AMD/Intel because something like a server hosting web pages would probably run really well with a f-ton of e-cores that can spin up an absurd number of threads/watt and will block IO most of the time, but is that market large enough to build that product or for AWS to turn it into a tier, etc? Probably not, so it doesn't exist. What we get is more-or-less detuned p-cores that better fit that profile, but are a bit slower for things that are non-parallelizable like database queries. Essentially DC compute is the RMS of the overall compute profile and as some application of that market gets big enough, a more dedicated compute profile pops out of it - AI, GPU, etc. So would Apple be good for DC? Probably - but which part of it? The whole point of Apple Silicon is that it's NOT the RMS of desktop compute - it's this better tuned corner of desktop because Apple knows what to tune it to and can steer customers to that corner. We shouldn't expect Apple Silicon to be good for the RMS of server compute for the same reason - what corner do you want to tune it to? And is that a market Apple would care about (no). Even looking at the rumors around Apple Silicon for their AI servers, the primary motivation for that to me isn't that it's more performant, it's code reuse and security, because the point is to offload device compute without translation which is already built around their silicon, and can leverage all of the existing E2E stuff that is already built into Apple Silicon. It's not performance - they're designing new silicon to close that gap up as much as possible, but they must know it'll be miles from a dedicated Nvidia product. So it becomes this tuned little corner of the space.
 

511

Diamond Member
Jul 12, 2024
3,274
3,200
106
I'm assuming for the latter half of that statement you mean it will fall quite short to Venice Dense?
Yes lol it's confusing sometimes
TBH, I think Turin Dense and CLF are going to be pretty close. CLF has a ~50% core count advantage, but I doubt CLF has a higher all core turbo frequency, Turin Dense should have higher perf/core, and also benefits from SMT.
I also think CLF will be competing with Venice Dense for most of its lifespan. I doubt Rouge River Forest, if it is going to come out, comes till late 2027/early 2028 at best, and I would imagine that's around the time, or at most ~1/2 a year earlier, than when Zen 7 server parts come out.
For this I have a theory a 144C SRF Part scores 710 in Spec2017 and dual socket roughly about 1400 Turin dense is roughly 1500-1600 Score.
So same core count + a full node jump over + Skymonts Insane IPC Improvement over Crestmont and Darkmont adds another 3-5% improvement I don't see Clearwater Forest hitting ~2000 Spec Score.

Crestmont(155H) -> 1.46
Skymont(255H) -> 1.88
1.88/1.46 = 1.28*1.05(Darkmont improvement)
= 1.35X Int IPC vs SRF
15% baseline from node ( I think the gain from node would be more considering Intel's obsession with HP Libs and I3 HD vs 18A HP is 15% PPW).
 
Last edited:

johnsonwax

Senior member
Jun 27, 2024
261
421
96
If you mean that there will be a mix of P and E on a server chip or worse, that instead of SMT, you would basically always have a P-Core and E-Core companion providing second thread, i think that would be the most cursed and hated thing in the field.

Multiple architectures can have their place in servers, but it's likely only so if they are separate products, like Intel does it now.
Possible exception: Say somebody like Amazon designs a chip for their own cloud stack. Then a processor could have say 128 P-Cores that the customers will see and use, and separately 32 E-Core that would not be rented and not visible from the cloud side, they will be reserved for invisible Amazon housekeeping, storage/network access processing, secure enclave functionalities and other shenanigans.
That's purely an economic not technical question. And I think the reason why you're looking at Amazon and not AMD for that is because Amazon is better able to square that economic question than AMD is. This is why I've previously said elsewhere that x86 is a Moores Law dead-end, because Amazon can't do that with x86, only with ARM or RISC-V, etc. If x86 were FRAND licensable, I think the space would look incredibly different. By holding onto that x86 IP AMD/Intel are saying 'we can meet every compute need' and the f they can and anyone who has the cash to assemble a silicon team and cut the necessary check to TSMC can build just what they need provided it's not x86. Note: nobody except Apple is using ARM for performance reasons. Nobody. And yet, x86 continues to shrink.

Before I retired I moved my rented compute from x86 to ARM when Apple announced their switch. Part of that was a test, but part of it was 'my future local compute is likely going to be ARM, so having that alignment simplifies things for me even if it's reasonably less performant' and it was cheaper per compute unit, and I took that savings. I was the market for AWS in that I scaled my compute needs up and down a lot, so throwing a larger server or an additional one on the manifest was fine given they were ⅓ cheaper. I wasn't Netflix running maximal loads per physical server so I didn't particularly care about that. A big part of AWSs pitch is value, and there's a lot of different ways they deliver that.
 

Doug S

Diamond Member
Feb 8, 2020
3,362
5,902
136
Multiple architectures can have their place in servers, but it's likely only so if they are separate products, like Intel does it now.
Possible exception: Say somebody like Amazon designs a chip for their own cloud stack. Then a processor could have say 128 P-Cores that the customers will see and use, and separately 32 E-Core that would not be rented and not visible from the cloud side, they will be reserved for invisible Amazon housekeeping, storage/network access processing, secure enclave functionalities and other shenanigans.

How does the cloud world charge for SMT capable processors now? If you're paying for a "core" but sharing it with someone else you're not only getting less performance for your thread due to that resource sharing you have the potential security issues to worry about. IMHO it would be preferable to pay more to get a "core" that's an entire physical core even if SMT was turned off so it was still only one thread, because at least I know I'd get 100% of its resources and 0% of the potential SMT related security issues.
 

OneEng2

Senior member
Sep 19, 2022
730
978
106
Nope it will beat Turin dense in Integer workload by a good Margin though it will fall quite short of Turin Dense but Turin dense is upto Rouge River forest.
Agree to disagree. I think it likely that Turin D will be competitive with CWF and possibly even beat it.

Venice D will likely be a definitive victory over CWF.
Intel confirmed that CWF would have 18A compute tiles and Intel 3 base tiles. Idk if server will ever go external tbh.
Thanks.

It will be interesting then to see how it performs. I think it likely that Venice D will be capable of running at decent clock speeds (somewhere North of 5.0Ghz I would guess under load with SMT and AVX512 on all cores) on N2.

Both Intel's E core architecture AND 18A with BSPDN are likely to limit the clock speeds that CWF can accomplish IMO.
I'm assuming for the latter half of that statement you mean it will fall quite short to Venice Dense?
TBH, I think Turin Dense and CLF are going to be pretty close. CLF has a ~50% core count advantage, but I doubt CLF has a higher all core turbo frequency, Turin Dense should have higher perf/core, and also benefits from SMT.
I also think CLF will be competing with Venice Dense for most of its lifespan. I doubt Rouge River Forest, if it is going to come out, comes till late 2027/early 2028 at best, and I would imagine that's around the time, or at most ~1/2 a year earlier, than when Zen 7 server parts come out.
I agree.
Yes lol it's confusing sometimes
I keep botching the names as well. My head thinks "Venice" and my fingers type "Turin" ;).
 
  • Like
Reactions: Tlh97

poke01

Diamond Member
Mar 8, 2022
3,893
5,221
106
IIRC Qualcomm does something similar too, prob because it's the "same" team.
No I don’t believe they do. Qualcomm X Elite can access the entire L2 cache within the same cluster but cannot access another P cores cluster L2 cache.
 

DavidC1

Golden Member
Dec 29, 2023
1,693
2,774
96
SMT is simply a fantastic use of die space to raise MT performance. It came out of the implementation of superscalar designs where resources were frequently idled. SMT gave them something useful to do. It's simply a really good idea and only adds 5% die space to the core for 40% MT performance.
Die area, but adds something bigger in the negative which is increased difficulty of validation and risk made worse by Meltdown/Spectre era.
2. How do we know that AMD's ST power efficiency is truly only measuring the core in the exact same way the Apple Silicon is? I think the best way is to take ST load power subtract idle power taken from the wall. This is how Notebookcheck does it. When power is measured this way, M4 is 3.6x more efficient than Zen5 in Cinebench. That seems way more in line with real world usage experience.
Notebookcheck's method is not completely accurate either. Notebooks disable lot of power savings when powered from the wall. Only under battery you can do a proper test. You can see many of their tests where it shows high power under AC test but good battery under load, and vice versa.

That said, in a general sense you are right.

For example:

Lunarlake system has a consistent ~4W advantage at all tested TDP levels, meaning it's due to uncore/platform optimizations, which Lunarlake has lots of. Now if you assume Apple is doing more of it, then that number is greater, say 6W or 7W.

At higher power levels, even 7W is not a big deal. You are talking 57W vs 50W. But at 15W total, that's half of your total power use.
 
Last edited:

DavidC1

Golden Member
Dec 29, 2023
1,693
2,774
96
Agree to disagree. I think it likely that Turin D will be competitive with CWF and possibly even beat it.
144 core Gracemont-based Sierra Forest is already pretty competitive in Integer workloads such as Cloud, Virtualization, and Kernel Compile: https://www.servethehome.com/wp-con...rin-Linux-Kernel-Compile-Benchmark-scaled.jpg

288 cores Clearwater Forest with Skymont cores will equal Turin D even if you assume only 2x gains. We would have seen most of the 2x gains had they released 288 core Sierra Forest.

There are many cases where Sierra Forest is substantially behind, and will continue to remain behind even against Turin D, but will improve substantially due to:
-2x FP capability in addition to substantial uarch advancements
-Large, under the core cache, connected by Foveros.
-Additional optimizations specific to certain workloads that we do not know.

That's why @511 said it's already close in SpecCPU.
Highest 2P 6780E: 1460/1410
Highest 2P 9965: 3380/3230

That's 144 Gracemont cores vs 192 Zen 5c + SMT
 
Last edited:
  • Like
Reactions: Tlh97 and 511

johnsonwax

Senior member
Jun 27, 2024
261
421
96
The value of having the hard R&D being subsidized by ARM itself.
Which is not something that's gonna last.
Why wouldn't it last? Presumably ARM by now has figured out how much the license need to cost to keep it going and there's zero marginal cost on delivering it, so scale keeps that engine going - that's the Moores law point I keep making. The problem is on AWS side to achieve the necessary scale/value to produce enough revenue to pay for the increasing development cost. It's not like AMD isn't having the same problem given their need to offer so many SKUs.