Question Qualcomm's first Nuvia based SoC - Hamoa

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Thibsie

Senior member
Apr 25, 2017
791
860
136
Nuvia was designing a server core that was supposed to be released in like 2021. Its been how long since Qualcomm bought Nuvia? In all that time you don't think they had ample opportunity to change focus to something more mobile (at least laptop if not phone) focused? It seems to be pretty power efficient, if it had been targeting servers it would be burning more power unless they had been planning to cram 256 cores on a reticle sized die.
No, I don't think they could do that.
Nuvia was absorbed.
Part of the team went kabooom
Some QC exec 'I have a bigger than yours' played their dirty games
They had to integrated their Nuvia server core into a QC uncore.

They really had other things to do (understandably)
 

naukkis

Senior member
Jun 5, 2002
727
610
136
Your Zen4 numbers are with L2, not just core area. . .

Yes - for fair comparison only core's last level cache should excluded. Apples core has much larger very low density L1-caches which makes it possible to do it without intermediate cache level. For Zen4 L2 is core private resources and should be in calculations when comparing silicon efficiency.
 

Hitman928

Diamond Member
Apr 15, 2012
5,445
8,373
136
Yes - for fair comparison only core's last level cache should excluded. Apples core has much larger very low density L1-caches which makes it possible to do it without intermediate cache level. For Zen4 L2 is core private resources and should be in calculations when comparing silicon efficiency.

I don't think it's that straight forward. M2 has another cache level beyond L2, it's just shared at the SOC level but the cores have basically sole use of it when CPU only loads are being run. The L2 cache on M2 is also only shared between the P-cores. Obviously AMD and Apple made different design decisions in their products because they have different goals and markets they are trying to address. You can make an argument that Apple's L1 is purposefully large and L2 should be excluded in area calculations but not AMD's, but then you can't also hold the argument about higher IPC from the Apple cores at the same time. It's a tradeoff. You either need to take core area for core area or include the L2 area / P-cores in the Apple calculation. Either way, you're going to come to the same end conclusion that AMD's (non-C) and Apple's cores are roughly the same size with one being much wider by design for higher IPC and the other being more narrow but higher frequency.
 

soresu

Platinum Member
Dec 19, 2014
2,827
2,027
136
Acccording to Anandtech's live blog of the QC summit Blackmagic are working on a native ARM64 Windows version of DV Resolve for release next year.

That's a huge win for the platform considering how many editors, color graders and compositors use it.

Now if only DVR wasn't such a crash happy frankensteinian trainwreck to work with 😂
 

naukkis

Senior member
Jun 5, 2002
727
610
136
I don't think it's that straight forward. M2 has another cache level beyond L2, it's just shared at the SOC level but the cores have basically sole use of it when CPU only loads are being run. The L2 cache on M2 is also only shared between the P-cores. Obviously AMD and Apple made different design decisions in their products because they have different goals and markets they are trying to address. You can make an argument that Apple's L1 is purposefully large and L2 should be excluded in area calculations but not AMD's, but then you can't also hold the argument about higher IPC from the Apple cores at the same time. It's a tradeoff. You either need to take core area for core area or include the L2 area / P-cores in the Apple calculation. Either way, you're going to come to the same end conclusion that AMD's (non-C) and Apple's cores are roughly the same size with one being much wider by design for higher IPC and the other being more narrow but higher frequency.

Cache density ain't equal. L1-cache density is about half of L2 density. If Zen4 have had 512KB L2 comparison as L2 with Zen to only L1 with Apple cores would be very fair. Apple could also optimize their design for higher clock speed by smaller L1-caches and use intermediate L2 to compensation - but as they don't aim so high clock they could use bigger L1 cache instead. And that's just that design philosofy - by targeting high clocks there will come need for more complex cache hierarchy to meet needed timings in silicon.
 

Hitman928

Diamond Member
Apr 15, 2012
5,445
8,373
136
Cache density ain't equal. L1-cache density is about half of L2 density. If Zen4 have had 512KB L2 comparison as L2 with Zen to only L1 with Apple cores would be very fair. Apple could also optimize their design for higher clock speed by smaller L1-caches and use intermediate L2 to compensation - but as they don't aim so high clock they could use bigger L1 cache instead. And that's just that design philosofy - by targeting high clocks there will come need for more complex cache hierarchy to meet needed timings in silicon.

This seems like a long way of saying, Apple chose a lower frequency, , higher IPC, larger area design as compared to AMD. . .
 

naukkis

Senior member
Jun 5, 2002
727
610
136
This seems like a long way of saying, Apple chose a lower frequency, , higher IPC, larger area design as compared to AMD. . .

More like that by not chasing high frequency they could fit larger logical core design to same or smaller area that AMD and specially Intel.
 

Hitman928

Diamond Member
Apr 15, 2012
5,445
8,373
136
More like that by not chasing high frequency they could fit larger logical core design to same or smaller area that AMD and specially Intel.

I don't think that was ever in dispute. The only thing I was replying to you on originally is your claim that Apple's core designs are almost 1/2 the size of AMD's due to AMD chasing high frequency. That is not accurate as they have roughly the same area when comparing their top performing cores on the same node. I think Apple's approach is certainly the better one and it seems that AMD is now starting to follow that same path though they will still keep one foot in the higher frequency door probably because they have the engineers with the talent and experience to do so. I believe @adroc_thurston has mentioned another core design that is specifically for low frequencies (rather than shrinking the high frequency design a la Zen4c), so maybe we'll see a full crossover at some point or maybe a bifurcation of product lines.
 

naukkis

Senior member
Jun 5, 2002
727
610
136
I don't think that was ever in dispute. The only thing I was replying to you on originally is your claim that Apple's core designs are almost 1/2 the size of AMD's due to AMD chasing high frequency. That is not accurate as they have roughly the same area when comparing their top performing cores on the same node. I think Apple's approach is certainly the better one and it seems that AMD is now starting to follow that same path though they will still keep one foot in the higher frequency door probably because they have the engineers with the talent and experience to do so. I believe @adroc_thurston has mentioned another core design that is specifically for low frequencies (rather than shrinking the high frequency design a la Zen4c), so maybe we'll see a full crossover at some point or maybe a bifurcation of product lines.

And my argument was against someone's who claimed that x86 designs chosed high frequency design instead more wide design to save silicon space, which I disagree.

And about that include or exclude Zen L2 cache from core area - A15 have whopping 256KB more L1 cache than Zen4. Theoretically Apple could easily within same total area switch that to 64KB L1 + 512KB L2 cache arrangement and probably win a bit more IPC from doing so. But it's power inefficient way to do things - Apple saves a quite a bit power with those bigger L1-caches which will massively cut energy consuming L1-L2 traffic.
 
Last edited:
  • Like
Reactions: SpudLobby

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Theoretically Apple could easily within same total area switch that to 64KB L1 + 512KB L2 cache arrangement and probably win a bit more IPC from doing so. But it's power inefficient way to do things - Apple saves a quite a bit power with those bigger L1-caches which will massively cut energy consuming L1-L2 traffic.

I don't think they would gain any IPC, most likely loose quite some and then loose on power.

People underestimate Apple's incredibly strong caching, 192KB L1I is obviously awesome due to massive size. But it's L1D that is real gem. Almost 4ghz, 3 cycles latency for basic pointer loads, 128KB size. Untouchable capacity and untouchable latency.
And they are backing this with ~20 cycle latency 16MB L2 cache.

So they are at sweet spot already, the only thing that can improve is pretty much L2 cache size and they are doing just that with each gen.
 

SpudLobby

Senior member
May 18, 2022
843
546
106
You are so completely wrong about that. nT performance is derived from 1T performance. The quicker a single task is performed, the quicker the chip can go back to sleep.

AMD chips also scale rather well from a power standpoint. A 7950X at 65W is one of the most efficient chips out there.
This is utterly silly. Desktops are a small and shrinking portion of the market for one thing.

Second of all, quadratic relationships between voltage and power consumption in a CPU imply, as most understand, that last running these chips at their peak curves will reduce energy and power efficiency even under racing to idle which is part of why Intel and AMD-powered laptops so often reduce their frequencies/performances on battery power.

It's also why Lunar Lake won't have a 5+GHz frequency. "Oh but you can race to idle" is a wordcel argument about the kind of thing AMD and Intel do that lacks enumeration, which is why people like "Spec" (Adroc_thurson) make it to justify AMD and Intel being retarded about ST power consumption - it might pass go for some others here but not me because you're full of it.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
727
610
136
I don't think they would gain any IPC, most likely loose quite some and then loose on power.

People underestimate Apple's incredibly strong caching, 192KB L1I is obviously awesome due to massive size. But it's L1D that is real gem. Almost 4ghz, 3 cycles latency for basic pointer loads, 128KB size. Untouchable capacity and untouchable latency.
And they are backing this with ~20 cycle latency 16MB L2 cache.

So they are at sweet spot already, the only thing that can improve is pretty much L2 cache size and they are doing just that with each gen.

I don't have simulation tools to verify my claims - but if Apple would switch to 32/32/512 cache arrangement to that low-clock scheme they could make L1D to 2-cycle latency with that 512KB L2 backing it up with sub 10 cycle latency. And from that there is plenty of potential to increase IPC. That cache arrangement really is a low hanging fruit that x86 manufacturers have taken( and actually ARM too for their designs) but it will come with pretty big efficiency drop from increased cache traffic. At this time it's pretty obvious to say that Apple's choice is the right one and everybody else will switch to large L1-caches sooner or later or they will totally lose efficiency race.
 
  • Like
Reactions: SpudLobby

SpudLobby

Senior member
May 18, 2022
843
546
106
I don't have simulation tools to verify my claims - but if Apple would switch to 32/32/512 cache arrangement to that low-clock scheme they could make L1D to 2-cycle latency with that 512KB L2 backing it up with sub 10 cycle latency. And from that there is plenty of potential to increase IPC. That cache arrangement really is a low hanging fruit that x86 manufacturers have taken( and actually ARM too for their designs) but it will come with pretty big efficiency drop from increased cache traffic. At this time it's pretty obvious to say that Apple's choice is the right one and everybody else will switch to large L1-caches sooner or later or they will totally lose efficiency race.
Yeah agree here. Keeping data movement less frequent + closer to the core has been a huge win for Apple via larger L1's and L2's. It's interesting, right now even the Cortex X4 has a larger combined L1 than AMD or Intel, lol.

Anyways I expect the Nuvia cores to have a massive L1i/L1D. I'm positive they will.
 

FlameTail

Platinum Member
Dec 15, 2021
2,669
1,503
106
Yeah agree here. Keeping data movement less frequent + closer to the core has been a huge win for Apple via larger L1's and L2's. It's interesting, right now even the Cortex X4 has a larger combined L1 than AMD or Intel, lol.

Anyways I expect the Nuvia cores to have a massive L1i/L1D. I'm positive they will.
I am indeed curious about the cache system in the Oryon CPU.

They quoted 42 MB of total cache.

I wonder if that includes the LLC/SLC?

Also does the Oryon CPU have L3 cache? If they are following Apple's design philosophy- they would not have one.
 

SpudLobby

Senior member
May 18, 2022
843
546
106
I am indeed curious about the cache system in the Oryon CPU.

They quoted 42 MB of total cache.

I wonder if that includes the LLC/SLC?

Also does the Oryon CPU have L3 cache? If they are following Apple's design philosophy- they would not have one.
Yeah it's not fully clear yet. Likely though that figure is L2 + L3 or L2 + SLC, I would be willing to bet each cluster of 4 cores has 12MB of L2. Conditional on 12MB, if they are counting the L1 and not SLC/L2, each core would have ~ 500kb of combined L1. Which is... possible seeing that Apple have 320kb and this was the next step for the core design but I think it'll just come in pretty close to Apple and instead that last 6MB is L3/SLC.

Either way I do anticipate the L1 being >= 2x the size of the Cortex X4's combined 128kb given the emphasis on low power operation that Nuvia had, and the L2 similarly being effectively much larger than Intel (especially given they share accesses... A single core for Apple can access ~ 75-80% of the L2).
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I don't have simulation tools to verify my claims - but if Apple would switch to 32/32/512 cache arrangement to that low-clock scheme they could make L1D to 2-cycle latency with that 512KB L2 backing it up with sub 10 cycle latency.

I don't have simulation tools either, but applying common sense would work too in this case. Why would Apple trade 192+128KB of excellent L1 for 576KB of tiered cache?

Apples L1I is very busy, as they don't have uOP cache and at least 8 decoders, so requirements for that hypothetical L2 would be huge. Ports and bandwidth would be eaten by instruction streams and that would either result into energy waste or increased latency ( or both ).
Sub 10 cycle 512KB L2 is frankly a fantasy. Intel/AMD were doing 256-512KB L2 caches and they've always came out ~12-13 cycles, so shawing a cycle for faster L1 vs AMD + secret sauce => I think 10 cycles is what is achievable, some other ARM stuff had 11 cycle 512LB L2, so maybe Apple could pull it and still target 4Ghz?
L1D with 2-cycle latency ? Now that's where we get into land of "Was done on last time on 130nm Pentium 4 and not done ever since territory". Ofc P4 was doing it with 8K sized L1D and Apple would have 32KB instead. Overall i don't think this is good tradeoff and would put huge pressure on L2 as well.

Insertion of another cache level currently only makes sense for Intel currently, where its 32KB + 48KB at 5 cycles and 2nd level is 16 cycles ( and probably getting to 17-18 cycles with 3MB ).
So it makes sense for Intel to insert ~256KB sized private L2 at some 10-11 cycles ( and combine it with 2 P-Cores into cluster that shares "current L2" maybe and looses even more latency but saves on power? ).
Still even for Intel, i wonder if after L1I expansion to 64KB, would so small private L2 make much sense, probably 512KB is minimum.
 
  • Like
Reactions: Tlh97 and Schmide

Henry swagger

Senior member
Feb 9, 2022
398
256
106
I don't have simulation tools either, but applying common sense would work too in this case. Why would Apple trade 192+128KB of excellent L1 for 576KB of tiered cache?

Apples L1I is very busy, as they don't have uOP cache and at least 8 decoders, so requirements for that hypothetical L2 would be huge. Ports and bandwidth would be eaten by instruction streams and that would either result into energy waste or increased latency ( or both ).
Sub 10 cycle 512KB L2 is frankly a fantasy. Intel/AMD were doing 256-512KB L2 caches and they've always came out ~12-13 cycles, so shawing a cycle for faster L1 vs AMD + secret sauce => I think 10 cycles is what is achievable, some other ARM stuff had 11 cycle 512LB L2, so maybe Apple could pull it and still target 4Ghz?
L1D with 2-cycle latency ? Now that's where we get into land of "Was done on last time on 130nm Pentium 4 and not done ever since territory". Ofc P4 was doing it with 8K sized L1D and Apple would have 32KB instead. Overall i don't think this is good tradeoff and would put huge pressure on L2 as well.

Insertion of another cache level currently only makes sense for Intel currently, where its 32KB + 48KB at 5 cycles and 2nd level is 16 cycles ( and probably getting to 17-18 cycles with 3MB ).
So it makes sense for Intel to insert ~256KB sized private L2 at some 10-11 cycles ( and combine it with 2 P-Cores into cluster that shares "current L2" maybe and looses even more latency but saves on power? ).
Still even for Intel, i wonder if after L1I expansion to 64KB, would so small private L2 make much sense, probably 512KB is minimum.
Intel p core are bandwidth hungry thats why intel increases l2 cache since alder lakr and lion cove will have 3mb l2 but will be restructured
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,982
4,305
96
Desktops are a small and shrinking portion of the market for one thing.
They're not really shrinking.
It's also why Lunar Lake won't have a 5+GHz frequency
ughhhhh.
lol.
is part of why Intel and AMD-powered laptops so often reduce their frequencies/performances on battery power.
Nothing recent drops 1t clk target.
"Oh but you can race to idle" is a wordcel argument about the kind of thing AMD and Intel do that lacks enumeration, which is why people like "Spec" (Adroc_thurson) make it to justify AMD and Intel being retarded about ST power consumption - it might pass go for some others here but not me because you're full of it.
rent free
 

FlameTail

Platinum Member
Dec 15, 2021
2,669
1,503
106
Yeah it's not fully clear yet. Likely though that figure is L2 + L3 or L2 + SLC, I would be willing to bet each cluster of 4 cores has 12MB of L2. Conditional on 12MB, if they are counting the L1 and not SLC/L2, each core would have ~ 500kb of combined L1. Which is... possible seeing that Apple have 320kb and this was the next step for the core design but I think it'll just come in pretty close to Apple and instead that last 6MB is L3/SLC.

Either way I do anticipate the L1 being >= 2x the size of the Cortex X4's combined 128kb given the emphasis on low power operation that Nuvia had, and the L2 similarly being effectively much larger than Intel (especially given they share accesses... A single core for Apple can access ~ 75-80% of the L2).
You know, I find it fascinating that the Apple M chips have smaller SLC than their A series counterparts. And the Oryon CPU too with only a suspected 6 MB SLC.

The Apple M2 has an 8MB SLC, whereas the A15 Bionic it was based on, has a whopping 32 MB.

One would think that the fatter CPU and GPU would naturally require even beefier caches, but it's not the case.

I believe the answer is that the fat system caches in Apple A series chips were for efficiency more than performance. Any memory bandwidth lost by shaving the SLC size would be more than compensated for by the wider RAM bus (128bit<) in M chips
 

SpudLobby

Senior member
May 18, 2022
843
546
106
They're not really shrinking.
The money is in mobile in servers and more so than 10 years ago.
ughhhhh.
lol.
The "I know x y and z" routine isn't going to pass go because you've worn your *** for a hat too often in the last two years, man. But if it is true, colossal self-own on Intel's part given the part's target.
Nothing recent drops 1t clk target.
Yikes. Going backwards for them.
rent free
Sophomoric. At least evict Andrei before slinging.
 
  • Like
Reactions: Nothingness

SpudLobby

Senior member
May 18, 2022
843
546
106
You know, I find it fascinating that the Apple M chips have smaller SLC than their A series counterparts. And the Oryon CPU too with only a suspected 6 MB SLC.

The Apple M2 has an 8MB SLC, whereas the A15 Bionic it was based on, has a whopping 32 MB.

One would think that the fatter CPU and GPU would naturally require even beefier caches, but it's not the case.

I believe the answer is that the fat system caches in Apple A series chips were for efficiency more than performance. Any memory bandwidth lost by shaving the SLC size would be more than compensated for by the wider RAM bus (128bit<) in M chips
The system caches are partially about efficiency yes but not primarily. First thing is the other IP like the GPU etc. The nice side effect of growing it is reducing DRAM access for the CPU for ex though yeah. You'll note Apple actually shrunk the SLC to 24MB from 32MB from the A15 to A16 probably in part because they upgraded the DRAM to LPDDR5, increased L2 size, and looking at the efficiency for CPU stuff it ended fine in that case. But it can help. A14 -> A15 was a tell on that front.