Question Qualcomm's first Nuvia based SoC - Hamoa

Page 29 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
From the article that @John Bruno posted:

I'm not really sure what to make of that quote. It's reasonable to assume the TDP's are in the ballpark of reference design thermal envelopes but I think its inaccurate to claim the tested SOC's had TDP's of 23w and 80w.

Nitpicks aside, I'm excited to see what one of these SOC's can do in a passively cooled laptop. I'm definitely interested in a fanless laptop that isn't made by Apple. Although I have to admit that I'm a bit worried that when devices with these SOC's hit the market, the pricing might be a kick to the groin.
Yeah, I don't think it's accurate to see those as the actual power draw for the SoC Package + DRAM in the windows of the benchmarks we saw, because they define TDP with those laptops as the maximum power draw the SoC + DDR + PD can long term sustain for a (presumably reasonable) max temp.

But that doesn't actually mean those benchmarks we saw were at 23W/80W of power draw for the SoC + DRAM. Could've been more in a short duration particularly for the former. They're just showing what the SoC can do in those classes of devices which is relevant to heat etc.

From Andrei in another forum:
"[it's only correlated in 30+ minute workloads, in which case TDP == power consumption, but that's only valid for Qualcomm, as that correlation doesn't exist for Intel/AMD]

[what those 23/80W are supposed to mean is "thermal envelope of the given test devices]"

Andrei's comment on the actual power and performance of the SoC:

"[the perf/power curves are the actual workload power without gaps"

So clearly we shouldn't take much from those other than as demonstration of scaling within a device's thermal constraints in short run tests. If we want to see the power/perf scaling, just look at the curves.
 
  • Like
Reactions: Rigg

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
https://x.com/QaM_Section31/status/1719944806502466005?s=20 on the A15 IPC, Nuvia.

Though, the M3 looks like about 3000-3150, and this year the M3 Max has no frequency boost over the others - still 4.05GHz.


Taking the highest as best, it doesn't look like Apple beat QC on GB6 at least for the Linux comparison from QC, and even the Windows ones are within rounding error of a lot of the M3 listings which range in the 2950-3050 range.


At 3.8GHz on Windows, QC is at 2777 for GB6, and 4.3GHz, about 2996. That 3.8GHz one is basically identical to an M3 on IPC, the speedier one less so^1. Overall huge win for QC and I don't think Apple's stuff looks that great here, I think QC is basically right with them on IPC in a pretty major benchmark.

1: https://www.notebookcheck.net/First...gen-H-and-14th-gen-desktop-CPUs.763149.0.html


And the 8 Gen 3 with the X4 and 2MB of cache is at about 2326 from Qualcomm's reference device @ 3.3GHz.

GB6 ST perf/GHz of recent chips and cores (keep in mind, higher clock speeds reduce average IPC albeit probably to varying extents depending on the SoC)

Qualcomm:
8 Gen 3 at 3.3GHz: 704
8 Gen 2 at 3.36GHz: 619^2
Snapdragon X Elite at 3.8GHz, Windows: 730
Snapdragon X Elite at 4.3GHz, Windows: 696
Apple:
M3, M3 Pro, M3 Max at 4.05GHz: range from 716 to 777^1
A15 at 3.23Ghz: about 720
A16 at 3.4-3.46Ghz: similar to A15

1: Highest score I've seen from an M3 Max is at 3150 for GB6 ST, but most scores cluster around the mid 2900's to low 3000's.
2: One can find 2000-2080 listings for ST from S23's using the 3.36GHz 8 Gen 2. What we care about is what the chip can really hit and with Android scheduling and OEM stuff which is less consistent it gets messier.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
https://x.com/QaM_Section31/status/1719944806502466005?s=20 on the A15 IPC, Nuvia.

Though, the M3 looks like about 3000-3150, and this year the M3 Max has no frequency boost over the others - still 4.05GHz.


Taking the highest as best, it doesn't look like Apple beat QC on GB6 at least for the Linux comparison from QC, and even the Windows ones are within rounding error of a lot of the M3 listings which range in the 2950-3050 range.


At 3.8GHz on Windows, QC is at 2777 for GB6, and 4.3GHz, about 2996. That 3.8GHz one is basically identical to an M3 on IPC, the speedier one less so^1. Overall huge win for QC and I don't think Apple's stuff looks that great here, I think QC is basically right with them on IPC in a pretty major benchmark.

1: https://www.notebookcheck.net/First...gen-H-and-14th-gen-desktop-CPUs.763149.0.html


And the 8 Gen 3 with the X4 and 2MB of cache is at about 2326 from Qualcomm's reference device @ 3.3GHz.

GB6 ST perf/GHz of recent chips and cores (keep in mind, higher clock speeds reduce average IPC albeit probably to varying extents depending on the SoC)

Qualcomm:
8 Gen 3 at 3.3GHz: 704
8 Gen 2 at 3.36GHz: 619^2
Snapdragon X Elite at 3.8GHz, Windows: 730
Snapdragon X Elite at 4.3GHz, Windows: 696
Apple:
M3, M3 Pro, M3 Max at 4.05GHz: range from 716 to 777^1
A15 at 3.23Ghz: about 720
A16 at 3.4-3.46Ghz: similar to A15

1: Highest score I've seen from an M3 Max is at 3150 for GB6 ST, but most scores cluster around the mid 2900's to low 3000's.
2: One can find 2000-2080 listings for ST from S23's using the 3.36GHz 8 Gen 2. What we care about is what the chip can really hit and with Android scheduling and OEM stuff which is less consistent it gets messier.
Notes:

Apple obviously hasn't gained much. More Interestingly, we see that the X4 really did have a 13% perf/GHz improvement, albeit they now draw 27% more power than the A15 in Spec and are down a few points still.
In GB5, they've basically closed the gap with the A15 on ST with the X4 and 8 Gen 3 @ 3.3GHz.

So the X4 has really closed some gaps, albeit at higher power consumption (at least in Spec) than Apple two years ago for similar performance. But say they put this in a laptop - would 25% more power consumption than an absurdly low starting value (the M1 = A15 in performance roughly) at peak ST put them in league with AMD's Phoenix or Intel's MTL for the same ST? I doubt it, I think it'd still be better.

Before we see dooming about the Nuvia cores vs the X4 and its successor, I think if you dialed Qualcomm's Oryon cores down to the same frequency as the X4, they'd blow it out on power draw. IPC alone is only part of the story to building a low power and performant core.
 
  • Like
Reactions: Tlh97

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
Also: Zen 4 and Intel's Golden Cove/Redwood Cove seem to have a solid 25-30% perf/GHz gap from the X4, anything since the M1 from Apple, and the new Qualcomm cores, and it's not like they're winning on e.g. idle power or power draw with peak ST. Should be interesting in the next few years.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106

IT'S THE GOOD STUFF

THEY MEASURED THE DIE SIZE OF THE X ELITE
This is incredible.



Coming at it this way we got 12.49 x 13.19 or 164.74mm^2 which is also the wrong answer, too low. That said we have a range of sizes to work with and the right answer is somewhere between. We won’t say how we know but think low 170’s for the real answer. Do note that AMD’s Phoenix Point CPU is 178mm^2 so X Elite is smaller by a little on an equivalent process. If the early performance claims hold up, that could make for a very interesting battle next year.

So in the end we have a device that is really a pain in the proverbial backside to measure but we tried. The die size is between 165 and 182mm^2 but we are confident in saying the real answer likes in the low 170s. Lets go out on a limb and call it at 171 +/- 2mm^2. The thickness is also quite impressive being more akin to cell phone dimensions than PC CPU ones. Even with preliminary numbers in hand, this could be a device to watch.

I was thinking more like a 200-220mm^2 die based on Adreno perf/watt, the display stuff, and then of course the 12 big cores with the 36MB of cache. But probably they also saved some with their cache hierarchy trending towards shared and huge L2 clusters, and have a smaller L3 or SLC - allows them to still get great performance/watt and make a bit more of an economic use of cache I guess.

But 170 +- some? That's Phoenix size and smaller as he says, except this chip beats Phoenix out on peak performance in MT, wins on ST or matches for U configurations - blows it out on GPU efficiency w/similar performance elsewhere, has similar media and display capabilities, and a more capable (at worst I bet similar but seems superior) NPU. I suspect it scales down better for both ST and MT as well and/or idle power, too.

They'll charge a premium for this but probably not as much as people think, and it means their whole enterprise here is sustainable wrt margins, it's not like this stuff is bloated.

2024, 2025 and on are going to be... Interesting.
 
  • Like
Reactions: Tlh97 and Tup3x

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,762
106
I am not entirely surprised. For reference the Snapdragon 8 Gen 2 on N4 is ~120 mm².

F3JYC3Rb0AA71xH.jpeg
And the 5G modem here (which is not integrated on the X Elite) itself takes about 15 mm².
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,762
106

Leaksters on Twitter are saying the Snapdragon 8 gen 4 will use 'Gen 1 Oryon' cores, implying that it and perhaps the X Elite Gen 2 will use the same cores as the X Elite.

That to me doesn't sound good. They should iterate fast and strongly. Right now X Elite's Oryon CPU remains on par with the competition.

But 1+ year later I doubt that will hold if they are using the same Oryon core, even with the upgrade to 3nm node.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
I am not entirely surprised. For reference the Snapdragon 8 Gen 2 on N4 is ~120 mm².

View attachment 88247
And the 5G modem here (which is not integrated on the X Elite) itself takes about 15 mm².
Well, that is a phone chip and is using Arm reference cores. This is an entirely different type of CPU Core and there are 12 of them, they are certainly a bit bigger than an X4 and there is more total cache by a lot. Adreno might be the same size or smaller, but then you also have other stuff like a larger Hexagon.

I among others know how big the phone dice are, all cluster around 90-120mm^2 in the last few years, with or without modems. But you can only extrapolate so much from them due to the major CPU change with this.

I think it's safe to say most of us are just slightly surprised, I expected them to be relatively area efficient for the part and it's hilarious how efficient and performant Adreno is while it probably isn't even larger than the new 8 Gen 3's Adreno - which is telling because if it were I expect it to make RDNA3 look even worse than it does.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106

Leaksters on Twitter are saying the Snapdragon 8 gen 4 will use 'Gen 1 Oryon' cores, implying that it and perhaps the X Elite Gen 2 will use the same cores as the X Elite.

That to me doesn't sound good. They should iterate fast and strongly. Right now X Elite's Oryon CPU remains on par with the competition.

But 1+ year later I doubt that will hold if they are using the same Oryon core, even with the upgrade to 3nm node.
Gen 1 could mean a lot of different things in practice and most likely still entails some tweaks with regards to power/performance, it just means they aren't aiming for any major micro architectural advances given the time priority and sensitivity. If they thought it'd be *worse* than an X5 for power/performance or not worth the cost and value proposition, they most likely wouldn't do it, if that's any consolation. IPC isn't really their issue, what they need to do is make sure they blow the X4/X5 out on performance within constrained profiles - you need IPC to get great performance at low frequencies, but it doesn't imply you have great perf/w ipso facto.

The X4 in the 8 Gen 3 demonstrates this. It's pretty good, matches the A15 or the M1 and at similar frequency, and doubtless vastly lower power draw than AMD/Intel can offer for the same ST, but still uses 25% more power than the A15 did for a similar score.

Iso-performance, do we think Qualcomm is closer to the X4's power draw here or closer to beating Apple with a hypothetical 8 Gen 4 on N4P? I think the answer is very likely the latter lol.

Qualcomm's goal should be to (corrected on the process node gains), beat Apple's A14/A15/A16 or at worst match them with the last one. In practice it'll be a bit better due to N3E, but still.
 
  • Like
Reactions: Tlh97 and FlameTail

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
Frankly, ignore Revegnus, he's an idiot and just reads from Korean forums and relays it. He's gotten an unholy sum of things wrong already.
 

roger_k

Member
Sep 23, 2021
102
219
86
At 3.8GHz on Windows, QC is at 2777 for GB6, and 4.3GHz, about 2996. That 3.8GHz one is basically identical to an M3 on IPC, the speedier one less so^1. Overall huge win for QC and I don't think Apple's stuff looks that great here, I think QC is basically right with them on IPC in a pretty major benchmark.

It’s important to note that IPC for M1/M2/M3/Oryon are pretty much within the rounding error of each other. Which leads to me to my earlier comment that the Nuvia team has likely rebuilt their work at Apple with Oryon (and there is nothing wrong with that). To me amateur eye it looks like Oryon is more or less equivalent to M2, and the 4.3Ghz is an aggressive overclock that likely takes the chip outside its comfort range (hence normal peak operating frequency is “only” 3.8Ghz). The improved energy efficiency claims over M2 are mostly in line with N4 vs. N5 advantage (with either additional small improvements to Oryon or some shenanigans with measurements). At any rate, it’s a very strong initial showing for Qualcomm/Nuvia team, looking forward to what they will bring out in the future.

Lack of IPC improvements in A17/M3 is certainly concerning, especially since the new architecture is substantially wider and can handle more braches. This is a good example that a wide core is not everything. We see something similar in X4, that can barely keep up in IPac with the considerably narrower Firestorm/Avalanche.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
It’s important to note that IPC for M1/M2/M3/Oryon are pretty much within the rounding error of each other. Which leads to me to my earlier comment that the Nuvia team has likely rebuilt their work at Apple with Oryon (and there is nothing wrong with that). To me amateur eye it looks like Oryon is more or less equivalent to M2, and the 4.3Ghz is an aggressive overclock that likely takes the chip outside its comfort range (hence normal peak operating frequency is “only” 3.8Ghz). The improved energy efficiency claims over M2 are mostly in line with N4 vs. N5 advantage (with either additional small improvements to Oryon or some shenanigans with measurements). At any rate, it’s a very strong initial showing for Qualcomm/Nuvia team, looking forward to what they will bring out in the future.

Lack of IPC improvements in A17/M3 is certainly concerning, especially since the new architecture is substantially wider and can handle more braches. This is a good example that a wide core is not everything. We see something similar in X4, that can barely keep up in IPac with the considerably narrower Firestorm/Avalanche.

Pretty much. but one thing to note is there is a separate component beyond the IPC which comes down to energy efficiency via the cache + phydes + yes, process node.

I don't however think that any advantage Qualcomm have over the M2 Max and M2 (on N5P) were/are entirely about the process node from N5P to N4P - nonzero sure but:
Screenshot 2023-11-02 at 3.07.50 PM.png

N5P is -10% power reduction, +5% performance (iso-power, arch) over N5. N4P is -22% power over N5, and +11% performance (iso-power, arch). Transitively, the power gains N4P offers over N5P are about 15%, and the performance gains are 5-6%.

This is nonzero of course, but it's nothing too crazy. Apple also have more cache going on the M2 for the L2 clusters and likely still on the SLC vs QC, so if Qualcomm manages to beat them on ST power draw by 20-30% iso-performance, that's a pretty big achievement.

Worth noting, there were still some gains made via the additional cache from M1 - > M2, or A14 -> A15 (see L2/SLC) and then to the A16 - which didn't have a node improvement and was on N4 but still increased ST by 10% at the exact same power, was impressive. So there have been tidbits of gains here and there on that kind of stuff, but you're right the overall IPC improvements have been middling and not really there.

On width: The decode is only one part of it but the X4 did get a 13-15% IPC improvement with the bigger core, dropping the mop caches, more ALUs and 2MB of L2. That was significant for them. It's only barely behind Firestorm or the A15/M1 - 1693 GB5 at 3.3GHz vs 1730 @ 3.23GHz, very similar.

Overall it just doesn't have as big of an ROB, the combined L1 is nearly 1/3 the size of Apple's, and then the L2 is smaller (albeit private). Another way to think of it is really that Arm isn't doing too bad for their resources, and I bet it might be doing a bit worse if they didn't go as far as they did.

Either way, yes, Apple is stalling and it's possible Arm and QC will too, but if I had to bet on who we'll see a significant increase from in the next 2-3 years it'd be Qualcomm or Arm. Apple lost their top guys.

AMD and Intel will definitely get some increases too in the next 1-2 years here but they're further behind so.
 
  • Like
Reactions: Tlh97 and roger_k

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,762
106
I think it's safe to say most of us are just slightly surprised, I expected them to be relatively area efficient for the part and it's hilarious how efficient and performant Adreno is while it probably isn't even larger than the new 8 Gen 3's Adreno - which is telling because if it were I expect it to make RDNA3 look even worse than it does
So Adreno is more area efficient than RDNA?

The legacy in smartphone SoC design keeps on giving.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
So Adreno is more area efficient than RDNA?

The legacy in smartphone SoC design keeps on giving.
I don't know that vs the RDNA3 in Phoenix, no. It also depends on the benchmark, if you have RDNA3 underperforming in benchmarks that Qualcomm uses in mobile it may well be a driver/software issue, very easy to change "performance per watt" with the GPU based on stuff like that since your output is heavily dependent on it. TimeSpy would be best.

But what we do know is that on Control which was ported to Arm, the performance was identical between the 7840HS (I think that was it) and the 23W TDP device with an X Elite chip - around 30-33 frames or whatever. We don't have power indices but it's entirely possible it was also using less power as a whole for that demo and makes other claims more believable.

Adreno has 4.6 TFLOPS in their config and maxes out at lower power than AMD do (like 30W vs more like 50+), and has 3x higher than the 8 Gen 2's GPU power draw for about 2-2.25x the TFLOPs of the Adreno 740. Most likely it's some very slightly larger GPU than the 740 with like 2X the frequency. I don't think it's the 750 though.
 

soresu

Diamond Member
Dec 19, 2014
4,117
3,572
136
if you have RDNA3 underperforming in benchmarks that Qualcomm uses in mobile it may well be a driver/software issue
Given OGL ES is derivative of OGL and AMD's OGL perf on Windows is still less than stellar even after the major rework done a year ago I would say that you really need to compare the best of RadeonSI in Linux against any Qualcomm benchmarks.
 

soresu

Diamond Member
Dec 19, 2014
4,117
3,572
136
Adreno has 4.6 TFLOPS in their config and maxes out at lower power than AMD do (like 30W vs more like 50+), and has 3x higher than the 8 Gen 2's GPU power draw for about 2-2.25x the TFLOPs of the Adreno 740
Be careful comparing GPUs using TFLOPS.

It works for some pure compute workloads - but for gaming workloads there are other constraints on a GPU that can mean n TFLOPS for one µArch can be significantly different for n TFLOPS of another µArch.

Case in point Vega vs RDNA2 - for the same TFLOPS you get substantially more FPS in a given game/app/benchmark.
 
Last edited:

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
It works for some pure compute - but for gaming workloads there are other constraints on a GPU that can mean n TFLOPS for one µArch can be significantly different for n TFLOPS of another µArch.

Case in point Vega vs RDNA2 - for the same TFLOPS you get substantially more FPS in a given game/app/benchmark.
Well, mostly comparing Adreno to Adreno here which is more kosher assuming there haven’t been massive changes in the structures but yeah.
 

soresu

Diamond Member
Dec 19, 2014
4,117
3,572
136
Well, mostly comparing Adreno to Adreno here which is more kosher assuming there haven’t been massive changes in the structures but yeah.
I would expect that serious µArch changes have been happening for Adreno over the years as they have been for Mali over at ARM.

Exactly when and how is anyones guess as they are probably the least forthcoming vendor in this regard + the fast pace of switching process nodes in mobile SoCs muddles up perf and area efficiency comparisons between generations.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
I would expect that serious µArch changes have been happening for Adreno over the years as they have been for Mali over at ARM.

Exactly when and how is anyones guess as they are probably the least forthcoming vendor in this regard + the fast pace of switching process nodes in mobile SoCs muddles up perf and area efficiency comparisons between generations.
They have but not in the sense that Adreno vs RDNA have arch differences. Qualcomm uses a shader-heavy arch that’s pretty wide yet also area efficient it seems.
 

roger_k

Member
Sep 23, 2021
102
219
86
It works for some pure compute - but for gaming workloads there are other constraints on a GPU that can mean n TFLOPS for one µArch can be significantly different for n TFLOPS of another µArch.

Doesn’t even work for compute often enough. Look at AMDs newer GPUs where peak TFLOPS is calculated using their dual-op instruction but real-world opportunities for actually using it in practice are limited.
 

soresu

Diamond Member
Dec 19, 2014
4,117
3,572
136
Leaksters on Twitter are saying the Snapdragon 8 gen 4 will use 'Gen 1 Oryon' cores, implying that it and perhaps the X Elite Gen 2 will use the same cores as the X Elite.
This doesn't surprise me at all.

I'm pretty sure that 8cx gen 1/2 used the same A76 CPU cores too despite complaints about speed on 8cx gen1 WoA laptops being less than stellar in reviews.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
This doesn't surprise me at all.

I'm pretty sure that 8cx gen 1/2 used the same A76 CPU cores too despite complaints about speed on 8cx gen1 WoA laptops being less than stellar in reviews.
That’s not even comparable. The 8 Gen 4 he’s talking about is for phones. This is a first generation custom core used in laptops and ported to a phone SoC coming out in 2024. Even an X5 with another + 15% performance would still likely be in line with and most likely inferior on overall perf/W to the Oryon cores.

You are talking about Qualcomm’s earlier crappy PC efforts using a reference core which was sloppy and dated, which was just laziness.
So this is… totally unrelated and not a good analogy at all, because the counterfactual in this case is using a recent Cortex X core for the 8 Gen 4, which most likely will be worse than what they have to offer on the custom end. The reason they’d stick to the first generation (albeit with phydes some changes likely) is really not hard to see from a risk POV and it’d still get them ahead of their peers on at least something.

They’ve already come out and said they expect to charge a bit more for the 8 Gen 4 (phone chip, again) due to the custom cores. Sounds to me like they’ve got something worthwhile.



Not trying to be rude but come on guys think.