News [Anandtech] Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Gideon

Platinum Member
Nov 27, 2007
2,030
5,035
136
ARM consumer stack updates:

  • Finally a decent "little" core update (A55 -> A510) with 35% performance gain
  • Big core is less ambitious (A78 -> A710) with 10% uplift mentioned
  • X2 is supposedly 16% faster than X1
  • Lots of other changes, Armv9 ISA (with decent vector ops finally), new interconnects, and more L3 cluster designs

 
  • Like
Reactions: Tlh97 and NTMBK

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Thanks for that.... 7742 EPYC is 51% faster than graviton 2, so this new processor should be slower still than the 7742, and Milan, or Milan-X that will be out in the same timeframe ? (or are already out) It will be a massacre.

In cloud setting, we are getting to the point of "good enough" for ARM. If this thing is 30% faster than Graviton2 at 100W envelope, it won't matter for typical cloud workloads if 225W CPU is say 25% faster.
TCO for Amazon will be highly in favor of Graviton3 and by the time cloud guys will sample Bergamo in 2023, Amazon will be out with Graviton4.

In generic cloud computing, the writing is on the wall for both Intel/AMD.
 
  • Like
Reactions: Gideon

gdansk

Diamond Member
Feb 8, 2011
4,587
7,708
136
In generic cloud computing, the writing is on the wall for both Intel/AMD.
Seems only to be a problem in scale out. Even in that segment, I'm not sure why you consider it will exclude them. Genoa will be more cores and faster still. And who knows if Intel will do a many core Goldmont-derivative.
 

Doug S

Diamond Member
Feb 8, 2020
3,603
6,368
136
Somewhat offtopic but relevant enough to discuss, Andrei is leaving Anandtech :oops:

He was the main guy covering mobile, ARM and often the nitty-gritty parts of processors in general (such as cache-latency graphs). While his coverage certainly wasn't perfect and graphs peculiar (and we've complained about it in this very thread) it was at least very different from other and almost always highly informative.

I just can't help but wonder what happens to Anandtech in general now that Ian is the only one left really doing reviews or benches and he seems to have his own stint with TechTechPotato ...

Ryan hasn't updated his GPU test-suite since 2019 and hasn't done anything worth mentioning after the RTX 2xxx series.


Yikes, the technical quality of this site is about to take a big dive. What sucks is there really isn't anyone anywhere else doing the types of in depth reviews of CPU microarchitecture he did. You have people doing little pieces of it like Hans DeVries, a few people investigating Apple M1 Max and so on, but they are mostly narrowly focused on a few things like instruction timings and how they affect their pet application or whatever rather than a more general overview. I always looked forward to his articles every October about Apple's new SoC, and his deep dives into the nitty gritty of the latest Intel or AMD designs.

Sounds like he's taking an industry job based on the "conflicts of interest" comment. Good luck wherever you're headed, Andrei!
 

beginner99

Diamond Member
Jun 2, 2009
5,318
1,763
136
In cloud setting, we are getting to the point of "good enough" for ARM. If this thing is 30% faster than Graviton2 at 100W envelope, it won't matter for typical cloud workloads if 225W CPU is say 25% faster.

You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.

Ofc, i am dealing with same type of complex requests every day. Running our own servers etc.
But even in this environment we find plenty of uses for "generic" cloud computing - like computing that is perfectly offloadable to cloud.
Even if we were not using it directly, i know 3rd party services we already depend on are using Azure and Amazon stuff. They are probably on x64 for now, but heck, as long as they meet our SLA, they could run on ZX Spectrums, data is data.

So there is an expanding sea of generic computing, that these ARM CPUs excel at. And with Graviton2, 3 things are looking better each day.

And obviuosly even if we look beyond HW - since that Phoronix benchmark software support for ARM64 is moving leaps forward. I think 99% of "server" software like compilers, frameworks,. JDKs are already ARM64 capable and being optimized for ARM as we speak.

A recent example with JDKs - Amazon is maintaining their own openJDK distro


That means very important thing - they are testing it on their ARM cloud stuff and hardware and probably do not mind optimizing for their hardware either as they seem to have very capable JDK developers as well! ( for example Announcing preview release for the generational mode to the Shenandoah GC | AWS Developer Tools Blog (amazon.com) )

Mindshare is increasing, i used to say the biggest obsticle is lack of capable workstation for development, i think now M1 derivatives are taking care of it. Need one more vendor with Linux/Windows workstation and we are set.
 
  • Like
Reactions: BorisTheBlade82

DrMrLordX

Lifer
Apr 27, 2000
22,931
13,014
136
Sounds like he's taking an industry job based on the "conflicts of interest" comment. Good luck wherever you're headed, Andrei!

Might be sharing a cubical with jonnyguru.

(Probably not, but you never know)
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.

But isn't that just what Graviton3 is optimized for, all that extra power is coming from better per-thread power, not from increased thread count? So 64 G3 threads offer similar throughput as 128 Epyc threads providing much better processing speed.

Core count isn't revealed yet but they did reveal that latency decreased 35% and as throughput didn't improve vastly more G3 core count seems to be same as G2.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
And obviuosly even if we look beyond HW - since that Phoronix benchmark software support for ARM64 is moving leaps forward. I think 99% of "server" software like compilers, frameworks,. JDKs are already ARM64 capable and being optimized for ARM as we speak.

This is an excellent point. Last time I checked, the Phoronix benchmark suite was containing many programs, which were heavily hand-optimized for x64, and thus making a lousy benchmark in general when running on ARM. But, as you mention, i would not be surprised if action is taken to improve the situation over time.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,730
136
I may be the only one who don't find this exciting at all.
300mm2 on N5 is a lot of transistors, more than EPYC Rome, when tailored for lower clock and high density.
Efficiency gains again contributed in a big way by N5
25% more perf than Gravitron 2 is a not at all impressive in 2022. Gravitron 2 is so much behind even 2nd Gen EPYC 7742 in almost all workloads
How is it not impressive?

It's slightly faster than Rome 7742 at less than half the power consumption. I doubt 5nm alone is responsible for the power savings. What's even more impressive is that they're fabbing the DDR5 controllers and PCIe 5 controllers each as separate dies, while AMD is going to be stuck with a single IO die that incorporates both. Graviton 3 can probably completely turn off the PCIe dies when not in use getting immense power savings.

 

NTMBK

Lifer
Nov 14, 2011
10,448
5,831
136
How is it not impressive?

It's slightly faster than Rome 7742 at less than half the power consumption. I doubt 5nm alone is responsible for the power savings. What's even more impressive is that they're fabbing the DDR5 controllers and PCIe 5 controllers each as separate dies, while AMD is going to be stuck with a single IO die that incorporates both. Graviton 3 can probably completely turn off the PCIe dies when not in use getting immense power savings.


Apparently they have halved the number of PCIe lanes from Graviton 2:

The Graviton3 is the first to deliver PCI-Express 5.0 and DDR5, and the former can deliver high bandwidth with half as many lanes as its PCI-Express 4.0 predecessor while the latter can deliver 50 percent more memory bandwidth with the same capacity and in the same power envelope.


So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings. I'd be surprised if AMD does the same on their PCIe5 platform.

Sounds like they kept 8 channels of memory, but increased to DDR5-4800.
 
  • Like
Reactions: Tlh97 and gdansk

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Apparently they have halved the number of PCIe lanes from Graviton 2:



So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings.

Thats not really a power issue at all. Unused lanes are shut down power-wise and do not consume any power. The number of lanes is mostly an area and pin issue.
 

NTMBK

Lifer
Nov 14, 2011
10,448
5,831
136
Thats not really a power issue at all. Unused lanes are shut down power-wise and do not consume any power. The number of lanes is mostly an area and pin issue.

Surely there's the power cost of routing data within the IOD to all those IOs?
 

Gideon

Platinum Member
Nov 27, 2007
2,030
5,035
136
Surely there's the power cost of routing data within the IOD to all those IOs?
Yes, when they're all fired up. Its pretty clear that amazon was interested in fitting 3 nodes into one server-tray and that limits package size. This also means that per-tray 50% of that I/O is compensated by having an extra CPU (and since all the lanes are PCIe 5.0 now it's still formidable)

M4ONGAk.png


Bear in mind these 3 CPUs have a way smaller TDP than even 2 (180W)) Milan processors, let-alone 220W or 280W versions.

This chip takes 30% less power under full load than a Ryzen 5950X (100W vs 142W) and that's despite also using chiplets and considerably more I/O.

Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.
 

NTMBK

Lifer
Nov 14, 2011
10,448
5,831
136
Yes, when they're all fired up. Its pretty clear that amazon was interested in fitting 3 nodes into one server-tray and that limits package size. This also means that per-tray 50% of that I/O is compensated by having an extra CPU (and since all the lanes are PCIe 5.0 now it's still formidable)

M4ONGAk.png


Bear in mind these 3 CPUs have a way smaller TDP than even 2 (180W)) Milan processors, let-alone 220W or 280W versions.

This chip takes 30% less power under full load than a Ryzen 5950X (100W vs 142W) and that's despite also using chiplets and considerably more I/O.

Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.

Oh for sure, it's an impressive part. I think ARM servers are looking very competitive.
 
  • Like
Reactions: Tlh97 and Gideon

jpiniero

Lifer
Oct 1, 2010
16,823
7,267
136
Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.

If you think about it, TCO doesn't matter to AWS customers. Only the price they pay. TCO is Amazon's problem.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Surely there's the power cost of routing data within the IOD to all those IOs?

As Gideon already pointed out, the power cost is only there if the lanes are fired up (e.g. used) and even then the power cost are largely a function of utilization.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,536
3,227
136
TCO apparently isn't something to be considered in isolation anymore. Amazon seems to be taking a holistic, whole system approach since they can control it all from top to bottom. It's more about TMpR (Total Margin per Rack). Since Amazon can go the (almost) custom route with everything, they can more delicately balance the TCO of the rack contents against the profit that they can make on that rack. If they can double their number of sellable instances or, to use a more generic term, if they can increase their total sold compute hours per rack by enough to surpass their own development, procurement, and operational cost per rack, they make more money over the long run. It was always a goal for most companies that weren't pure science in nature, but, none of them have the scale that Amazon (and google, MS, etc) have to actually affect things like TCO at the design phase. In other words, you used to get a menu of options like Package 1 and Package 2 and Package 3. Now, you get to say "I want a package with these specific options at the circuit and module level to exactly meet my needs".
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
64 of V1 cores in 100W is scary, considering the actual core was supposed to be used in HPC projects where vector FP with SVE is important. This thing can do 4x128bit or 2x256bit SVE and is more in place for HPC/ML than in "generic" cloud setting.
What Amazon did is basically "take best out of the shelf ARM core and cram as many as we can fit on chiplet". And some real innovation on system level too, this 3 chips on rack thing is proving them with 192C where AMD and Intel's most popular 2S systems will struggle to match them in "rack" throughput.
 

moinmoin

Diamond Member
Jun 1, 2017
5,248
8,462
136
Apparently they have halved the number of PCIe lanes from Graviton 2:


So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings. I'd be surprised if AMD does the same on their PCIe5 platform.

Sounds like they kept 8 channels of memory, but increased to DDR5-4800.
Where did you get the info that Amazon has halved the number of PCIe lanes from Graviton 2? If it's from that fluffy way overly wordy nextplatform text, that's just nextplatform nonsensically raving about PCIe 5 having doubled bandwidth over PCIe 4 and as such needing only half the lanes for the same bandwidth, duh. Beyond using PCIe 5 there's no mention of Graviton 3's actual PCIe configuration in there at all.

this 3 chips on rack thing is proving them with 192C where AMD and Intel's most popular 2S systems will struggle to match them in "rack" throughput.
Genoa will match and Bergamo surpass it in core count (never mind thread count).

The big difference is not performance density but power consumption, the resulting TDP and the necessary cooling infrastructure. If Amazon's solution were available on the open market many data center may well choose this instead moving from air to water cooling, but it isn't so this may well turn out to be a miss for Arm-based servers overall...
 
Last edited:

LightningZ71

Platinum Member
Mar 10, 2017
2,536
3,227
136
I hadn't even thought about thread count As I recall, the NeoVerse V1 is a single thread processor. Zen3 and Zen4 are SMT2. Assuming that Genoa is still going to be 96 cores (12 CCDs, 8 core CCX), that 192 threads PER socket, and 384 threads per 2S system. I can't imagine that they will have a problem with system throughput, even if they have to dial back clocks a lot to keep power draw reasonable. This should be an interesting race...
 

jpiniero

Lifer
Oct 1, 2010
16,823
7,267
136
The big difference is not performance density but power consumption, the resulting TDP and the necessary cooling infrastructure. If Amazon's solution were available on the open market many data center may well choose this instead moving from air to water cooling, but it isn't so this may well turn out to be a miss for Arm-based servers overall...

Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.
 

moinmoin

Diamond Member
Jun 1, 2017
5,248
8,462
136
I hadn't even thought about thread count As I recall, the NeoVerse V1 is a single thread processor.
The only ARM core with SMT Arm launched so far is A65 back in 2018. That that A6x line hasn't been updated since makes me think either Arm isn't interested in pursuing SMT further or there is not enough demand for it.

Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.
I think we presumed before (since the launch of Naples which main selling point at the time was increasing core density over Intel's offerings) already that most sales are to cloud, and that AMD is trying to expand from there to the slower moving enterprise market.
 

Hitman928

Diamond Member
Apr 15, 2012
6,696
12,373
136
Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.

This is correct, cloud customers make up the bulk of AMD's Epyc sales. Additionally, while AMD is starting to get a bigger foothold into the enterprise market, the cloud market is only growing, so they are doing a lot to try and keep those customers happy and continually buying Epyc.