News [Anandtech] Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510

Gideon · May 25, 2021

ARM consumer stack updates:

Finally a decent "little" core update (A55 -> A510) with 35% performance gain
Big core is less ambitious (A78 -> A710) with 10% uplift mentioned
X2 is supposedly 16% faster than X1
Lots of other changes, Armv9 ISA (with decent vector ops finally), new interconnects, and more L3 cluster designs

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

JoeRambo · Dec 1, 2021

Markfw said:
Thanks for that.... 7742 EPYC is 51% faster than graviton 2, so this new processor should be slower still than the 7742, and Milan, or Milan-X that will be out in the same timeframe ? (or are already out) It will be a massacre.

In cloud setting, we are getting to the point of "good enough" for ARM. If this thing is 30% faster than Graviton2 at 100W envelope, it won't matter for typical cloud workloads if 225W CPU is say 25% faster.
TCO for Amazon will be highly in favor of Graviton3 and by the time cloud guys will sample Bergamo in 2023, Amazon will be out with Graviton4.

In generic cloud computing, the writing is on the wall for both Intel/AMD.

gdansk · Dec 1, 2021

JoeRambo said:
In generic cloud computing, the writing is on the wall for both Intel/AMD.

Seems only to be a problem in scale out. Even in that segment, I'm not sure why you consider it will exclude them. Genoa will be more cores and faster still. And who knows if Intel will do a many core Goldmont-derivative.

Doug S · Dec 1, 2021

Gideon said:
Somewhat offtopic but relevant enough to discuss, Andrei is leaving Anandtech

https://twitter.com/x/status/1465975510866042882

He was the main guy covering mobile, ARM and often the nitty-gritty parts of processors in general (such as cache-latency graphs). While his coverage certainly wasn't perfect and graphs peculiar (and we've complained about it in this very thread) it was at least very different from other and almost always highly informative.

I just can't help but wonder what happens to Anandtech in general now that Ian is the only one left really doing reviews or benches and he seems to have his own stint with TechTechPotato ...

Ryan hasn't updated his GPU test-suite since 2019 and hasn't done anything worth mentioning after the RTX 2xxx series.

Yikes, the technical quality of this site is about to take a big dive. What sucks is there really isn't anyone anywhere else doing the types of in depth reviews of CPU microarchitecture he did. You have people doing little pieces of it like Hans DeVries, a few people investigating Apple M1 Max and so on, but they are mostly narrowly focused on a few things like instruction timings and how they affect their pet application or whatever rather than a more general overview. I always looked forward to his articles every October about Apple's new SoC, and his deep dives into the nitty gritty of the latest Intel or AMD designs.

Sounds like he's taking an industry job based on the "conflicts of interest" comment. Good luck wherever you're headed, Andrei!

moinmoin · Dec 1, 2021

JoeRambo said:
In generic cloud computing, the writing is on the wall for both Intel/AMD.

Indeed it is, and both are lucky for the time being this is coming as a proprietary in-house offering instead an open market one. This gives them somewhat of a breathing space similarly to how Apple's offerings don't directly affect the market as a whole.

beginner99 · Dec 2, 2021

JoeRambo said:
In cloud setting, we are getting to the point of "good enough" for ARM. If this thing is 30% faster than Graviton2 at 100W envelope, it won't matter for typical cloud workloads if 225W CPU is say 25% faster.

You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.

JoeRambo · Dec 2, 2021

beginner99 said:
You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.

Ofc, i am dealing with same type of complex requests every day. Running our own servers etc.
But even in this environment we find plenty of uses for "generic" cloud computing - like computing that is perfectly offloadable to cloud.
Even if we were not using it directly, i know 3rd party services we already depend on are using Azure and Amazon stuff. They are probably on x64 for now, but heck, as long as they meet our SLA, they could run on ZX Spectrums, data is data.

So there is an expanding sea of generic computing, that these ARM CPUs excel at. And with Graviton2, 3 things are looking better each day.

And obviuosly even if we look beyond HW - since that Phoronix benchmark software support for ARM64 is moving leaps forward. I think 99% of "server" software like compilers, frameworks,. JDKs are already ARM64 capable and being optimized for ARM as we speak.

A recent example with JDKs - Amazon is maintaining their own openJDK distro

OpenJDK Download - Corretto - AWS

Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit (OpenJDK). Corretto comes with no-cost long-term support. Amazon runs Corretto internally on thousands of production services and Corretto is certified as compatible with the Java SE standard.

aws.amazon.com

That means very important thing - they are testing it on their ARM cloud stuff and hardware and probably do not mind optimizing for their hardware either as they seem to have very capable JDK developers as well! ( for example Announcing preview release for the generational mode to the Shenandoah GC | AWS Developer Tools Blog (amazon.com) )

Mindshare is increasing, i used to say the biggest obsticle is lack of capable workstation for development, i think now M1 derivatives are taking care of it. Need one more vendor with Linux/Windows workstation and we are set.

DrMrLordX · Dec 2, 2021

Doug S said:
Sounds like he's taking an industry job based on the "conflicts of interest" comment. Good luck wherever you're headed, Andrei!

Might be sharing a cubical with jonnyguru.

(Probably not, but you never know)

naukkis · Dec 2, 2021

beginner99 said:
You are probbaly right. But in my area of work bandwidth/reuwest/s is never an issue. latency is or better said the "above normal" complexity to display data. So that 25% faster processing might be the difference between the application feeling fast enough to end-users.
Difference between man simple vs few complex requests.

But isn't that just what Graviton3 is optimized for, all that extra power is coming from better per-thread power, not from increased thread count? So 64 G3 threads offer similar throughput as 128 Epyc threads providing much better processing speed.

Core count isn't revealed yet but they did reveal that latency decreased 35% and as throughput didn't improve vastly more G3 core count seems to be same as G2.

Thala · Dec 2, 2021

JoeRambo said:
And obviuosly even if we look beyond HW - since that Phoronix benchmark software support for ARM64 is moving leaps forward. I think 99% of "server" software like compilers, frameworks,. JDKs are already ARM64 capable and being optimized for ARM as we speak.

This is an excellent point. Last time I checked, the Phoronix benchmark suite was containing many programs, which were heavily hand-optimized for x64, and thus making a lousy benchmark in general when running on ARM. But, as you mention, i would not be surprised if action is taken to improve the situation over time.

gdansk · Dec 2, 2021

Lousy or reflective, however you choose to see it.

tamz_msc · Dec 2, 2021

DisEnchantment said:
I may be the only one who don't find this exciting at all.
300mm2 on N5 is a lot of transistors, more than EPYC Rome, when tailored for lower clock and high density.
Efficiency gains again contributed in a big way by N5
25% more perf than Gravitron 2 is a not at all impressive in 2022. Gravitron 2 is so much behind even 2nd Gen EPYC 7742 in almost all workloads

How is it not impressive?

It's slightly faster than Rome 7742 at less than half the power consumption. I doubt 5nm alone is responsible for the power savings. What's even more impressive is that they're fabbing the DDR5 controllers and PCIe 5 controllers each as separate dies, while AMD is going to be stuck with a single IO die that incorporates both. Graviton 3 can probably completely turn off the PCIe dies when not in use getting immense power savings.

Amazon Graviton 3 Uses Chiplets & Advanced Packaging To Commoditize High Performance CPUs | The First PCIe 5.0 And DDR5 Server CPU

Amazon has shattered all norms continuously with their AWS platform.

semianalysis.substack.com

NTMBK · Dec 3, 2021

tamz_msc said:
How is it not impressive?

It's slightly faster than Rome 7742 at less than half the power consumption. I doubt 5nm alone is responsible for the power savings. What's even more impressive is that they're fabbing the DDR5 controllers and PCIe 5 controllers each as separate dies, while AMD is going to be stuck with a single IO die that incorporates both. Graviton 3 can probably completely turn off the PCIe dies when not in use getting immense power savings.

Amazon Graviton 3 Uses Chiplets & Advanced Packaging To Commoditize High Performance CPUs | The First PCIe 5.0 And DDR5 Server CPU

Amazon has shattered all norms continuously with their AWS platform.

semianalysis.substack.com

Apparently they have halved the number of PCIe lanes from Graviton 2:

The Graviton3 is the first to deliver PCI-Express 5.0 and DDR5, and the former can deliver high bandwidth with half as many lanes as its PCI-Express 4.0 predecessor while the latter can deliver 50 percent more memory bandwidth with the same capacity and in the same power envelope.

AWS Goes Wide And Deep With Graviton3 Server Chip

It is always an exciting time when there is a new compute engine coming into the market, and interest is particularly keen with any new Arm server chip

www.nextplatform.com

So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings. I'd be surprised if AMD does the same on their PCIe5 platform.

Sounds like they kept 8 channels of memory, but increased to DDR5-4800.

Thala · Dec 3, 2021

NTMBK said:
Apparently they have halved the number of PCIe lanes from Graviton 2:

AWS Goes Wide And Deep With Graviton3 Server Chip

It is always an exciting time when there is a new compute engine coming into the market, and interest is particularly keen with any new Arm server chip

www.nextplatform.com

So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings.

Thats not really a power issue at all. Unused lanes are shut down power-wise and do not consume any power. The number of lanes is mostly an area and pin issue.

NTMBK · Dec 3, 2021

Thala said:
Thats not really a power issue at all. Unused lanes are shut down power-wise and do not consume any power. The number of lanes is mostly an area and pin issue.

Surely there's the power cost of routing data within the IOD to all those IOs?

Gideon · Dec 3, 2021

NTMBK said:
Surely there's the power cost of routing data within the IOD to all those IOs?

Yes, when they're all fired up. Its pretty clear that amazon was interested in fitting 3 nodes into one server-tray and that limits package size. This also means that per-tray 50% of that I/O is compensated by having an extra CPU (and since all the lanes are PCIe 5.0 now it's still formidable)

Bear in mind these 3 CPUs have a way smaller TDP than even 2 (180W)) Milan processors, let-alone 220W or 280W versions.

This chip takes 30% less power under full load than a Ryzen 5950X (100W vs 142W) and that's despite also using chiplets and considerably more I/O.

Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.

NTMBK · Dec 3, 2021

Gideon said:
Yes, when they're all fired up. Its pretty clear that amazon was interested in fitting 3 nodes into one server-tray and that limits package size. This also means that per-tray 50% of that I/O is compensated by having an extra CPU (and since all the lanes are PCIe 5.0 now it's still formidable)

Bear in mind these 3 CPUs have a way smaller TDP than even 2 (180W)) Milan processors, let-alone 220W or 280W versions.

This chip takes 30% less power under full load than a Ryzen 5950X (100W vs 142W) and that's despite also using chiplets and considerably more I/O.

Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.

Oh for sure, it's an impressive part. I think ARM servers are looking very competitive.

jpiniero · Dec 3, 2021

Gideon said:
Is TCO suddenly a totally irrelevant metric? I remember it being all the rage when Rome came out.

If you think about it, TCO doesn't matter to AWS customers. Only the price they pay. TCO is Amazon's problem.

Thala · Dec 3, 2021

NTMBK said:
Surely there's the power cost of routing data within the IOD to all those IOs?

As Gideon already pointed out, the power cost is only there if the lanes are fired up (e.g. used) and even then the power cost are largely a function of utilization.

LightningZ71 · Dec 3, 2021

TCO apparently isn't something to be considered in isolation anymore. Amazon seems to be taking a holistic, whole system approach since they can control it all from top to bottom. It's more about TMpR (Total Margin per Rack). Since Amazon can go the (almost) custom route with everything, they can more delicately balance the TCO of the rack contents against the profit that they can make on that rack. If they can double their number of sellable instances or, to use a more generic term, if they can increase their total sold compute hours per rack by enough to surpass their own development, procurement, and operational cost per rack, they make more money over the long run. It was always a goal for most companies that weren't pure science in nature, but, none of them have the scale that Amazon (and google, MS, etc) have to actually affect things like TCO at the design phase. In other words, you used to get a menu of options like Package 1 and Package 2 and Package 3. Now, you get to say "I want a package with these specific options at the circuit and module level to exactly meet my needs".

JoeRambo · Dec 3, 2021

64 of V1 cores in 100W is scary, considering the actual core was supposed to be used in HPC projects where vector FP with SVE is important. This thing can do 4x128bit or 2x256bit SVE and is more in place for HPC/ML than in "generic" cloud setting.
What Amazon did is basically "take best out of the shelf ARM core and cram as many as we can fit on chiplet". And some real innovation on system level too, this 3 chips on rack thing is proving them with 192C where AMD and Intel's most popular 2S systems will struggle to match them in "rack" throughput.

moinmoin · Dec 3, 2021

NTMBK said:
Apparently they have halved the number of PCIe lanes from Graviton 2:

AWS Goes Wide And Deep With Graviton3 Server Chip

It is always an exciting time when there is a new compute engine coming into the market, and interest is particularly keen with any new Arm server chip

www.nextplatform.com

So that means they've dropped to only 32 lanes of PCIe, which will be a big part of the power savings. I'd be surprised if AMD does the same on their PCIe5 platform.

Sounds like they kept 8 channels of memory, but increased to DDR5-4800.

Where did you get the info that Amazon has halved the number of PCIe lanes from Graviton 2? If it's from that fluffy way overly wordy nextplatform text, that's just nextplatform nonsensically raving about PCIe 5 having doubled bandwidth over PCIe 4 and as such needing only half the lanes for the same bandwidth, duh. Beyond using PCIe 5 there's no mention of Graviton 3's actual PCIe configuration in there at all.

JoeRambo said:
this 3 chips on rack thing is proving them with 192C where AMD and Intel's most popular 2S systems will struggle to match them in "rack" throughput.

Genoa will match and Bergamo surpass it in core count (never mind thread count).

The big difference is not performance density but power consumption, the resulting TDP and the necessary cooling infrastructure. If Amazon's solution were available on the open market many data center may well choose this instead moving from air to water cooling, but it isn't so this may well turn out to be a miss for Arm-based servers overall...

LightningZ71 · Dec 3, 2021

I hadn't even thought about thread count As I recall, the NeoVerse V1 is a single thread processor. Zen3 and Zen4 are SMT2. Assuming that Genoa is still going to be 96 cores (12 CCDs, 8 core CCX), that 192 threads PER socket, and 384 threads per 2S system. I can't imagine that they will have a problem with system throughput, even if they have to dial back clocks a lot to keep power draw reasonable. This should be an interesting race...

jpiniero · Dec 3, 2021

moinmoin said:
The big difference is not performance density but power consumption, the resulting TDP and the necessary cooling infrastructure. If Amazon's solution were available on the open market many data center may well choose this instead moving from air to water cooling, but it isn't so this may well turn out to be a miss for Arm-based servers overall...

Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.

moinmoin · Dec 3, 2021

LightningZ71 said:
I hadn't even thought about thread count As I recall, the NeoVerse V1 is a single thread processor.

The only ARM core with SMT Arm launched so far is A65 back in 2018. That that A6x line hasn't been updated since makes me think either Arm isn't interested in pursuing SMT further or there is not enough demand for it.

jpiniero said:
Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.

I think we presumed before (since the launch of Naples which main selling point at the time was increasing core density over Intel's offerings) already that most sales are to cloud, and that AMD is trying to expand from there to the slower moving enterprise market.

Hitman928 · Dec 3, 2021

jpiniero said:
Cloud is a big chunk of the market. I wouldn't be surprised if the majority of Epyc sales are to Cloud for instance.

This is correct, cloud customers make up the bulk of AMD's Epyc sales. Additionally, while AMD is starting to get a bigger foothold into the enterprise market, the cloud market is only growing, so they are doing a lot to try and keep those customers happy and continually buying Epyc.

News [Anandtech] Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 & Cortex-A510

Platinum Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Lifer

Golden Member

Golden Member

Diamond Member

Diamond Member

Lifer

Golden Member

Lifer

Platinum Member

Lifer

Lifer

Golden Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Lifer

Diamond Member

Diamond Member