News Ampere Altra Launched with 80 Arm Cores for the Cloud（Performance Estimates）

csbin · Mar 3, 2020

Ampere Altra Launched with 80 Arm Cores for the Cloud

The Ampere Altra 80-core Arm server CPU is upon us with 128 PCIe Gen4 lanes per CPU, CCIX, and dual-socket server capabilities. We assess the impact.

www.servethehome.com

Schmide · Mar 4, 2020

Andrei. said:
Saying NEON doesn't even reach SSE is insane, please name one thing you cannot do with NEON?

I could name a few but by far the biggest is movemask.

USER8000 · Mar 4, 2020

moinmoin said:
According to the STH article the 80 cores ARM chip is a 210W TDP part as well. And the End Notes above are a huge bummer with those Epyc and Xeon results being retroactively reduced to make the results of different compilers "more comparable". They should have excluded Epyc from the comparison, the latest Epyc really is way too close to those ARM efforts to make one wonder why bother.

Because its about generating hype,and best case scenarios,and then you have people doing the following on the internet:

A lot of these competing parts don't see significant volume,or actually don't ship in significant volume for a time period after the announcements,so are competing with later designs from incumbants(then there is also the matter of support and experience in delivering designs and supporting them in ease of acquiring spare parts,etc).

What has picqued my interest is the Fujitsu A64FX,which is doing something different compared to the current paradigm,and Fujitsu are experienced.

insertcarehere · Mar 4, 2020

@Andrei.

What sort of timeframe would you be able to release the results of your testing for the AWS Gravitron 2 implementation? Would be an interesting look into how the N1 core performs..

Schmide · Mar 5, 2020

Pushing the technicalities of metrics aside. This is a decent step into more than niche. The ecosystem may have some lag time, but there is a whole generation of arm users and developers that will fill the toolchain from phones, to SBC, to these servers.

What amazes me is these things are basically on par with epyc in terms of pci lanes, interconnects, and memory. Yeah they're not exact but all the pieces are there.

Nothingness · Mar 5, 2020

Schmide said:
Estimates, rates, synthetics, overwhelmingly stream oriented workloads, may favor a certain niche. Other workloads not so much. x86 is on it's forth major iteration of simd. neon while moving fast, has still not even reached the level of sse.

I'd like you to list where you think NEON is lagging behind SSE. I know people who wrote assembly routines for FFmpeg and they said NEON is better than SSE. If you don't mind I'll take their words rather than yours, unless you provide evidence.

Schmide · Mar 5, 2020

Nothingness said:
I'd like you to list where you think NEON is lagging behind SSE. I know people who wrote assembly routines for FFmpeg and they said NEON is better than SSE. If you don't mind I'll take their words rather than yours, unless you provide evidence.

I actually said the exact same thing Tuesday at [H] (can't go back in time)

Could be very good at video as anyone who's worked with neon and their cousins understands from their color channel muxing.

That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

Different architectures have different trade offs, which is what most of my arguments against the IPC is greater expand on. However, IMO there is at most a small set of operations a simple efficient core can out perform a monolithic x86.

When full reviews are made, I hope I am pleasantly surprised.

Nothingness · Mar 5, 2020

Schmide said:
I actually said the exact same thing Tuesday at [H] (can't go back in time)

That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

Well given that you admit that for video it's good which is the info I had, I have no other source to counter your previous claim, so no provable reason not to believe you

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

ARM isn't following the silly Intel path to create a new ISA for each widening of vectors. They have SVE to go beyond NEON which is vector length agnostic. Obviously I guess to get the best you'll have to stick to some vector length but that's still the same instructions for 128-bit up to whatever chips will implement (up to 2048-bit but I don't think anyone will go that far). And no I'm not qualified enough to comment on whether it's better than AVX-512 or not

SarahKerrigan · Mar 5, 2020

Schmide said:
I actually said the exact same thing Tuesday at [H] (can't go back in time)

That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

Different architectures have different trade offs, which is what most of my arguments against the IPC is greater expand on. However, IMO there is at most a small set of operations a simple efficient core can out perform a monolithic x86.

When full reviews are made, I hope I am pleasantly surprised.

N1 isn't a "simple efficient core", though. It is a big, serious OoO core with a very aggressive cache hierarchy, and while its SIMD is indeed narrower than current x86 types, this is basically irrelevant to most non-HPC server code streams.

I absolutely think it's plausible that iso-clock integer ST perf exceeds that of SKL, and I'm excited for the review.

Richie Rich · Mar 6, 2020

SVE2 has some nice HW and efficiency advantage:

2048-bit vector is 16x longer than 128-bit NEON so at reorder engine it saves energy for searching dependencies, at reorder buffer it saves 15 positions (packed like macro ops). So IMHO it's more about efficiency (primary target for mobile uarch) than about performance (secondary target/side effect)

Andrei is right about ST IPC comparison, the real MT performance is influenced by SMT in favor of ARM because it lowers Zen2's IPC by half. I also assume that fully MT loaded Neoverse N1 will be close to A76 IPC (as N1 is boosted by cache in ST).

For iso-clock:

Zen2 is +25% faster than A76 and SMT gives another +25% throughput, which gives total +56% (1.56x) IPC over A76 per two threads, or 0.78x per one thread
A76 will be in real load faster per thread (1/0.78=1.28... +28% at iso clock) despite narrow core and lower ST IPC

For real clock:

Zen2@2.6GHz vs. Altra@3.0 GHz..... result is 1.28* 3.0/2.6 = 1.48x faster per thread.
all core throughput will be Altra 80x1.48 = 118.4 vs. EPYC's 128 .... EPYC wins!
unless Altra will clock 3.3 GHz then Altra is 80x1.62 = 130 ... tight win for Altra (now we know where that 3.3 GHz come from)

To sum up Altra will be outperforming EPYC systems per thread despite having weaker cores (remember A76 is only 3xALU+1xJump vs. Zen2 is 4xALU) and at least matching overall performance (big shared L3 cache in 80-core monolith could be also performance advantage for some type of code). As soon as they will implement A77 (much wider 4xALU+2xJump) with +20% IPC (and +8% higher than Zen2) then things will be even more interesting for customers. If Zen3's IPC jump will be lower than A77's +20% then situation will turn in favor of ARM systems even more.

JoeRambo · Dec 18, 2020

Anandtech now has awesome review of this beast:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

What is interesting is Anandtech testing of JVM finally almost caught up to real world usage and they are running instances as they should be:

JAVA_OPTS_BE="-server -Xms172g -Xmx172g -Xmn156g -XX:+AlwaysPreTouch"

Looking great here when combined with "numactl –cpunodebind and –membind " to bind to specific NUMA node.

But it can be made even greater by using -XX:+UseTransparentHugePages, since JVM is allocating all memory upfront with -XX:+AlwaysPreTouch it means you end up madvising all those pretouched pages on init and instead of 45088768 4k sized paged one ends up with 88064 2M sized THP pages.
The actual performance impact of course depends on access patters, but we've seen 5-10% free performance. What is obviuos tho is a chunk of memory saved in PageTables size department.

Oh, and they even asked on Twitter if people use THP enabled by default, the answer to that question is: they don't, but they make use of madvise THP where possible like that JVM support above.

SarahKerrigan · Dec 18, 2020

Wow, that's quite a chip! Wonder how it would be doing if it wasn't constrained by a fun-size LLC. I really hope they aren't going to end up trying to use a 32MB LLC for Altra Max too - that would start to get a bit silly.

Looks like the perf/W sweet spot is probably a bit further down the SKU list - the Q64-24, in particular, looks like a heck of a lot of compute in a pretty minimal power profile.

Gideon · Dec 18, 2020

SarahKerrigan said:
Wow, that's quite a chip! Wonder how it would be doing if it wasn't constrained by a fun-size LLC.

Looks like the perf/W sweet spot is probably a bit further down the SKU list - the Q64-24, in particular, looks like a heck of a lot of compute in a pretty minimal power profile.

The 128 core version will probably be clocked a bit lower, but should still be a good option for workloads that can use all the threads.

Milan should be competitive with it (at a bit worse power-draw but often a bit better performance), but Genoa vs 5nm ARM cores is going get tight for AMD.
ARM promises +50% IPC for Neoverse V1. To compete, genoa really has to do some magic on the Uncore/packaging side as well as have another around 20% IPC uplift to compete.

insertcarehere · Dec 18, 2020

Gideon said:
The 128 core version will probably be clocked a bit lower, but should still be a good option for workloads that can use all the threads.

Milan should be competitive with it (at a bit worse power-draw but often a bit better performance), but Genoa vs 5nm ARM cores is going get tight for AMD.
ARM promises +50% IPC for Neoverse V1. To compete, genoa really has to do some magic on the Uncore/packaging side as well as have another around 20% IPC uplift to compete.

The more impressive thing is that Ampere can put 80 (maybe 128?) N1 cores into one single die w/ I/O. AMD on the same process had to resort to a ton of chiplets for 64 cores, and Intel's Ice Lake-SP supposedly only fits 42 cores into a single die.

beginner99 · Dec 18, 2020

insertcarehere said:
AMD on the same process had to resort to a ton of chiplets for 64 cores

Had to? or was it just smart in terms of available 7nm capacity and re-usability?

uzzi38 · Dec 18, 2020

beginner99 said:
Had to? or was it just smart in terms of available 7nm capacity and re-usability?

The latter.

I kind of want to know what such a Rome/Milan would look like in terms of power draw tbh. One without the I/O die. Would have been cool to see.

amrnuke · Dec 18, 2020

Gideon said:
The 128 core version will probably be clocked a bit lower, but should still be a good option for workloads that can use all the threads.

Milan should be competitive with it (at a bit worse power-draw but often a bit better performance), but Genoa vs 5nm ARM cores is going get tight for AMD.
ARM promises +50% IPC for Neoverse V1. To compete, genoa really has to do some magic on the Uncore/packaging side as well as have another around 20% IPC uplift to compete.

What's impressive is that despite what was labeled as PR fluff in March, this chip actually met expectations - SPEC 2017 rate-n int scores on GCC are 1.05-1.06x 7742, and 2.35x 8280, which is pretty close to what they claimed.

While, yes, that's against chips that came out >18 months ago, and it'd be interesting to see how it stacks up to the Zen3 server chips that are surely already rolled out in some application or another, this is still a great accomplishment.

And if the 50% IPC for V1 doesn't result in any substantial clock speed losses or too much power usage, could be a very awesome next few years in the server/data center space.

name99 · Dec 18, 2020

moinmoin said:
According to the STH article the 80 cores ARM chip is a 210W TDP part as well. And the End Notes above are a huge bummer with those Epyc and Xeon results being retroactively reduced to make the results of different compilers "more comparable". They should have excluded Epyc from the comparison, the latest Epyc really is way too close to those ARM efforts to make one wonder why bother.

The reason for "bothering" is an expectation that ARM will get faster more rapidly than AMD. Which means that the large companies that aren't Amazon (and Apple?...?), if they have any sense, will be buying a few today to start preparing for their large scale transitions over the next few years.

name99 · Dec 18, 2020

Schmide said:
Single lane 128bit simd. Certainly not as beefy as it could of been but fp16 support is good.

Probably the best balance for where arm is in this stage of development.

Please, for the love of god, stop talking about "128bit simd" like that is some sort of monolithic thing across all ARM.
There is ZERO reason why multiple SIMD units implemented on a very wide machine should perform any worse than fewer wider SIMD units, and this is in fact exactly what we see for Apple.

ARM MacBook vs Intel MacBook: a SIMD benchmark

In my previous blog post, I compared the performance of my new ARM-based MacBook Pro with my 2017 Intel-based MacBook Pro. I used a number parsing benchmark. In some cases, the ARM-based MacBook Pro was nearly twice as fast as the older Intel-based MacBook Pro. I think that the Apple M1...

lemire.me

That's for SIMD used for "data processing. If you prefer SIMD used for dense linear algebra:

LG Electronics IceLake Platform vs iPhone13,3 - Geekbench

browser.geekbench.com

(look at SGEMM and SFFT).

If you're unhappy with the Altra SIMD results, complain about Altra. But STFU about "128bit simd" in general, at least till you have schooled yourself in the different IMPLEMENTATIONS of NEON, and in what Apple achieves with their NEON performance.

moinmoin · Dec 18, 2020

Looks to be a great chip. I wonder how big the die is. The slightly awkward cooler seems to indicate it's not that big.

For AMD this reinforces its roadmap, they'll have to continue executing well to keep up and get/stay ahead with the advances of Apple M1 in laptop space and now Ampere Altra in server space. Exciting times! Tough luck for Intel though.

name99 · Dec 18, 2020

Schmide said:
I actually said the exact same thing Tuesday at [H] (can't go back in time)

That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

Different architectures have different trade offs, which is what most of my arguments against the IPC is greater expand on. However, IMO there is at most a small set of operations a simple efficient core can out perform a monolithic x86.

When full reviews are made, I hope I am pleasantly surprised.

What EXACTLY are you wondering about?
There is no ARM 256bit SIMD. There is SVE/2, and you don't have to wonder about it, there's plenty of documentation available on the internet...

itsmydamnation · Dec 18, 2020

name99 said:
The reason for "bothering" is an expectation that ARM will get faster more rapidly than AMD. Which means that the large companies that aren't Amazon (and Apple?...?), if they have any sense, will be buying a few today to start preparing for their large scale transitions over the next few years.

You ARM fan boys don't really live in the real world do you?

Also almost all non hyperscalres buy via an OEM/ODM, until people like Dell and HPE are fully on board, the TCO for an entire server based on ARM is going to be higher for volume customers because storage/memory costs are what really drives Server Costs.

Maybe in 2-3 gens time when an advantage might start to exist people might start planning but right now they don't give a crap. Also Enterprise licensing SKU's don't favour massive core counts, the big outlier here is RHEL , lets wait and see how long it takes IBM to "fix" that , just like they fixed centos.

The thing that will be interesting to me is how ARM server companies sustain themselves until that can reach viable levels of market share, the TAM for there SOC's are smaller compared to x86 and with single SOC products they will eat more cost in the lower and mid SKU's where most servers are sold at ( intel 18-24 , 32 for AMD) compared to AMD's chiplet and intels multi SOC approach. In a race to the Bottom both AMD and Intel have big advantages to "floor" pricing , Intel can eat manufacturing Margin , AMD amortises costs and yield over both consumer and sever.

if i've learnt anything over 15 years of designing/selling Datacentre infrastructure is the market in general always cares way less then you think they do, you have to have massive undeniable advantage over multiple generations to gain inertia.

insertcarehere · Dec 18, 2020

beginner99 said:
Had to? or was it just smart in terms of available 7nm capacity and re-usability?

The actual EPYC CPU turned out to have eight 75mm^2 chiplets and a ~450mm^2 14nm IO die. A hypothetical monolithic die to fit those together might straight up exceed what can be fabbed on 7nm due to the size alone.

moinmoin · Dec 18, 2020

Btw. anybody happen to know what the 4926 contacts of the Altra socket could be for? I/O is comparable to Epyc, but SP3 manages with "only" 4094 contacts (so Altra has over 20% more).

beginner99 · Dec 19, 2020

itsmydamnation said:
he TCO for an entire server based on ARM is going to be higher for volume customers because storage/memory costs are what really drives Server Costs.

My 2cents is that we can't yet? compare based on list-prices from AMD/Intel. As finally getting the budget to get a nice server for data science stuff.
Anyway from total hardware cost, I can say that server CPU list prices in x86 camp are essential meaningless. The list price for the CPU would make up over half of the total server cost (which includes an expensive A100, eg CPU + GPU list price is already more than we pay for the whole thing)

RasCas99 · Dec 19, 2020

itsmydamnation said:
You ARM fan boys don't really live in the real world do you?

Also almost all non hyperscalres buy via an OEM/ODM, until people like Dell and HPE are fully on board, the TCO for an entire server based on ARM is going to be higher for volume customers because storage/memory costs are what really drives Server Costs.

Maybe in 2-3 gens time when an advantage might start to exist people might start planning but right now they don't give a crap. Also Enterprise licensing SKU's don't favour massive core counts, the big outlier here is RHEL , lets wait and see how long it takes IBM to "fix" that , just like they fixed centos.

The thing that will be interesting to me is how ARM server companies sustain themselves until that can reach viable levels of market share, the TAM for there SOC's are smaller compared to x86 and with single SOC products they will eat more cost in the lower and mid SKU's where most servers are sold at ( intel 18-24 , 32 for AMD) compared to AMD's chiplet and intels multi SOC approach. In a race to the Bottom both AMD and Intel have big advantages to "floor" pricing , Intel can eat manufacturing Margin , AMD amortises costs and yield over both consumer and sever.

if i've learnt anything over 15 years of designing/selling Datacentre infrastructure is the market in general always cares way less then you think they do, you have to have massive undeniable advantage over multiple generations to gain inertia.

Why everything needs to be fan boys ? dude you are replying in a ARM CPU thread...... the topic will be ARM CPU , ill go off topic with this one post , but i had to write it.
MS are working on their own ARM servers , AMAZON are already there , and Google will be onboard shortly , its not about fan boys , its about money , you being a sales person , Im surprised you didnt figure it out yet , but alas , you believe Amazon and MS , 2 of the biggest cloud companies in the world bar none are building their own ARM servers is not enough to push this segment forward ? it already happened , once Amazon started building their own server CPU`s you better believe MS were already recruiting , i know because i got the first recruitment call 3-4 years ago , could it be for a different CPU ? sure but my guess its for a server part , Google were recruiting for the past 3 years , heavily at that , I dont believe its just for their camera IP`s.

The biggest problem for Intel and AMD is the fact that their biggest clients are so large and rich that they decided that DIY is more beneficial to them , so i wouldnt worry about Altra CPU`s if i was Intel/AMD , the ARM competition is coming fast and furious from the companies that are buying those AMD/Intel CPU`s in bulk today.

News Ampere Altra Launched with 80 Arm Cores for the Cloud（Performance Estimates）

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Platinum Member

Senior member

Diamond Member

Platinum Member

Golden Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Member