• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

News Ampere Altra Launched with 80 Arm Cores for the Cloud(Performance Estimates)

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

USER8000

Golden Member
Jun 23, 2012
1,516
737
136
According to the STH article the 80 cores ARM chip is a 210W TDP part as well. And the End Notes above are a huge bummer with those Epyc and Xeon results being retroactively reduced to make the results of different compilers "more comparable". They should have excluded Epyc from the comparison, the latest Epyc really is way too close to those ARM efforts to make one wonder why bother.
Because its about generating hype,and best case scenarios,and then you have people doing the following on the internet:



A lot of these competing parts don't see significant volume,or actually don't ship in significant volume for a time period after the announcements,so are competing with later designs from incumbants(then there is also the matter of support and experience in delivering designs and supporting them in ease of acquiring spare parts,etc).

What has picqued my interest is the Fujitsu A64FX,which is doing something different compared to the current paradigm,and Fujitsu are experienced.
 
Last edited:
  • Like
Reactions: lightmanek

insertcarehere

Senior member
Jan 17, 2013
304
110
116
@Andrei.

What sort of timeframe would you be able to release the results of your testing for the AWS Gravitron 2 implementation? Would be an interesting look into how the N1 core performs..
 

Schmide

Diamond Member
Mar 7, 2002
5,326
216
106
Pushing the technicalities of metrics aside. This is a decent step into more than niche. The ecosystem may have some lag time, but there is a whole generation of arm users and developers that will fill the toolchain from phones, to SBC, to these servers.

What amazes me is these things are basically on par with epyc in terms of pci lanes, interconnects, and memory. Yeah they're not exact but all the pieces are there.
 

Nothingness

Platinum Member
Jul 3, 2013
2,153
397
126
Estimates, rates, synthetics, overwhelmingly stream oriented workloads, may favor a certain niche. Other workloads not so much. x86 is on it's forth major iteration of simd. neon while moving fast, has still not even reached the level of sse.
I'd like you to list where you think NEON is lagging behind SSE. I know people who wrote assembly routines for FFmpeg and they said NEON is better than SSE. If you don't mind I'll take their words rather than yours, unless you provide evidence.
 

Schmide

Diamond Member
Mar 7, 2002
5,326
216
106
I'd like you to list where you think NEON is lagging behind SSE. I know people who wrote assembly routines for FFmpeg and they said NEON is better than SSE. If you don't mind I'll take their words rather than yours, unless you provide evidence.
I actually said the exact same thing Tuesday at [H] (can't go back in time)

Could be very good at video as anyone who's worked with neon and their cousins understands from their color channel muxing.
That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

Different architectures have different trade offs, which is what most of my arguments against the IPC is greater expand on. However, IMO there is at most a small set of operations a simple efficient core can out perform a monolithic x86.

When full reviews are made, I hope I am pleasantly surprised.
 

Nothingness

Platinum Member
Jul 3, 2013
2,153
397
126
I actually said the exact same thing Tuesday at [H] (can't go back in time)

That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.
Well given that you admit that for video it's good which is the info I had, I have no other source to counter your previous claim, so no provable reason not to believe you :)

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.
ARM isn't following the silly Intel path to create a new ISA for each widening of vectors. They have SVE to go beyond NEON which is vector length agnostic. Obviously I guess to get the best you'll have to stick to some vector length but that's still the same instructions for 128-bit up to whatever chips will implement (up to 2048-bit but I don't think anyone will go that far). And no I'm not qualified enough to comment on whether it's better than AVX-512 or not :)
 
  • Like
Reactions: Etain05

SarahKerrigan

Member
Oct 12, 2014
196
202
116
I actually said the exact same thing Tuesday at [H] (can't go back in time)



That actually is an area where neon is a bit better or at least more specifically optimized for rgb. Their vector load interleaves and d-interleaves rgb seamlessly, where sse would require shuffles and blends. They have the equivalent vzip vuzp to pack unpack to alternate data quickly.

Please take other peoples word for the best available information.

I often wonder how arm's 256bit simd will differ from AVX. Laneing has it's advantages as well as it's annoyances. There were a lot of growing pains in the early days of intel simd. ARM will have to go through the same process to reach parity.

Different architectures have different trade offs, which is what most of my arguments against the IPC is greater expand on. However, IMO there is at most a small set of operations a simple efficient core can out perform a monolithic x86.

When full reviews are made, I hope I am pleasantly surprised.
N1 isn't a "simple efficient core", though. It is a big, serious OoO core with a very aggressive cache hierarchy, and while its SIMD is indeed narrower than current x86 types, this is basically irrelevant to most non-HPC server code streams.

I absolutely think it's plausible that iso-clock integer ST perf exceeds that of SKL, and I'm excited for the review.
 

Richie Rich

Senior member
Jul 28, 2019
438
200
76
SVE2 has some nice HW and efficiency advantage:
  • 2048-bit vector is 16x longer than 128-bit NEON so at reorder engine it saves energy for searching dependencies, at reorder buffer it saves 15 positions (packed like macro ops). So IMHO it's more about efficiency (primary target for mobile uarch) than about performance (secondary target/side effect)

Andrei is right about ST IPC comparison, the real MT performance is influenced by SMT in favor of ARM because it lowers Zen2's IPC by half. I also assume that fully MT loaded Neoverse N1 will be close to A76 IPC (as N1 is boosted by cache in ST).

For iso-clock:
  • Zen2 is +25% faster than A76 and SMT gives another +25% throughput, which gives total +56% (1.56x) IPC over A76 per two threads, or 0.78x per one thread
  • A76 will be in real load faster per thread (1/0.78=1.28... +28% at iso clock) despite narrow core and lower ST IPC

For real clock:
  • Zen2@2.6GHz vs. Altra@3.0 GHz..... result is 1.28* 3.0/2.6 = 1.48x faster per thread.
  • all core throughput will be Altra 80x1.48 = 118.4 vs. EPYC's 128 .... EPYC wins!
  • unless Altra will clock 3.3 GHz then Altra is 80x1.62 = 130 ... tight win for Altra (now we know where that 3.3 GHz come from)

To sum up Altra will be outperforming EPYC systems per thread despite having weaker cores (remember A76 is only 3xALU+1xJump vs. Zen2 is 4xALU) and at least matching overall performance (big shared L3 cache in 80-core monolith could be also performance advantage for some type of code). As soon as they will implement A77 (much wider 4xALU+2xJump) with +20% IPC (and +8% higher than Zen2) then things will be even more interesting for customers. If Zen3's IPC jump will be lower than A77's +20% then situation will turn in favor of ARM systems even more.
 
Last edited:

ASK THE COMMUNITY