Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

OriAr · Mar 25, 2020

trivik12 said:
So 1st Arm Macbook would use A14x. I wonder if 14" MBP will use Tigerlake or A series SOC.

First ARM Mac will be almost certainly MBA (Or a new consumer focused product, and to put it bluntly, a Facebook machine, which is why the new iPad Pro makes me think they might not even do that).

If there was a MBP coming this fall you'd have heard the software developers working on software for it already.

Carfax83 · Mar 26, 2020

naukkis said:
And why not? Look how A76 optimized for server workload & cache & memory controllers perform: it's IPC and multicore scores improve a lot, IPC increases 30% and so on. Don't except Apple chip to perform any different. So probably desktop-class version of Apple core will have better IPC and much improved multicore speed against phone-chip.

That's my whole point. The Graviton 2 is substantially different from the A13 because its not narrowly optimized for single threaded burst performance like the A13 is. My whole problem with this thread is how certain people have been implying that the A13 core could be akin to a drop in solution that is successful across a wide variety of workloads just as it is. To scale the A13 up to be successful in more diverse and multithreaded workloads would probably require some serious architectural changes, which would from what I've read in this thread, lower the single thread/IPC performance substantially.

Carfax83 · Mar 26, 2020

Nothingness said:
At the other end you have a server chip from AWS that is competitive against Intel and AMD chips.

How competitive is it really though? The benchmarks were only limited to Spec2006, and the competition was a 32 core first generation Zen CPU, and an 28 core Intel Cascade Lake CPU that is based on Skylake architecture from nearly 5 years ago.

Yes this chip and other ARM CPU derivatives will get much better with each iteration, but AMD and Intel aren't going to be standing still either.

Hitman928 · Mar 26, 2020

Carfax83 said:
How competitive is it really though? The benchmarks were only limited to Spec2006, and the competition was a 32 core first generation Zen CPU, and an 28 core Intel Cascade Lake CPU that is based on Skylake architecture from nearly 5 years ago.

Yes this chip and other ARM CPU derivatives will get much better with each iteration, but AMD and Intel aren't going to be standing still either.

The only numbers I think we have for Spec2017 for Graviton2 are from Anandtech and they didn't run Spec2017 for Rome but there are plenty of published and verified results but using AOCC instead of GCC. Also it needs to be noted that Graviton2 was running as a cloud server whereas I'm sure this Rome test was running bare metal. Taking that into consideration, I compared the numbers below.

CPU2017 Integer Rate Result: ASUSTeK Computer Inc. ASUS RS500A-E10(KRPA-U16) Server System 2.25 GHz, AMD EPYC 7742

CINT2017 result for ASUS RS500A-E10(KRPA-U16) Server System 2.25 GHz, AMD EPYC 7742; SPECrate2017_int_base: 353; SPECrate2017_int_peak: 385

www.spec.org

https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/7

	Rome 7742	Graviton2	Rome vs Graviton2
500.perlbench_r	310.58	174.4	178.09%
502.gcc_r	334.34	176.9	189.00%
505.mcf_r	442.24	103.1	428.94%
520.omnetpp_r	159.59	85.6	186.43%
523.xalancbmk_r	374.43	131.4	284.95%
525.x264_r	845.29	304.4	277.69%
531.deepsjeng_r	368.39	202.7	181.74%
541.leela_r	359.44	204.4	175.85%
548.exchange2_r	1000.03	385.7	259.28%
557.xz_r	232.59	114.7	202.78%

Markfw · Mar 26, 2020

Hitman928 said:
The only numbers I think we have for Spec2017 for Graviton2 are from Anandtech and they didn't run Spec2017 for Rome but there are plenty of published and verified results but using AOCC instead of GCC. Also it needs to be noted that Graviton2 was running as a cloud server whereas I'm sure this Rome test was running bare metal. Taking that into consideration, I compared the numbers below.

CPU2017 Integer Rate Result: ASUSTeK Computer Inc. ASUS RS500A-E10(KRPA-U16) Server System 2.25 GHz, AMD EPYC 7742

CINT2017 result for ASUS RS500A-E10(KRPA-U16) Server System 2.25 GHz, AMD EPYC 7742; SPECrate2017_int_base: 353; SPECrate2017_int_peak: 385

www.spec.org

https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/7

Rome 7742
Graviton2
Rome vs Graviton2
500.perlbench_r
310.58
174.4
178.09%
502.gcc_r
334.34
176.9
189.00%
505.mcf_r
442.24
103.1
428.94%
520.omnetpp_r
159.59
85.6
186.43%
523.xalancbmk_r
374.43
131.4
284.95%
525.x264_r
845.29
304.4
277.69%
531.deepsjeng_r
368.39
202.7
181.74%
541.leela_r
359.44
204.4
175.85%
548.exchange2_r
1000.03
385.7
259.28%
557.xz_r
232.59
114.7
202.78%

So, do I read that Rome is 2-4 times faster than graviton2 ???? Maybe this will shut Richie Rich up.....

Hitman928 · Mar 26, 2020

Markfw said:
So, do I read that Rome is 2-4 times faster than graviton2 ???? Maybe this will shut Richie Rich up.....

There's some decent caveats here, but yeah, even taking those into account, it's going to be much faster.

Carfax83 · Mar 26, 2020

Hitman928 said:
There's some decent caveats here, but yeah, even taking those into account, it's going to be much faster.

And Zen 3 should be much more potent. I've always said that Zen 3 would be AMD's true break away moment, if it's ever going to happen. Zen 2 was playing catch up with Intel, and luckily for AMD, Intel screwed up badly with their 10nm node so it will make it easier for Zen 3 to really do some damage.

Zen 3 is going to be a monster!

ARM has their work cut out for them if they want to catch up with x86-64.

Richie Rich · Mar 26, 2020

Markfw said:
So, do I read that Rome is 2-4 times faster than graviton2 ???? Maybe this will shut Richie Rich up.....

523.scalancbmk .... 32c Zen1 ….. 53.7
523.scalancbmk .... 64c Zen2 … 374.4 ……….. that 7x more than Zen1 and 3.5x more per core

505.mcf.... 32c Zen1 ….. 73.2
505.mcf.... 64c Zen2 … 442.2 ……….. that 6x more than Zen1 and 3x more per core

I didn't noticed that Zen2 has >200% higher IPC than Zen1. I thought it's about 15%.
According to Andrei's test G2 is about 1.7x faster while having 2xmore cores. That's expectable.
If Rome would be 2-4 times faster than G2 then also Rome would be 3.4-6.8 times faster than Zen1 Naples. And this is impossible.

Do you still believe those numbers are correct and fully comparable?

@Carfax83
I agree Zen3 is gonna be much better than Zen2 however A78 will have higher IPC. And after that new ARMv9 core line up with SVE2 2048-bit vectors. Well, you can see it isn't ARM who needs to catch up. Since A77 delivers 8% higher IPC by than Zen2, ARM became super dangerous for x86 world.

DrMrLordX · Mar 26, 2020

Carfax83 said:
That's my whole point. The Graviton 2 is substantially different from the A13 because its not narrowly optimized for single threaded burst performance like the A13 is. My whole problem with this thread is how certain people have been implying that the A13 core could be akin to a drop in solution that is successful across a wide variety of workloads just as it is. To scale the A13 up to be successful in more diverse and multithreaded workloads would probably require some serious architectural changes, which would from what I've read in this thread, lower the single thread/IPC performance substantially.

If you look at Graviton2, even it struggles with performance whenever too many cores are engaged on the same workload. Anandtech went well out of their way to point out that Graviton2 performs better running multiple, low-resource VM instances.

DrMrLordX · Mar 26, 2020

Richie Rich said:
I didn't noticed that Zen2 has >200% higher IPC than Zen1. I thought it's about 15%.

Where have you been? Rome crushes Naples.

AMD Rome Second Generation EPYC Review: 2x 64-core Benchmarked

www.anandtech.com

With an updated version of GCC it does even better. Also again with the IPC? You have to take into consideration that Rome allows higher clocks within the same power envelope.

Hitman928 · Mar 26, 2020

Richie Rich said:
523.scalancbmk .... 32c Zen1 ….. 53.7
523.scalancbmk .... 64c Zen2 … 374.4 ……….. that 7x more than Zen1 and 3.5x more per core

505.mcf.... 32c Zen1 ….. 73.2
505.mcf.... 64c Zen2 … 442.2 ……….. that 6x more than Zen1 and 3x more per core

I didn't noticed that Zen2 has >200% higher IPC than Zen1. I thought it's about 15%.
According to Andrei's test G2 is about 1.7x faster while having 2xmore cores. That's expectable.
If Rome would be 2-4 times faster than G2 then also Rome would be 3.4-6.8 times faster than Zen1 Naples. And this is impossible.

Do you still believe those numbers are correct and fully comparable?

@Carfax83
I agree Zen3 is gonna be much better than Zen2 however A78 will have higher IPC. And after that new ARMv9 core line up with SVE2 2048-bit vectors. Well, you can see it isn't ARM who needs to catch up. Since A77 delivers 8% higher IPC by than Zen2, ARM became super dangerous for x86 world.

The compiler can make a huge difference on certain tests. Look at the published results for Spec2006 against Andrei's tests on Zen1. The libquantum score increases by 1700% on Zen1 by using an AMD optimized compiler. If you look at the published Spec2017 results, they all use AMD's open compiler, no one uses GCC.

As far as I'm aware, GCC is basically performance equal to ARM's own compiler (could be wrong, I don't follow ARM that much).

Edit: I should add that later versions of GCC I believe incorporate more of AMD's optimizations. Additionally, some tests (like libquantum) are very memory bound and will vary greatly depending on how the memory of the system is configured so that test in a cloud instance could show really low performance compared to a bare metal test. I'd love to see some bare metal tests of Graviton2 but unfortunately I don't think Amazon wants this.

Thunder 57 · Mar 26, 2020

Markfw said:
So, do I read that Rome is 2-4 times faster than graviton2 ???? Maybe this will shut Richie Rich up.....

I very much doubt that. He;s dead set on SMT4, x86 is dead, ARM is the best thing ever, etc. Maybe Apple should hire him?

DrMrLordX · Mar 26, 2020

Thunder 57 said:
Maybe Apple should hire him?

If they did, would he tell us?

Nothingness · Mar 27, 2020

Hitman928 said:
There's some decent caveats here, but yeah, even taking those into account, it's going to be much faster.

I see two caveats: AOCC is like icc, a SPEC compiler and making any comparison with it is bound to be useless; Rome does turbo in single thread. ~~The second point matters, especially when some (rightly) complain that SPEC isn't representative of server workloads, single-thread is even less representative~~.

EDIT: my bad, you quoted MT results, great!

Hitman928 said:
The compiler can make a huge difference on certain tests. Look at the published results for Spec2006 against Andrei's tests on Zen1. The libquantum score increases by 1700% on Zen1 by using an AMD optimized compiler. If you look at the published Spec2017 results, they all use AMD's open compiler, no one uses GCC.

Why would system makers use a compiler that gets worse results when they can use a compiler that targets SPEC? Oh wait.

Why didn't you make a SPEC 2006 comparison where you have gcc results? And why not post this in the Graviton thread?

EDIT: I saw @amrnuke posted his results for SPEC 2006 Graviton2 vs 7742 with gcc. And that definitely paints a different picture from what you get, though Rome still has the lead by 30%.

Hitman928 · Mar 27, 2020

Nothingness said:
I see two caveats: AOCC is like icc, a SPEC compiler and making any comparison with it is bound to be useless; Rome does turbo in single thread. ~~The second point matters, especially when some (rightly) complain that SPEC isn't representative of server workloads, single-thread is even less representative~~.

EDIT: my bad, you quoted MT results, great!

Why would system makers use a compiler that gets worse results when they can use a compiler that targets SPEC? Oh wait.

Why didn't you make a SPEC 2006 comparison where you have gcc results? And why not post this in the Graviton thread?

EDIT: I saw @amrnuke posted his results for SPEC 2006 Graviton2 vs 7742 with gcc. And that definitely paints a different picture from what you get, though Rome still has the lead by 30%.

The latest versions of GCC have the Rome optimizations included now anyway. I agree that some of the results might be a little optimistic using AOCC, but using GCC8 also makes things a little pessimistic for Rome (see libquantum results with GCC8 showing Rome regressed in perf compared to Naples). The reason I used the scores I did is because they are published and verified by Spec. This is also the reason I didn't use Spec2006, by the time Rome came out, Spec2006 was EOL and no published Spec2006 results exist for Rome. If Amazon would allow Graviton2 to have published Spec results we could make a better comparison but so far they won't (and I doubt they ever will).

Even if we take Ampere's numbers and do some basic calculations to adjust them for Graviton2 (since they're using the same Arm design) you get that Rome is ~50% faster than Graviton2, and that's with them de-rating Rome's score assuming an older version of GCC and not even the highest score published for the Epyc 7742.

Ampere 3.3 GHz, 80 core N1 CPU est. Spec score = 1.04 * Epyc 7742.

Graviton2 = Ampere est. Spec score * (2.5/3.3) * (64/80) = 0.63 * Epyc 7742

Or in other words, Epyc 7742 is 58.7% faster than Ampere, maybe closer to 50% given the lack of multi-score scaling with N1. That's using Ampere's own numbers which are probably pretty optimistic for the Arm design in comparison. So is Epyc 2x - 4x faster than Graviton2? Probably not. But it's also probably faster than what we get using Ampere's numbers so it could be 2x as fast or very close to it.

Thala · Mar 27, 2020

Markfw said:
So, do I read that Rome is 2-4 times faster than graviton2 ???? Maybe this will shut Richie Rich up.....

Not suprised you are not questioning numbers as long as they follow your agenda. Or perhaps you did not notice how the compiler is different? Or maybe the number of HW threads is different by a factor of 2? Is the gatecount or power somewhat equal? - probably not.

Hitman928 said:
The latest versions of GCC have the Rome optimizations included now anyway.

The problem is not the Rome optimization, the problem are the SPEC optimizations. In any case when comparing architectures always use the same compiler, everything else is speculation.
Why do you compare a 64 thread CPU with a 128 thread CPU anyway? Is it not within reasonable expectation that the later is faster in multi-threaded workloads?

Nothingness · Mar 27, 2020

Hitman928 said:
The latest versions of GCC have the Rome optimizations included now anyway. I agree that some of the results might be a little optimistic using AOCC, but using GCC8 also makes things a little pessimistic for Rome (see libquantum results with GCC8 showing Rome regressed in perf compared to Naples). The reason I used the scores I did is because they are published and verified by Spec. This is also the reason I didn't use Spec2006, by the time Rome came out, Spec2006 was EOL and no published Spec2006 results exist for Rome. If Amazon would allow Graviton2 to have published Spec results we could make a better comparison but so far they won't (and I doubt they ever will).

I agree we should use latest compiler (as long it's the same) and SPEC 2017. But what you're doing here is like comparing AMD vs Intel on benchmarks where it's been shown that Intel cheated. Don't you remember icc and AMD fans rightly crying Intel was cheating?

Here is what AOCC and icc do on SPECrate2017 on a 7601 and a Gold 6148:

Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option

The Cavium ThunderX2 is a complete game changer in the server CPU market. Backed by a vastly improved Arm ecosystem, the ThunderX2 features 32 high speed Arm cores capable of a total of 128 threads and 56 PCIe lanes in a single socket, or 256 threads in a dual socket server

www.servethehome.com

Cavium ThunderX2 SPEC Int Rate Peak Compiler Optimized Results

.

Cavium ThunderX2 SPEC Int Rate Peak Gcc7

Vendor compilers should not be used on such benchmarks when doing cross-vendor comparisons. Period. And this has nothing to do with ARM vs x86.

Even if we take Ampere's numbers and do some basic calculations to adjust them for Graviton2 (since they're using the same Arm design) you get that Rome is ~50% faster than Graviton2, and that's with them de-rating Rome's score assuming an older version of GCC and not even the highest score published for the Epyc 7742.

Ampere 3.3 GHz, 80 core N1 CPU est. Spec score = 1.04 * Epyc 7742.

Graviton2 = Ampere est. Spec score * (2.5/3.3) * (64/80) = 0.63 * Epyc 7742

Or in other words, Epyc 7742 is 58.7% faster than Ampere, maybe closer to 50% given the lack of multi-score scaling with N1. That's using Ampere's own numbers which are probably pretty optimistic for the Arm design in comparison. So is Epyc 2x - 4x faster than Graviton2? Probably not. But it's also probably faster than what we get using Ampere's numbers so it could be 2x as fast or very close to it.

Sorry but these are again wild guessing (you have no way to know how Ampere interconnect and memory controllers will behave) and meaningless computations.

Hitman928 · Mar 27, 2020

Nothingness said:
I agree we should use latest compiler (as long it's the same) and SPEC 2017. But what you're doing here is like comparing AMD vs Intel on benchmarks where it's been shown that Intel cheated. Don't you remember icc and AMD fans rightly crying Intel was cheating?

Here is what AOCC and icc do on SPECrate2017 on a 7601 and a Gold 6148:

Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option

The Cavium ThunderX2 is a complete game changer in the server CPU market. Backed by a vastly improved Arm ecosystem, the ThunderX2 features 32 high speed Arm cores capable of a total of 128 threads and 56 PCIe lanes in a single socket, or 256 threads in a dual socket server

www.servethehome.com

.

Vendor compilers should not be used on such benchmarks when doing cross-vendor comparisons. Period. And this has nothing to do with ARM vs x86.

Sorry but these are again wild guessing (you have no way to know how Ampere interconnect and memory controllers will behave) and meaningless computations.

The problem with using old GCC versions is that AMD doesn't upstream their architecture optimizations for GCC so you also get a very unfair comparison. The best comparison, again, would be the same compiler but with the latest and best so each company has their optimizations included, but we don't have that because neither Ampere or Amazon have really allowed for that (yet).

As far as the numbers go compared to Ampere, sure, it's obviously not perfect, but if you throw that out then we're left with basically nothing and the best you can say is that they both are server CPUs and that 64 core Graviton2 on 7nm is faster than 32 core Zen1 on 14 nm. Might as well not try to make any comparison to Rome at all.

Thala · Mar 27, 2020

Hitman928 said:
The problem with using old GCC versions is that AMD doesn't upstream their architecture optimizations for GCC so you also get a very unfair comparison. The best comparison, again, would be the same compiler but with the latest and best so each company has their optimizations included, but we don't have that because neither Ampere or Amazon have really allowed for that (yet).

Thist just means, we cannot conclude yet. It does not mean we should start juggling around with unreasonable results.
In addition when comparing architectures, you need a meaningful metric. Just comparing a 128 thread implementation against a 64 thread implementation with respect to absolute performance is moot.

Hitman928 · Mar 27, 2020

Thala said:
Thist just means, we cannot conclude yet. It does not mean we should start juggling around with unreasonable results.
In addition when comparing architectures, you need a meaningful metric. Just comparing a 128 thread implementation against a 64 thread implementation with respect to absolute performance is moot.

1) I never said my numbers were a conclusion, on the contrary I said that they contained lots of caveats but wanted to show what we get with the numbers we have understanding there's caveats either way we do it.

2) So should we compare a 2 rack solution of Graviton2 versus a 1 rack solution of Epyc since that's what it would take to reach thread parity between the two?

Nothingness · Mar 27, 2020

Hitman928 said:
1) I never said my numbers were a conclusion, on the contrary I said that they contained lots of caveats but wanted to show what we get with the numbers we have understanding there's caveats either way we do it.

Indeed you made it clear. But people jumped to your data and made utterly stupid statements as if that was the Truth because that fits their beliefs.

It's sometimes much better not to provide data rather than juggling with computations (and I plead guilty as I sometimes do that myself).

2) So should we compare a 2 rack solution of Graviton2 versus a 1 rack solution of Epyc since that's what it would take to reach thread parity between the two?

IMHO the best comparison would be if AWS propose Rome and Andrei updated his review with it.

Hitman928 · Mar 27, 2020

Nothingness said:
I agree we should use latest compiler (as long it's the same) and SPEC 2017. But what you're doing here is like comparing AMD vs Intel on benchmarks where it's been shown that Intel cheated. Don't you remember icc and AMD fans rightly crying Intel was cheating?

Here is what AOCC and icc do on SPECrate2017 on a 7601 and a Gold 6148:

Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option

The Cavium ThunderX2 is a complete game changer in the server CPU market. Backed by a vastly improved Arm ecosystem, the ThunderX2 features 32 high speed Arm cores capable of a total of 128 threads and 56 PCIe lanes in a single socket, or 256 threads in a dual socket server

www.servethehome.com

Thanks for linking that. When we look at the Spec2017 GCC results, TX2 sure looks really strong beating the 7601 by a decent margin and crushing the Gold 6148. How did the TX2 fair in the STH test suite though (also using GCC)?

Cavium-ThunderX2-c-ray-8K-benchmark-comparison-stack.jpg

Hmm. So should we just throw away Spec entirely or. . .

Nothingness · Mar 27, 2020

Hitman928 said:
Thanks for linking that. When we look at the Spec2017 GCC results, TX2 sure looks really strong beating the 7601 by a decent margin and crushing the Gold 6148. How did the TX2 fair in the STH test suite though (also using GCC)?

Hmm. So should we just throw away Spec entirely or. . .

Are you trying to compare microbenchmarks and domain specific benchmarks with SPEC? Really?

Hitman928 · Mar 27, 2020

Nothingness said:
IMHO the best comparison would be if AWS propose Rome and Andrei updated his review with it.

The best comparison would be bare metal setups, but I doubt Amazon would ever allow that.

Hitman928 · Mar 27, 2020

Nothingness said:
Are you trying to compare microbenchmarks and domain specific benchmarks with SPEC? Really?

So we can't compare a collection of individual benchmarks to form a custom suite, we have to stick to Spec's collection of tests?

The reality is these CPUs will be tested by customers on their own optimized setups with their own actual flow being tested. Everything else is just talking points but as just consumers on a consumer forum, that's all we really have, right? The only way we'd have any info to go off of would be someone to release their internal testing (which almost no one will do).

Look, I'm not trying to downplay what ARM is doing in the server space, they've made a ton of progress and are starting to become a real threat to x86, just trying to bring some perspective compared to the marketing from ARM partners which is all we have to go off of because they haven't (maybe won't) release test systems for independent reviewers to publish their results. My personal opinion is that ARM isn't quite there yet with this generation, but the next generation could be a whole different story, especially with Intel continuing to struggle to put anything really competitive out in this space. The next generation or two may come down to how valuable system admins see sticking with x86 would be and using AMD versus switching to an ARM ecosystem.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Member

Diamond Member

Diamond Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Senior member

Lifer

Lifer

Diamond Member

Platinum Member

Lifer

Platinum Member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member