Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

RetroZombie · Mar 31, 2020

DrMrLordX said:
He must have changed his numbers?

And he cut the number in half, why?

DrMrLordX · Mar 31, 2020

RetroZombie said:
And he cut the number in half, why?

Well I think there was a spreadsheet error for Rome. But the Naples numbers were removed completely from the comparison. Don't know why.

Nothingness · Mar 31, 2020

DrMrLordX said:
He must have changed his numbers? The difference I saw in the MT difference was huge for Rome. Naples, on the other hand, wasn't as big, though there was still a swing. Naples lost in ST but won in MT.

Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results

DrMrLordX · Mar 31, 2020

Nothingness said:
Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results

He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.

soresu · Mar 31, 2020

RetroZombie said:
Yeah, they have gone from the great Wii

More like the Gamecube to the Wii.

GC was a brand new hardware (custom ATI gfx?, first use of PowerPC in a console?), whereas the Wii is basically an overclocked GC hardware platform with new motion controllers - it was an innovative move to be sure for a console, but the underlying console hardware was already years old by then.

I actually wonder if the devkit was a modified PowerPC Mac.

Richie Rich said:
Nintendo Switch has ..................................... 0.2 TFLOPS FP32

Switch is based on Tegra X1/SHIELD TV which could manage up to 500 gigaflops - obviously that would be more than a stretch in handheld mode, but it should be a lot closer to that max in docked mode, certainly higher than 200 gigaflops.

Bear in mind that TX1 also has double rate FP16 like Pascal, Vega and PS4 Pro, it is the only Maxwell GPU that has this feature unless I'm mistaken, somewhat like PS4 Pro is the only Polaris GPU that has it.

Nothingness · Mar 31, 2020

DrMrLordX said:
He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.

You can make that computation yourself, it's easy. That will prevent you from thinking there's some hidden agenda

soresu · Mar 31, 2020

Richie Rich said:
Gaming is about fun factor where Nintendo excels. Most Android phones can become gaming console from performance point of view. I think combination of 8x A78 with second generation of Valhall Mali G78 GPU could deliver pretty decent performance around 1.5 TFLOPS. I'm not sure how much they can scale it up. Max to 6 TFOPS maybe?

Valhall is still a question mark really - we all hope ARM fixed whatever ailed Bifrost/Bifrost2(G76) certainly, but this hope needs vindication with good perf/watt results vs Adreno.

If G78 doesn't deliver for ARM, then coupled with interference from the US government, Mali high end may wither on the vine as it were. With Samsung switching to RDNA it has too few customers now that Huawei is an uncertain customer due to imperial entanglements.

amrnuke · Mar 31, 2020

DrMrLordX said:
He must have changed his numbers? The difference I saw in the MT difference was huge for Rome. Naples, on the other hand, wasn't as big, though there was still a swing. Naples lost in ST but won in MT.

Yes, big apologies. I made a follow-up post, I calculated something wrong somewhere. Someone said something that made me start thinking, and I went back and re-calculated.

amrnuke · Mar 31, 2020

Nothingness said:
Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results

I want to go back and revisit the comparison and look more closely at the data. But I got caught up in a discussion about SMT in the Graviton thread.

Interesting that I'm talking about Graviton in this thread and SMT in the Graviton thread haha... Arguing with a guy who is CERTAIN SMT4 is coming in Zen3, trying to convince him that SMT2 is actually beneficial. It's almost comical.

Nothingness · Mar 31, 2020

Yeah we should not discuss Graviton2 here...

SarahKerrigan · Mar 31, 2020

DrMrLordX said:
He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.

The spreadsheet error applied to both - Naples and Rome should have been cut in half and they weren't. You can see raw numbers (for 2s Naples) here:

Page 6 - Discussion - AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

for comparison against Grav2, here:

Page 6 - Discussion - AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

One-socket Graviton2 is competitive against dual-socket 7601, although it typically loses by a little bit. Saying "oh, it has scaling issues because a one-socket ARM chip sometimes loses against dual-socket x86" does not help your case.

amrnuke · Mar 31, 2020

SarahKerrigan said:
One-socket Graviton2 is competitive against dual-socket 7601, although it typically loses by a little bit. Saying "oh, it has scaling issues because a one-socket ARM chip sometimes loses against dual-socket x86" does not help your case.

Keep in mind that EPYC first gen had MAJOR NUMA issues and the intersocket latency likely hammers 7601 performance quite a bit in these tests.

Hitman928 · Mar 31, 2020

In regards to Anandtech's SPEC scaling numbers, looking back at their compiler flags, I think they are missing a couple that would really help (e.g. jemalloc) that all the builders use when compiling for these high core/thread count machines. I don't know how much it would help for the rate benchmarks but all the system vendors use it when running rate and Ampere used it for their Spec rate estimate scores as well.

amrnuke · Mar 31, 2020

Hitman928 said:
In regards to Anandtech's SPEC scaling numbers, looking back at their compiler flags, I think they are missing a couple that would really help (e.g. jemalloc) that all the builders use when compiling for these high core/thread count machines. I don't know how much it would help for the rate benchmarks but all the system vendors use it when running rate and Ampere used it for their Spec rate estimate scores as well.

Exactly - and one thing to keep in mind for Graviton vs EPYC in these tests: even though AMD improved NUMA in Rome, it is still a limitation, with intersocket latencies of 200+ns. This is for sure better than 250+ns on Naples but is likely hampering total system throughput on SPEC rate scores.

SarahKerrigan · Mar 31, 2020

amrnuke said:
Exactly - and one thing to keep in mind for Graviton vs EPYC in these tests: even though AMD improved NUMA in Rome, it is still a limitation, with intersocket latencies of 200+ns. This is for sure better than 250+ns on Naples but is likely hampering total system throughput on SPEC rate scores.

I would be extremely surprised if rate is going to care much about inter-socket coherence latency; each hardware context is running its own SPEC copy and there's no real sharing. As for jemalloc, the improvements I've seen myself with it are minimal.

amrnuke · Mar 31, 2020

SarahKerrigan said:
I would be extremely surprised if rate is going to care much about inter-socket coherence latency; each hardware context is running its own SPEC copy and there's no real sharing. As for jemalloc, the improvements I've seen myself with it are minimal.

Interesting. On a 7601 would it be an issue, with 3 NUMA nodes per socket?

SarahKerrigan · Mar 31, 2020

amrnuke said:
Interesting. On a 7601 would it be an issue, with 3 NUMA nodes per socket?

Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)

Hitman928 · Mar 31, 2020

SarahKerrigan said:
Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)

Do you know if asm is disabled for the x264 test?

Nothingness · Mar 31, 2020

Hitman928 said:
Do you know if asm is disabled for the x264 test?

Of course it is. The benchmark has to be portable.

amrnuke · Mar 31, 2020

SarahKerrigan said:
Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)

What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.

Hitman928 · Mar 31, 2020

Nothingness said:
Of course it is. The benchmark has to be portable.

I assumed so, just wanted to make sure.

Nothingness · Mar 31, 2020

amrnuke said:
What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.

SPECrate has nothing to do with multithreading at all. It's multiple processes doing the same thing launched at the same time. That's a very poor characterization for properly parallelized/multithreaded tasks: it's basically, for some of the subtests, a stress test for memory bandwidth. It's not even a good proxy for parallel compilation.

amrnuke · Mar 31, 2020

Nothingness said:
SPECrate has nothing to do with multithreading at all. It's multiple processes doing the same thing launched at the same time. That's a very poor characterization for properly parallelized/multithreaded tasks: it's basically, for some of the subtests, a stress test for memory bandwidth. It's not even a good proxy for parallel compilation.

In other words we still know nothing about how ARM scales up.

SarahKerrigan · Mar 31, 2020

amrnuke said:
What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.

For enterprise-type parallel code, SAP SD-2 is pretty good. Any database bench is going to hit coherency structures harder than SPECrate will, I suspect.

Andrei. · Mar 31, 2020

NTMBK said:
Most phones also run their SoC with much higher power limits than the Switch. They prioritise benchmark wins over long term thermal performance and battery life... whereas the Switch deliberately aims to provide consistent performance for multiple hours. There's a reason why they drastically underclocked the Tegra X1.

Anyway, I doubt Nintendo would ever swap to a non-NVidia vendor for a Switch follow up. They have a custom low level API written by NVidia, NVN, so they can't just swap in any old ARM SoC.

Hahaha what!!. The Switch is down-clocked because otherwise the X1 would melt a hole in it and battery would last 1-2 hours. At full power it's 7W GPU only for like a 13-14W SoC on a 3D workload. Why the heck do you think the Switch is actively cooled while phones are passive? Phone SoCs are far more efficient and less power.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Senior member

Golden Member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Senior member