Solved! ARM Apple High-End CPU - Intel replacement

Page 34 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
There is a first rumor about Intel replacement in Apple products:
  • ARM based high-end CPU
  • 8 cores, no SMT
  • IPC +30% over Cortex A77
  • desktop performance (Core i7/Ryzen R7) with much lower power consumption
  • introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
  • massive AI accelerator

Source Coreteks:
 
  • Like
Reactions: vspalanki
Solution
What an understatement :D And it looks like it doesn't want to die. Yet.


Yes, A13 is competitive against Intel chips but the emulation tax is about 2x. So given that A13 ~= Intel, for emulated x86 programs you'd get half the speed of an equivalent x86 machine. This is one of the reasons they haven't yet switched.

Another reason is that it would prevent the use of Windows on their machines, something some say is very important.

The level of ignorance in this thread would be shocking if it weren't depressing.
Let's state some basics:

(a) History. Apple has never let backward compatibility limit what they do. They are not Intel, they are not Windows. They don't sell perpetual compatibility as a feature. Christ, the big...

Nothingness

Platinum Member
Jul 3, 2013
2,442
767
136
He must have changed his numbers? The difference I saw in the MT difference was huge for Rome. Naples, on the other hand, wasn't as big, though there was still a swing. Naples lost in ST but won in MT.
Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results :(
 
  • Like
Reactions: lightmanek

DrMrLordX

Lifer
Apr 27, 2000
21,678
10,940
136
Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results :(

He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.
 

soresu

Platinum Member
Dec 19, 2014
2,695
1,902
136
Yeah, they have gone from the great Wii
More like the Gamecube to the Wii.

GC was a brand new hardware (custom ATI gfx?, first use of PowerPC in a console?), whereas the Wii is basically an overclocked GC hardware platform with new motion controllers - it was an innovative move to be sure for a console, but the underlying console hardware was already years old by then.

I actually wonder if the devkit was a modified PowerPC Mac.
Nintendo Switch has ..................................... 0.2 TFLOPS FP32
Switch is based on Tegra X1/SHIELD TV which could manage up to 500 gigaflops - obviously that would be more than a stretch in handheld mode, but it should be a lot closer to that max in docked mode, certainly higher than 200 gigaflops.

Bear in mind that TX1 also has double rate FP16 like Pascal, Vega and PS4 Pro, it is the only Maxwell GPU that has this feature unless I'm mistaken, somewhat like PS4 Pro is the only Polaris GPU that has it.
 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,442
767
136
He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.
You can make that computation yourself, it's easy. That will prevent you from thinking there's some hidden agenda :D
 

soresu

Platinum Member
Dec 19, 2014
2,695
1,902
136
Gaming is about fun factor where Nintendo excels. Most Android phones can become gaming console from performance point of view. I think combination of 8x A78 with second generation of Valhall Mali G78 GPU could deliver pretty decent performance around 1.5 TFLOPS. I'm not sure how much they can scale it up. Max to 6 TFOPS maybe?
Valhall is still a question mark really - we all hope ARM fixed whatever ailed Bifrost/Bifrost2(G76) certainly, but this hope needs vindication with good perf/watt results vs Adreno.

If G78 doesn't deliver for ARM, then coupled with interference from the US government, Mali high end may wither on the vine as it were. With Samsung switching to RDNA it has too few customers now that Huawei is an uncertain customer due to imperial entanglements.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
He must have changed his numbers? The difference I saw in the MT difference was huge for Rome. Naples, on the other hand, wasn't as big, though there was still a swing. Naples lost in ST but won in MT.
Yes, big apologies. I made a follow-up post, I calculated something wrong somewhere. Someone said something that made me start thinking, and I went back and re-calculated.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Yes he had made a mistake in his original data. It's depressing that what's left in people mind are the wrong results :(
:(

I want to go back and revisit the comparison and look more closely at the data. But I got caught up in a discussion about SMT in the Graviton thread.

Interesting that I'm talking about Graviton in this thread and SMT in the Graviton thread haha... Arguing with a guy who is CERTAIN SMT4 is coming in Zen3, trying to convince him that SMT2 is actually beneficial. It's almost comical.
 

SarahKerrigan

Senior member
Oct 12, 2014
383
558
136
He also deleted the Naples numbers, and his original numbers had Graviton2 winning the ST SPEC2006 outright. What happened?

edit: the Naples numbers were important in that they demonstrated some of the scaling trouble Graviton2 has, albeit not to the extent that the (apparently erroneous) Rome numbers did. Naples was losing to Graviton2 in ST but winning in MT. Now with the Naples numbers removed from the ST comparison, that isn't even evident anymore.

The spreadsheet error applied to both - Naples and Rome should have been cut in half and they weren't. You can see raw numbers (for 2s Naples) here:


for comparison against Grav2, here:


One-socket Graviton2 is competitive against dual-socket 7601, although it typically loses by a little bit. Saying "oh, it has scaling issues because a one-socket ARM chip sometimes loses against dual-socket x86" does not help your case.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
One-socket Graviton2 is competitive against dual-socket 7601, although it typically loses by a little bit. Saying "oh, it has scaling issues because a one-socket ARM chip sometimes loses against dual-socket x86" does not help your case.
Keep in mind that EPYC first gen had MAJOR NUMA issues and the intersocket latency likely hammers 7601 performance quite a bit in these tests.
 

Hitman928

Diamond Member
Apr 15, 2012
5,368
8,180
136
In regards to Anandtech's SPEC scaling numbers, looking back at their compiler flags, I think they are missing a couple that would really help (e.g. jemalloc) that all the builders use when compiling for these high core/thread count machines. I don't know how much it would help for the rate benchmarks but all the system vendors use it when running rate and Ampere used it for their Spec rate estimate scores as well.
 
  • Like
Reactions: lightmanek

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
In regards to Anandtech's SPEC scaling numbers, looking back at their compiler flags, I think they are missing a couple that would really help (e.g. jemalloc) that all the builders use when compiling for these high core/thread count machines. I don't know how much it would help for the rate benchmarks but all the system vendors use it when running rate and Ampere used it for their Spec rate estimate scores as well.
Exactly - and one thing to keep in mind for Graviton vs EPYC in these tests: even though AMD improved NUMA in Rome, it is still a limitation, with intersocket latencies of 200+ns. This is for sure better than 250+ns on Naples but is likely hampering total system throughput on SPEC rate scores.
 

SarahKerrigan

Senior member
Oct 12, 2014
383
558
136
Exactly - and one thing to keep in mind for Graviton vs EPYC in these tests: even though AMD improved NUMA in Rome, it is still a limitation, with intersocket latencies of 200+ns. This is for sure better than 250+ns on Naples but is likely hampering total system throughput on SPEC rate scores.

I would be extremely surprised if rate is going to care much about inter-socket coherence latency; each hardware context is running its own SPEC copy and there's no real sharing. As for jemalloc, the improvements I've seen myself with it are minimal.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
I would be extremely surprised if rate is going to care much about inter-socket coherence latency; each hardware context is running its own SPEC copy and there's no real sharing. As for jemalloc, the improvements I've seen myself with it are minimal.
Interesting. On a 7601 would it be an issue, with 3 NUMA nodes per socket?
 

SarahKerrigan

Senior member
Oct 12, 2014
383
558
136
Interesting. On a 7601 would it be an issue, with 3 NUMA nodes per socket?

Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)
 

Hitman928

Diamond Member
Apr 15, 2012
5,368
8,180
136
Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)

Do you know if asm is disabled for the x264 test?
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Realistically, no; everything is going to be using its home node's memory anyway, barring occasional small spills. SPEC rate scales well to thousands of sockets; it's one of the reasons it's a relatively poor MT benchmark for commercial code (which tends to do a lot of sharing, especially when locks, etc, are used heavily.) It's slightly better for highly parallel workstation code but I still don't find rate terribly useful for MT throughput.

For ST, it remains the gold standard (although the reporting rules in '17 have some annoying aspects that IMO reduce the usefulness of submitted results.)
What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.
 

Nothingness

Platinum Member
Jul 3, 2013
2,442
767
136
What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.
SPECrate has nothing to do with multithreading at all. It's multiple processes doing the same thing launched at the same time. That's a very poor characterization for properly parallelized/multithreaded tasks: it's basically, for some of the subtests, a stress test for memory bandwidth. It's not even a good proxy for parallel compilation.
 
  • Like
Reactions: thigobr

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
SPECrate has nothing to do with multithreading at all. It's multiple processes doing the same thing launched at the same time. That's a very poor characterization for properly parallelized/multithreaded tasks: it's basically, for some of the subtests, a stress test for memory bandwidth. It's not even a good proxy for parallel compilation.
In other words we still know nothing about how ARM scales up.
 

SarahKerrigan

Senior member
Oct 12, 2014
383
558
136
What is a benchmark that realistically parallels commercial code?

I know AT specifically points out that they believe the rate score to be terrible for multithreading evaluation, but it's the only MT benchmark we have for an ARM system with more than a handful of big cores. So when we talk about scaling up ARM to the laptop/desktop/HEDT world, is very interesting and important to consider, I believe. Hence why we keep circling back around to Graviton2's results.

For enterprise-type parallel code, SAP SD-2 is pretty good. Any database bench is going to hit coherency structures harder than SPECrate will, I suspect.
 

Andrei.

Senior member
Jan 26, 2015
316
386
136
Most phones also run their SoC with much higher power limits than the Switch. They prioritise benchmark wins over long term thermal performance and battery life... whereas the Switch deliberately aims to provide consistent performance for multiple hours. There's a reason why they drastically underclocked the Tegra X1.

Anyway, I doubt Nintendo would ever swap to a non-NVidia vendor for a Switch follow up. They have a custom low level API written by NVidia, NVN, so they can't just swap in any old ARM SoC.
Hahaha what!!. The Switch is down-clocked because otherwise the X1 would melt a hole in it and battery would last 1-2 hours. At full power it's 7W GPU only for like a 13-14W SoC on a 3D workload. Why the heck do you think the Switch is actively cooled while phones are passive? Phone SoCs are far more efficient and less power.
 
  • Like
Reactions: Lodix