Solved! ARM Apple High-End CPU - Intel replacement

Page 38 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
There is a first rumor about Intel replacement in Apple products:
  • ARM based high-end CPU
  • 8 cores, no SMT
  • IPC +30% over Cortex A77
  • desktop performance (Core i7/Ryzen R7) with much lower power consumption
  • introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
  • massive AI accelerator

Source Coreteks:
 
  • Like
Reactions: vspalanki
Solution
What an understatement :D And it looks like it doesn't want to die. Yet.


Yes, A13 is competitive against Intel chips but the emulation tax is about 2x. So given that A13 ~= Intel, for emulated x86 programs you'd get half the speed of an equivalent x86 machine. This is one of the reasons they haven't yet switched.

Another reason is that it would prevent the use of Windows on their machines, something some say is very important.

The level of ignorance in this thread would be shocking if it weren't depressing.
Let's state some basics:

(a) History. Apple has never let backward compatibility limit what they do. They are not Intel, they are not Windows. They don't sell perpetual compatibility as a feature. Christ, the big...

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
Is this a RPi speculation thread?

RPi 5 is probably going to be 18FDS use Cortex-A58/VideoCore 7/WiFi 6E(802.11ax w/ 3-bands)

Cortex-A58 should be in the upcoming updated BYOD(build-your-own-device) IP library for 18FDS. Broadcom's Wifi 6E chip is on 16nm, and 18FDS supports RF IP down to 11nm.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
@soresu

Rockchip is using someone's 8nm process for the RK3588. I could see Pi using that as well.
Rockchip is an SoC vendor like Broadcom that supply the BCM2711 - they can afford to fab RK3588 for whoever will buy it.

The Pi Foundation's pockets are not deep, and they are very frugal in their economising.

I believe that even once the open Panfrost Bifrost GPU driver is done that PiF still won't replace the VCx GPU with an ARM Mali because of the expense of ARM's licensing over Broadcom's own solutions - though I'd welcome being wrong about that.

It would be great to have a more dependable SBC with a higher end SoC, I just doubt that it will happen any time soon.

For that matter RK3588 was delayed even before COVID, I wonder what's up in Rock land.....
 

name99

Senior member
Sep 11, 2010
404
303
136
I agree 100%. What nobody seems to get, or understand, is that Apple and ARM AT THE MOMENT seems to be very strong in single core, non IO dependent benchmarks. They were designed for that purpose and do it well. But what about things that have high IO and multi-threaded requirements ? Blender, is just one. What about a huge database server, that serves Anandtech ? or Amazon ? Or a medical database that serves 10 million consumers records, and is used everyday for 1000 different purposes ? I used to work for such a company. With data centers measured in square miles. They require terrabytes of memory, and the amount of IO for an Oracle database is staggering. The database is 100's of terrabytes in size, maybe thousands. Just my one little system had a 300 terrabyte database, and that is summarized data for analytical purposes, no details.

You mean exactly the sort of things that are running successfully on Graviton 2 RIGHT NOW?
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
No offense, but I'm going to ignore your Blender numbers, there's nothing here that says they are trustworthy. It's not personal, just factual. I would have the same reaction no matter who posted this.
After 5 and half hours I made a printcreen because I didn't want to re-test that :D But I think other RPi4 owners can confirm that numbers. It's not so difficult to install Blender from repository and run it. It just takes a lot of time. Soresu, could you please sacrifice yourself? :D


Is that why my RPi4 is slow as balls?

Damn, kinda wish they would warn people about Raspbian, or pursue a better default OS package.
I was surprised too. 94% IPC benefit from swap to 64-bit OS is huge. But that's specific to Blender raytracing engine and some other heavy computing aplications (optimized for using all 31 registers?). I searched for 64-bit GeekBench for ARM but unfortunately there is no such a version right now (GB web says there is no market demand).


Not gonna happen.

Maybe if 8nm becomes ridiculously cheap as everyone else chases sub 5nm nodes - it's a case of expense in the end, there's a pretty good reason BCM2711 is made on 28nm instead of 16/14/12nm which many products like Fire TV device SoC's are being made on now.

The Pi Foundation guys have no other income than sales the Pi and its accessories to defray the costs of the SoC - unlike Amazon who have an entire empire, not to mention the content streamed to it from Amazon Video which is the real money maker for them.

Bear in mind that even A72 is having severe thermal issues without a heatsink in BCM2711 - A78 would probably not be viable even with 16/12nm.

At best I would expect A73, or at a stretch A75 in RPi5 as A73 is much more power efficient than A72, and A75 supposedly has a similar power/clock figure to A73.
RPi4 can run 2 GHz easily with cooler. IMHO RPi5 with A78 cores manufactured at cheap GF 12nm, or TSMC 10nm can achieve 2.5-3 GHz very easily as well. With bunch of USB4s you can connect 8K monitor, fast SSD and whatever you want. 8GB of RAM will be enough for most 99% applications. Even for increased price to 100$ this RPi5 has potential to destroy cheap PC market entirely.


Is this a RPi speculation thread?

RPi 5 is probably going to be 18FDS use Cortex-A58/VideoCore 7/WiFi 6E(802.11ax w/ 3-bands)

Cortex-A58 should be in the upcoming updated BYOD(build-your-own-device) IP library for 18FDS. Broadcom's Wifi 6E chip is on 16nm, and 18FDS supports RF IP down to 11nm.
If A58 is A55 successor of in-order cores then I doubt RPi would make a performance step down. For the same reason I doubt they will use 2xALU OoO A73/75 - it's still too slow. My favorite is something from new Austin cores: A76 or maybe A77/78. Even at dirt cheap 28nm that A78 would be great performer around 2 GHz. 2.5x the IPC and +50% clock - that's almost 4x ST performance in compare to A72. Pretty huge step.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
If A58 is A55 successor of in-order cores then I doubt RPi would make a performance step down. For the same reason I doubt they will use 2xALU OoO A73/75 - it's still too slow. My favorite is something from new Austin cores: A76 or maybe A77/78. Even at dirt cheap 28nm that A78 would be great performer around 2 GHz. 2.5x the IPC and +50% clock - that's almost 4x ST performance in compare to A72. Pretty huge step.
It(A58) should be a >2-wide OoO w/ a fused-op cache and perf spec'd around A75/A76 w/ power optimizations of A77/A78.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,181
7,631
136
After 5 and half hours I made a printcreen because I didn't want to re-test that :D But I think other RPi4 owners can confirm that numbers. It's not so difficult to install Blender from repository and run it. It just takes a lot of time. Soresu, could you please sacrifice yourself? :D

It's not only the Rpi4 score, but the scenes used, blender settings, OS used, operating frequencies, etc. There is nothing in your post that allows anyone to even check to see if your numbers are reasonable let alone accurate.
 
  • Like
Reactions: soresu

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
Sounds a bit of a stretch considering how much above the relatively new 3 wide A65 that would be perf wise.
A65 has the same gen number as 35/55/75, so it will be behind processors with a gen number of 8(38/58/78).

Also, decode with is 2-wide OoO in A65. Which should be the same with A58(2x4B decode), then the A58 adds a L0/fused-op cache(>512-entries x 8B => 4 KB L0).
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
A65 has the same gen number as 35/55/75, so it will be behind stuff with a gen number of 8(38/58/78).
A65 has a higher ST IPC than A55 (20% ish), taken with SMT its MT IPC shreds A55 completely.

A35 is also not ISA compatible with A55 having v8.0-A, whereas A55 has v8.2-A to match A75.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
A75 is 3 wide.

Part of what makes me so amazed at the Sophia teams engineering skills is that it is so close to the 2 wide power/clock of A73.
Well, that's correct A75 is 3-wide at decode/front end.
But still identical width in scalar back end as A73: 2xALUs +1xBranch, 2xLSU

A73
arm-a75_a55-cpu_diagram-a73.png

A75
arm-a75_a55-cpu_diagram-a75.png
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
You mean exactly the sort of things that are running successfully on Graviton 2 RIGHT NOW?

Oh I didn't know Apple made that chip.

. . .

heh.

In all seriousness though, can you even buy time on a Graviton2 instance yet? I thought they were doing test runs by invitation only.
 

name99

Senior member
Sep 11, 2010
404
303
136
Oh I didn't know Apple made that chip.

. . .

heh.

In all seriousness though, can you even buy time on a Graviton2 instance yet? I thought they were doing test runs by invitation only.

The point was to demonstrate the vapidity of the claim "What nobody seems to get, or understand, is that Apple and ARM AT THE MOMENT seems to be very strong in single core, non IO dependent benchmarks."

(a) ARM is in that claim. ARM of course provides the cores for Graviton (and most of the other ARMv8 server SoC's)...

(b) Just like EVERY DAMN STEP OF THIS PROCESS, the x86 crowd keep insisting that there's some magic ingredient in x86 cores that no-one else can reproduce. At first it was IPC, till Apple beat that. Then it was absolute performance, till Apple matched that. Then it was total throughput/memory bandwidth/IO, till Graviton 2 has matched that.
When will you get it through your heads that there is no magic there?!? The only core competency of x86 SoCs is executing x86 code. If Amazon or Marvell are capable of creating a SoC with many attached memory controllers, lots of PCIe lanes, and plenty of cores, believe me, Apple is capable of doing the exact same thing -- IF they have a reason to do so.

For some of you the game might be "x86 uber alles", but for the rest of us, we're just sick of seeing the forums polluted by the massive ignorance. You have people who don't have a clue about Marvell (high thread count), don't have a clue about Ampere (dual socket support), don't have a clue about Apple (extremely high IPC), don't have a clue about Amazon (commercial performance/dollar advantage) making these grand statements about a world of which they know not a damn thing.
I mean, christ, how can you be making claims about how "ARM has lousy support for large SoCs running substantial IO and memory footprints" if you're not even following what AWS is doing? WTF are you basing your claims on if you refuse to even track the single most obvious (but not the only) example of ARM being used for precisely those tasks???

There ARE true, negative, statements about ARM that can be made -- like, right now, there are some fundamental libraries (eg bignum or crypto) that have not had nearly the optimization put into them that x86 has seen. How about in future we stick to claims like that that rather than wild fantasies about "I have no idea how one might design a SoC with a high PCIe lane count, therefore, obviously, Apple similarly has no idea"?
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
(b) Just like EVERY DAMN STEP OF THIS PROCESS, the x86 crowd keep insisting that there's some magic ingredient in x86 cores that no-one else can reproduce.

Since Amazon is letting people purchase time on Graviton2 instances, why don't we just run some big iron workloads on one and see for sure?
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
There ARE true, negative, statements about ARM that can be made -- like, right now, there are some fundamental libraries (eg bignum or crypto) that have not had nearly the optimization put into them that x86 has seen.
It depends on what you are doing with it as to what you get out of it - for me Android has long since been a slick, fast and generally pleasant UX because it has clearly been the focus of Linaro efforts as well as obviously Google and likely several Android vendors on top of that.

Ordinary Linux on the other hand is not nearly as impressive for me when experienced on an ARM SBC - I certainly expected far more from the base RPi 4 experience on Raspbian than the reality when first using it.

Definitely much more work to be done.

Something like a Clear Linux equivalent for ARM.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
It's not only the Rpi4 score, but the scenes used, blender settings, OS used, operating frequencies, etc. There is nothing in your post that allows anyone to even check to see if your numbers are reasonable let alone accurate.
The scene is the BMW of course because it's the smallest one. I don't want to wait 2 days to render Grohe's Pavilion Barcelona.
For Ubuntu 64-bit it is Blender 2.82a and for 32-bit Raspbian it's 2.79b, both from repository, default settings.
 

Hitman928

Diamond Member
Apr 15, 2012
5,181
7,631
136
The scene is the BMW of course because it's the smallest one. I don't want to wait 2 days to render Grohe's Pavilion Barcelona.
For Ubuntu 64-bit it is Blender 2.82a and for 32-bit Raspbian it's 2.79b, both from repository, default settings.

Well, I don't have a 3700x, but I have a 2700 and I get very different results than 3700x numbers in your post.

I downclocked my 2700 to 1.5 GHz and 4 cores with no SMT to match the Rpi4. System info can be found in the Geekbench4 link. BMW demo file and default settings for Blender.


My Zen+ : GB4 ST - 1487 pts/GHz
Your Rpi4: GB4 ST - 645 pts/GHz

Zen+ = 130.5% faster performance per clock in GB4

My Zen+ : Blender (v2.82) 4T - 1356s
Your Rpi4: Blender (v2.82) 4T - 4077s

Zen+ = 200% faster performance per clock in Blender

That's a pretty big difference. Obviously Zen2 will be even faster. I'd prefer it to both be done in a controlled environment though. Perhaps I'll have to pick up a Rpi4 to play with. Maybe GB5 would be closer to Blender.

1589647729167.png
 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
Since Amazon is letting people purchase time on Graviton2 instances, why don't we just run some big iron workloads on one and see for sure?

Once again, you miss the point: where do you get your certainty as to the performance of ARM if you haven't followed the people who have been doing just that?

If you ASKED people for references to Graviton2 performance (or better yet, use this amazing new thing called Google) you'd get plenty of results. What I am complaining about is your absolute certainty that you already know the answer, when it's clear that you're not even following this space.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
For some of you the game might be "x86 uber alles", but for the rest of us, we're just sick of seeing the forums polluted by the massive ignorance.


For some of us this is a story we've heard before. In the 90s when Pentium Pro came out and x86 started beating RISCs in integer all the same arguments and excuses we are hearing here were made for why x86 could never compete with PA-RISC and Alpha. You aren't comparing with the right benchmarks, they'll never match RISC in floating point, you need bigger benchmarks to properly measure the memory system, you aren't taking I/O into account etc. etc.

People who think x86 has some unique advantages over ARM have their heads in the sand. I guess it will take Apple releasing the first ARM Macs to finally admit this, though I imagine some will still manage to find a few things x86 does better and try to claim those are the things that really matter.
 
  • Like
Reactions: Tlh97 and Lodix

Richie Rich

Senior member
Jul 28, 2019
470
229
76
@Hitman928

Blender ST results:
  • Zen2 Ryzen 3700X ... 7463 s/GHz
  • Cortex A72 (RPi4) ... 15443 s/GHz .... that's 48% PPC of Zen2
But there are some diffences between your and mine measurements:
  • my comparison was in ST where A72 could benefit from 64-bit. 3-core load is still OK while 4-core load performance suffer a lot (probably due to mem bandwith bottleneck). So for core2core comparison is ST load much more realistic due to bottleneck elimination.
  • no downclock of my 3700X and SMT ON, just recalculation based on given frequency per one thread (16t)

Blender MT results:
  • Zen2 Ryzen 3700X 8c/16t ..... 179 s ..... 11 466 s/GHz/thread
  • Cortex A72 (RPi4) 4c/4t ...... 4077 s ..... 24 464 s/GHz/thread .... that's only 47% PPC per thread of Zen2
Please note, that we compare A72 core to Zen2 thread which means that AMD can get more than 4x higher PPC out of Zen2 core thanks to SMT2 (while RPi4 bottleneck at all-core load). This corresponds with your 200% higher PPC (Zen2 has +300%).

What is your time when you run Blender as single core? This would more interesting to compare if Zen+ scales similarly as Zen2.