Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

DrMrLordX · May 15, 2020

@soresu

Rockchip is using someone's 8nm process for the RK3588. I could see Pi using that as well.

NostaSeronx · May 15, 2020

Is this a RPi speculation thread?

RPi 5 is probably going to be 18FDS use Cortex-A58/VideoCore 7/WiFi 6E(802.11ax w/ 3-bands)

Cortex-A58 should be in the upcoming updated BYOD(build-your-own-device) IP library for 18FDS. Broadcom's Wifi 6E chip is on 16nm, and 18FDS supports RF IP down to 11nm.

soresu · May 15, 2020

DrMrLordX said:
@soresu

Rockchip is using someone's 8nm process for the RK3588. I could see Pi using that as well.

Rockchip is an SoC vendor like Broadcom that supply the BCM2711 - they can afford to fab RK3588 for whoever will buy it.

The Pi Foundation's pockets are not deep, and they are very frugal in their economising.

I believe that even once the open Panfrost Bifrost GPU driver is done that PiF still won't replace the VCx GPU with an ARM Mali because of the expense of ARM's licensing over Broadcom's own solutions - though I'd welcome being wrong about that.

It would be great to have a more dependable SBC with a higher end SoC, I just doubt that it will happen any time soon.

For that matter RK3588 was delayed even before COVID, I wonder what's up in Rock land.....

name99 · May 15, 2020

Markfw said:
I agree 100%. What nobody seems to get, or understand, is that Apple and ARM AT THE MOMENT seems to be very strong in single core, non IO dependent benchmarks. They were designed for that purpose and do it well. But what about things that have high IO and multi-threaded requirements ? Blender, is just one. What about a huge database server, that serves Anandtech ? or Amazon ? Or a medical database that serves 10 million consumers records, and is used everyday for 1000 different purposes ? I used to work for such a company. With data centers measured in square miles. They require terrabytes of memory, and the amount of IO for an Oracle database is staggering. The database is 100's of terrabytes in size, maybe thousands. Just my one little system had a 300 terrabyte database, and that is summarized data for analytical purposes, no details.

You mean exactly the sort of things that are running successfully on Graviton 2 RIGHT NOW?

Richie Rich · May 15, 2020

Hitman928 said:
No offense, but I'm going to ignore your Blender numbers, there's nothing here that says they are trustworthy. It's not personal, just factual. I would have the same reaction no matter who posted this.

After 5 and half hours I made a printcreen because I didn't want to re-test that

But I think other RPi4 owners can confirm that numbers. It's not so difficult to install Blender from repository and run it. It just takes a lot of time. Soresu, could you please sacrifice yourself?

soresu said:
Is that why my RPi4 is slow as balls?

Damn, kinda wish they would warn people about Raspbian, or pursue a better default OS package.

I was surprised too. 94% IPC benefit from swap to 64-bit OS is huge. But that's specific to Blender raytracing engine and some other heavy computing aplications (optimized for using all 31 registers?). I searched for 64-bit GeekBench for ARM but unfortunately there is no such a version right now (GB web says there is no market demand).

soresu said:
Not gonna happen.

Maybe if 8nm becomes ridiculously cheap as everyone else chases sub 5nm nodes - it's a case of expense in the end, there's a pretty good reason BCM2711 is made on 28nm instead of 16/14/12nm which many products like Fire TV device SoC's are being made on now.

The Pi Foundation guys have no other income than sales the Pi and its accessories to defray the costs of the SoC - unlike Amazon who have an entire empire, not to mention the content streamed to it from Amazon Video which is the real money maker for them.

Bear in mind that even A72 is having severe thermal issues without a heatsink in BCM2711 - A78 would probably not be viable even with 16/12nm.

At best I would expect A73, or at a stretch A75 in RPi5 as A73 is much more power efficient than A72, and A75 supposedly has a similar power/clock figure to A73.

RPi4 can run 2 GHz easily with cooler. IMHO RPi5 with A78 cores manufactured at cheap GF 12nm, or TSMC 10nm can achieve 2.5-3 GHz very easily as well. With bunch of USB4s you can connect 8K monitor, fast SSD and whatever you want. 8GB of RAM will be enough for most 99% applications. Even for increased price to 100$ this RPi5 has potential to destroy cheap PC market entirely.

NostaSeronx said:
Is this a RPi speculation thread?

RPi 5 is probably going to be 18FDS use Cortex-A58/VideoCore 7/WiFi 6E(802.11ax w/ 3-bands)

Cortex-A58 should be in the upcoming updated BYOD(build-your-own-device) IP library for 18FDS. Broadcom's Wifi 6E chip is on 16nm, and 18FDS supports RF IP down to 11nm.

If A58 is A55 successor of in-order cores then I doubt RPi would make a performance step down. For the same reason I doubt they will use 2xALU OoO A73/75 - it's still too slow. My favorite is something from new Austin cores: A76 or maybe A77/78. Even at dirt cheap 28nm that A78 would be great performer around 2 GHz. 2.5x the IPC and +50% clock - that's almost 4x ST performance in compare to A72. Pretty huge step.

NostaSeronx · May 15, 2020

Richie Rich said:
If A58 is A55 successor of in-order cores then I doubt RPi would make a performance step down. For the same reason I doubt they will use 2xALU OoO A73/75 - it's still too slow. My favorite is something from new Austin cores: A76 or maybe A77/78. Even at dirt cheap 28nm that A78 would be great performer around 2 GHz. 2.5x the IPC and +50% clock - that's almost 4x ST performance in compare to A72. Pretty huge step.

It(A58) should be a >2-wide OoO w/ a fused-op cache and perf spec'd around A75/A76 w/ power optimizations of A77/A78.

soresu · May 15, 2020

Richie Rich said:
IMHO RPi5 with A78 cores manufactured at cheap GF 12nm, or TSMC 10nm can achieve 2.5-3 GHz very easily as well.

Not sure what you pulled that from considering the only news concerning it on process specifics mentioned 5nm at Samsung.

soresu · May 15, 2020

Richie Rich said:
For the same reason I doubt they will use 2xALU OoO A73/75 - it's still too slow.

A75 is 3 wide.

Part of what makes me so amazed at the Sophia teams engineering skills is that it is so close to the 2 wide power/clock of A73.

Hitman928 · May 15, 2020

Richie Rich said:
After 5 and half hours I made a printcreen because I didn't want to re-test that But I think other RPi4 owners can confirm that numbers. It's not so difficult to install Blender from repository and run it. It just takes a lot of time. Soresu, could you please sacrifice yourself?

It's not only the Rpi4 score, but the scenes used, blender settings, OS used, operating frequencies, etc. There is nothing in your post that allows anyone to even check to see if your numbers are reasonable let alone accurate.

soresu · May 15, 2020

NostaSeronx said:
It should be a >2-wide OoO w/ a fused-op cache and spec'd around A75/A76 w/ power optimizations of A77/A78.

Sounds a bit of a stretch considering how much above the relatively new 3 wide A65 that would be perf wise.

NostaSeronx · May 15, 2020

soresu said:
Sounds a bit of a stretch considering how much above the relatively new 3 wide A65 that would be perf wise.

A65 has the same gen number as 35/55/75, so it will be behind processors with a gen number of 8(38/58/78).

Also, decode with is 2-wide OoO in A65. Which should be the same with A58(2x4B decode), then the A58 adds a L0/fused-op cache(>512-entries x 8B => 4 KB L0).

soresu · May 15, 2020

NostaSeronx said:
A65 has the same gen number as 35/55/75, so it will be behind stuff with a gen number of 8(38/58/78).

A65 has a higher ST IPC than A55 (20% ish), taken with SMT its MT IPC shreds A55 completely.

A35 is also not ISA compatible with A55 having v8.0-A, whereas A55 has v8.2-A to match A75.

Richie Rich · May 15, 2020

soresu said:
A75 is 3 wide.

Part of what makes me so amazed at the Sophia teams engineering skills is that it is so close to the 2 wide power/clock of A73.

Well, that's correct A75 is 3-wide at decode/front end.
But still identical width in scalar back end as A73: 2xALUs +1xBranch, 2xLSU

A73

https://images.anandtech.com/doci/11441/arm-a75_a55-cpu_diagram-a73.png

A75

https://images.anandtech.com/doci/11441/arm-a75_a55-cpu_diagram-a75.png

soresu · May 15, 2020

NostaSeronx said:
Also, decode with is 2-wide OoO in A65.

2 wide decode/dispatch, 3 wide OoO issue....

07_Infra%20Tech%20Day%202019_Jamil%20Neoverse%20E1%20FINAL%20WM8_575px.jpg

DrMrLordX · May 15, 2020

name99 said:
You mean exactly the sort of things that are running successfully on Graviton 2 RIGHT NOW?

Oh I didn't know Apple made that chip.

. . .

heh.

In all seriousness though, can you even buy time on a Graviton2 instance yet? I thought they were doing test runs by invitation only.

Hitman928 · May 15, 2020

DrMrLordX said:
Oh I didn't know Apple made that chip.

. . .

heh.

In all seriousness though, can you even buy time on a Graviton2 instance yet? I thought they were doing test runs by invitation only.

They became available May 11.

DrMrLordX · May 15, 2020

Hitman928 said:
They became available May 11.

Oh really? Hmmm interesting.

name99 · May 15, 2020

DrMrLordX said:
Oh I didn't know Apple made that chip.

. . .

heh.

In all seriousness though, can you even buy time on a Graviton2 instance yet? I thought they were doing test runs by invitation only.

The point was to demonstrate the vapidity of the claim "What nobody seems to get, or understand, is that Apple and ARM AT THE MOMENT seems to be very strong in single core, non IO dependent benchmarks."

(a) ARM is in that claim. ARM of course provides the cores for Graviton (and most of the other ARMv8 server SoC's)...

(b) Just like EVERY DAMN STEP OF THIS PROCESS, the x86 crowd keep insisting that there's some magic ingredient in x86 cores that no-one else can reproduce. At first it was IPC, till Apple beat that. Then it was absolute performance, till Apple matched that. Then it was total throughput/memory bandwidth/IO, till Graviton 2 has matched that.
When will you get it through your heads that there is no magic there?!? The only core competency of x86 SoCs is executing x86 code. If Amazon or Marvell are capable of creating a SoC with many attached memory controllers, lots of PCIe lanes, and plenty of cores, believe me, Apple is capable of doing the exact same thing -- IF they have a reason to do so.

For some of you the game might be "x86 uber alles", but for the rest of us, we're just sick of seeing the forums polluted by the massive ignorance. You have people who don't have a clue about Marvell (high thread count), don't have a clue about Ampere (dual socket support), don't have a clue about Apple (extremely high IPC), don't have a clue about Amazon (commercial performance/dollar advantage) making these grand statements about a world of which they know not a damn thing.
I mean, christ, how can you be making claims about how "ARM has lousy support for large SoCs running substantial IO and memory footprints" if you're not even following what AWS is doing? WTF are you basing your claims on if you refuse to even track the single most obvious (but not the only) example of ARM being used for precisely those tasks???

There ARE true, negative, statements about ARM that can be made -- like, right now, there are some fundamental libraries (eg bignum or crypto) that have not had nearly the optimization put into them that x86 has seen. How about in future we stick to claims like that that rather than wild fantasies about "I have no idea how one might design a SoC with a high PCIe lane count, therefore, obviously, Apple similarly has no idea"?

DrMrLordX · May 15, 2020

name99 said:
(b) Just like EVERY DAMN STEP OF THIS PROCESS, the x86 crowd keep insisting that there's some magic ingredient in x86 cores that no-one else can reproduce.

Since Amazon is letting people purchase time on Graviton2 instances, why don't we just run some big iron workloads on one and see for sure?

soresu · May 15, 2020

name99 said:
There ARE true, negative, statements about ARM that can be made -- like, right now, there are some fundamental libraries (eg bignum or crypto) that have not had nearly the optimization put into them that x86 has seen.

It depends on what you are doing with it as to what you get out of it - for me Android has long since been a slick, fast and generally pleasant UX because it has clearly been the focus of Linaro efforts as well as obviously Google and likely several Android vendors on top of that.

Ordinary Linux on the other hand is not nearly as impressive for me when experienced on an ARM SBC - I certainly expected far more from the base RPi 4 experience on Raspbian than the reality when first using it.

Definitely much more work to be done.

Something like a Clear Linux equivalent for ARM.

Richie Rich · May 16, 2020

Hitman928 said:
It's not only the Rpi4 score, but the scenes used, blender settings, OS used, operating frequencies, etc. There is nothing in your post that allows anyone to even check to see if your numbers are reasonable let alone accurate.

The scene is the BMW of course because it's the smallest one. I don't want to wait 2 days to render Grohe's Pavilion Barcelona.
For Ubuntu 64-bit it is Blender 2.82a and for 32-bit Raspbian it's 2.79b, both from repository, default settings.

Hitman928 · May 16, 2020

Richie Rich said:
The scene is the BMW of course because it's the smallest one. I don't want to wait 2 days to render Grohe's Pavilion Barcelona.
For Ubuntu 64-bit it is Blender 2.82a and for 32-bit Raspbian it's 2.79b, both from repository, default settings.

Well, I don't have a 3700x, but I have a 2700 and I get very different results than 3700x numbers in your post.

I downclocked my 2700 to 1.5 GHz and 4 cores with no SMT to match the Rpi4. System info can be found in the Geekbench4 link. BMW demo file and default settings for Blender.

My Zen+ : GB4 ST - 1487 pts/GHz
Your Rpi4: GB4 ST - 645 pts/GHz

Gigabyte Technology Co., Ltd. X570 AORUS ELITE - Geekbench

Benchmark results for a Gigabyte Technology Co., Ltd. X570 AORUS ELITE with an AMD Ryzen 7 2700 processor.

browser.geekbench.com

Zen+ = 130.5% faster performance per clock in GB4

My Zen+ : Blender (v2.82) 4T - 1356s
Your Rpi4: Blender (v2.82) 4T - 4077s

Zen+ = 200% faster performance per clock in Blender

That's a pretty big difference. Obviously Zen2 will be even faster. I'd prefer it to both be done in a controlled environment though. Perhaps I'll have to pick up a Rpi4 to play with. Maybe GB5 would be closer to Blender.

name99 · May 16, 2020

DrMrLordX said:
Since Amazon is letting people purchase time on Graviton2 instances, why don't we just run some big iron workloads on one and see for sure?

Once again, you miss the point: where do you get your certainty as to the performance of ARM if you haven't followed the people who have been doing just that?

If you ASKED people for references to Graviton2 performance (or better yet, use this amazing new thing called Google) you'd get plenty of results. What I am complaining about is your absolute certainty that you already know the answer, when it's clear that you're not even following this space.

Doug S · May 16, 2020

name99 said:
For some of you the game might be "x86 uber alles", but for the rest of us, we're just sick of seeing the forums polluted by the massive ignorance.

For some of us this is a story we've heard before. In the 90s when Pentium Pro came out and x86 started beating RISCs in integer all the same arguments and excuses we are hearing here were made for why x86 could never compete with PA-RISC and Alpha. You aren't comparing with the right benchmarks, they'll never match RISC in floating point, you need bigger benchmarks to properly measure the memory system, you aren't taking I/O into account etc. etc.

People who think x86 has some unique advantages over ARM have their heads in the sand. I guess it will take Apple releasing the first ARM Macs to finally admit this, though I imagine some will still manage to find a few things x86 does better and try to claim those are the things that really matter.

Richie Rich · May 16, 2020

@Hitman928

Blender ST results:

Zen2 Ryzen 3700X ... 7463 s/GHz

Cortex A72 (RPi4) ... 15443 s/GHz .... that's 48% PPC of Zen2

But there are some diffences between your and mine measurements:

my comparison was in ST where A72 could benefit from 64-bit. 3-core load is still OK while 4-core load performance suffer a lot (probably due to mem bandwith bottleneck). So for core2core comparison is ST load much more realistic due to bottleneck elimination.
no downclock of my 3700X and SMT ON, just recalculation based on given frequency per one thread (16t)

Blender MT results:

Zen2 Ryzen 3700X 8c/16t ..... 179 s ..... 11 466 s/GHz/thread
Cortex A72 (RPi4) 4c/4t ...... 4077 s ..... 24 464 s/GHz/thread .... that's only 47% PPC per thread of Zen2

Please note, that we compare A72 core to Zen2 thread which means that AMD can get more than 4x higher PPC out of Zen2 core thanks to SMT2 (while RPi4 bottleneck at all-core load). This corresponds with your 200% higher PPC (Zen2 has +300%).

What is your time when you run Blender as single core? This would more interesting to compare if Zen+ scales similarly as Zen2.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member