Povray on ARM

sefsefsefsef · Aug 20, 2012

Ferzerp said:
It has absolutely no practical meaning by itself. It is only when combined with clockspeed and power usage that it has any meaning at all.

IPC always goes down as clockspeed goes up, so it turns out that IPC is never that interesting of a metric, even if you also know clockspeed and power consumption. Workload throughput and latency are all that really matters.

Power consumption is also meaningless on its own, and it's only when you also know the throughput/latency of a workload and can therefore compute some meaningful *energy consumed* metrics that "power consumption" becomes interesting. Who cares if you use 1/8th the power of some other chip if it offers 10x the performance? The high performance, "high power consumption" chip in that scenario is the more energy-efficient choice.

ARM is only successful because their CPUs run workloads where performance doesn't matter. ARM isn't using magic to create low-power chips, it's just making low performance chips, and then they get low-power for free. If ARM wants to have as high of performance as an Intel chip, then they cannot escape the fact that they must draw as much power as Intel chips.

Ferzerp · Aug 20, 2012

sefsefsefsef said:
IPC always goes down as clockspeed goes up, so it turns out that IPC is never that interesting of a metric, even if you also know clockspeed and power consumption. Workload throughput and latency are all that really matters.

Power consumption is also meaningless on its own, and it's only when you also know the throughput/latency of a workload and can therefore compute some meaningful *energy consumed* metrics that "power consumption" becomes interesting. Who cares if you use 1/8th the power of some other chip if it offers 10x the performance? The high performance, "high power consumption" chip in that scenario is the more energy-efficient choice.

ARM is only successful because their CPUs run workloads where performance doesn't matter. ARM isn't using magic to create low-power chips, it's just making low performance chips, and then they get low-power for free. If ARM wants to have as high of performance as an Intel chip, then they cannot escape the fact that they must draw as much power as Intel chips.

That was kind of my point

I've just said exactly that so many times that I got tired of it.

Haserath · Aug 20, 2012

sefsefsefsef said:
IPC always goes down as clockspeed goes up, so it turns out that IPC is never that interesting of a metric, even if you also know clockspeed and power consumption. Workload throughput and latency are all that really matters.

Power consumption is also meaningless on its own, and it's only when you also know the throughput/latency of a workload and can therefore compute some meaningful *energy consumed* metrics that "power consumption" becomes interesting. Who cares if you use 1/8th the power of some other chip if it offers 10x the performance? The high performance, "high power consumption" chip in that scenario is the more energy-efficient choice.

ARM is only successful because their CPUs run workloads where performance doesn't matter. ARM isn't using magic to create low-power chips, it's just making low performance chips, and then they get low-power for free. If ARM wants to have as high of performance as an Intel chip, then they cannot escape the fact that they must draw as much power as Intel chips.

Actually IPC going down depends on of it fits within cache and whether or not cache is clocked based on core clock.

Sandy's IPC shouldn't go down as long as Povray fits within the cache. Latency shouldn't be a big issue, because it's mostly throughput that matters.

It's hard to tell whether ARM or x86 is better for high end processors. You have arm at ~1 W and the lowest Ivy is 17W(without everything). Lots of optimization has gone into Intel's x86 design over the years.

sefsefsefsef · Aug 20, 2012

Haserath said:
Actually IPC going down depends on of it fits within cache and whether or not cache is clocked based on core clock.

Sandy's IPC shouldn't go down as long as Povray fits within the cache. Latency shouldn't be a big issue, because it's mostly throughput that matters.

It's hard to tell whether ARM or x86 is better for high end processors. You have arm at ~1 W and the lowest Ivy is 17W(without everything). Lots of optimization has gone into Intel's x86 design over the years.

You are correct about the cache capacity and speed being the determining factor in IPC scaling with clockspeed. What useful, real-world workloads really fit completely in the cache though? It's the degree of cache spilling that makes some workloads scale better with higher clockspeeds than others.

I kept saying "throughput/latency" because some workloads care about throughput, and others care more about latency. I wasn't just talking specifically about POV-Ray. I was using "throughput/latency" as a substitute for "performance."

Even a 17W IB isn't necessarily more energy-efficient than a 77W IB. Reviews of low-power IB chipss have revealed that the highest end K-series i7 are actually more energy efficient, but that comes at the cost of higher peak power draw, as seen in:
http://www.tomshardware.com/reviews/core-i5-3570-low-power,3204-14.html

Cerb · Aug 20, 2012

Haserath said:
It's hard to tell whether ARM or x86 is better for high end processors. You have arm at ~1 W and the lowest Ivy is 17W(without everything). Lots of optimization has gone into Intel's x86 design over the years.

It's that optimization that makes it superior. Back when memory wasn't so slow, to get decent dumb bench scores, like MIPS and FLOPS, RISCs could load/store less, and properly time loads and stores to get the most from the ALU.

X86 CPU designers had no choice but to make powerful memory subsystems on the CPU. Most RISCs have/had 12-64 GPRs, while IA32 has 0, though sometimes you can use as many as 4 (OTOH, several SFRs make some of those GPRs unnecessary). Now we're up to 8 more. So, addressing the heap has to be fast, and stack operations have to be fast. There also aren't nearly as many clever ways to hide mispredict penalties and page faults.

If Intel had need to design a high performance ARM CPU, I'm sure they could make one competitive with their x86 ones.

sefsefsefsef said:
I kept saying "throughput/latency" because some workloads care about throughput, and others care more about latency.

Shared memory multithreaded and multiprocess applications (think games, browsers, web and DB servers, etc.) need both. Outside of scientific computing (IE, easy problems), there is much more often one being a more apparent bottleneck than the other at a given time for a given app, rather than only one mattering, even if all you generally measure is throughput.

sm625 · Aug 21, 2012

jhu said:
Core i5 2400S (2.5 GHz): 235.18 pps; 94.07 pps/GHz
Celeron 220 (1.2 GHz): 81.15 pps; 67.62 pps/GHz
Athlon II x4 (2.8 GHz): 179.82 pps ; 64.22 pps/GHz
PowerPC 750 (700 MHz): 20.47 pps ; 29.25 pps/GHz
Pentium !!! (450 MHz): 12.43 pps ; 27.62 pps/GHz
Exynos 4210 (1.2 GHz): 29.90 pps ; 24.91 pps/GHz (-mfloat-abi=hard)
Pentium 4m (1.5 GHz): 36.24 pps; 24.16 pps/GHz
Exynos 4210 (1.2 GHz): 21.99 pps ; 18.32 pps/GHz (-mfloat-abi=softfp)
Atom N270 (1.6 GHz): 28.96 pps ; 18.10 pps/GHz

There's the real winner. I wonder what the 807UE does. It is a 10W chip.

jhu · Aug 21, 2012

Updated:

Added numbers for OMAP 3621 from my Nook Color, overclocked to 1.2 GHz. I'm surprised how slow this thing is. At first I didn't believe it, so I wrote a small factoring program using double precision fp, ran it, then looked at the assembler source. Sure enough it's passing function arguments in the fp registers and using fp instructions. I guess Cortex-A9 really is that much faster than Cortex-A8.

Cerb · Aug 21, 2012

It's not just the A8. Until recently, most mobile SoCs have not had high performance memory controllers.

jpiniero · Aug 21, 2012

Don't think you are taking into account that the 2400S can turbo to 3.3.

soccerballtux · Aug 21, 2012

jhu said:
Updated:

Added numbers for OMAP 3621 from my Nook Color, overclocked to 1.2 GHz. I'm surprised how slow this thing is. At first I didn't believe it, so I wrote a small factoring program using double precision fp, ran it, then looked at the assembler source. Sure enough it's passing function arguments in the fp registers and using fp instructions. I guess Cortex-A9 really is that much faster than Cortex-A8.

A8 is horribly slow until you get into 1.5ghz territory unless you have a snapdragon or something in which case it's not quite as bad at 1.2ghz. The HTCs from that era felt like fast phones.

OGDroid had an Omap A8 running at 550mhz. SOC supported 600mhz which Verizon downclocked to 550 for some reason, when we were all able to overclock to 1.0ghz easy, 1.2ghz with much higher voltage (and terrible battery). Even at 1.2ghz it was awfully slow...
Part of its problem was the SGX 530 couldn't render the android UI faster than ~20FPS so everything felt slow. Whatever proc HTC used in their phones or the Galaxy S1 had a better GPU which helped cover up slowness in the CPU...

this post accomplished nothing sorry

jhu · Aug 21, 2012

jpiniero said:
Don't think you are taking into account that the 2400S can turbo to 3.3.

Oh, forgot about that. I'll check again.

jhu · Aug 22, 2012

jpiniero said:
Don't think you are taking into account that the 2400S can turbo to 3.3.

Just checked. Doesn't look like it is using turbo:

$ cat /proc/cpuinfo | grep Mhz
cpu Mhz : 1600.000
cpu Mhz : 1600.000
cpu Mhz : 2501.000
cpu Mhz : 1600.000

jhu · Feb 28, 2013

Put up OMAP 4430. Not surprisingly, it performs about as well as the Exynos in my phone. Also put up an Ivy Bridge processor. Still haven't gotten around to compiling 3.7 for ARM yet.

Exophase · Feb 28, 2013

Saw there was commentary about perf on Cortex-A8..

There's a huge penalty (~20 cycles) for moving from VFP to integer registers, and any instructions on VFP are slow (7+ cycles for FADD etc), slower if double precision (especially divides). The only way to get fast FP code on Cortex-A8 is with single precision that gets in NEON registers and stays in NEON registers. So no soft-ABI, no double precision, and a competent compiler (and since that probably won't be available, competent hand ASM)

But that doesn't mean it's that much slower in general. Heavily hand optimized NEON can actually be faster clock-per-clock on Cortex-A8 than A9.

jhu · Feb 28, 2013

Exophase said:
Saw there was commentary about perf on Cortex-A8..

There's a huge penalty (~20 cycles) for moving from VFP to integer registers, and any instructions on VFP are slow (7+ cycles for FADD etc), slower if double precision (especially divides). The only way to get fast FP code on Cortex-A8 is with single precision that gets in NEON registers and stays in NEON registers. So no soft-ABI, no double precision, and a competent compiler (and since that probably won't be available, competent hand ASM)

But that doesn't mean it's that much slower in general. Heavily hand optimized NEON can actually be faster clock-per-clock on Cortex-A8 than A9.

Good to know. Unfortunately I gave away my Nook Color and can't play around with it anymore. OTOH, the Nook Tablet I bought really is a lot faster than its predecessor and cost less too!

jhu · Jul 31, 2013

Just got my Samsung Chromebook with Exynos 5 (Cortex A15). It's not a bad piece of computer for $210. And it's faster than my old Atom-based netbook. Updated list with Exynos 5 results.

Nothingness · Aug 1, 2013

jhu said:
Just got my Samsung Chromebook with Exynos 5 (Cortex A15). It's not a bad piece of computer for $210. And it's faster than my old Atom-based netbook. Updated list with Exynos 5 results.

Nice result, thanks. Was it with hard or softfp float ABI?

soccerballtux · Aug 1, 2013

thanks.

jhu · Aug 1, 2013

Nothingness said:
Nice result, thanks. Was it with hard or softfp float ABI?

hard float. unable to compile soft-float on this system.

jhu · Sep 9, 2013

Just got a Nexus 4 and tested it out. I'm surprised at how slow it is in Povray. It's touted as being similar Cortex A15. Yet, it's slower than a Pentium 4. I made sure to lock the processor at 1.5 GHz. Seems a little anomalous, but I can't figure out what's wrong. Or if nothing's wrong, why it's so slow compared to the Exynos 5.

Oh, other people have noticed this too.

soccerballtux · Sep 10, 2013

odd that it's not faster clock/clock than the exynos 4210 I almost bought a Nexus 4 to upgrade from GS2.

Exophase · Sep 10, 2013

jhu said:
Just got a Nexus 4 and tested it out. I'm surprised at how slow it is in Povray. It's touted as being similar Cortex A15. Yet, it's slower than a Pentium 4. I made sure to lock the processor at 1.5 GHz. Seems a little anomalous, but I can't figure out what's wrong. Or if nothing's wrong, why it's so slow compared to the Exynos 5.

Oh, other people have noticed this too.

The only time Krait has come close to Cortex-A15 is when comparing a Krait 400 @ 2.3GHz to a Cortex-A15 @ 1.8-1.9GHz or so. Not only are you looking at a lower clock speed instead of a higher one but you're also looking at an older uarch (Krait 200), missing several improvements made in Krait 300 and one improvement in Krait 400 (lower L2 latency).

On top of that, Krait's performance vs other CPUs can be pretty erratic, doing much better in some benchmarks than others. Krait 200 has been seen to occasionally do no better than Cortex-A9 at the same clock speed.

Note that Krait 200 came out like 10 months ahead of the first Cortex-A15 SoC..

jhu · Sep 11, 2013

soccerballtux said:
odd that it's not faster clock/clock than the exynos 4210 I almost bought a Nexus 4 to upgrade from GS2.

I actually did get the Nexus 4 to replace my GS2. The price is unbeatable for an unlocked, non-contract phone. If anything, it at least makes a good backup since I have a tendency to break phones.

NostaSeronx · Sep 11, 2013

jhu, are you able to put the Athlon on ICC 11.1?

icc 11.1, -xSSE3/-msse3 + -O3 + -fast

jhu · Sep 11, 2013

NostaSeronx said:
jhu, are you able to put the Athlon on ICC 11.1?

icc 11.1, -xSSE3/-msse3 + -O3 + -fast

I only have icc 13. The only binary that would run was the one compiled for pentium 4. All others gave seg faults on my Athlon.

Povray on ARM

Senior member

Diamond Member

Senior member

Senior member

Elite Member

Diamond Member

Lifer

Elite Member

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer