SANDRA for Android benchmarks...

Mar 10, 2006
11,715
2,012
126
Anybody see this from a while ago?

http://www.sisoftware.net/?d=qa&f=xolo

In this test we have the Intel Atom Z2460, the HTC One S (dual core 1.5GHz Krait 200), and the Tegra 2, I believe.

xolo_cpuaa.png


In native processing performance (GIPS and MFLOPS), the results showed that the dual Krait @ 1.5GHz is ~2x the speed of a single core Atom at 1.6GHz, suggesting similar integer performance per core/clock.

In floating point, the Atom actually wins slightly, but it's apparent that the benchmark isn't optimized for ARM's NEON (which would likely give a significant performance boost)

Then there's the multi-media tests which do seem to be NEON optimized for ARM, and it seems that the Atom does pretty well on FPU/SIMD code in a benchmark that also supports NEON (but the fact that the Atom is single core while the KRait is dual core makes me question the validity of these results):

xolo_cpumm.png


Now, things get ugly for the Intel part when running code through the Dalvik VM,

xolo_javaaa.png


the dual core Krait mops the floor with the single core Atom.

Discuss?
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Big caveat is Dhrystone sucks (surprising right).. This one is at least has had an industry consensus behind it for some time. The code is a bunch of totally random looking operations, but more than anything it's 100% resident in L1 cache, has very good branch properties.. it doesn't test some of the most performance critical parts of the CPU.

Another thing that the SiSoft description alludes to is that it spends a lot of time in library code like string stuff. So a highly tweaked libc can make a big difference.

There's no native threading so thread scaling means running multiple instances in separate cores. Considering how much Atom benefits from HT Krait gives a very strong showing here, much stronger than it does in a lot of better benches. I don't know a lot about Krait so I can't give definitive reasons why it has problems, all I really know is that it's using smaller L1 than most people (16KB/16KB), has one of the L1 dcache ways set aside for power reasons (but adds latency on misses), probably pays in L2 latency due to needing it less tightly coupled to support asynchronous clock domains, and apparently before Krait 300 had no data prefetch. That last one could be one of the bigger reasons for why it sucks at a lot of benchmarks, if true.

Atom's integer SIMD is actually very good for its class - it has 2 128-bit pipelines, at least for simple ALU operations. I don't know if what SiSoft's said about ARM being penalized about moving things from NEON to scalar registers is really a big factor - that was a huge problem on Cortex-A8 but none of these are Cortex-A8. What I heard is A9 was supposed to "fix" this, although I haven't timed it.

But I don't think that alone explains the results. Would have to look at the ASM carefully (blegh...) to see if it's software related. If we're talking auto-vectorization then I'd expect even GCC to do better with x86 than ARM - NEON autovectorization is a lot less mature. If they're using ICC than it'll do a better job still. If they're using intrinsics or hand-rolled ASM then all bets are off. They say "standard math libraries" but I still have no idea what they're actually using.

It'd be great to know what the actual test is.

I have no idea what the deal is with the Dalvik tests. Maybe the JIT does badly with fewer registers (since the Atom is just 32-bit x86 it has 8 GPRs vs 15 on ARM). But that's a total shot in the dark. This test would be a lot harder to analyze since you can't look at native code disassembly :/

Weird that SiSoft doesn't comment on how very poor the Dalvik Whetstone scores are.. what are they doing, using emulated floating point?
 

Nothingness

Diamond Member
Jul 3, 2013
3,333
2,416
136
Three comments:
- if it's icc vs gcc it might be broken for native tests given Intel history of benchmark tuning of their compiler
- comparing Dalvik from Android 2 vs 4 makes little sense and I bet X900 would get much better results
- L1 latency of Cortex-A9 is variable and it seems they picked a slowish case; see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388h/Babghcej.html

As far as I know this page is several months old, but Sandra still doesn't seem to be available.