Okay, I have some results from testing native vs translated on the DS emulator. Also some figures from some ARM devices.
Code:
Mario Kart NSMB world 1 NSMB world 1
countdown map in-game
Galaxy S3 4x1.4GHz A9 135 140 290
JXD S7300 2x1.2GHz A9 85 90 180
POV Mobii 2x1.0GHz A9 (Tegra 2) 52 52 135
Xperia Play 1x1.0GHz Scorpion 50 58 135
Venue 8 (X86) 2x2.0GHz Atom Z2580 62 56 150
Venue 8 (ARM) 2x2.0GHz Atom Z2580 26 27 70
These are in percentages (so 100 = full speed), taken without frameskip. There are three different versions of the code running:
1) ARMv7 NEON: CPU emulation with ARM recompiler, 2D/3D/Geometry emulation with hand-optimized NEON routines.
2) ARMv7 compatibility: CPU emulation with ARM recompiler, 2D/3D/Geometry emulation with C functions (this is only used on Tegra 2)
3) x86: CPU emulation with less advanced x86 recompiler, 2D/3D/Geometry emulation with C functions.
If available, 3D emulation will use 2-3 threads to divide the work between cores and 2D emulation will use 2 threads. The Venue 8 uses 3 threads (but the difference between this and 2 threads is negligible, hyperthreading doesn't really help here), XPeria Play uses 1, the POV Mobii and JXD S7300 use 2, and the Galaxy S3 uses 3.
Of the three scenes tested, the first two have heavy CPU and 3D loads, while the third one has much lighter CPU and 3D load (but a heavier 2D load.. which doesn't really impact things as much). The x86 version is hit hardest in the second test (NSMB world 1 map) because of the weaker CPU emulation vs all of the ARM versions.
But despite the fact that the x86 version is running much less optimized code than the ARM NEON version - which is what is the translation layer is using, because NEON support is reported in this mode - the x86 native version is a lot faster. About 2 to 2.4 times faster.
A better comparison may actually be between the JXD S7300 and the Venue 8 in ARM mode, since they're both running the same code (ARM NEON) and both have the same core count and thread configuration. The Venue 8's up to 2GHz Saltwell cores should ostensibly be decently faster than the S7300's 1.2GHz Cortex-A9 cores. Here the S7300 is 2.5 to 3.33 times faster.
To be fair though, our code - both the ARM recompiler and the NEON functions - may not exactly be representative. They have heavy register pressure (going to punish x86 w/8 registers, especially only 8 SSE and MMX regs) and the NEON instructions will often not gracefully map to SSE instructions. But this is what you get with something that is heavily optimized that also desperately needs as much performance as it can get. If it's running with ARM translation it's pretty much useless on current gen x86 Android hardware. I don't know how much Silvermont cores will help things but it'll probably still be pretty bad. Running the x86 native code is an unfortunate compromise, but it's at least good enough to be worth using (if you throw in frameskip), but doing it with the ARM translation is unacceptable.
UPDATE: We also tried testing the Venue 8 using the ARM w/o NEON library, and it was actually about 5-12% faster. Meaning that, at least in our case, it's better off translating less optimized scalar ARM code to scalar x86 code than more optimized NEON code to SSE! So I guess I was right, translating (our) NEON code to SSE is a mess...