Geekbench Is Broken?

Arachnotronic · Jul 17, 2013

Apparently an i5-4670 at 3.4GHz is less than twice as fast as a quad Cortex A15 at 1.9GHz.

http://browser.primatelabs.com/geekbench2/compare/2162592/2025974

I mean, I get that the sharpen/blur results throw things off significantly, but there's even some weirdness on the dot product benchmark. I find it real hard to believe that a Tegra 4 can do dot products better than a 4670K.

Nothingness · Jul 17, 2013

Didn't you read the previous posts about FP issues in Geekbench?

Also comparing Geekbench scores across different OS is not a very good idea due to different compilers and settings used. My understanding is that Geekbench author uses the compiler mostly used for each platform. So this means Visual C++ for Windows and gcc for Android.

EDIT: BTW the results you use for 4670 look too low. Pick this: http://browser.primatelabs.com/geekbench2/2164671
and for 64-bit: http://browser.primatelabs.com/geekbench2/2146689

Arachnotronic · Jul 17, 2013

Nothingness said:
Didn't you read the previous posts about FP issues in Geekbench?

Yes, I was the one who brought this issue to light in the first place by posting my interaction with the Intel engineer. Still doesn't explain the dot product test; only the sharpen/blur ones.

Also comparing Geekbench scores across different OS is not a very good idea due to different compilers and settings used.

I see. So why does Primate Labs says this on its website?

Compare apples and oranges. Or Macs and PCs. Geekbench is available for a variety of platforms enabling you to benchmark different computers running different operating systems.

So now the most pervasive benchmark in the mobile world seems to be completely broken on Intel (and possibly AMD?) platforms - not just the Atom. Why hasn't there been a media fiasco about this?

Inquiring minds want to know.

Nothingness · Jul 17, 2013

Intel17 said:
So now the most pervasive benchmark in the mobile world seems to be completely broken on Intel (and possibly AMD?) platforms - not just the Atom. Why hasn't there been a media fiasco about this?

Inquiring minds want to know.

That's very different from the AnTuTu fiasco, don't you think? AnTuTu was and still is not using the platform compiler. AnTuTu is in all cases only run on a single OS. AnTuTu still is using poor flag settings for ARM.

Now if you can demonstrate something bad is done against Intel running Android that favors ARM, we'll be able to discuss about a fiasco.

Anyway I have a hard time trusting closed source benchmarks. But IMHO if correctly used Geekbench is among the better benchmarks for mobile machines.

Arachnotronic · Jul 17, 2013

Nothingness said:
That's very different from the AnTuTu fiasco, don't you think? AnTuTu was and still is not using the platform compiler. AnTuTu is in all cases only run on a single OS. AnTuTu still is using poor flag settings for ARM.

AnTuTu is a really crappy benchmark and I am glad that its credibility is now out the window, courtesy of our very own Exophase. I only seek the truth, and believe he did the community a great service.

Now if you can demonstrate something bad is done against Intel running Android that favors ARM, we'll be able to discuss about a fiasco.

How about broken sharpen/blur tests? Did you know that Geekbench was one of the benchmarks used in that article by Jim McGregor to whine about how AnTuTu was terrible? Also, it's not clear if these mobile benchmarks are optimized for NEON but not optimized for SSE2/3/4. Further, it is clear that Geekbench must not be properly utilizing AVX2/FMA3 on the Haswell chips because there is a negligible increase in floating point scores from IVB -> HSW. Note that a "dot product" is an operation RIPE for FMA instructions since it is just the computation:

A*B = a1*b1 + a2*b2 + ... + an*bn.

Anyway I have a hard time trusting closed source benchmarks. But IMHO if correctly used Geekbench is among the better benchmarks for mobile machines.

Yes. Comparing ARM chip to ARM chip, Geekbench is great & awesome, but for some reason, the floating point tests completely barf on Intel's chips. Again, we now know why because the Intel engineer was so kind as to point it out to me, and I posted it all over the place...but if I hadn't done that, would anybody have cared? Or would we go around all thinking that a quad A15 has more per clock FPU performance than a quad Haswell @ twice the clock?

Nothingness · Jul 17, 2013

Intel17 said:
Also, it's not clear if these mobile benchmarks are optimized for NEON but not optimized for SSE2/3/4. Further, it is clear that Geekbench must not be properly utilizing AVX2/FMA3 on the Haswell chips because there is a negligible increase in floating point scores from IVB -> HSW. Note that a "dot product" is an operation RIPE for FMA instructions since it is just the computation:

A*B = a1*b1 + a2*b2 + ... + an*bn.

Blame Intel for that: their utterly stupid segmentation makes developers life hard to support new features when even latest generation doesn't support them.

ARM code doesn't use NEON:

Code:

   a42a8:       eb06 0c03       add.w   ip, r6, r3
   a42ac:       eddc 6a00       vldr    s13, [ip]
   a42b0:       eb05 0c03       add.w   ip, r5, r3
   a42b4:       ed9c 7a00       vldr    s14, [ip]
   a42b8:       ee46 7a87       vmla.f32        s15, s13, s14
   a42bc:       3101            adds    r1, #1
   a42be:       3304            adds    r3, #4
   a42c0:       42b9            cmp     r1, r7
   a42c2:       d1f1            bne.n   a42a8 <_ZN10DotProduct12workerScalarEi+0x18>
   a42c4:       3201            adds    r2, #1
   a42c6:       42a2            cmp     r2, r4
   a42c8:       d003            beq.n   a42d2 <_ZN10DotProduct12workerScalarEi+0x42>
   a42ca:       2300            movs    r3, #0
   a42cc:       6ac7            ldr     r7, [r0, #44]   ; 0x2c
   a42ce:       4619            mov     r1, r3
   a42d0:       e7f6            b.n     a42c0 <_ZN10DotProduct12workerScalarEi+0x30>

Note that the code is extremely bad, probably due to a low level of optimization.

Yes. Comparing ARM chip to ARM chip, Geekbench is great & awesome, but for some reason, the floating point tests completely barf on Intel's chips. Again, we now know why because the Intel engineer was so kind as to point it out to me, and I posted it all over the place...but if I hadn't done that, would anybody have cared? Or would we go around all thinking that a quad A15 has more per clock FPU performance than a quad Haswell @ twice the clock?

The denormal issue also exists on ARM chips. Except that ARM chips don't rely on microcode to handle them

Arachnotronic · Jul 17, 2013

Nothingness said:
Blame Intel for that: their utterly stupid segmentation makes developers life hard to support new features when even latest generation doesn't support them.

ARM code doesn't use NEON:

Code:

a42a8: eb06 0c03 add.w ip, r6, r3 a42ac: eddc 6a00 vldr s13, [ip] a42b0: eb05 0c03 add.w ip, r5, r3 a42b4: ed9c 7a00 vldr s14, [ip] a42b8: ee46 7a87 vmla.f32 s15, s13, s14 a42bc: 3101 adds r1, #1 a42be: 3304 adds r3, #4 a42c0: 42b9 cmp r1, r7 a42c2: d1f1 bne.n a42a8 <_ZN10DotProduct12workerScalarEi+0x18> a42c4: 3201 adds r2, #1 a42c6: 42a2 cmp r2, r4 a42c8: d003 beq.n a42d2 <_ZN10DotProduct12workerScalarEi+0x42> a42ca: 2300 movs r3, #0 a42cc: 6ac7 ldr r7, [r0, #44] ; 0x2c a42ce: 4619 mov r1, r3 a42d0: e7f6 b.n a42c0 <_ZN10DotProduct12workerScalarEi+0x30>

Note that the code is extremely bad, probably due to a low level of optimization.

The denormal issue also exists on ARM chips. Except that ARM chips don't rely on microcode to handle them

Thanks for the post, Nothingness. Very helpful.

Nothingness · Jul 17, 2013

I'll dig out the Windows code later, it's not easy for me as I'm only familiar with Linux

Arachnotronic · Jul 17, 2013

Nothingness said:
I'll dig out the Windows code later, it's not easy for me as I'm only familiar with Linux

Much appreciated.

Nothingness · Jul 18, 2013

Here you go:

Code:

Linux 32-bit
 8099eb0:    31 c0                    xor    %eax,%eax
 8099eb2:    85 d2                    test   %edx,%edx
 8099eb4:    74 0f                    je     8099ec5
 8099eb6:    d9 04 83                 flds   (%ebx,%eax,4)
 8099eb9:    d8 0c 81                 fmuls  (%ecx,%eax,4)
 8099ebc:    83 c0 01                 add    $0x1,%eax
 8099ebf:    39 d0                    cmp    %edx,%eax
 8099ec1:    de c1                    faddp  %st,%st(1)
 8099ec3:    75 f1                    jne    8099eb6
 8099ec5:    83 c6 01                 add    $0x1,%esi
 8099ec8:    39 fe                    cmp    %edi,%esi
 8099eca:    75 e4                    jne    8099eb0

Windows 32-bit
  441e63:    83 79 38 00              cmpl   $0x0,0x38(%ecx)
  441e67:    76 1d                    jbe    0x441e86
  441e69:    8b 71 38                 mov    0x38(%ecx),%esi
  441e6c:    8b d3                    mov    %ebx,%edx
  441e6e:    8b c7                    mov    %edi,%eax
  441e70:    2b d7                    sub    %edi,%edx
  441e72:    d9 04 02                 flds   (%edx,%eax,1)
  441e75:    83 c0 04                 add    $0x4,%eax
  441e78:    4e                       dec    %esi
  441e79:    d8 48 fc                 fmuls  -0x4(%eax)
  441e7c:    d8 44 24 14              fadds  0x14(%esp)
  441e80:    d9 5c 24 14              fstps  0x14(%esp)
  441e84:    75 ec                    jne    0x441e72
  441e86:    4d                       dec    %ebp
  441e87:    75 da                    jne    0x441e63

Android x86
   c54d8:    f3 0f 10 0c 97           movss  (%edi,%edx,4),%xmm1
   c54dd:    f3 0f 59 0c 96           mulss  (%esi,%edx,4),%xmm1
   c54e2:    42                       inc    %edx
   c54e3:    f3 0f 58 c1              addss  %xmm1,%xmm0
   c54e7:    3b 55 f0                 cmp    0xfffffff0(%ebp),%edx
   c54ea:    75 ec                    jne    c54d8
   c54ec:    41                       inc    %ecx
   c54ed:    3b 4d ec                 cmp    0xffffffec(%ebp),%ecx
   c54f0:    74 0a                    je     c54fc
   c54f2:    8b 50 28                 mov    0x28(%eax),%edx
   c54f5:    89 55 f0                 mov    %edx,0xfffffff0(%ebp)
   c54f8:    31 d2                    xor    %edx,%edx
   c54fa:    eb eb                    jmp    c54e7

Android ARMv7
   a42a8:    eb06 0c03     add.w    ip, r6, r3
   a42ac:    eddc 6a00     vldr    s13, [ip]
   a42b0:    eb05 0c03     add.w    ip, r5, r3
   a42b4:    ed9c 7a00     vldr    s14, [ip]
   a42b8:    ee46 7a87     vmla.f32    s15, s13, s14
   a42bc:    3101          adds    r1, #1
   a42be:    3304          adds    r3, #4
   a42c0:    42b9          cmp    r1, r7
   a42c2:    d1f1          bne.n    a42a8
   a42c4:    3201          adds    r2, #1
   a42c6:    42a2          cmp    r2, r4
   a42c8:    d003          beq.n    a42d2
   a42ca:    2300          movs    r3, #0
   a42cc:    6ac7          ldr    r7, [r0, #44]    ; 0x2c
   a42ce:    4619          mov    r1, r3
   a42d0:    e7f6          b.n    a42c0

All loops look similar. As already said, Windows 32-bit uses x87 (as does Linux), while Android uses SSE.

Arachnotronic · Jul 18, 2013

Nothingness said:

Here you go:

Code:

Linux 32-bit
 8099eb0:    31 c0                    xor    %eax,%eax
 8099eb2:    85 d2                    test   %edx,%edx
 8099eb4:    74 0f                    je     8099ec5
 8099eb6:    d9 04 83                 flds   (%ebx,%eax,4)
 8099eb9:    d8 0c 81                 fmuls  (%ecx,%eax,4)
 8099ebc:    83 c0 01                 add    $0x1,%eax
 8099ebf:    39 d0                    cmp    %edx,%eax
 8099ec1:    de c1                    faddp  %st,%st(1)
 8099ec3:    75 f1                    jne    8099eb6
 8099ec5:    83 c6 01                 add    $0x1,%esi
 8099ec8:    39 fe                    cmp    %edi,%esi
 8099eca:    75 e4                    jne    8099eb0

Windows 32-bit
  441e63:    83 79 38 00              cmpl   $0x0,0x38(%ecx)
  441e67:    76 1d                    jbe    0x441e86
  441e69:    8b 71 38                 mov    0x38(%ecx),%esi
  441e6c:    8b d3                    mov    %ebx,%edx
  441e6e:    8b c7                    mov    %edi,%eax
  441e70:    2b d7                    sub    %edi,%edx
  441e72:    d9 04 02                 flds   (%edx,%eax,1)
  441e75:    83 c0 04                 add    $0x4,%eax
  441e78:    4e                       dec    %esi
  441e79:    d8 48 fc                 fmuls  -0x4(%eax)
  441e7c:    d8 44 24 14              fadds  0x14(%esp)
  441e80:    d9 5c 24 14              fstps  0x14(%esp)
  441e84:    75 ec                    jne    0x441e72
  441e86:    4d                       dec    %ebp
  441e87:    75 da                    jne    0x441e63

Android x86
   c54d8:    f3 0f 10 0c 97           movss  (%edi,%edx,4),%xmm1
   c54dd:    f3 0f 59 0c 96           mulss  (%esi,%edx,4),%xmm1
   c54e2:    42                       inc    %edx
   c54e3:    f3 0f 58 c1              addss  %xmm1,%xmm0
   c54e7:    3b 55 f0                 cmp    0xfffffff0(%ebp),%edx
   c54ea:    75 ec                    jne    c54d8
   c54ec:    41                       inc    %ecx
   c54ed:    3b 4d ec                 cmp    0xffffffec(%ebp),%ecx
   c54f0:    74 0a                    je     c54fc
   c54f2:    8b 50 28                 mov    0x28(%eax),%edx
   c54f5:    89 55 f0                 mov    %edx,0xfffffff0(%ebp)
   c54f8:    31 d2                    xor    %edx,%edx
   c54fa:    eb eb                    jmp    c54e7

Android ARMv7
   a42a8:    eb06 0c03     add.w    ip, r6, r3
   a42ac:    eddc 6a00     vldr    s13, [ip]
   a42b0:    eb05 0c03     add.w    ip, r5, r3
   a42b4:    ed9c 7a00     vldr    s14, [ip]
   a42b8:    ee46 7a87     vmla.f32    s15, s13, s14
   a42bc:    3101          adds    r1, #1
   a42be:    3304          adds    r3, #4
   a42c0:    42b9          cmp    r1, r7
   a42c2:    d1f1          bne.n    a42a8
   a42c4:    3201          adds    r2, #1
   a42c6:    42a2          cmp    r2, r4
   a42c8:    d003          beq.n    a42d2
   a42ca:    2300          movs    r3, #0
   a42cc:    6ac7          ldr    r7, [r0, #44]    ; 0x2c
   a42ce:    4619          mov    r1, r3
   a42d0:    e7f6          b.n    a42c0

All loops look similar. As already said, Windows 32-bit uses x87 (as does Linux), while Android uses SSE.

Just SSE1?

Nothingness · Jul 18, 2013

Intel17 said:
Just SSE1?

Yes, it's not vectorized, which makes comparisons against ARM more fair

Exophase · Jul 18, 2013

It's really SSE2 where Intel introduced scalar operations</nitpick>

The code quality is really crappy for both. Not at all representative of what real FP code should look like. But there is a lot of crappy code out there for one reason or another..

I hope the integer stuff is better optimized :/

tempestglen · Jul 19, 2013

Exophase said:
It's really SSE2 where Intel introduced scalar operations</nitpick>

The code quality is really crappy for both. Not at all representative of what real FP code should look like. But there is a lot of crappy code out there for one reason or another..

I hope the integer stuff is better optimized :/

Why not use SPEC2000 or SPEC06?

Nowadays it's common for andriod phones with more than 2GBit memory.

Search

Geekbench Is Broken?

Arachnotronic

Lifer

Nothingness

Diamond Member

Arachnotronic

Lifer

Nothingness

Diamond Member

Arachnotronic

Lifer

Nothingness

Diamond Member

Arachnotronic

Lifer

Nothingness

Diamond Member

Arachnotronic

Lifer

Nothingness

Diamond Member

Arachnotronic

Lifer

Nothingness

Diamond Member

Exophase

Diamond Member

tempestglen

Member

TRENDING THREADS